conc
  1. API
  2. corpora
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. corpora

corpora

Functions to work with multiple corpora and download and build sample corpora.

List available corpora


source

list_corpora

 list_corpora (path:str)

Scan a directory for available corpora

Type Details
path str path to load corpus
Returns DataFrame Dataframe with path, corpus, corpus name, document count, token count
print(list_corpora(save_path))
┌──────────────────────────────────────────────┬───────────────────────────────────────┬─────────────────────┬────────────────┬─────────────┐
│ corpus                                       ┆ name                                  ┆ date_created        ┆ document_count ┆ token_count │
╞══════════════════════════════════════════════╪═══════════════════════════════════════╪═════════════════════╪════════════════╪═════════════╡
│ brown.corpus                                 ┆ Brown Corpus                          ┆ 2025-06-21 10:38:11 ┆ 500            ┆ 1,138,566   │
│ gutenberg.corpus                             ┆ Gutenberg Corpus                      ┆ 2025-06-21 10:38:57 ┆ 18             ┆ 2,546,286   │
│ toy.corpus                                   ┆ Toy Corpus                            ┆ 2025-06-21 10:38:02 ┆ 6              ┆ 38          │
│ us-congressional-speeches-subset-10k.corpus  ┆ US Congressional Speeches Subset 10k  ┆ 2025-06-03 16:43:17 ┆ 10,000         ┆ 1,964,972   │
│ reuters.corpus                               ┆ Reuters Corpus                        ┆ 2025-06-21 10:38:23 ┆ 10,788         ┆ 1,552,919   │
│ us-congressional-speeches-subset-100k.corpus ┆ US Congressional Speeches Subset 100k ┆ 2025-06-03 16:45:39 ┆ 100,000        ┆ 20,027,241  │
│ garden-party.corpus                          ┆ Garden Party Corpus                   ┆ 2025-06-21 10:38:59 ┆ 15             ┆ 74,664      │
└──────────────────────────────────────────────┴───────────────────────────────────────┴─────────────────────┴────────────────┴─────────────┘

Get data sources


source

create_toy_corpus_sources

 create_toy_corpus_sources (source_path:str)

Create txt files and csv to test build of toy corpus.

Type Details
source_path str path to location of sources for building corpora

source

show_toy_corpus

 show_toy_corpus (csv_path:str)

Show toy corpus in a table.

Type Details
csv_path str path to location of csv for building corpora
Returns GT
show_toy_corpus(os.path.join(source_path, 'toy.csv'))
source text category species
1.txt The cat sat on the mat. feline cat
2.txt The dog sat on the mat. canine dog
3.txt The cat is meowing. feline cat
4.txt The dog is barking. canine dog
5.txt The cat is climbing a tree. feline cat
6.txt The dog is digging a hole. canine dog

source

get_nltk_corpus_sources

 get_nltk_corpus_sources (source_path:str)

Get NLTK corpora as sources for development or testing Conc functionality.

Type Details
source_path str path to location of sources for building corpora

The texts for the Brown corpus from nltk can be used to test Conc functionality. The Reuters and Gutenberg corpora are also prepared by get_nltk_corpus_sources. Running the function will download the texts and save the texts as a .csv.gz files with columns: source and text. The Brown Corpus is also saved as .txt files to test the Corpus.build_from_texts method.


source

get_garden_party

 get_garden_party (source_path:str)

Get corpus of The Garden Party by Katherine Mansfield for development of Conc and testing Conc functionality.

Type Details
source_path str path to location of sources for building corpora

The get_garden_party function downloads a zip file of an example corpus based on Katherine Mansfield short stories. This function creates a .tar and a .tar.gz version of the texts for testing Corpus build methods.

get_garden_party(source_path)

Create large corpora for development and testing


source

get_large_dataset

 get_large_dataset (source_path:str)

Get 1m rows of https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset for testing.

Type Details
source_path str path to location of sources for building corpora

source

create_large_dataset_sizes

 create_large_dataset_sizes (source_path:str, sizes:list=[10000, 100000,
                             200000, 500000])

Create datasets of different sizes from data source retrieved by get_large_dataset for testing.

Type Default Details
source_path str path to location of sources for building corpora
sizes list [10000, 100000, 200000, 500000] list of sizes for test data-sets

Build sample corpora


source

build_sample_corpora

 build_sample_corpora (source_path:str, save_path:str,
                       force_rebuild:bool=False)

Build all test corpora from source files.

Type Default Details
source_path str path to folder with corpora
save_path str path to save corpora
force_rebuild bool False force rebuild of corpora, useful for development and testing

The build_sample_corpora function downloads sources and creates sample corpora for development and testing for releases. These datasets are a good way to get started working with Conc. Sample corpora available are:

  • Brown Corpus (via NLTK)
  • Gutenberg Corpus (via NLTK)
  • Reuters Corpus (via NLTK)
  • Garden Party Corpus (Katherine Mansfield short stories)
  • Toy Corpus (6 very short texts for testing only)

After installing Conc you can invoke this function from the command line to download and build the sample corpora:

conc_build_sample_corpora path/to/save/sources path/to/save/corpora

Note: build_sample_corpora was accessible via conc.corpus as build_test_corpora up to version 0.1.1. This functionality is only accessible from conc.corpora now.

  • Report an issue