corpora

Functions to work with multiple corpora and download and build sample corpora.

List available corpora

list_corpora

 list_corpora (path:str)

Scan a directory for available corpora

	Type	Details
path	str	path to load corpus
Returns	DataFrame	Dataframe with path, corpus, format (Corpus or List Corpus), corpus name, document count, token count

print(list_corpora(save_path))

┌───────────────────────────┬─────────────────────┬─────────────┬─────────────────────┬────────────────┬─────────────┐
│ corpus                    ┆ name                ┆ format      ┆ date_created        ┆ document_count ┆ token_count │
╞═══════════════════════════╪═════════════════════╪═════════════╪═════════════════════╪════════════════╪═════════════╡
│ bnc.listcorpus            ┆ BNC                 ┆ List Corpus ┆ 2025-07-09 08:54:00 ┆ 4,049          ┆ 113,536,056 │
│ quake-stories-v2.corpus   ┆ Quake Stories v2    ┆ Corpus      ┆ 2025-07-02 09:42:43 ┆ 487            ┆ 472,876     │
│ gutenberg.corpus          ┆ Gutenberg Corpus    ┆ Corpus      ┆ 2025-07-09 09:22:17 ┆ 18             ┆ 2,546,286   │
│ brown.corpus              ┆ Brown Corpus        ┆ Corpus      ┆ 2025-07-09 09:21:45 ┆ 500            ┆ 1,138,566   │
│ brown.listcorpus          ┆ Brown Corpus        ┆ List Corpus ┆ 2025-07-09 09:21:45 ┆ 500            ┆ 1,138,566   │
│ bnc.corpus                ┆ BNC                 ┆ Corpus      ┆ 2025-07-09 08:54:00 ┆ 4,049          ┆ 113,536,056 │
│ introduce-yourself.corpus ┆ Introduce Yourself  ┆ Corpus      ┆ 2025-07-01 11:58:45 ┆ 28             ┆ 9,913       │
│ reuters.corpus            ┆ Reuters Corpus      ┆ Corpus      ┆ 2025-07-09 09:21:55 ┆ 10,788         ┆ 1,552,919   │
│ toy.listcorpus            ┆ Toy Corpus          ┆ List Corpus ┆ 2025-07-09 09:21:39 ┆ 6              ┆ 38          │
│ garden-party.corpus       ┆ Garden Party Corpus ┆ Corpus      ┆ 2025-07-09 09:22:19 ┆ 15             ┆ 74,664      │
│ toy.corpus                ┆ Toy Corpus          ┆ Corpus      ┆ 2025-07-09 09:21:39 ┆ 6              ┆ 38          │
│ garden-party.listcorpus   ┆ Garden Party Corpus ┆ List Corpus ┆ 2025-07-09 09:22:19 ┆ 15             ┆ 74,664      │
│ baby-bnc.listcorpus       ┆ Baby BNC            ┆ List Corpus ┆ 2025-07-09 08:44:05 ┆ 182            ┆ 4,674,632   │
│ baby-bnc.corpus           ┆ Baby BNC            ┆ Corpus      ┆ 2025-07-09 08:44:05 ┆ 182            ┆ 4,674,632   │
└───────────────────────────┴─────────────────────┴─────────────┴─────────────────────┴────────────────┴─────────────┘

Get data sources

source

create_toy_corpus_sources

 create_toy_corpus_sources (source_path:str)

Create txt files and csv to test build of toy corpus.

	Type	Details
source_path	str	path to location of sources for building corpora

source

show_toy_corpus

 show_toy_corpus (csv_path:str)

Show toy corpus in a table.

	Type	Details
csv_path	str	path to location of csv for building corpora
Returns	GT

show_toy_corpus(os.path.join(source_path, 'toy.csv'))

source	text	category	species
1.txt	The cat sat on the mat.	feline	cat
2.txt	The dog sat on the mat.	canine	dog
3.txt	The cat is meowing.	feline	cat
4.txt	The dog is barking.	canine	dog
5.txt	The cat is climbing a tree.	feline	cat
6.txt	The dog is digging a hole.	canine	dog

source

get_nltk_corpus_sources

 get_nltk_corpus_sources (source_path:str)

Get NLTK corpora as sources for development or testing Conc functionality.

	Type	Details
source_path	str	path to location of sources for building corpora

The texts for the Brown corpus from nltk can be used to test Conc functionality. The Reuters and Gutenberg corpora are also prepared by get_nltk_corpus_sources. Running the function will download the texts and save the texts as a .csv.gz files with columns: source and text. The Brown Corpus is also saved as .txt files to test the Corpus.build_from_texts method.

source

parse_bnc_to_csv

 parse_bnc_to_csv (source_path:str, save_path:str,
                   output_filename:str='bnc.csv.gz')

Converts BNC XML files, available via the British National Corpus, XML edition and the British National Corpus, Baby edition to a compressed CSV with title information retained.

	Type	Default	Details
source_path	str		path to location of sources for building corpora, this is like to be a path ending with ‘Texts’
save_path	str		path to save the csv
output_filename	str	bnc.csv.gz	name of the output file e.g. bnc.csv.gz or bnc-baby.csv.gz

The 1994 version of the British National Corpus (BNC) is available from the Oxford Text Archive. The parse_bnc_to_csv function assumes you have downloaded and unziped the files. You need to specify the directory containing the XML files for the texts. This will be a path ending with Texts. Conc does not provide a way to download the BNC zip files directly. You need to go to the Oxford Text Archive and read the notes about restricted use. Note: this function was previously using NLTK’s BNC parser, but this has been rewritten for lxml, which is much (2x) faster.

source

get_garden_party

 get_garden_party (source_path:str, create_archive_variations:bool=False)

Get corpus of The Garden Party by Katherine Mansfield for development of Conc and testing Conc functionality.

	Type	Default	Details
source_path	str		path to location of sources for building corpora
create_archive_variations	bool	False	create .tar and .tar.gz files for dev/testing (leave False if you just want the zip)

The get_garden_party function downloads a zip file of an example corpus based on Katherine Mansfield short stories. This function creates a .tar and a .tar.gz version of the texts for testing Corpus build methods.

get_garden_party(source_path, create_archive_variations = True)

Create large corpora for development and testing

source

get_large_dataset

 get_large_dataset (source_path:str)

Get 1m rows of https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset for testing.

	Type	Details
source_path	str	path to location of sources for building corpora

source

create_large_dataset_sizes

 create_large_dataset_sizes (source_path:str, sizes:list=[10000, 100000,
                             200000, 500000])

Create datasets of different sizes from data source retrieved by get_large_dataset for testing.

	Type	Default	Details
source_path	str		path to location of sources for building corpora
sizes	list	[10000, 100000, 200000, 500000]	list of sizes for test data-sets

Build sample corpora

source

build_sample_corpora

 build_sample_corpora (source_path:str, save_path:str,
                       force_rebuild:bool=False)

Build all test corpora from source files.

	Type	Default	Details
source_path	str		path to folder with corpora
save_path	str		path to save corpora
force_rebuild	bool	False	force rebuild of corpora, useful for development and testing

The build_sample_corpora function downloads sources and creates sample corpora for development and testing for releases. These datasets are a good way to get started working with Conc. Sample corpora available are:

Brown Corpus (via NLTK)
Gutenberg Corpus (via NLTK)
Reuters Corpus (via NLTK)
Garden Party Corpus (Katherine Mansfield short stories)
Toy Corpus (6 very short texts for testing only)
Brown List Corpus (a lightweight ListCorpus version of the Brown Corpus for use with keywords functionality)

After installing Conc you can invoke this function from the command line to download and build the sample corpora:

conc_build_sample_corpora path/to/save/sources path/to/save/corpora

Note: build_sample_corpora was accessible via conc.corpus as build_test_corpora up to version 0.1.1. This functionality is only accessible from conc.corpora now.

build_sample_corpora(source_path, save_path, force_rebuild=False) # must be left as False after dev, otherwise could destroy corpora mid test in CI