conc
  1. API
  2. corpora
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. corpora

corpora

Functions to work with multiple corpora and download and build sample corpora.

List available corpora


source

list_corpora

 list_corpora (path:str)

Scan a directory for available corpora

Type Details
path str path to load corpus
Returns DataFrame Dataframe with path, corpus, format (Corpus or List Corpus), corpus name, document count, token count
print(list_corpora(save_path))
┌───────────────────────────┬─────────────────────┬─────────────┬─────────────────────┬────────────────┬─────────────┐
│ corpus                    ┆ name                ┆ format      ┆ date_created        ┆ document_count ┆ token_count │
╞═══════════════════════════╪═════════════════════╪═════════════╪═════════════════════╪════════════════╪═════════════╡
│ bnc.listcorpus            ┆ BNC                 ┆ List Corpus ┆ 2025-07-09 08:54:00 ┆ 4,049          ┆ 113,536,056 │
│ quake-stories-v2.corpus   ┆ Quake Stories v2    ┆ Corpus      ┆ 2025-07-02 09:42:43 ┆ 487            ┆ 472,876     │
│ gutenberg.corpus          ┆ Gutenberg Corpus    ┆ Corpus      ┆ 2025-07-09 09:22:17 ┆ 18             ┆ 2,546,286   │
│ brown.corpus              ┆ Brown Corpus        ┆ Corpus      ┆ 2025-07-09 09:21:45 ┆ 500            ┆ 1,138,566   │
│ brown.listcorpus          ┆ Brown Corpus        ┆ List Corpus ┆ 2025-07-09 09:21:45 ┆ 500            ┆ 1,138,566   │
│ bnc.corpus                ┆ BNC                 ┆ Corpus      ┆ 2025-07-09 08:54:00 ┆ 4,049          ┆ 113,536,056 │
│ introduce-yourself.corpus ┆ Introduce Yourself  ┆ Corpus      ┆ 2025-07-01 11:58:45 ┆ 28             ┆ 9,913       │
│ reuters.corpus            ┆ Reuters Corpus      ┆ Corpus      ┆ 2025-07-09 09:21:55 ┆ 10,788         ┆ 1,552,919   │
│ toy.listcorpus            ┆ Toy Corpus          ┆ List Corpus ┆ 2025-07-09 09:21:39 ┆ 6              ┆ 38          │
│ garden-party.corpus       ┆ Garden Party Corpus ┆ Corpus      ┆ 2025-07-09 09:22:19 ┆ 15             ┆ 74,664      │
│ toy.corpus                ┆ Toy Corpus          ┆ Corpus      ┆ 2025-07-09 09:21:39 ┆ 6              ┆ 38          │
│ garden-party.listcorpus   ┆ Garden Party Corpus ┆ List Corpus ┆ 2025-07-09 09:22:19 ┆ 15             ┆ 74,664      │
│ baby-bnc.listcorpus       ┆ Baby BNC            ┆ List Corpus ┆ 2025-07-09 08:44:05 ┆ 182            ┆ 4,674,632   │
│ baby-bnc.corpus           ┆ Baby BNC            ┆ Corpus      ┆ 2025-07-09 08:44:05 ┆ 182            ┆ 4,674,632   │
└───────────────────────────┴─────────────────────┴─────────────┴─────────────────────┴────────────────┴─────────────┘

Get data sources


source

create_toy_corpus_sources

 create_toy_corpus_sources (source_path:str)

Create txt files and csv to test build of toy corpus.

Type Details
source_path str path to location of sources for building corpora

source

show_toy_corpus

 show_toy_corpus (csv_path:str)

Show toy corpus in a table.

Type Details
csv_path str path to location of csv for building corpora
Returns GT
show_toy_corpus(os.path.join(source_path, 'toy.csv'))
source text category species
1.txt The cat sat on the mat. feline cat
2.txt The dog sat on the mat. canine dog
3.txt The cat is meowing. feline cat
4.txt The dog is barking. canine dog
5.txt The cat is climbing a tree. feline cat
6.txt The dog is digging a hole. canine dog

source

get_nltk_corpus_sources

 get_nltk_corpus_sources (source_path:str)

Get NLTK corpora as sources for development or testing Conc functionality.

Type Details
source_path str path to location of sources for building corpora

The texts for the Brown corpus from nltk can be used to test Conc functionality. The Reuters and Gutenberg corpora are also prepared by get_nltk_corpus_sources. Running the function will download the texts and save the texts as a .csv.gz files with columns: source and text. The Brown Corpus is also saved as .txt files to test the Corpus.build_from_texts method.


source

parse_bnc_to_csv

 parse_bnc_to_csv (source_path:str, save_path:str,
                   output_filename:str='bnc.csv.gz')

Converts BNC XML files, available via the British National Corpus, XML edition and the British National Corpus, Baby edition to a compressed CSV with title information retained.

Type Default Details
source_path str path to location of sources for building corpora, this is like to be a path ending with ‘Texts’
save_path str path to save the csv
output_filename str bnc.csv.gz name of the output file e.g. bnc.csv.gz or bnc-baby.csv.gz

The 1994 version of the British National Corpus (BNC) is available from the Oxford Text Archive. The parse_bnc_to_csv function assumes you have downloaded and unziped the files. You need to specify the directory containing the XML files for the texts. This will be a path ending with Texts. Conc does not provide a way to download the BNC zip files directly. You need to go to the Oxford Text Archive and read the notes about restricted use. Note: this function was previously using NLTK’s BNC parser, but this has been rewritten for lxml, which is much (2x) faster.


source

get_garden_party

 get_garden_party (source_path:str, create_archive_variations:bool=False)

Get corpus of The Garden Party by Katherine Mansfield for development of Conc and testing Conc functionality.

Type Default Details
source_path str path to location of sources for building corpora
create_archive_variations bool False create .tar and .tar.gz files for dev/testing (leave False if you just want the zip)

The get_garden_party function downloads a zip file of an example corpus based on Katherine Mansfield short stories. This function creates a .tar and a .tar.gz version of the texts for testing Corpus build methods.

get_garden_party(source_path, create_archive_variations = True)

Create large corpora for development and testing


source

get_large_dataset

 get_large_dataset (source_path:str)

Get 1m rows of https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset for testing.

Type Details
source_path str path to location of sources for building corpora

source

create_large_dataset_sizes

 create_large_dataset_sizes (source_path:str, sizes:list=[10000, 100000,
                             200000, 500000])

Create datasets of different sizes from data source retrieved by get_large_dataset for testing.

Type Default Details
source_path str path to location of sources for building corpora
sizes list [10000, 100000, 200000, 500000] list of sizes for test data-sets

Build sample corpora


source

build_sample_corpora

 build_sample_corpora (source_path:str, save_path:str,
                       force_rebuild:bool=False)

Build all test corpora from source files.

Type Default Details
source_path str path to folder with corpora
save_path str path to save corpora
force_rebuild bool False force rebuild of corpora, useful for development and testing

The build_sample_corpora function downloads sources and creates sample corpora for development and testing for releases. These datasets are a good way to get started working with Conc. Sample corpora available are:

  • Brown Corpus (via NLTK)
  • Gutenberg Corpus (via NLTK)
  • Reuters Corpus (via NLTK)
  • Garden Party Corpus (Katherine Mansfield short stories)
  • Toy Corpus (6 very short texts for testing only)
  • Brown List Corpus (a lightweight ListCorpus version of the Brown Corpus for use with keywords functionality)

After installing Conc you can invoke this function from the command line to download and build the sample corpora:

conc_build_sample_corpora path/to/save/sources path/to/save/corpora

Note: build_sample_corpora was accessible via conc.corpus as build_test_corpora up to version 0.1.1. This functionality is only accessible from conc.corpora now.

# set_logger_state('verbose')
build_sample_corpora(source_path, save_path, force_rebuild=False) # must be left as False after dev, otherwise could destroy corpora mid test in CI
# set_logger_state('quiet')
  • Report an issue