conc
  1. Explanations
  2. Anatomy of a corpus
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. Explanations
  2. Anatomy of a corpus

Anatomy of a corpus

Information on Conc corpus format if you want to access the data directly.

Introduction

A Conc corpus is a directory containing files with specific names and formats to represent the data. This document provides an overview of the various files and what they contain. Here is the directory structure of an example Conc corpus:

└── garden-party.corpus
    ├── vocab.parquet
    ├── tokens.parquet
    ├── puncts.parquet
    ├── spaces.parquet
    ├── metadata.parquet
    ├── corpus.json
    └── README.md

Note: by default the library creates a directory with the .corpus suffix. The directory name is created automatically on build based on a slugified version of the corpus name you assigned.

For example, if you passed in the name:

Garden Party Corpus

The directory will be:

garden-party.corpus

The directory can be renamed and still loaded. The .corpus extension is intended to make corpora on your filesystem easier to find or identify.

To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is.

Below is an overview of the files in a Conc corpus directory. The data can be accessed via Conc or accessed directly from the files.

File Access via Conc Description
README.md - Human readable information about the corpus to aide distribution
corpus.json specific properties e.g. conc.token_count Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus
vocab.parquet corpus.vocab A table mapping token strings to token IDs and frequency information
tokens.parquet corpus.tokens A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs
metadata.parquet corpus.metadata A table with metadata for each document (if there is any)
spaces.parquet corpus.spaces A table to allow recovery of document spacing without the original texts
puncts.parquet corpus.puncts A table with punctuation positions

Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus parquet and JSON files directly. Conc works with parquet files using the Polars library, but there are other libraries that support the format. Python provides native support for JSON, but there are more efficient libraries. Conc uses the msgspec library to read and write JSON.

Notes on specific Conc corpus files and data formats

The following information will help you if you want to work with the corpus data/files directly.

README.md

Below is an example of the README.md file generated by the Conc.

Brown Corpus

About

This directory contains a corpus created using the Conc Python library.

Corpus Information

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.

Date created: 2025-06-15 22:33:08
Document count: 500
Token count: 1138566
Word token count: 980144
Unique tokens: 42930
Unique word tokens: 42907
Conc Version Number: 0.1.3
spaCy model: en_core_web_sm, version 3.8.0

Using this corpus

Conc can be installed via pip. The Conc documentation site has tutorials and detailed information to get you started with Conc or to work with the corpus data directly.

Cite Conc

If you use Conc in your work, please cite it as follows:

corpus.json file

Below is the schema of the corpus.json file showing metadata saved with a corpus. These are loaded by Conc as attributes using Corpus.load or are created when you build a corpus using Corpus.build_from_files or Corpus.build_from_csv. The schema used to validate the JSON data represents the names and types of the attributes.

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Once you have built or loaded a corpus you can access the attributes. For example …

corpus = Corpus().load(path_to_brown_corpus) # loading the Brown corpus
print(corpus.name) # accessing the name of the corpus
print('Word token count: ', corpus.word_token_count) # access word_token_count
Brown Corpus
Word token count:  980144

Some of these attributes are exposed by Corpus methods. For example …

corpus.info() # Polars dataframe with summary metadata
Attribute Value
"Name" "Brown Corpus"
"Description" "A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version …
"Date Created" "2025-06-15 22:33:08"
"Conc Version" "0.1.3"
"Corpus Path" "/home/geoff/data/conc-test-corpora/brown.corpus"
"Document Count" "500"
"Token Count" "1,138,566"
"Word Token Count" "980,144"
"Unique Tokens" "42,930"
"Unique Word Tokens" "42,907"

vocab.parquet

The vocab parquet file contains …

  1. A lookup between token_id, token (string representation), and tokens_sort_order. The sort order allows sorting tokens alphabetically directly from token ids.
  2. A frequency table, with counts for lower cased tokens and orthographic realisation of tokens as they appeared in the text.
  3. Information on the type of token (i.e. whether punctuation or space - or if neither of those, a “word” token).

If you have loaded a corpus in Conc, you can access the vocab parquet data as a Polars dataframe like this …

# corpus.vocab is a Polars dataframe
corpus.vocab.head(5).collect(engine='streaming')
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
1 50087 22848 "the" 63516 62473 false false
2 28 8128 "," 58331 58331 true false
3 41 38309 "." 49907 49907 true false
4 35232 2739 "of" 36321 36122 false false
5 3351 7126 "and" 27787 27633 false false

You can also access the vocab data directly from the parquet file using Polars (or other libraries that support parquet).

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').head(5).collect(engine='streaming'))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
1 50087 22848 "the" 63516 62473 false false
2 28 8128 "," 58331 58331 true false
3 41 38309 "." 49907 49907 true false
4 35232 2739 "of" 36321 36122 false false
5 3351 7126 "and" 27787 27633 false false

To illustrate how frequencies are stored, see the instances for ‘the’. The counts for the form as it appeared in the text are stored in frequency_orth. ‘The’ appears 1043 times and ‘the’ as lowercase appears 62,473 times. The frequency_lower column provides a count of the total number of mentions of ‘the’ regardless of case.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming'))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
1 50087 22848 "the" 63516 62473 false false
99 50086 15682 "The" null 1043 false false

Punctuation is included in the token table, but these tokens can be filtered in Conc reports. If you are working with the table directly you can use is_punct to access or remove punctuation.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_punct') == True).head(5).collect(engine='streaming'))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
2 28 8128 "," 58331 58331 true false
3 41 38309 "." 49907 49907 true false
12 1577 1601 "`" 9788 9788 true false
14 14 42833 "''" 8762 8762 true false
15 29 27963 "-" 8131 8131 true false

Conc also stores space tokens from spaCy’s tokenisation process, which are sequences of whitespace characters. Space tokens are included (without counts). Space tokens are explained in more detail below the table.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_space') == True).head(5).collect(engine='streaming'))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
47165 1 2956 " " null null false true
47166 2 2799 " " null null false true
47167 3 27276 " " null null false true
47168 4 22812 " " null null false true
47169 5 4112 " " null null false true

A note about space tokens

When SpaCy tokenises text it outputs whether each token is followed by a standard space character or not. Conc stores this information during build in the has_spaces column in the tokens table (see below). SpaCy also creates tokens for other whitespace sequences. Conc documentation refers to these as space tokens. Space tokens may be useful for some sequence classification problems, but more importantly for Conc - it allows re-representation of source document in their original form with newlines, tabs and sequences of whitespace preserved.

Space tokens are not included in overall token counts stored in the corpus.json file.

tokens.parquet

The tokens.parquet file contains a table representing tokens in the corpus. Whitespace tokens have been removed (see discussion of space tokens above). The tokens data can be directly accessed in Conc as a Polars dataframe …

# corpus.tokens is a Polars dataframe
corpus.tokens.with_row_index('position').filter(pl.col('position').is_between(99, 104)).collect(engine='streaming')
position orth_index lower_index token2doc_index has_spaces
99 46333 46333 -1 false
100 15682 22848 1 true
101 4361 41672 1 true
102 14610 29725 1 true
103 54713 49998 1 true
104 45742 19078 1 true

You can also access the tokens data directly from the parquet file using Polars (or other libraries that support parquet).

pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 112)).collect(engine='streaming')
position orth_index lower_index token2doc_index has_spaces
99 46333 46333 -1 false
100 15682 22848 1 true
101 4361 41672 1 true
102 14610 29725 1 true
103 54713 49998 1 true
104 45742 19078 1 true
105 53250 53250 1 true
106 8699 35796 1 true
107 45680 45680 1 true
108 30305 30305 1 true
109 2739 2739 1 true
110 38486 35571 1 false
111 49732 49732 1 true
112 42720 42720 1 true

The columns are as follows:

  • the orth_index column stores the token_id of the original form of the token
  • the lower_index column stores the token_id of the lowercased form of the token
  • the token2doc_index column stores the document id assigned by Conc in the order the texts where processed (starting from index position 1) - this is the same order as the metadata in the metadata.parquet file.
  • the has_spaces column stores a boolean value indicating if the token is followed by a standard space character or not.

To demarcate the start and end of documents, Conc uses an end of file token (EOF_TOKEN) in the orth_index and lower_index columns. The EOF_TOKEN is stored in corpus.json and is accessible as an attribute of a Corpus object.

The token2doc_index represents token positions outside texts in the corpus as -1.

The view below shows tokens joined with vocab, so you can see the token string and the attributes of the tokens.

pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
    pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
    left_on='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming')
position orth_index lower_index token2doc_index has_spaces token is_punct is_space
99 46333 46333 -1 false " conc-end-of-file-token" false false
100 15682 22848 1 true "The" false false
101 4361 41672 1 true "Fulton" false false
102 14610 29725 1 true "County" false false
103 54713 49998 1 true "Grand" false false
104 45742 19078 1 true "Jury" false false
105 53250 53250 1 true "said" false false
106 8699 35796 1 true "Friday" false false
107 45680 45680 1 true "an" false false
108 30305 30305 1 true "investigation" false false
109 2739 2739 1 true "of" false false
110 38486 35571 1 false "Atlanta" false false
111 49732 49732 1 true "'s" false false
112 42720 42720 1 true "recent" false false
113 12294 12294 1 true "primary" false false
114 29461 29461 1 true "election" false false
115 42473 42473 1 true "produced" false false
116 1601 1601 1 false "`" true false
117 1601 1601 1 true "`" true false
118 39507 39507 1 true "no" false false
119 9335 9335 1 true "evidence" false false
120 42833 42833 1 true "''" true false
121 13607 13607 1 true "that" false false

spaces.parquet

Space tokens (see note above) are stored in spaces.parquet. Space tokens are stored separate from word and punctuation tokens. This allows consistency with most tools for corpus linguistics. The spaces table follows the format of the tokens table. Spaces are represented using position and the corresponding token_id of the whitespace. The original token sequences can be reconstructed from the tokens.parquet and spaces.parquet files. Conc has functionality to recover specific texts by recombining the data.

From Conc …

# corpus.spaces is a Polars dataframe
corpus.spaces.head(3).collect(engine='streaming')
position orth_index lower_index token2doc_index has_spaces
100 27276 27276 1 false
345 2956 2956 1 false
637 2956 2956 1 false

Directly accessing the parquet file …

pl.scan_parquet(f'{path_to_brown_corpus}/spaces.parquet').head(3).collect(engine='streaming')
position orth_index lower_index token2doc_index has_spaces
100 27276 27276 1 false
345 2956 2956 1 false
637 2956 2956 1 false

puncts.parquet

The puncts.parquet file stores an index of the position of punctuation tokens in the corpus. Below are the first three rows of a puncts.parquet file. If you look above the positions align with punctuation tokens in the tokens.parquet file. These are intended to be used for filtering tokens to exclude punctuation where necessary.

From Conc …

# corpus.puncts is a Polars dataframe
corpus.puncts.head(3).collect(engine='streaming')
position
116
117
120

Directly from the parquet file …

pl.scan_parquet(f'{path_to_brown_corpus}/puncts.parquet').head(3).collect(engine='streaming')
position
116
117
120

metadata.parquet

The metadata.parquet should not be confused with the metadata of the corpus itself, which is accessible via corpus.json.

If populated, the metadata.parquet file contains metadata for each document in the corpus.

From Conc you can access the metadata dataframe …

corpus = Corpus().load(path_to_congress_corpus) # loading a corpus with some metadata!
corpus.metadata.head(3).collect(engine='streaming')
speech_id date speaker chamber state
530182158 "1895-01-10T00:00:00.000000" "Mr. COCKRELL" "S" "Unknown"
890274849 "1966-08-31T00:00:00.000000" "Mr. LONG of Louisiana" "S" "Louisiana"
880088363 "1963-09-11T00:00:00.000000" "Mr. FULBRIGHT" "S" "Unknown"

Directly from the parquet file …

display(pl.scan_parquet(f'{path_to_congress_corpus}/metadata.parquet').head(3).collect(engine='streaming'))
speech_id date speaker chamber state
530182158 "1895-01-10T00:00:00.000000" "Mr. COCKRELL" "S" "Unknown"
890274849 "1966-08-31T00:00:00.000000" "Mr. LONG of Louisiana" "S" "Louisiana"
880088363 "1963-09-11T00:00:00.000000" "Mr. FULBRIGHT" "S" "Unknown"

Metadata is represented in the same order as documents are stored in the tokens.parquet file. The tokens with token2doc_index 1 correspond to the first metadata row.

For corpora created from files using the Corpus.build_from_files method, there will always be a field for the source file at the time of creation.

corpus = Corpus().load(path_to_gardenparty_corpus)
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))
file
"an-ideal-family.txt"
"at-the-bay.txt"
"bank-holiday.txt"
  • Report an issue