└── garden-party.corpus
├── vocab.parquet
├── tokens.parquet
├── puncts.parquet
├── spaces.parquet
├── metadata.parquet
├── corpus.json
└── README.md
Anatomy of a corpus
Introduction
A Conc corpus is a directory containing files with specific names and formats to represent the data. This document provides an overview of the various files and what they contain. Here is the directory structure of an example Conc corpus:
Note: by default the library creates a directory with the .corpus
suffix. The directory name is created automatically on build based on a slugified version of the corpus name you assigned.
For example, if you passed in the name:
Garden Party Corpus
The directory will be:
garden-party.corpus
The directory can be renamed and still loaded. The .corpus
extension is intended to make corpora on your filesystem easier to find or identify.
To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is.
Below is an overview of the files in a Conc corpus directory. The data can be accessed via Conc or accessed directly from the files.
File | Access via Conc | Description |
---|---|---|
README.md | - | Human readable information about the corpus to aide distribution |
corpus.json | specific properties e.g. conc.token_count | Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus |
vocab.parquet | corpus.vocab | A table mapping token strings to token IDs and frequency information |
tokens.parquet | corpus.tokens | A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs |
metadata.parquet | corpus.metadata | A table with metadata for each document (if there is any) |
spaces.parquet | corpus.spaces | A table to allow recovery of document spacing without the original texts |
puncts.parquet | corpus.puncts | A table with punctuation positions |
Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus parquet and JSON files directly. Conc works with parquet files using the Polars library, but there are other libraries that support the format. Python provides native support for JSON, but there are more efficient libraries. Conc uses the msgspec library to read and write JSON.
Notes on specific Conc corpus files and data formats
The following information will help you if you want to work with the corpus data/files directly.
README.md
Below is an example of the README.md file generated by the Conc.
Brown Corpus
About
This directory contains a corpus created using the Conc Python library.
Corpus Information
A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date created: 2025-06-15 22:33:08
Document count: 500
Token count: 1138566
Word token count: 980144
Unique tokens: 42930
Unique word tokens: 42907
Conc Version Number: 0.1.3
spaCy model: en_core_web_sm, version 3.8.0
Using this corpus
Conc can be installed via pip. The Conc documentation site has tutorials and detailed information to get you started with Conc or to work with the corpus data directly.
Cite Conc
If you use Conc in your work, please cite it as follows:
corpus.json file
Below is the schema of the corpus.json
file showing metadata saved with a corpus. These are loaded by Conc as attributes using Corpus.load
or are created when you build a corpus using Corpus.build_from_files
or Corpus.build_from_csv
. The schema used to validate the JSON data represents the names and types of the attributes.
{'name': {'type': 'string'},
'description': {'type': 'string'},
'slug': {'type': 'string'},
'conc_version': {'type': 'string'},
'document_count': {'type': 'integer'},
'token_count': {'type': 'integer'},
'word_token_count': {'type': 'integer'},
'punct_token_count': {'type': 'integer'},
'space_token_count': {'type': 'integer'},
'unique_tokens': {'type': 'integer'},
'unique_word_tokens': {'type': 'integer'},
'date_created': {'type': 'string'},
'EOF_TOKEN': {'type': 'integer'},
'SPACY_EOF_TOKEN': {'type': 'integer'},
'SPACY_MODEL': {'type': 'string'},
'SPACY_MODEL_VERSION': {'type': 'string'},
'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}
Once you have built or loaded a corpus you can access the attributes. For example …
= Corpus().load(path_to_brown_corpus) # loading the Brown corpus
corpus print(corpus.name) # accessing the name of the corpus
print('Word token count: ', corpus.word_token_count) # access word_token_count
Brown Corpus
Word token count: 980144
Some of these attributes are exposed by Corpus
methods. For example …
# Polars dataframe with summary metadata corpus.info()
Attribute | Value |
---|---|
"Name" | "Brown Corpus" |
"Description" | "A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version … |
"Date Created" | "2025-06-15 22:33:08" |
"Conc Version" | "0.1.3" |
"Corpus Path" | "/home/geoff/data/conc-test-corpora/brown.corpus" |
"Document Count" | "500" |
"Token Count" | "1,138,566" |
"Word Token Count" | "980,144" |
"Unique Tokens" | "42,930" |
"Unique Word Tokens" | "42,907" |
vocab.parquet
The vocab parquet file contains …
- A lookup between token_id, token (string representation), and tokens_sort_order. The sort order allows sorting tokens alphabetically directly from token ids.
- A frequency table, with counts for lower cased tokens and orthographic realisation of tokens as they appeared in the text.
- Information on the type of token (i.e. whether punctuation or space - or if neither of those, a “word” token).
If you have loaded a corpus in Conc, you can access the vocab parquet data as a Polars dataframe like this …
# corpus.vocab is a Polars dataframe
5).collect(engine='streaming') corpus.vocab.head(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
1 | 50087 | 22848 | "the" | 63516 | 62473 | false | false |
2 | 28 | 8128 | "," | 58331 | 58331 | true | false |
3 | 41 | 38309 | "." | 49907 | 49907 | true | false |
4 | 35232 | 2739 | "of" | 36321 | 36122 | false | false |
5 | 3351 | 7126 | "and" | 27787 | 27633 | false | false |
You can also access the vocab data directly from the parquet file using Polars (or other libraries that support parquet).
f'{path_to_brown_corpus}/vocab.parquet').head(5).collect(engine='streaming')) display(pl.scan_parquet(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
1 | 50087 | 22848 | "the" | 63516 | 62473 | false | false |
2 | 28 | 8128 | "," | 58331 | 58331 | true | false |
3 | 41 | 38309 | "." | 49907 | 49907 | true | false |
4 | 35232 | 2739 | "of" | 36321 | 36122 | false | false |
5 | 3351 | 7126 | "and" | 27787 | 27633 | false | false |
To illustrate how frequencies are stored, see the instances for ‘the’. The counts for the form as it appeared in the text are stored in frequency_orth. ‘The’ appears 1043 times and ‘the’ as lowercase appears 62,473 times. The frequency_lower column provides a count of the total number of mentions of ‘the’ regardless of case.
f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming')) display(pl.scan_parquet(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
1 | 50087 | 22848 | "the" | 63516 | 62473 | false | false |
99 | 50086 | 15682 | "The" | null | 1043 | false | false |
Punctuation is included in the token table, but these tokens can be filtered in Conc reports. If you are working with the table directly you can use is_punct
to access or remove punctuation.
f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_punct') == True).head(5).collect(engine='streaming')) display(pl.scan_parquet(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
2 | 28 | 8128 | "," | 58331 | 58331 | true | false |
3 | 41 | 38309 | "." | 49907 | 49907 | true | false |
12 | 1577 | 1601 | "`" | 9788 | 9788 | true | false |
14 | 14 | 42833 | "''" | 8762 | 8762 | true | false |
15 | 29 | 27963 | "-" | 8131 | 8131 | true | false |
Conc also stores space tokens from spaCy’s tokenisation process, which are sequences of whitespace characters. Space tokens are included (without counts). Space tokens are explained in more detail below the table.
f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_space') == True).head(5).collect(engine='streaming')) display(pl.scan_parquet(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
47165 | 1 | 2956 | " " | null | null | false | true |
47166 | 2 | 2799 | " " | null | null | false | true |
47167 | 3 | 27276 | " " | null | null | false | true |
47168 | 4 | 22812 | " " | null | null | false | true |
47169 | 5 | 4112 | " " | null | null | false | true |
A note about space tokens
When SpaCy tokenises text it outputs whether each token is followed by a standard space character or not. Conc stores this information during build in the has_spaces column in the tokens table (see below). SpaCy also creates tokens for other whitespace sequences. Conc documentation refers to these as space tokens. Space tokens may be useful for some sequence classification problems, but more importantly for Conc - it allows re-representation of source document in their original form with newlines, tabs and sequences of whitespace preserved.
Space tokens are not included in overall token counts stored in the corpus.json file.
tokens.parquet
The tokens.parquet
file contains a table representing tokens in the corpus. Whitespace tokens have been removed (see discussion of space tokens above). The tokens data can be directly accessed in Conc as a Polars dataframe …
# corpus.tokens is a Polars dataframe
'position').filter(pl.col('position').is_between(99, 104)).collect(engine='streaming') corpus.tokens.with_row_index(
position | orth_index | lower_index | token2doc_index | has_spaces |
---|---|---|---|---|
99 | 46333 | 46333 | -1 | false |
100 | 15682 | 22848 | 1 | true |
101 | 4361 | 41672 | 1 | true |
102 | 14610 | 29725 | 1 | true |
103 | 54713 | 49998 | 1 | true |
104 | 45742 | 19078 | 1 | true |
You can also access the tokens data directly from the parquet file using Polars (or other libraries that support parquet).
f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 112)).collect(engine='streaming') pl.scan_parquet(
position | orth_index | lower_index | token2doc_index | has_spaces |
---|---|---|---|---|
99 | 46333 | 46333 | -1 | false |
100 | 15682 | 22848 | 1 | true |
101 | 4361 | 41672 | 1 | true |
102 | 14610 | 29725 | 1 | true |
103 | 54713 | 49998 | 1 | true |
104 | 45742 | 19078 | 1 | true |
105 | 53250 | 53250 | 1 | true |
106 | 8699 | 35796 | 1 | true |
107 | 45680 | 45680 | 1 | true |
108 | 30305 | 30305 | 1 | true |
109 | 2739 | 2739 | 1 | true |
110 | 38486 | 35571 | 1 | false |
111 | 49732 | 49732 | 1 | true |
112 | 42720 | 42720 | 1 | true |
The columns are as follows:
- the
orth_index
column stores the token_id of the original form of the token - the
lower_index
column stores the token_id of the lowercased form of the token
- the
token2doc_index
column stores the document id assigned by Conc in the order the texts where processed (starting from index position 1) - this is the same order as the metadata in themetadata.parquet
file. - the
has_spaces
column stores a boolean value indicating if the token is followed by a standard space character or not.
To demarcate the start and end of documents, Conc uses an end of file token (EOF_TOKEN) in the orth_index and lower_index columns. The EOF_TOKEN is stored in corpus.json
and is accessible as an attribute of a Corpus
object.
The token2doc_index represents token positions outside texts in the corpus as -1.
The view below shows tokens joined with vocab, so you can see the token string and the attributes of the tokens.
f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
pl.scan_parquet(='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming') left_on
position | orth_index | lower_index | token2doc_index | has_spaces | token | is_punct | is_space |
---|---|---|---|---|---|---|---|
99 | 46333 | 46333 | -1 | false | " conc-end-of-file-token" | false | false |
100 | 15682 | 22848 | 1 | true | "The" | false | false |
101 | 4361 | 41672 | 1 | true | "Fulton" | false | false |
102 | 14610 | 29725 | 1 | true | "County" | false | false |
103 | 54713 | 49998 | 1 | true | "Grand" | false | false |
104 | 45742 | 19078 | 1 | true | "Jury" | false | false |
105 | 53250 | 53250 | 1 | true | "said" | false | false |
106 | 8699 | 35796 | 1 | true | "Friday" | false | false |
107 | 45680 | 45680 | 1 | true | "an" | false | false |
108 | 30305 | 30305 | 1 | true | "investigation" | false | false |
109 | 2739 | 2739 | 1 | true | "of" | false | false |
110 | 38486 | 35571 | 1 | false | "Atlanta" | false | false |
111 | 49732 | 49732 | 1 | true | "'s" | false | false |
112 | 42720 | 42720 | 1 | true | "recent" | false | false |
113 | 12294 | 12294 | 1 | true | "primary" | false | false |
114 | 29461 | 29461 | 1 | true | "election" | false | false |
115 | 42473 | 42473 | 1 | true | "produced" | false | false |
116 | 1601 | 1601 | 1 | false | "`" | true | false |
117 | 1601 | 1601 | 1 | true | "`" | true | false |
118 | 39507 | 39507 | 1 | true | "no" | false | false |
119 | 9335 | 9335 | 1 | true | "evidence" | false | false |
120 | 42833 | 42833 | 1 | true | "''" | true | false |
121 | 13607 | 13607 | 1 | true | "that" | false | false |
spaces.parquet
Space tokens (see note above) are stored in spaces.parquet
. Space tokens are stored separate from word and punctuation tokens. This allows consistency with most tools for corpus linguistics. The spaces table follows the format of the tokens table. Spaces are represented using position
and the corresponding token_id of the whitespace. The original token sequences can be reconstructed from the tokens.parquet
and spaces.parquet
files. Conc has functionality to recover specific texts by recombining the data.
From Conc …
# corpus.spaces is a Polars dataframe
3).collect(engine='streaming') corpus.spaces.head(
position | orth_index | lower_index | token2doc_index | has_spaces |
---|---|---|---|---|
100 | 27276 | 27276 | 1 | false |
345 | 2956 | 2956 | 1 | false |
637 | 2956 | 2956 | 1 | false |
Directly accessing the parquet file …
f'{path_to_brown_corpus}/spaces.parquet').head(3).collect(engine='streaming') pl.scan_parquet(
position | orth_index | lower_index | token2doc_index | has_spaces |
---|---|---|---|---|
100 | 27276 | 27276 | 1 | false |
345 | 2956 | 2956 | 1 | false |
637 | 2956 | 2956 | 1 | false |
puncts.parquet
The puncts.parquet
file stores an index of the position of punctuation tokens in the corpus. Below are the first three rows of a puncts.parquet
file. If you look above the positions align with punctuation tokens in the tokens.parquet file. These are intended to be used for filtering tokens to exclude punctuation where necessary.
From Conc …
# corpus.puncts is a Polars dataframe
3).collect(engine='streaming') corpus.puncts.head(
position |
---|
116 |
117 |
120 |
Directly from the parquet file …
f'{path_to_brown_corpus}/puncts.parquet').head(3).collect(engine='streaming') pl.scan_parquet(
position |
---|
116 |
117 |
120 |
metadata.parquet
The metadata.parquet
should not be confused with the metadata of the corpus itself, which is accessible via corpus.json
.
If populated, the metadata.parquet
file contains metadata for each document in the corpus.
From Conc you can access the metadata dataframe …
= Corpus().load(path_to_congress_corpus) # loading a corpus with some metadata!
corpus 3).collect(engine='streaming') corpus.metadata.head(
speech_id | date | speaker | chamber | state |
---|---|---|---|---|
530182158 | "1895-01-10T00:00:00.000000" | "Mr. COCKRELL" | "S" | "Unknown" |
890274849 | "1966-08-31T00:00:00.000000" | "Mr. LONG of Louisiana" | "S" | "Louisiana" |
880088363 | "1963-09-11T00:00:00.000000" | "Mr. FULBRIGHT" | "S" | "Unknown" |
Directly from the parquet file …
f'{path_to_congress_corpus}/metadata.parquet').head(3).collect(engine='streaming')) display(pl.scan_parquet(
speech_id | date | speaker | chamber | state |
---|---|---|---|---|
530182158 | "1895-01-10T00:00:00.000000" | "Mr. COCKRELL" | "S" | "Unknown" |
890274849 | "1966-08-31T00:00:00.000000" | "Mr. LONG of Louisiana" | "S" | "Louisiana" |
880088363 | "1963-09-11T00:00:00.000000" | "Mr. FULBRIGHT" | "S" | "Unknown" |
Metadata is represented in the same order as documents are stored in the tokens.parquet
file. The tokens with token2doc_index 1 correspond to the first metadata row.
For corpora created from files using the Corpus.build_from_files
method, there will always be a field for the source file at the time of creation.
= Corpus().load(path_to_gardenparty_corpus)
corpus f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming')) display(pl.scan_parquet(
file |
---|
"an-ideal-family.txt" |
"at-the-bay.txt" |
"bank-holiday.txt" |