Anatomy of a corpus

Information on Conc corpus format if you want to access the data directly.

Introduction

A Conc corpus is a directory containing files with specific names and formats to represent the data. This document provides an overview of the various files and what they contain. Here is the directory structure of an example Conc corpus:

├── garden-party.corpus
│   ├── tokens.parquet
│   ├── spaces.parquet
│   ├── README.md
│   ├── puncts.parquet
│   ├── corpus.json
│   ├── vocab.parquet
│   └── metadata.parquet

Note: by default the library creates a directory with the .corpus suffix. The directory name is created automatically on build based on a slugified version of the corpus name you assigned.

For example, if you passed in the name:

Garden Party Corpus

The directory will be:

garden-party.corpus

The directory can be renamed and still loaded. The .corpus extension is intended to make corpora on your filesystem easier to find or identify.

To distribute a corpus, send a zip of the directory for others to extract or just share the directory as-is.

Below is an overview of the files in a Conc corpus directory. The data can be accessed via Conc or accessed directly from the files.

File	Access via Conc	Description
README.md	-	Human readable information about the corpus to aide distribution
corpus.json	specific properties e.g. conc.token_count	Machine readable information about the corpus, including name, description, various summary statistics, and models used to build the corpus
vocab.parquet	corpus.vocab	A table mapping token strings to token IDs and frequency information
tokens.parquet	corpus.tokens	A table with indices based on token positions used to query the corpus with tokens represented by numeric IDs
metadata.parquet	corpus.metadata	A table with metadata for each document (if there is any)
spaces.parquet	corpus.spaces	A table to allow recovery of document spacing without the original texts
puncts.parquet	corpus.puncts	A table with punctuation positions

Below is more information about each file. You can obviously work with a corpus using Conc, but you can work with the processed corpus parquet and JSON files directly. Conc works with parquet files using the Polars library, but there are other libraries that support the format. Python provides native support for JSON, but there are more efficient libraries. Conc uses the msgspec library to read and write JSON.

Notes on specific Conc corpus files and data formats

The following information will help you if you want to work with the corpus data/files directly.

README.md

Below is an example of the README.md file generated by the Conc.

Brown Corpus

About

This directory contains a corpus created using the Conc Python library.

Corpus Information

A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.

Date created: 2025-07-23 22:27:11
Document count: 500
Token count: 1138566
Word token count: 980144
Unique tokens: 42930
Unique word tokens: 42907
Conc Version Number: 0.1.10
spaCy model: en_core_web_sm, version 3.8.0

Using this corpus

Conc can be installed via pip. The Conc documentation site has tutorials and detailed information to get you started with Conc or to work with the corpus data directly.

Cite Conc

If you use Conc in your work, please cite it as follows: Ford, G. (2025). Conc: a Python library for efficient corpus analysis (Version 0.1.10) [Computer software]. https://doi.org/10.5281/zenodo.16358752

corpus.json file

Below is the schema of the corpus.json file showing metadata saved with a corpus. These are loaded by Conc as attributes using Corpus.load or are created when you build a corpus using Corpus.build_from_files or Corpus.build_from_csv. The schema used to validate the JSON data represents the names and types of the attributes.

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Once you have built or loaded a corpus you can access the attributes. For example …

corpus = Corpus().load(path_to_brown_corpus) # loading the Brown corpus
print(corpus.name) # accessing the name of the corpus
print('Word token count: ', corpus.word_token_count) # access word_token_count

Brown Corpus
Word token count:  980144

Some of these attributes are exposed by Corpus methods. For example …

corpus.info() # Polars dataframe with summary metadata

Attribute	Value
"Name"	"Brown Corpus"
"Description"	"A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version …
"Date Created"	"2025-07-23 22:27:11"
"Conc Version"	"0.1.10"
"Corpus Path"	"/home/geoff/data/conc-test-corpora/brown.corpus"
"Document Count"	"500"
"Token Count"	"1,138,566"
"Word Token Count"	"980,144"
"Unique Tokens"	"42,930"
"Unique Word Tokens"	"42,907"

vocab.parquet

The vocab parquet file contains …

A lookup between token_id, token (string representation), and tokens_sort_order. The sort order allows sorting tokens alphabetically directly from token ids.
A frequency table, with counts for lower cased tokens and orthographic realisation of tokens as they appeared in the text.
Information on the type of token (i.e. whether punctuation or space - or if neither of those, a “word” token).

If you have loaded a corpus in Conc, you can access the vocab parquet data as a Polars dataframe like this …

# corpus.vocab is a Polars dataframe
corpus.vocab.head(5).collect(engine='streaming')

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
1	50087	22848	"the"	63516	62473	false	false
2	28	8128	","	58331	58331	true	false
3	41	38309	"."	49907	49907	true	false
4	35232	2739	"of"	36321	36122	false	false
5	3351	7126	"and"	27787	27633	false	false

You can also access the vocab data directly from the parquet file using Polars (or other libraries that support parquet).

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').head(5).collect(engine='streaming'))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
1	50087	22848	"the"	63516	62473	false	false
2	28	8128	","	58331	58331	true	false
3	41	38309	"."	49907	49907	true	false
4	35232	2739	"of"	36321	36122	false	false
5	3351	7126	"and"	27787	27633	false	false

To illustrate how frequencies are stored, see the instances for ‘the’. The counts for the form as it appeared in the text are stored in frequency_orth. ‘The’ appears 1043 times and ‘the’ as lowercase appears 62,473 times. The frequency_lower column provides a count of the total number of mentions of ‘the’ regardless of case.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('token').str.to_lowercase() == 'the').head(5).collect(engine='streaming'))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
1	50087	22848	"the"	63516	62473	false	false
99	50086	15682	"The"	null	1043	false	false

Punctuation is included in the token table, but these tokens can be filtered in Conc reports. If you are working with the table directly you can use is_punct to access or remove punctuation.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_punct') == True).head(5).collect(engine='streaming'))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
2	28	8128	","	58331	58331	true	false
3	41	38309	"."	49907	49907	true	false
12	1577	1601	"`"	9788	9788	true	false
14	14	42833	"''"	8762	8762	true	false
15	29	27963	"-"	8131	8131	true	false

Conc also stores space tokens from spaCy’s tokenisation process, which are sequences of whitespace characters. Space tokens are included (without counts). Space tokens are explained in more detail below the table.

display(pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').filter(pl.col('is_space') == True).head(5).collect(engine='streaming'))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
47165	1	2956	" "	null	null	false	true
47166	2	2799	" "	null	null	false	true
47167	3	27276	" "	null	null	false	true
47168	4	22812	" "	null	null	false	true
47169	5	4112	" "	null	null	false	true

A note about space tokens

When SpaCy tokenises text it outputs whether each token is followed by a standard space character or not. Conc stores this information during build in the has_spaces column in the tokens table (see below). SpaCy also creates tokens for other whitespace sequences. Conc documentation refers to these as space tokens. Space tokens may be useful for some sequence classification problems, but more importantly for Conc - it allows re-representation of source document in their original form with newlines, tabs and sequences of whitespace preserved.

Space tokens are not included in overall token counts stored in the corpus.json file.

tokens.parquet

The tokens.parquet file contains a table representing tokens in the corpus. Whitespace tokens have been removed (see discussion of space tokens above). The tokens data can be directly accessed in Conc as a Polars dataframe …

# corpus.tokens is a Polars dataframe
corpus.tokens.with_row_index('position').filter(pl.col('position').is_between(99, 104)).collect(engine='streaming')

position	orth_index	lower_index	token2doc_index	has_spaces
99	46333	46333	-1	false
100	15682	22848	1	true
101	4361	41672	1	true
102	14610	29725	1	true
103	54713	49998	1	true
104	45742	19078	1	true

You can also access the tokens data directly from the parquet file using Polars (or other libraries that support parquet).

pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 112)).collect(engine='streaming')

position	orth_index	lower_index	token2doc_index	has_spaces
99	46333	46333	-1	false
100	15682	22848	1	true
101	4361	41672	1	true
102	14610	29725	1	true
103	54713	49998	1	true
104	45742	19078	1	true
105	53250	53250	1	true
106	8699	35796	1	true
107	45680	45680	1	true
108	30305	30305	1	true
109	2739	2739	1	true
110	38486	35571	1	false
111	49732	49732	1	true
112	42720	42720	1	true

The columns are as follows:

the orth_index column stores the token_id of the original form of the token
the lower_index column stores the token_id of the lowercased form of the token
the token2doc_index column stores the document id assigned by Conc in the order the texts where processed (starting from index position 1) - this is the same order as the metadata in the metadata.parquet file.
the has_spaces column stores a boolean value indicating if the token is followed by a standard space character or not.

To demarcate the start and end of documents, Conc uses an end of file token (EOF_TOKEN) in the orth_index and lower_index columns. The EOF_TOKEN is stored in corpus.json and is accessible as an attribute of a Corpus object.

The token2doc_index represents token positions outside texts in the corpus as -1.

The view below shows tokens joined with vocab, so you can see the token string and the attributes of the tokens.

pl.scan_parquet(f'{path_to_brown_corpus}/tokens.parquet').with_row_index('position').filter(pl.col('position').is_between(99, 121)).join(
    pl.scan_parquet(f'{path_to_brown_corpus}/vocab.parquet').select(pl.col('token_id'), pl.col('token'), pl.col('is_punct'), pl.col('is_space')),
    left_on='orth_index', right_on='token_id', how='left', maintain_order='left').collect(engine='streaming')

position	orth_index	lower_index	token2doc_index	has_spaces	token	is_punct	is_space
99	46333	46333	-1	false	" conc-end-of-file-token"	false	false
100	15682	22848	1	true	"The"	false	false
101	4361	41672	1	true	"Fulton"	false	false
102	14610	29725	1	true	"County"	false	false
103	54713	49998	1	true	"Grand"	false	false
104	45742	19078	1	true	"Jury"	false	false
105	53250	53250	1	true	"said"	false	false
106	8699	35796	1	true	"Friday"	false	false
107	45680	45680	1	true	"an"	false	false
108	30305	30305	1	true	"investigation"	false	false
109	2739	2739	1	true	"of"	false	false
110	38486	35571	1	false	"Atlanta"	false	false
111	49732	49732	1	true	"'s"	false	false
112	42720	42720	1	true	"recent"	false	false
113	12294	12294	1	true	"primary"	false	false
114	29461	29461	1	true	"election"	false	false
115	42473	42473	1	true	"produced"	false	false
116	1601	1601	1	false	"`"	true	false
117	1601	1601	1	true	"`"	true	false
118	39507	39507	1	true	"no"	false	false
119	9335	9335	1	true	"evidence"	false	false
120	42833	42833	1	true	"''"	true	false
121	13607	13607	1	true	"that"	false	false

spaces.parquet

Space tokens (see note above) are stored in spaces.parquet. Space tokens are stored separate from word and punctuation tokens. This allows consistency with most tools for corpus linguistics. The spaces table follows the format of the tokens table. Spaces are represented using position and the corresponding token_id of the whitespace. The original token sequences can be reconstructed from the tokens.parquet and spaces.parquet files. Conc has functionality to recover specific texts by recombining the data.

From Conc …

# corpus.spaces is a Polars dataframe
corpus.spaces.head(3).collect(engine='streaming')

position	orth_index	lower_index	token2doc_index	has_spaces
100	27276	27276	1	false
345	2956	2956	1	false
637	2956	2956	1	false

Directly accessing the parquet file …

pl.scan_parquet(f'{path_to_brown_corpus}/spaces.parquet').head(3).collect(engine='streaming')

position	orth_index	lower_index	token2doc_index	has_spaces
100	27276	27276	1	false
345	2956	2956	1	false
637	2956	2956	1	false

puncts.parquet

The puncts.parquet file stores an index of the position of punctuation tokens in the corpus. Below are the first three rows of a puncts.parquet file. If you look above the positions align with punctuation tokens in the tokens.parquet file. These are intended to be used for filtering tokens to exclude punctuation where necessary.

From Conc …

# corpus.puncts is a Polars dataframe
corpus.puncts.head(3).collect(engine='streaming')

position
116
117
120

Directly from the parquet file …

pl.scan_parquet(f'{path_to_brown_corpus}/puncts.parquet').head(3).collect(engine='streaming')

position
116
117
120

metadata.parquet

The metadata.parquet should not be confused with the metadata of the corpus itself, which is accessible via corpus.json.

If populated, the metadata.parquet file contains metadata for each document in the corpus.

From Conc you can access the metadata dataframe …

corpus = Corpus().load(path_to_congress_corpus) # loading a corpus with some metadata!
corpus.metadata.head(3).collect(engine='streaming')

speech_id	date	speaker	chamber	state
530182158	"1895-01-10T00:00:00.000000"	"Mr. COCKRELL"	"S"	"Unknown"
890274849	"1966-08-31T00:00:00.000000"	"Mr. LONG of Louisiana"	"S"	"Louisiana"
880088363	"1963-09-11T00:00:00.000000"	"Mr. FULBRIGHT"	"S"	"Unknown"

Directly from the parquet file …

display(pl.scan_parquet(f'{path_to_congress_corpus}/metadata.parquet').head(3).collect(engine='streaming'))

speech_id	date	speaker	chamber	state
530182158	"1895-01-10T00:00:00.000000"	"Mr. COCKRELL"	"S"	"Unknown"
890274849	"1966-08-31T00:00:00.000000"	"Mr. LONG of Louisiana"	"S"	"Louisiana"
880088363	"1963-09-11T00:00:00.000000"	"Mr. FULBRIGHT"	"S"	"Unknown"

Metadata is represented in the same order as documents are stored in the tokens.parquet file. The tokens with token2doc_index 1 correspond to the first metadata row.

For corpora created from files using the Corpus.build_from_files method, there will always be a field for the source file at the time of creation.

corpus = Corpus().load(path_to_gardenparty_corpus)
display(pl.scan_parquet(f'{corpus.corpus_path}/metadata.parquet').head(3).collect(engine='streaming'))

file
"an-ideal-family.txt"
"at-the-bay.txt"
"bank-holiday.txt"