listcorpus

Representation of frequency information for a corpus, which can be be used as a reference corpus for keyword analysis.

Note: to generate a frequency table for a corpus, see Conc.frequencies.

ListCorpus class

ListCorpus

 ListCorpus (name:str='', description:str='')

Represention of a corpus based on frequency information, which can be loaded as a reference corpus.

	Type	Default	Details
name	str		name of corpus
description	str		description of corpus

source

ListCorpus.build_from_corpus

 ListCorpus.build_from_corpus (source_corpus_path:str, save_path:str)

Build a List Corpus from a Conc corpus.

	Type	Details
source_corpus_path	str	path to a Conc corpus directory
save_path	str	directory where corpus will be created, a subdirectory will be automatically created with the corpus content
Returns	None

source

ListCorpus.load

 ListCorpus.load (corpus_path:str)

Load list corpus from disk.

	Type	Details
corpus_path	str	path to load corpus

source

ListCorpus.info

 ListCorpus.info (formatted:bool=True)

Return information about the list corpus.

	Type	Default	Details
formatted	bool	True	return formatted output
Returns	str		formatted information about the corpus

source

ListCorpus.report

 ListCorpus.report ()

Get information about the list corpus as a result object.

source

ListCorpus.summary

 ListCorpus.summary (include_memory_usage:bool=False)

Print information about the list corpus in a formatted table.

	Type	Default	Details
include_memory_usage	bool	False	include memory usage in output

Information on working with the list corpus format

To create a list corpus you first need to create a standard corpus using Conc.corpus. See the recipes for examples.

Note: if you intend to use the list corpus as a reference corpus for keyness analsis, it will probably be helpful to add standardize_word_token_punctuation_characters to the build method when building the source corpus. This will ensure that word tokens with punctuation (e.g. n’t) use the same apostrophe character and allow Conc to handle these differences when calculating keyness.

Once created you create a list corpus by creating in the path to the corpus directory …

listcorpus = ListCorpus().build_from_corpus(source_corpus_path = f'{save_path}garden-party.corpus', save_path = save_path)

List corpus will copy some of the data from the corpus, and add document frequency information for each token. Conc uses the .listcorpus suffix on directories to differentiate standard corpora from list corpora. The directory for the list corpus will contain corpus information in listcorpus.json, the frequency information in the vocab.parquet file, and a human-readable README.md to aide sharing the data.

├── garden-party.listcorpus
│   ├── README.md
│   ├── vocab.parquet
│   └── listcorpus.json

You can access summary information, with the same methods as the Conc.corpus class.

For example …

listcorpus.summary()

List Corpus Summary

Attribute	Value
name	Garden Party Corpus
description	A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
date_created	2025-07-23 22:15:21
conc_version	0.1.10
corpus_path	/home/geoff/data/conc-test-corpora/garden-party.listcorpus
document_count	15
token_count	74,664
word_token_count	59,514
unique_tokens	5,410
unique_word_tokens	5,392

This preview of the vocab table shows the available columns in case you want to access the data directly. The anatomy page has information on the columns from the standard Conc corpus format that is relevant to working with a list corpus.

display(listcorpus.vocab.head(1000).collect().sample(10))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space	document_frequency_lower	document_frequency_orth
109	35	349	"about"	88	88	false	false	14	14
883	5262	4305	"strong"	9	9	false	false	5	5
797	5474	3022	"thank"	17	10	false	false	8	7
504	4037	2788	"played"	17	17	false	false	7	7
216	6102	1300	"who"	56	44	false	false	14	14
819	868	1720	"chairs"	9	9	false	false	5	5
27	2383	4423	"had"	469	466	false	false	14	14
209	6128	6234	"will"	47	45	false	false	13	12
986	616	2240	"brass"	7	7	false	false	4	4
251	2410	587	"hands"	38	38	false	false	13	13

source

ListCorpus.get_token_count_text

 ListCorpus.get_token_count_text (exclude_punctuation:bool=False)

Get the token count for the corpus with adjustments and text for output

	Type	Default	Details
exclude_punctuation	bool	False	exclude punctuation tokens from the count
Returns	tuple		token count with adjustments based on exclusions, token descriptor, total descriptor