= ListCorpus().build_from_corpus(source_corpus_path = f'{save_path}garden-party.corpus', save_path = save_path) listcorpus
listcorpus
Note: to generate a frequency table for a corpus, see Conc.frequencies
.
ListCorpus class
ListCorpus
ListCorpus (name:str='', description:str='')
Represention of a corpus based on frequency information, which can be loaded as a reference corpus.
Type | Default | Details | |
---|---|---|---|
name | str | name of corpus | |
description | str | description of corpus |
ListCorpus.build_from_corpus
ListCorpus.build_from_corpus (source_corpus_path:str, save_path:str)
Build a List Corpus from a Conc corpus.
Type | Details | |
---|---|---|
source_corpus_path | str | path to a Conc corpus directory |
save_path | str | directory where corpus will be created, a subdirectory will be automatically created with the corpus content |
Returns | None |
ListCorpus.load
ListCorpus.load (corpus_path:str)
Load list corpus from disk.
Type | Details | |
---|---|---|
corpus_path | str | path to load corpus |
ListCorpus.info
ListCorpus.info (formatted:bool=True)
Return information about the list corpus.
Type | Default | Details | |
---|---|---|---|
formatted | bool | True | return formatted output |
Returns | str | formatted information about the corpus |
ListCorpus.report
ListCorpus.report ()
Get information about the list corpus as a result object.
ListCorpus.summary
ListCorpus.summary (include_memory_usage:bool=False)
Print information about the list corpus in a formatted table.
Type | Default | Details | |
---|---|---|---|
include_memory_usage | bool | False | include memory usage in output |
To create a list corpus you first need to create a standard corpus using Conc.corpus
. See the recipes for examples.
Note: if you intend to use the list corpus as a reference corpus for keyness analsis, it will probably be helpful to add standardize_word_token_punctuation_characters
to the build method when building the source corpus. This will ensure that word tokens with punctuation (e.g. n’t) use the same apostrophe character and allow Conc to handle these differences when calculating keyness.
Once created you create a list corpus by creating in the path to the corpus directory …
List corpus will copy some of the data from the corpus, and add document frequency information for each token. Conc uses the .listcorpus suffix on directories to differentiate standard corpora from list corpora. The directory for the list corpus will contain corpus information in listcorpus.json, the frequency information in the vocab.parquet file, and a human-readable README.md to aide sharing the data.
├── garden-party.listcorpus
│ ├── vocab.parquet
│ ├── README.md
│ └── listcorpus.json
You can access summary information, with the same methods as the Conc.corpus
class.
For example …
listcorpus.summary()
List Corpus Summary | |
---|---|
Attribute | Value |
name | Garden Party Corpus |
description | A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party |
date_created | 2025-07-09 11:15:56 |
conc_version | 0.1.6 |
corpus_path | /home/geoff/data/conc-test-corpora/garden-party.listcorpus |
document_count | 15 |
token_count | 74,664 |
word_token_count | 59,514 |
unique_tokens | 5,410 |
unique_word_tokens | 5,392 |
This preview of the vocab table shows the available columns in case you want to access the data directly. The anatomy page has information on the columns from the standard Conc corpus format that is relevant to working with a list corpus.
1000).collect().sample(10)) display(listcorpus.vocab.head(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space | document_frequency_lower | document_frequency_orth |
---|---|---|---|---|---|---|---|---|---|
509 | 4863 | 1482 | "sky" | 17 | 17 | false | false | 8 | 8 |
284 | 3999 | 4150 | "pink" | 32 | 32 | false | false | 10 | 10 |
399 | 3719 | 3175 | "On" | null | 22 | false | false | null | 8 |
450 | 2504 | 1071 | "held" | 19 | 19 | false | false | 9 | 9 |
87 | 5523 | 691 | "this" | 138 | 112 | false | false | 14 | 14 |
491 | 6171 | 1788 | "women" | 20 | 18 | false | false | 10 | 9 |
256 | 6110 | 5588 | "why" | 99 | 37 | false | false | 12 | 10 |
720 | 4974 | 3520 | "somebody" | 13 | 11 | false | false | 5 | 5 |
398 | 2922 | 2620 | "Kember" | null | 22 | false | false | null | 1 |
472 | 2016 | 1213 | "followed" | 18 | 18 | false | false | 8 | 8 |
ListCorpus.get_token_count_text
ListCorpus.get_token_count_text (exclude_punctuation:bool=False)
Get the token count for the corpus with adjustments and text for output
Type | Default | Details | |
---|---|---|---|
exclude_punctuation | bool | False | exclude punctuation tokens from the count |
Returns | tuple | token count with adjustments based on exclusions, token descriptor, total descriptor |