conc
  1. API
  2. listcorpus
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. listcorpus

listcorpus

Representation of frequency information for a corpus, which can be be used as a reference corpus for keyword analysis.

Note: to generate a frequency table for a corpus, see Conc.frequencies.

ListCorpus class


source

ListCorpus

 ListCorpus (name:str='', description:str='')

Represention of a corpus based on frequency information, which can be loaded as a reference corpus.

Type Default Details
name str name of corpus
description str description of corpus

source

ListCorpus.build_from_corpus

 ListCorpus.build_from_corpus (source_corpus_path:str, save_path:str)

Build a List Corpus from a Conc corpus.

Type Details
source_corpus_path str path to a Conc corpus directory
save_path str directory where corpus will be created, a subdirectory will be automatically created with the corpus content
Returns None

source

ListCorpus.load

 ListCorpus.load (corpus_path:str)

Load list corpus from disk.

Type Details
corpus_path str path to load corpus

source

ListCorpus.info

 ListCorpus.info (formatted:bool=True)

Return information about the list corpus.

Type Default Details
formatted bool True return formatted output
Returns str formatted information about the corpus

source

ListCorpus.report

 ListCorpus.report ()

Get information about the list corpus as a result object.


source

ListCorpus.summary

 ListCorpus.summary (include_memory_usage:bool=False)

Print information about the list corpus in a formatted table.

Type Default Details
include_memory_usage bool False include memory usage in output

To create a list corpus you first need to create a standard corpus using Conc.corpus. See the recipes for examples.

Note: if you intend to use the list corpus as a reference corpus for keyness analsis, it will probably be helpful to add standardize_word_token_punctuation_characters to the build method when building the source corpus. This will ensure that word tokens with punctuation (e.g. n’t) use the same apostrophe character and allow Conc to handle these differences when calculating keyness.

Once created you create a list corpus by creating in the path to the corpus directory …

listcorpus = ListCorpus().build_from_corpus(source_corpus_path = f'{save_path}garden-party.corpus', save_path = save_path)

List corpus will copy some of the data from the corpus, and add document frequency information for each token. Conc uses the .listcorpus suffix on directories to differentiate standard corpora from list corpora. The directory for the list corpus will contain corpus information in listcorpus.json, the frequency information in the vocab.parquet file, and a human-readable README.md to aide sharing the data.

├── garden-party.listcorpus
│   ├── vocab.parquet
│   ├── README.md
│   └── listcorpus.json

You can access summary information, with the same methods as the Conc.corpus class.

For example …

listcorpus.summary()
List Corpus Summary
Attribute Value
name Garden Party Corpus
description A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
date_created 2025-07-09 11:15:56
conc_version 0.1.6
corpus_path /home/geoff/data/conc-test-corpora/garden-party.listcorpus
document_count 15
token_count 74,664
word_token_count 59,514
unique_tokens 5,410
unique_word_tokens 5,392

This preview of the vocab table shows the available columns in case you want to access the data directly. The anatomy page has information on the columns from the standard Conc corpus format that is relevant to working with a list corpus.

display(listcorpus.vocab.head(1000).collect().sample(10))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space document_frequency_lower document_frequency_orth
509 4863 1482 "sky" 17 17 false false 8 8
284 3999 4150 "pink" 32 32 false false 10 10
399 3719 3175 "On" null 22 false false null 8
450 2504 1071 "held" 19 19 false false 9 9
87 5523 691 "this" 138 112 false false 14 14
491 6171 1788 "women" 20 18 false false 10 9
256 6110 5588 "why" 99 37 false false 12 10
720 4974 3520 "somebody" 13 11 false false 5 5
398 2922 2620 "Kember" null 22 false false null 1
472 2016 1213 "followed" 18 18 false false 8 8

source

ListCorpus.get_token_count_text

 ListCorpus.get_token_count_text (exclude_punctuation:bool=False)

Get the token count for the corpus with adjustments and text for output

Type Default Details
exclude_punctuation bool False exclude punctuation tokens from the count
Returns tuple token count with adjustments based on exclusions, token descriptor, total descriptor
  • Report an issue