conc
  1. API
  2. listcorpus
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. listcorpus

listcorpus

Representation of frequency information for a corpus, which can be be used as a reference corpus for keyword analysis.

Note: to generate a frequency table for a corpus, see Conc.frequencies.

ListCorpus class


source

ListCorpus

 ListCorpus (name:str='', description:str='')

Represention of a corpus based on frequency information, which can be loaded as a reference corpus.

Type Default Details
name str name of corpus
description str description of corpus

source

ListCorpus.build_from_corpus

 ListCorpus.build_from_corpus (source_corpus_path:str, save_path:str)

Build a List Corpus from a Conc corpus.

Type Details
source_corpus_path str path to a Conc corpus directory
save_path str directory where corpus will be created, a subdirectory will be automatically created with the corpus content
Returns None

source

ListCorpus.load

 ListCorpus.load (corpus_path:str)

Load list corpus from disk.

Type Details
corpus_path str path to load corpus

source

ListCorpus.info

 ListCorpus.info (formatted:bool=True)

Return information about the list corpus.

Type Default Details
formatted bool True return formatted output
Returns str formatted information about the corpus

source

ListCorpus.report

 ListCorpus.report ()

Get information about the list corpus as a result object.


source

ListCorpus.summary

 ListCorpus.summary (include_memory_usage:bool=False)

Print information about the list corpus in a formatted table.

Type Default Details
include_memory_usage bool False include memory usage in output

Information on working with the list corpus format

To create a list corpus you first need to create a standard corpus using Conc.corpus. See the recipes for examples.

Note: if you intend to use the list corpus as a reference corpus for keyness analsis, it will probably be helpful to add standardize_word_token_punctuation_characters to the build method when building the source corpus. This will ensure that word tokens with punctuation (e.g. n’t) use the same apostrophe character and allow Conc to handle these differences when calculating keyness.

Once created you create a list corpus by creating in the path to the corpus directory …

listcorpus = ListCorpus().build_from_corpus(source_corpus_path = f'{save_path}garden-party.corpus', save_path = save_path)

List corpus will copy some of the data from the corpus, and add document frequency information for each token. Conc uses the .listcorpus suffix on directories to differentiate standard corpora from list corpora. The directory for the list corpus will contain corpus information in listcorpus.json, the frequency information in the vocab.parquet file, and a human-readable README.md to aide sharing the data.

├── garden-party.listcorpus
│   ├── README.md
│   ├── vocab.parquet
│   └── listcorpus.json

You can access summary information, with the same methods as the Conc.corpus class.

For example …

listcorpus.summary()
List Corpus Summary
Attribute Value
name Garden Party Corpus
description A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
date_created 2025-07-23 22:15:21
conc_version 0.1.10
corpus_path /home/geoff/data/conc-test-corpora/garden-party.listcorpus
document_count 15
token_count 74,664
word_token_count 59,514
unique_tokens 5,410
unique_word_tokens 5,392

This preview of the vocab table shows the available columns in case you want to access the data directly. The anatomy page has information on the columns from the standard Conc corpus format that is relevant to working with a list corpus.

display(listcorpus.vocab.head(1000).collect().sample(10))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space document_frequency_lower document_frequency_orth
109 35 349 "about" 88 88 false false 14 14
883 5262 4305 "strong" 9 9 false false 5 5
797 5474 3022 "thank" 17 10 false false 8 7
504 4037 2788 "played" 17 17 false false 7 7
216 6102 1300 "who" 56 44 false false 14 14
819 868 1720 "chairs" 9 9 false false 5 5
27 2383 4423 "had" 469 466 false false 14 14
209 6128 6234 "will" 47 45 false false 13 12
986 616 2240 "brass" 7 7 false false 4 4
251 2410 587 "hands" 38 38 false false 13 13

source

ListCorpus.get_token_count_text

 ListCorpus.get_token_count_text (exclude_punctuation:bool=False)

Get the token count for the corpus with adjustments and text for output

Type Default Details
exclude_punctuation bool False exclude punctuation tokens from the count
Returns tuple token count with adjustments based on exclusions, token descriptor, total descriptor
  • Report an issue