conc
  1. API
  2. corpus
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. corpus

corpus

Create a conc corpus.

Corpus class


source

Corpus

 Corpus (name:str='', description:str='')

Represention of text corpus, with methods to build, load and save a corpus from a variety of formats and to work with the corpus data.

Type Default Details
name str name of corpus
description str description of corpus

Build and save a corpus

Conc defines a punctuation token as a token consisting only of punctuation characters. Punctuation characters are defined by combining the Python string.punctuation characters with unicode characters categorised as punctuation (i.e. unicode characters with general category starting with P) or currency symbols (i.e. general category Sc). This means, for example, that various versions of a dashes or quotation marks can be identified as punctuation. This also means that any emoticons that are based on sequences of punctuation characters, like :), will be defined as a punctuation token. To access reporting on punctuation is still possible in Conc reports using the relevant parameters. There are still many unicode symbol characters that are not defined as punctuation by Conc. This may change in future versions of Conc, including the ability to define punctuation strings or exceptions. Any changes will be documented.

print(len(PUNCTUATION_STRINGS))
print(PUNCTUATION_STRINGS)
890
〃⳺⁈⟮༐﹛𑑋𐫰𐩓܅⸦「꠸⹍⁾꣸,!⦘፣₠⸠‑⁃❵᠃⦏𑙃⸌𑿟៙。𑱃⹔⹉‰܉꧅𑂿𑪡𐽖՞᨟܃𐩒{⸳′𑗐꩜𐩐𑙣꘍𑃁‱𑈹⸙﹩᰿︐𑙢⧼︻𞥞𑗊﹖܀❮𑂾₴₸⁎>׀܋𑪛〟𐾆﹂⹆𑗏₳︗<꘏⦅#⟬⟆¥⳿𐄀𑙦𑇍𑑝༉【⁔₦﹔𑓆⁌︔⹖࿔⦉𐩿﹠𐫲⧘)᪠︽꧋𐾈᭜“⹎𑙩⸋࠷꧞﴿፥༊।𖺗꧇‥᪣࠼﹙₲⟫࠳⸿}⸃⸀𑪟𖬹࿙៚᳀𑪢฿꓿𑇈౷»༈'࠲⦍࠾⸶𑅂︑꩟⁐﹞⦓᠅⦕︾₽෴᭽৳゠𐽘𖺘᳄𖬺࠹༻⦊𑅴𑱅⁖᠇༽၊־︱𑑏‵§𑩄𑚹﹁𐫴᠂"৻܌⹛܊࿒᭞𑱁𐩗﹜︓𖺚⹅;⸆:)𑗂༄𑥄𐄂𑿠᰽𑨿@⹕᥄॰⁋⹃﹒﹝⦆"಄𑗉꣺€؛᭝❭₧𐽙※『_\`𑙫𑩅᭟𝪉⸢❰⸘꣎⦅﹄﹐›𖩯†𐬺❱𖬻᥅܇𑗗‟𝪇𑁇᚜᭾𐩕᪭⦆𑗒⟅₶᠈𑁌⹋་᪢﹗%𑃀⸽⸾٫܂࠺᪥〘$./‚:𐬻⦒﹅᚛᰻‖᪩𐽕〞፠𑙁⸧/𑁋꛳⦗𐺭᱾٬၍᳁‘⸚﹌(⸴⦃⁗❯︕𑥆❬᳂𐬾⸥︴၌⁇❩⟪〕﹇.𑗈᰼𑗔﴾⟨¤⧛】᙮𑅀𑇟⹗꩝⟩؞𒑳‛𖬷₤״〉𐮙⁂⸈⟯፦𑈸\܁¢⧙۔❲՟❪𑈼꡷﹫𖿢᠀⹙⸺𑂻❨〉「₢᭛⃀@؝𒿱𑇅𐩑𑜾᨞։︹៖𑜼}𑻸᯼⌈‣⹈﹡〈꣼༅𐩔[𑇆𞲰⁓𑇛𑥅༼᪨꥟𑅵𖩮𖺙⁍₮𑱄࠰𐬽᯿―፨⟭՚⸡𑧢𖬸︸﹨𐏐…⁝᜶⁊⸹៕⁁⸻࿑꧃₷⹂𑗃؋׃𑙥︷․⁑𑈻₹⸕』⸵𝪋︳_꧄—⁞⸎⹁〙𐬹₭꓾𐬼𑩃⸷⸞𑩀·᳃⁛⹝𑗇܄᠆﹆⟦࿓₯︼[⸇᪫〽𑁈𒑴⳹❴߷₡《።࿐꣏•꧆𑗅𑈽。།„=௹᳅₱⹇⵰⸤⦔𑗁⸔࠽٭𒑱𑗕₩⹀⹓꡴𑅃︶︰‒‾・߸𑱂፡࠵⸨𑑍﹊᪡❳𛲟₍𑊩؍⹏𑗄﹈𐽗᯽᠉𐾇𐡗〔𑑚⌋⸁﹘࠶܍⸱𒑲𑙧⸍꫰-’᠊(⹊⸊+〚𝪊𐮛׆⁙⸗⦌࠱𑪜܆︘₵⦖¢᰾₺︒՝᪤๚༑⸒࡞༇𑙂⦋𑗎]𐫳᐀⸰¶‹⸅﹀‗﹕𑠻༏𑇇𖫵⁜၎‐។꧍」'𐫶𑗑؉″꫟?𑈺៘𐬿᪪꛷*྅꛵﹋૰᜵᛬꙳£﹍𑑎^༆᠁₾⹚*꡶𝪈⳾༌︙₨﹉⸂¿⸐⁅⸸!𑪚𑗌₥꧟𑙤꡵‧⁘߿‡⸜᪬︖𑪞⁏》𑙬〗༒–‶᱿⸖𐫵،𑇞⸓๏՜᯾、؊𐤟꛲·‼〖𐮚៛₰𑁍૱𑙨⸟৽⹄₪𑿝𖭄٪⸼」𑿞᳆꩞׳⸏₼❫⌉࠴⦄;፧॥⸩࠻¥፤𐩖𞥟꧁〝𑙠꧊⧚⁀﹃﹏〛𑁊‸᛫⧽$꤮֏๛֊₫‽₩‴𐄁𐕯⦐𑪠〜⹘⸛༎꧂꫞𑱱꘎꯫⁆𑅁﹟܈᭠꙾𑿿৲᭚𑩂᪦﹪𑂼𑜽𑗓؟”%⹜]¡꛴՛𐫱₻༺꣹𑗋𑩆⳼꤯£⸫𑁉𑑌᠄⁕⦇﹎﷼꧌꫱⸮﹣#߹&‿{︿、𐾉𐮜⸉₣⸲‷𑗖⸣࠸𑗆«⹒᳓༔੶𑱰᛭𑇝₿⸭︺჻⹌︵𑙪𑩁-𒿲᳇⦈꧉꧈𞋿⌊⸬𑙡߾〰︲⁉၏⦎・|,;⸪﹚𑗍။𐤿⸝⸄𐩘&𒑰⁽𑻷⸑⁚〈﹑࿚⳻₎꛶𐎟⦑⟧𑑛?~

Spacy includes space tokens in the vocab for non-destructive tokenisation. Positions of space tokens are stored so they can be filtered out for analysis and reporting.

Tokens consisting of only punctuation are defined as punctuation tokens. These can be removed or included in analysis and reporting.

NOTE: currently streaming either with sink_parquet or collect(engine=‘streaming’) can break the order of the dataframe (not just whole rows, but within specific columns leading to misaligned data). Streaming is not being used for the build, this will be reassessed in the future as the new Polars streaming functionality matures.


source

Corpus.save_corpus_metadata

 Corpus.save_corpus_metadata ()

Save corpus metadata.


source

Corpus.build

 Corpus.build (save_path:str, iterator:<built-infunctioniter>,
               model:str='en_core_web_sm', spacy_batch_size:int=500,
               build_process_batch_size:int=5000,
               build_process_cleanup:bool=True)

Build a corpus from an iterator of texts.

Type Default Details
save_path str directory where corpus will be created, a subdirectory will be automatically created with the corpus content
iterator iter iterator of texts
model str en_core_web_sm spacy model to use for tokenisation
spacy_batch_size int 500 batch size for spacy tokenizer
build_process_batch_size int 5000 save in-progress build to disk every n docs
build_process_cleanup bool True Remove the build files after build is complete, retained for development and testing purposes

source

Corpus.build_from_files

 Corpus.build_from_files (source_path:str, save_path:str,
                          file_mask:str='*.txt',
                          metadata_file:str|None=None,
                          metadata_file_column:str='file',
                          metadata_columns:list[str]=[],
                          encoding:str='utf-8',
                          model:str='en_core_web_sm',
                          spacy_batch_size:int=1000,
                          build_process_batch_size:int=5000,
                          build_process_cleanup:bool=True)

Build a corpus from text files in a folder.

Type Default Details
source_path str path to folder with text files
save_path str path to save corpus
file_mask str *.txt mask to select files
metadata_file str | None None path to a CSV with metadata
metadata_file_column str file column in metadata file with file names to align texts with metadata
metadata_columns list [] list of column names to import from metadata
encoding str utf-8 encoding of text files
model str en_core_web_sm spacy model to use for tokenisation
spacy_batch_size int 1000 batch size for spacy tokenizer
build_process_batch_size int 5000 save in-progress build to disk every n docs
build_process_cleanup bool True Remove the build files after build is complete, retained for development and testing purposes

source

Corpus.build_from_csv

 Corpus.build_from_csv (source_path:str, save_path:str,
                        text_column:str='text',
                        metadata_columns:list[str]=[],
                        encoding:str='utf8', model:str='en_core_web_sm',
                        spacy_batch_size:int=1000,
                        build_process_batch_size:int=5000,
                        build_process_cleanup:bool=True)

Build a corpus from a csv file.

Type Default Details
source_path str path to csv file
save_path str path to save corpus
text_column str text column in csv with text
metadata_columns list [] list of column names to import from csv
encoding str utf8 encoding of csv passed to Polars read_csv, see their documentation
model str en_core_web_sm spacy model to use for tokenisation
spacy_batch_size int 1000 batch size for Spacy tokenizer
build_process_batch_size int 5000 save in-progress build to disk every n docs
build_process_cleanup bool True Remove the build files after build is complete, retained for development and testing purposes

Load a corpus


source

Corpus.load

 Corpus.load (corpus_path:str)

Load corpus from disk and load the corresponding spaCy model.

Type Details
corpus_path str path to load corpus

Information about the corpus


source

Corpus.info

 Corpus.info (include_disk_usage:bool=False, formatted:bool=True)

Return information about the corpus.

Type Default Details
include_disk_usage bool False include information of size on disk in output
formatted bool True return formatted output
Returns str formatted information about the corpus

source

Corpus.report

 Corpus.report (include_memory_usage:bool=False)

Get information about the corpus as a result object.

Type Default Details
include_memory_usage bool False include memory usage in output
Returns Result returns Result object with corpus summary information

source

Corpus.summary

 Corpus.summary (include_memory_usage:bool=False)

Print information about the corpus in a formatted table.

Type Default Details
include_memory_usage bool False include memory usage in output

You can get summary information on your corpus, including the number of documents, the token count and the number of unique tokens as a dataframe using the info method. You can also just print the corpus itself.

print(brown) # equivalent to print(brown.info())
┌────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Attribute          ┆ Value                                                                                                                                                                                                                                              │
╞════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Name               ┆ Brown Corpus                                                                                                                                                                                                                                       │
│ Description        ┆ A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 │
│                    ┆ http://www.hit.uib.no/icame/brown/bcm.html. This version …                                                                                                                                                                                         │
│ Date Created       ┆ 2025-06-23 13:16:15                                                                                                                                                                                                                                │
│ Conc Version       ┆ 0.1.4                                                                                                                                                                                                                                              │
│ Corpus Path        ┆ /home/geoff/data/conc-test-corpora/brown.corpus                                                                                                                                                                                                    │
│ Document Count     ┆ 500                                                                                                                                                                                                                                                │
│ Token Count        ┆ 1,138,566                                                                                                                                                                                                                                          │
│ Word Token Count   ┆ 980,144                                                                                                                                                                                                                                            │
│ Unique Tokens      ┆ 42,930                                                                                                                                                                                                                                             │
│ Unique Word Tokens ┆ 42,907                                                                                                                                                                                                                                             │
└────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The info method can also provide information on the disk usage of the corpus setting the include_disk_usage parameter to True.

print(brown.info(include_disk_usage=True))
┌────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Attribute                  ┆ Value                                                                                                                                                                                                                                              │
╞════════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Name                       ┆ Brown Corpus                                                                                                                                                                                                                                       │
│ Description                ┆ A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 │
│                            ┆ http://www.hit.uib.no/icame/brown/bcm.html. This version …                                                                                                                                                                                         │
│ Date Created               ┆ 2025-06-23 13:16:15                                                                                                                                                                                                                                │
│ Conc Version               ┆ 0.1.4                                                                                                                                                                                                                                              │
│ Corpus Path                ┆ /home/geoff/data/conc-test-corpora/brown.corpus                                                                                                                                                                                                    │
│ Document Count             ┆ 500                                                                                                                                                                                                                                                │
│ Token Count                ┆ 1,138,566                                                                                                                                                                                                                                          │
│ Word Token Count           ┆ 980,144                                                                                                                                                                                                                                            │
│ Unique Tokens              ┆ 42,930                                                                                                                                                                                                                                             │
│ Unique Word Tokens         ┆ 42,907                                                                                                                                                                                                                                             │
│ Corpus Metadata (Mb)       ┆ 0.001                                                                                                                                                                                                                                              │
│ Document Metadata (Mb)     ┆ 0.001                                                                                                                                                                                                                                              │
│ Tokens (Mb)                ┆ 4.468                                                                                                                                                                                                                                              │
│ Vocab (Mb)                 ┆ 0.678                                                                                                                                                                                                                                              │
│ Punctuation Positions (Mb) ┆ 0.425                                                                                                                                                                                                                                              │
│ Space Positions (Mb)       ┆ 0.012                                                                                                                                                                                                                                              │
└────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

You can get the same information in a table format by using the summary method.

brown.summary()
Corpus Summary
Attribute Value
Name Brown Corpus
Description A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created 2025-06-23 13:16:15
Conc Version 0.1.4
Corpus Path /home/geoff/data/conc-test-corpora/brown.corpus
Document Count 500
Token Count 1,138,566
Word Token Count 980,144
Unique Tokens 42,930
Unique Word Tokens 42,907

Working with tokens

Internally, Conc uses Polars and Numpy vector operations where possible to speed up processing.


source

Corpus.token_ids_to_tokens

 Corpus.token_ids_to_tokens (token_ids:numpy.ndarray|list)

Get token strings for a list of token ids.

Type Details
token_ids numpy.ndarray | list token ids to return token strings for
Returns ndarray return token strings for token ids

source

Corpus.tokens_to_token_ids

 Corpus.tokens_to_token_ids (tokens:list[str]|numpy.ndarray[str])

Convert a list or np.array of token string to token ids

Type Details
tokens list[str] | numpy.ndarray[str] list of tokens to get ids for
Returns ndarray array of token ids, 0 for unknown tokens

source

Corpus.token_to_id

 Corpus.token_to_id (token:str)

Get the token id of a token string.

Type Details
token str token to get id for
Returns int return token id (0 if token not found in the corpus)

A list or numpy array of token strings can be converted to a numpy array of token ids like this using tokens_to_token_ids …

tokens = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
token_ids = brown.tokens_to_token_ids(tokens)
print(token_ids)
[15682 37698 47121 13458   526 16875 22848 25923 23289]

To reverse this use token_ids_to_tokens …

tokens = brown.token_ids_to_tokens(token_ids) # token_ids was set above
print(tokens)
['The' 'quick' 'brown' 'fox' 'jumps' 'over' 'the' 'lazy' 'dog']

The tokens_to_token_ids method will return a 0 for any tokens not in the corpus vocabulary.

tokens = ['some', 'random', 'gazupinfava', 'words']
brown.tokens_to_token_ids(tokens)
array([21572, 28602,     0, 31327])

If zero is passed to token_ids_to_tokens it will return an error token as shown below. A negative value will raise a ValueError.

brown.token_ids_to_tokens([0])
array(['ERROR: not a token'], dtype=object)

The token_to_id method wraps tokens_to_token_ids. You can pass a single token string and get the token id back. As with tokens_to_token_ids, if the token is not in the vocabulary it will return 0.

print(brown.token_to_id('brown')) # returns token id
print(brown.token_to_id('Supercalifragilisticexpialidocious')) # returns 0 if token not in corpus
47121
0

source

Corpus.token_ids_to_sort_order

 Corpus.token_ids_to_sort_order (token_ids:numpy.ndarray|list)

Get the sort order of token strings corresponding to token ids

Type Details
token_ids numpy.ndarray | list token ids to return token strings for
Returns ndarray rank of token ids
tokens = np.array(['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'])
token_ids = brown.tokens_to_token_ids(tokens)
sort_order = brown.token_ids_to_sort_order(token_ids)
sorted_tokens = tokens[np.argsort(sort_order)]

print(tokens)
print(token_ids)
print(sort_order)
print(sorted_tokens)
['The' 'quick' 'brown' 'fox' 'jumps' 'over' 'the' 'lazy' 'dog']
[15682 37698 47121 13458   526 16875 22848 25923 23289]
[50086 40359  7940 20497 27663 35982 50087 29054 15849]
['brown' 'dog' 'fox' 'jumps' 'lazy' 'over' 'quick' 'The' 'the']

source

Corpus.get_token_count_text

 Corpus.get_token_count_text (exclude_punctuation:bool=False)

Get the token count for the corpus with adjustments and text for output

Type Default Details
exclude_punctuation bool False exclude punctuation tokens from the count
Returns tuple token count with adjustments based on exclusions, token descriptor, total descriptor

Tokenization


source

Corpus.tokenize

 Corpus.tokenize (string:str, simple_indexing=False)

Tokenize a string using the Spacy tokenizer.

Type Default Details
string str string to tokenize
simple_indexing bool False use simple indexing

Work with specific texts in the corpus


source

Corpus.text

 Corpus.text (doc_id:int)

Get a text document

Type Details
doc_id int the id of the document

Find positions of tokens


source

Corpus.get_tokens_by_index

 Corpus.get_tokens_by_index (index:str='orth_index',
                             exclude_punctuation:bool=False)

Get tokens for a given index.

Type Default Details
index str orth_index index to get tokens from i.e. ‘orth_index’ ‘lower_index’ ‘token2doc_index’
exclude_punctuation bool False exclude punctuation tokens from the result (unused currently)
Returns ndarray

source

Corpus.get_ngrams_by_index

 Corpus.get_ngrams_by_index (ngram_length:int, index:str)

Get ngrams for a given index and ngram length.

Type Details
ngram_length int length of ngrams to get
index str index to get tokens from, e.g. ‘orth_index’ ‘lower_index’
Returns ndarray
toy.get_ngrams_by_index(ngram_length=2, index='lower_index')[100:110]
array([[10,  6],
       [ 6, 12],
       [12,  8],
       [ 8, 10],
       [10, 13],
       [13, 15],
       [15, 17],
       [17, 10],
       [10, 11],
       [11, 12]], dtype=uint32)

source

Corpus.get_token_positions

 Corpus.get_token_positions (token_sequence:list[numpy.ndarray],
                             index_id:int)

Get the positions of a token sequence in the corpus.

Type Details
token_sequence list token sequence to get index for
index_id int index to search (i.e. ORTH, LOWER)
Returns ndarray positions of token sequence
token_str = 'dog'
token_sequence, index_id = brown.tokenize(token_str, simple_indexing=True)
token_positions = brown.get_token_positions(token_sequence, index_id)
print(token_positions)
[array([  18833,   18870,   18880,   18950,   18957,   37578,   88691,
        125019,  137037,  137687,  137722,  137731,  137775,  143860,
        188374,  248842,  248982,  249204,  249217,  249243,  249311,
        249337,  249397,  249425,  249535,  250476,  250495,  250554,
        250613,  250645,  250699,  250709,  251033,  252740,  253700,
        255256,  255360,  255532,  330282,  359785,  437987,  437991,
        438046,  438051,  463456,  463485,  463507,  521175,  648316,
        694080,  694129,  694289,  694481,  694760,  695139,  695216,
        695313,  861865,  861872,  863503,  863521,  875531,  875573,
        875660,  887598,  994901, 1012130, 1028088, 1050598, 1050607,
       1052032, 1074911, 1084765, 1086020, 1086052, 1086639, 1104994,
       1128317, 1137426])]

source

Corpus.get_tokens_in_context

 Corpus.get_tokens_in_context (token_positions:numpy.ndarray, index:str,
                               context_length:int=5,
                               position_offset:int=1,
                               position_offset_step:int=1,
                               exclude_punctuation:bool=True,
                               convert_eof:bool=True)

Get tokens in context for given token positions, context length and direction, operates one side at a time.

Type Default Details
token_positions ndarray Numpy array of token positions in the corpus
index str Index to use - lower_index, orth_index
context_length int 5 Number of context words to consider on each side of the token
position_offset int 1 offset to start retrieving context words - negatve is left of node, positive for right - may want to adjust if sequence_len > 1
position_offset_step int 1 step to move position offset by, this sets direct, -1 for left, 1 for right
exclude_punctuation bool True ignore punctuation from context retrieved
convert_eof bool True if True (for collocation functionality), contexts with end of file tokens will have eof token and tokens after set to zero, otherwise EOF retained (e.g. False used for ngrams)
Returns Result

source

build_test_corpora

 build_test_corpora (source_path:str, save_path:str,
                     force_rebuild:bool=False)

(Deprecated - moved to conc.corpora) Build all test corpora from source files.

Type Default Details
source_path str path to folder with corpora
save_path str path to save corpora
force_rebuild bool False force rebuild of corpora, useful for development and testing

Note: build_sample_corpora was accessible via conc.corpus as build_test_corpora up to version 0.1.1. Calling it this way will raise a deprecation warning. It will be removed for version 1.0.

  • Report an issue