conc
  1. API
  2. core
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. core

core

Helper functions and classes for Conc.

Logging


source

set_logger_state

 set_logger_state (state:str)

Set the state of the conc logger to either ‘quiet’ or ‘verbose’

Type Details
state str ‘quiet’ or ‘verbose’

spaCy


source

spacy_attribute_name

 spacy_attribute_name (index)

Get name of index from spacy.

Corpus metadata schema


source

CorpusMetadata

 CorpusMetadata (name:str, description:str, slug:str, conc_version:str,
                 document_count:int, token_count:int,
                 word_token_count:int, punct_token_count:int,
                 space_token_count:int, unique_tokens:int,
                 unique_word_tokens:int, date_created:str, EOF_TOKEN:int,
                 SPACY_EOF_TOKEN:int, SPACY_MODEL:str,
                 SPACY_MODEL_VERSION:str, punct_tokens:list[int],
                 space_tokens:list[int])

JSON validation schema for corpus metadata

properties = msgspec.json.schema(CorpusMetadata)['$defs']['CorpusMetadata']['properties']
display(properties)
{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Get word lists


source

get_stop_words

 get_stop_words (save_path:str, spacy_model:str='en_core_web_sm')

Get stop words from spaCy and cache to disk

Type Default Details
save_path str directory to save stop words to, file name will be created based on spaCy model name
spacy_model str en_core_web_sm model to get stop words for
print(get_stop_words(save_path = save_path, spacy_model='en_core_web_sm'))
{'last', 'go', 'whither', 'somewhere', 'former', '’re', 'out', 'take', 'neither', '’d', 'next', 'part', 'though', 'first', 'whence', 'by', 'whether', 'thereafter', 'above', 'but', 'so', 'namely', 'both', 'itself', 'fifteen', 'up', 'while', 'below', 'sixty', 'everything', 'these', 'than', 'nevertheless', 'must', 'be', 'hundred', 'elsewhere', 'anything', '‘m', 'themselves', 'since', 'they', 'any', 'more', 'ours', 'quite', 'where', 'wherever', 'and', 'indeed', 'under', 'beyond', 'ten', 'still', 'n’t', 'anyway', 'ourselves', 'your', 'however', 'he', 'front', 'becomes', 'along', 'wherein', 'nor', 'via', 'really', 'might', 'mine', 'thereby', 'whenever', 'into', 'every', 'been', '‘ll', 'has', 'doing', 'keep', 'off', 'anyhow', 'say', 'twelve', 'now', 'hereby', 'back', 'did', 'four', 'beside', '‘d', 'this', 'else', 'often', 'all', 'toward', 'nobody', 'latter', 'is', 'do', 'perhaps', 'herself', 'me', 'at', 'should', 'alone', 'meanwhile', 'as', 'seeming', 'afterwards', 'whereas', 'their', "'m", 'except', 'regarding', "n't", 'between', 'when', 'full', 'together', 'noone', 'of', 'across', 're', 'how', 'several', 'them', 'make', 'per', 'twenty', 'none', 'get', 'rather', 'sometime', 'who', 'whereby', 'hers', 'were', 'yourself', 'three', 'because', 'to', 'name', 'used', 'made', 'our', 'n‘t', 'much', 'formerly', 'hereafter', 'become', 'least', 'whom', 'she', 'about', 'give', 'hereupon', 'anywhere', 'against', 'please', 'call', 'therefore', 'it', '’m', 'the', 'may', 'thereupon', 'had', 'whereupon', 'seemed', 'are', 'although', 'we', 'myself', 'his', 'him', '’ve', 'does', 'again', 'himself', 'own', 'an', 'within', 'amount', 'for', 'some', 'another', 'after', 'thus', 'down', 'due', 'such', 'us', 'somehow', 'only', 'otherwise', 'yours', 'side', 'other', '‘re', 'nothing', 'behind', 'during', 'serious', "'ve", 'just', 'you', 'put', 'over', 'not', 'yet', 'enough', "'d", 'being', 'further', 'nowhere', 'am', 'bottom', 'or', 'was', 'always', 'others', 'moreover', '’ll', 'my', 'yourselves', 'whoever', 'unless', 'sometimes', 'throughout', 'from', 'that', 'why', 'also', 'without', 'became', 'once', 'thru', 'around', 'never', 'upon', 'which', 'many', 'well', 'through', 'already', "'re", 'no', 'even', 'using', 'if', 'done', 'eight', 'towards', 'amongst', 'everyone', 'among', 'there', 'will', 'seem', 'few', 'various', 'therein', 'a', 'herein', 'nine', 'ca', "'s", 'thence', 'something', 'then', 'can', 'cannot', 'hence', 'almost', 'same', 'each', 'either', 'could', 'six', 'show', 'move', 'one', 'forty', 'those', 'beforehand', 'whose', '’s', 'besides', 'becoming', 'her', 'fifty', 'seems', 'five', 'would', 'anyone', 'before', '‘ve', 'less', 'mostly', 'too', 'here', 'eleven', 'on', 'onto', 'third', 'everywhere', 'whole', 'very', 'top', 'empty', 'most', 'someone', 'have', 'whereafter', 'ever', 'what', 'latterly', 'two', 'see', 'in', '‘s', 'its', 'whatever', "'ll", 'until', 'with', 'i'}

Access these functions from conc.corpora

Up to version 0.1.1 conc.core included helper functions to list, download and build corpora. These have been moved to the conc.corpora module. Running these functions will trigger a warning with a note about deprecation and the new location of the functions. Access to these functions will only be via conc.corpora by Conc version 1.0.0.


source

list_corpora

 list_corpora (path:str)

(Deprecated - call via conc.corpora) Scan a directory for available corpora

Type Details
path str path to load corpus
Returns DataFrame Dataframe with path, corpus, corpus name, document count, token count

source

create_toy_corpus_sources

 create_toy_corpus_sources (source_path:str)

(Deprecated - call via conc.corpora) Create txt files and csv to test build of toy corpus.

Type Details
source_path str path to location of sources for building corpora

source

show_toy_corpus

 show_toy_corpus (csv_path:str)

(Deprecated - call via conc.corpora) Show toy corpus in a table.

Type Details
csv_path str path to location of csv for building corpora
Returns GT

source

get_nltk_corpus_sources

 get_nltk_corpus_sources (source_path:str)

(Deprecated - call via conc.corpora) Get NLTK corpora as sources for development or testing Conc functionality.

Type Details
source_path str path to location of sources for building corpora

source

get_garden_party

 get_garden_party (source_path:str)

(Deprecated - call via conc.corpora) Get corpus of The Garden Party by Katherine Mansfield for development of Conc and testing Conc functionality.

Type Details
source_path str path to location of sources for building corpora

source

get_large_dataset

 get_large_dataset (source_path:str)

(Deprecated - call via conc.corpora) Get 1m rows of https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset for testing.

Type Details
source_path str path to location of sources for building corpora

source

create_large_dataset_sizes

 create_large_dataset_sizes (source_path:str, sizes:list=[10000, 100000,
                             200000, 500000])

(Deprecated - call via conc.corpora) Create datasets of different sizes from data source retrieved by get_large_dataset for testing.

Type Default Details
source_path str path to location of sources for building corpora
sizes list [10000, 100000, 200000, 500000] list of sizes for test data-sets
  • Report an issue