core

Helper functions and classes for Conc.

Logging

set_logger_state

 set_logger_state (state:str)

Set the state of the conc logger to either ‘quiet’ or ‘verbose’

	Type	Details
state	str	‘quiet’ or ‘verbose’

spaCy

source

spacy_attribute_name

 spacy_attribute_name (index)

Get name of index from spacy.

Corpus metadata schema

source

CorpusMetadata

 CorpusMetadata (name:str, description:str, slug:str, conc_version:str,
                 document_count:int, token_count:int,
                 word_token_count:int, punct_token_count:int,
                 space_token_count:int, unique_tokens:int,
                 unique_word_tokens:int, date_created:str, EOF_TOKEN:int,
                 SPACY_EOF_TOKEN:int, SPACY_MODEL:str,
                 SPACY_MODEL_VERSION:str, punct_tokens:list[int],
                 space_tokens:list[int])

JSON validation schema for corpus metadata

properties = msgspec.json.schema(CorpusMetadata)['$defs']['CorpusMetadata']['properties']
display(properties)

{'name': {'type': 'string'},
 'description': {'type': 'string'},
 'slug': {'type': 'string'},
 'conc_version': {'type': 'string'},
 'document_count': {'type': 'integer'},
 'token_count': {'type': 'integer'},
 'word_token_count': {'type': 'integer'},
 'punct_token_count': {'type': 'integer'},
 'space_token_count': {'type': 'integer'},
 'unique_tokens': {'type': 'integer'},
 'unique_word_tokens': {'type': 'integer'},
 'date_created': {'type': 'string'},
 'EOF_TOKEN': {'type': 'integer'},
 'SPACY_EOF_TOKEN': {'type': 'integer'},
 'SPACY_MODEL': {'type': 'string'},
 'SPACY_MODEL_VERSION': {'type': 'string'},
 'punct_tokens': {'type': 'array', 'items': {'type': 'integer'}},
 'space_tokens': {'type': 'array', 'items': {'type': 'integer'}}}

Get word lists

source

get_stop_words

 get_stop_words (save_path:str, spacy_model:str='en_core_web_sm')

Get stop words from spaCy and cache to disk

	Type	Default	Details
save_path	str		directory to save stop words to, file name will be created based on spaCy model name
spacy_model	str	en_core_web_sm	model to get stop words for

print(get_stop_words(save_path = save_path, spacy_model='en_core_web_sm'))

["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both', 'bottom', 'but', 'by', 'ca', 'call', 'can', 'cannot', 'could', 'did', 'do', 'does', 'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'if', 'in', 'indeed', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', "n't", 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'n‘t', 'n’t', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', '‘d', '‘ll', '‘m', '‘re', '‘s', '‘ve', '’d', '’ll', '’m', '’re', '’s', '’ve']

Access these functions from conc.corpora

Up to version 0.1.1 conc.core included helper functions to list, download and build corpora. These have been moved to the conc.corpora module. Running these functions will trigger a warning with a note about deprecation and the new location of the functions. Access to these functions will only be via conc.corpora by Conc version 1.0.0.

source

list_corpora

 list_corpora (path:str)

(Deprecated - call via conc.corpora) Scan a directory for available corpora

	Type	Details
path	str	path to load corpus
Returns	DataFrame	Dataframe with path, corpus, corpus name, document count, token count

source

create_toy_corpus_sources

 create_toy_corpus_sources (source_path:str)

(Deprecated - call via conc.corpora) Create txt files and csv to test build of toy corpus.

	Type	Details
source_path	str	path to location of sources for building corpora

source

show_toy_corpus

 show_toy_corpus (csv_path:str)

(Deprecated - call via conc.corpora) Show toy corpus in a table.

	Type	Details
csv_path	str	path to location of csv for building corpora
Returns	GT

source

get_nltk_corpus_sources

 get_nltk_corpus_sources (source_path:str)

(Deprecated - call via conc.corpora) Get NLTK corpora as sources for development or testing Conc functionality.

	Type	Details
source_path	str	path to location of sources for building corpora

source

get_garden_party

 get_garden_party (source_path:str)

(Deprecated - call via conc.corpora) Get corpus of The Garden Party by Katherine Mansfield for development of Conc and testing Conc functionality.

	Type	Details
source_path	str	path to location of sources for building corpora

source

get_large_dataset

 get_large_dataset (source_path:str)

(Deprecated - call via conc.corpora) Get 1m rows of https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset for testing.

	Type	Details
source_path	str	path to location of sources for building corpora

source

create_large_dataset_sizes

 create_large_dataset_sizes (source_path:str, sizes:list=[10000, 100000,
                             200000, 500000])

(Deprecated - call via conc.corpora) Create datasets of different sizes from data source retrieved by get_large_dataset for testing.

	Type	Default	Details
source_path	str		path to location of sources for building corpora
sizes	list	[10000, 100000, 200000, 500000]	list of sizes for test data-sets