corpus

Create a conc corpus.

Corpus class

Corpus

 Corpus (name:str='', description:str='')

Represention of text corpus, with methods to build, load and save a corpus from a variety of formats and to work with the corpus data.

	Type	Default	Details
name	str		name of corpus
description	str		description of corpus

Build and save a corpus

Conc defines a punctuation token as a token consisting only of punctuation characters. Punctuation characters are defined by combining the Python string.punctuation characters with unicode characters categorised as punctuation (i.e. unicode characters with general category starting with P) or currency symbols (i.e. general category Sc). This means, for example, that various versions of a dashes or quotation marks can be identified as punctuation. This also means that any emoticons that are based on sequences of punctuation characters, like :), will be defined as a punctuation token. To access reporting on punctuation is still possible in Conc reports using the relevant parameters. There are still many unicode symbol characters that are not defined as punctuation by Conc. This may change in future versions of Conc, including the ability to define punctuation strings or exceptions. Any changes will be documented.

print(len(PUNCTUATION_STRINGS))
print(PUNCTUATION_STRINGS)

890
𐩿᭟£﹚”᰿₢𑜾᳁𐫳𑩅᪩﹝𐩓꣺᭝︖‹⁓❵﹏⸰𑅂߸﹣፨꧉‗᠂–︰⸎𐮛⸻‰𑑎⹘𑗆𑩂€՜𑃀𑅵࠻﹨𑓆𑱄⁅𒑱<⟦⸹︶𐮜᳅߿⦗}｛〕𖩮⁖⳻—᨟„𑅁'᭜࠸𐽗𐬼᰼𖺚؋࿒𝪉）𑙥፡𑑍︓⦑〜゠𒑴𖺘𑪞؍𑥄𖬹꧞₣₫𑅴𑗍⸦࠼/᠀࠶❨⸟𑗗။᳂⧽⸢᠈𑙡𐬻⸤⟬*／𑇞⹅､︼꧇𑜽𖭄𑥆•︳꧄＄𑈽⹗᙮᛫𑗋᜵⧙𑈼₲﹋？༈⹊⦎·=]︑﹆၏﹛﹠𑻷࠽𑧢𑿠؛܈᭚﹖𐏐꧂᐀【‿꧃𑗑፠𑱁⸂𑇛₦❮︿𑗇𐺭־⸇𑇅⸲𑗅⦋⸙〉‸׆𐄁[٭᠅︻᳓〰﹙⸽𑪢𖬸❲𑙃࠺𒿱༑⸃﹅༆❪⦊⟨꧈՛₧𐩔﹕¤―𐡗₶࠹⁏₹＂⸶𐤿𑙫･﹪꧌₠᳇𐩖꛷࠱᱾〔・𐄀𑗂𞥟︱𑗏៛܍៘𑗕❬꧁꡴›₤࠴⁃׃꫱׳…⸋؊❴〈؝⁆⁑꤮𐩗𑁇࠰⸺『𒑳𑙠⸒￥〖﷼⁜⟯⸍𖫵៖﹁׀《᚛꧊⹍𖬻⦆⁌܆⸧𑗉︷𐬺᨞⸖⹔⸫⸵𐫶⹃𑇇𝪊𑜼!࿔⁍⦔⹛꙳𑨿⦌″⹏꡷︔༇⁉՝︐⸌𑿿𑿟|᭠⁀𐩑་｢⦉၌᥅❱𐫱︴𑗒𐤟꤯₩₵¡၎‐⦄₴†⸨❯𐩕₨𐬾⸆⌊₪՟𑱂༊﹇𖺙［⹒❫⸓༽𐩒᠇𑪛⁕𑑚᠃⹇౷。₾𐮚᯿⸝⧚᯽᪨𑑌』𒑲᪠·₼꧟％)۔⸿⹌-︒￦❰𛲟⸾᚜⁾𑗌𑪠𖺗༄𑱃﹐𖬷၊᛬﹡⹂‑❩﴾𑱰‟꫰᠄⧛⌈𑻸᳆꛵¿࠷𑇆𐬿‾;֊༉༏՞⧘⸣𑑝𑙩₍࿓៕〟፥⟪༌𑩃᜶〃܃，𑙦‶⸪᭾⦓︸๚𑙪;𑗖𐾆᪭𑇈᳃₮𑗔￡𑊩৻฿។⦐៚﹉⸑꡵𐽙𐬹﹑₯‴⸳꫞⸛⸔\༐﴿𑗓𑩀※𐾈܋⁙⸀𑑛｝𑁉⸞⁊〛〝꛶’〚⁽࠲𑩁꛲܀༅⸭྅؞"᰽᪤⁁﹍﹟༻܅։⸱।#⸊⁈︙૰৲‥︕࿑⸕𐽖⸠𐩘𐽘𐽕꫟𑗁᱿&꩝᪬⹀᳀`⸏⸄꧋፣؉܂】࠵꣎꣸﹞₺₻‘৳⟆𐾉፦⹙٬𑠻𑿝࿐܉「᪣〘᯾⸜〞჻𑁌𖩯﹌༺𑇝§𑗎༒𑙬（.𐾇࠾⸩𖿢′︽﹊།॰¢‡𐬽｟⳾꓾⹉＃$‷⸘‒〽．𑅃｡］᯼⸷₷{꣹𐫵᰾܄⹈꣏＆૱꧆⦃₡‱‧〙｠﹔᪫‣﹜𐫲𑂿𖬺๛⹁꛳⦕𒿲₳⦖᰻𑁍⟧⦇‽?‛⌋₿⸅＊᠉৽᠁𑪜；․⁋⦍⹄߾॥࿙»⹝᭽﹎‖꧅︺՚＿಄‚_๏⁂𑑏⳿꣼༔₰⸸﹗܊᠊𑩆₎᪥𑈸߹⧼⸡〈❭𑁋𑇟״(𑪟٪𐩐꯫⹖꩞‼࿚⳺^﹃⦅⸉꘎༎𞋿⁗𑙧︵꩜𑈺₥꘏߷၍⁐،,⁛︘⟅⸗᪢⳹¶⌉꓿⸮꛴⸈₸：︾𑙁࡞𝪇⁎⟭¥᭛𝪋੶︹𐫴𞥞𐄂𑂼‵❳𑅀﹫𝪈«、𐫰⁞꥟፤𐎟⸁⦏༼𑪚￠𑗊⁔᪡₭᛭᥄𑂻﹈𑁈𑥅܌₽𑗄－𑗃𑗐꧍﹄𑗈𑱅᠆᭞﹂>﹀﹘⁇%⵰𑿞⟮》⹋﹩﹒܇⟫+⦘᳄」⁚᪪᪦⹆＼⸥@꘍︲:࠳𑙤꡶⹜𑃁؟𐮙𑈻௹𑪡𑑋៙~⟩₱⹎〉𑱱꙾𞲰⸬⦈⃀⸴𑚹⸐〗︗＠𑁊𑙨٫𑂾⦒⹚𑙢⹓෴⸼𐕯⁘𑈹“𑙣֏⳼𑇍！𒑰⸚⁝𑩄𑙂⹕።｣＇꠸፧܁꩟

Spacy includes space tokens in the vocab for non-destructive tokenisation. Positions of space tokens are stored so they can be filtered out for analysis and reporting.

Tokens consisting of only punctuation are defined as punctuation tokens. These can be removed or included in analysis and reporting.

NOTE: currently streaming either with sink_parquet or collect(engine=‘streaming’) can break the order of the dataframe (not just whole rows, but within specific columns leading to misaligned data). Streaming is not being used for the build, this will be reassessed in the future as the new Polars streaming functionality matures.

	Type	Default	Details
source_path	str		path to folder with text files, path can be a directory, zip or tar/tar.gz file
save_path	str		directory where corpus will be created, a subdirectory will be automatically created with the corpus content
file_mask	str	*.txt	mask to select files
metadata_file	str \| None	None	path to a CSV with metadata
metadata_file_column	str	file	column in metadata file with file names to align texts with metadata
metadata_columns	list	[]	list of column names to import from metadata
encoding	str	utf-8	encoding of text files
model	str	en_core_web_sm	spacy model to use for tokenisation
spacy_batch_size	int	1000	batch size for spacy tokenizer
build_process_batch_size	int	5000	save in-progress build to disk every n docs
build_process_cleanup	bool	True	Remove the build files after build is complete, retained for development and testing purposes
standardize_word_token_punctuation_characters	bool	False	whether to standardize apostrophes in word tokens

	Type	Default	Details
source_path	str		path to csv file
save_path	str		directory where corpus will be created, a subdirectory will be automatically created with the corpus content
text_column	str	text	column in csv with text
metadata_columns	list	[]	list of column names to import from csv
encoding	str	utf8	encoding of csv passed to Polars read_csv, see their documentation
model	str	en_core_web_sm	spacy model to use for tokenisation
spacy_batch_size	int	1000	batch size for Spacy tokenizer
build_process_batch_size	int	5000	save in-progress build to disk every n docs
build_process_cleanup	bool	True	Remove the build files after build is complete, retained for development and testing purposes
standardize_word_token_punctuation_characters	bool	False	whether to standardize apostrophes in word tokens

	Type	Default	Details
include_disk_usage	bool	False	include information of size on disk in output
formatted	bool	True	return formatted output
Returns	str		formatted information about the corpus

	Type	Default	Details
include_memory_usage	bool	False	include memory usage in output
Returns	Result		returns Result object with corpus summary information

Corpus class

Corpus

Build and save a corpus

Corpus.save_corpus_metadata

Corpus.build_from_files

Corpus.build_from_csv

Load a corpus

Corpus.load

Information about the corpus

Corpus.info

Corpus.report

Corpus.summary

Working with tokens

Corpus.token_ids_to_tokens

Corpus.tokens_to_token_ids

Corpus.token_to_id

Corpus.token_ids_to_sort_order

Corpus.get_token_count_text

Tokenization

Corpus.tokenize

Work with specific texts in the corpus

Corpus.text

Find positions of tokens

Corpus.get_tokens_by_index

Corpus.get_ngrams_by_index

Corpus.get_token_positions

Corpus.get_tokens_in_context

build_test_corpora