Unified interface to Conc reporting for analysis of frequency, ngrams, concordances, keyness, and collocates.
Details
corpus
Corpus instance
# load (or build) a corpusreuters = Corpus('reuters').load(path_to_reuters_corpus)
#get a summaryreuters.summary()
Corpus Summary
Attribute
Value
Name
Reuters Corpus
Description
Reuters corpus (Reuters-21578, Distribution 1.0). "The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data *for research purposes only*. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name (Reuters-21578, Distribution 1.0), and inform your readers of the current location of the data set." https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created
2025-06-09 12:44:27
Conc Version
0.0.1
Corpus Path
/home/geoff/data/conc-test-corpora/reuters.corpus
Document Count
10,788
Token Count
1,552,919
Word Token Count
1,398,782
Unique Tokens
49,901
Unique Word Tokens
49,860
# create a Conc report instance for the corpusconc = Conc(reuters)
# load a corpus as a reference corpusbrown = Corpus('brown').load(path_to_brown_corpus)# set corpus as reference corpusconc.set_reference_corpus(brown)
effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure
str
log_likelihood
statistical significance measure to use, currently only ‘log_likelihood’ is supported
order
str | None
None
default of None orders by effect size measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending
bool
True
order is descending or ascending
statistical_significance_cut
float | None
None
statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni
bool
False
apply Bonferroni correction to the statistical significance cut-off
min_document_frequency
int
0
minimum document frequency in target for token to be included in the report
min_document_frequency_reference
int
0
minimum document frequency in reference for token to be included in the report
min_frequency
int
0
minimum frequency in target for token to be included in the report
min_frequency_reference
int
0
minimum document frequency in reference for token to be included in the report
case_sensitive
bool
False
frequencies for tokens with or without case preserved
normalize_by
int
10000
normalize frequencies by a number (e.g. 10000)
page_size
int
20
number of rows to return, if 0 returns all
page_current
int
1
current page, ignored if page_size is 0
show_document_frequency
bool
False
show document frequency in output
exclude_tokens
list
[]
exclude specific tokens from report results
exclude_tokens_text
str
text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens
list
[]
restrict report to return results for a list of specific tokens
restrict_tokens_text
str
text to explain which tokens are included, will be added to the report notes
statistical measure to use for collocation calculation: logdice, mutual_information
statistical_significance_measure
str
log_likelihood
statistical significance measure to use, currently only ‘log_likelihood’ is supported
order
str | None
None
default of None orders by collocation_measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood
order_descending
bool
True
order is descending or ascending
statistical_significance_cut
float | None
None
statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni
bool
False
apply Bonferroni correction to the statistical significance cut-off
context_length
int | tuple[int, int]
5
Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5))