conc
  1. API
  2. conc
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. conc

conc

An interface to create Conc reports for corpus linguistic analysis of frequency, concordances, ngrams, keyness, and collocation.

source

Conc

 Conc (corpus)

Unified interface to Conc reporting for analysis of frequency, ngrams, concordances, keyness, and collocates.

Details
corpus Corpus instance
# load (or build) a corpus
reuters = Corpus('reuters').load(path_to_reuters_corpus)
#get a summary
reuters.summary()
Corpus Summary
Attribute Value
Name Reuters Corpus
Description Reuters corpus (Reuters-21578, Distribution 1.0). "The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data *for research purposes only*. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name (Reuters-21578, Distribution 1.0), and inform your readers of the current location of the data set." https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created 2025-07-09 09:21:55
Conc Version 0.1.6
Corpus Path /home/geoff/data/conc-test-corpora/reuters.corpus
Document Count 10,788
Token Count 1,552,919
Word Token Count 1,398,782
Unique Tokens 49,901
Unique Word Tokens 49,860
# create a Conc report instance for the corpus
conc = Conc(reuters)

source

Conc.frequencies

 Conc.frequencies (case_sensitive:bool=False, normalize_by:int=10000,
                   page_size:int=20, page_current:int=1,
                   show_token_id:bool=False,
                   show_document_frequency:bool=False,
                   exclude_tokens:list[str]=[],
                   exclude_tokens_text:str='',
                   restrict_tokens:list[str]=[],
                   restrict_tokens_text:str='',
                   exclude_punctuation:bool=True)

Report frequent tokens.

Type Default Details
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_token_id bool False show token_id in output
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from frequency report, can be used to remove stopwords
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict frequency report to return frequencies for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
Returns Result return a Result object with the frequency table
conc.frequencies(normalize_by=10000).display()
Frequencies
Frequencies of word tokens, Reuters Corpus
Rank Token Frequency Normalized Frequency
1 the 69,263 495.17
2 of 36,779 262.94
3 to 36,328 259.71
4 in 29,252 209.12
5 and 25,645 183.34
6 said 25,379 181.44
7 a 24,844 177.61
8 mln 18,621 133.12
9 vs 14,332 102.46
10 for 13,720 98.09
11 dlrs 12,411 88.73
12 it 11,104 79.38
13 pct 9,810 70.13
14 's 9,627 68.82
15 on 9,244 66.09
16 cts 8,357 59.74
17 from 8,216 58.74
18 is 7,673 54.85
19 that 7,540 53.90
20 year 7,523 53.78
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,398,782
Unique word tokens: 49,860
Showing 20 rows
Page 1 of 2494

source

Conc.ngrams

 Conc.ngrams (token_str:str, ngram_length:int=2,
              ngram_token_position:str='LEFT', normalize_by:int=10000,
              page_size:int=20, page_current:int=1,
              show_all_columns:bool=False, exclude_punctuation:bool=True,
              use_cache:bool=True)

Report ngram frequencies containing a token string.

Type Default Details
token_str str token string to get ngrams for
ngram_length int 2 length of ngram
ngram_token_position str LEFT specify if token sequence is on LEFT or RIGHT or MIDDLE (support for other positions is in-development)
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False return raw df with all columns or just ngram and frequency
exclude_punctuation bool True do not return ngrams with punctuation tokens
use_cache bool True retrieve the results from cache if available
Returns Result return a Result object with ngram data
conc.ngrams(token_str = 'said', ngram_length = 3, ngram_token_position = 'RIGHT', exclude_punctuation = True).display()
Ngrams for "said"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 the department said 194 1.39
3 the sources said 165 1.18
4 of england said 122 0.87
5 the spokesman said 116 0.83
6 the bank said 114 0.81
7 agriculture department said 106 0.76
8 trade sources said 95 0.68
9 company also said 93 0.66
10 the report said 93 0.66
11 but he said 75 0.54
12 it also said 71 0.51
13 the official said 71 0.51
14 he also said 70 0.50
15 industry sources said 68 0.49
16 industries inc said 68 0.49
17 the group said 66 0.47
18 the officials said 64 0.46
19 the statement said 59 0.42
20 company spokesman said 54 0.39
Report based on word tokens
Ngram length: 3, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 4,698
Total ngrams: 12,707
Showing 20 rows
Page 1 of 235

source

Conc.ngram_frequencies

 Conc.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                         normalize_by:int=10000, page_size:int=20,
                         page_current:int=1,
                         show_document_frequency:bool=False,
                         exclude_punctuation:bool=True)

Report frequent ngrams.

Type Default Details
ngram_length int 2 length of ngram
case_sensitive bool False frequencies for tokens lowercased or with case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return
page_current int 1 current page
show_document_frequency bool False show document frequency in output (slow to compute for large corpora)
exclude_punctuation bool True exclude ngrams containing punctuation tokens
Returns Result return a Result object with the frequency table
conc.ngram_frequencies(ngram_length = 3, case_sensitive = False, exclude_punctuation = True, page_current = 1).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 mln dlrs in 795 5.68
3 cts vs loss 665 4.75
4 said it has 636 4.55
5 mln avg shrs 620 4.43
6 pct of the 608 4.35
7 the united states 603 4.31
8 qtr net shr 574 4.10
9 dlrs a share 546 3.90
10 inc said it 523 3.74
11 the company 's 518 3.70
12 cts net loss 517 3.70
13 the end of 501 3.58
14 cts a share 494 3.53
15 is expected to 429 3.07
16 corp said it 412 2.95
17 nine mths shr 412 2.95
18 said in a 407 2.91
19 the bank of 380 2.72
20 billion dlrs in 373 2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239

source

Conc.concordance

 Conc.concordance (token_str:str, context_length:int=5,
                   order:str='1R2R3R', page_size:int=20,
                   page_current:int=1, show_all_columns:bool=False,
                   use_cache:bool=True, ignore_punctuation:bool=True,
                   filter_context_str:str|None=None,
                   filter_context_length:int|tuple[int,int]=5)

Report concordance for a token string.

Type Default Details
token_str str token string to get concordance for
context_length int 5 number of words to show on left and right of token string
order str 1R2R3R order of sort columns - one of 1L2L3L, 3L2L1L, 2L1L1R, 1L1R2R, 1R2R3R (default if ommitted), LEFT, RIGHT
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False df with all columns or just essentials
use_cache bool True retrieve the results from cache if available (currently ignored)
ignore_punctuation bool True whether to ignore punctuation in the concordance sort
filter_context_str str | None None if a string is provided, the concordance lines will be filtered to show lines with contexts containing this string
filter_context_length int | tuple[int, int] 5 ignored if filter_context_str is None, otherwise this is the context window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right)
Returns Result concordance report results
conc.concordance('the company said', context_length = 5, order='1R2R3R').display()
Concordance for "the company said"
Reuters Corpus, Context tokens: 5, Order: 1R2R3R
Doc Id Left Node Right
3754 dividend payment since 1981 , the company said .
10165 second quarter of 1987 , the company said .
7883 of Metex common stock , the company said .
4520 110 dlrs per tonne , the company said .
1801 close in early July , the company said .
10681 Transamerica Occidental Life subsidiary , the company said .
7856 grades of crude oil , the company said .
10162 of record March 25 , the company said .
8573 adhesives for engineering plastics , the company said .
5681 saleable for 13 months , the company said .
4497 Corp earlier this year , the company said .
8740 encouraging signs at yearend , the company said .
10143 close of business today , the company said .
10163 heavy to 14.60 dlrs , the company said .
7484 operated by National Pizza , the company said .
3651 its sixth largest overall , the company said .
9655 May one , 1987 , the company said .
10741 approvals and other conditions , the company said .
8987 eastward from that point , the company said .
7408 of any other shareholder , the company said .
Total Concordance Lines: 1173
Total Documents: 911
Showing 20 lines
Page 1 of 59

source

Conc.concordance_plot

 Conc.concordance_plot (token_str:str, page_size:int=10,
                        append_info:bool=True)

Create a concordance plot.

Type Default Details
token_str str token string for concordance plot
page_size int 10 number of plots per page
append_info bool True append token position info to the concordance line preview screens visible when hover over the plot lines
Returns Plot concordance plot object, add .display() to view in notebook
conc.concordance_plot('cause', page_size=10).display()

Concordance Plot for "cause"

Reuters Corpus

Total Documents: 121
Total Concordance Lines: 135

source

Conc.set_reference_corpus

 Conc.set_reference_corpus
                            (corpus:conc.corpus.Corpus|conc.listcorpus.Lis
                            tCorpus)

Set a reference corpus for keyness analysis.

Type Details
corpus conc.corpus.Corpus | conc.listcorpus.ListCorpus Reference corpus
Returns None
# load a corpus as a reference corpus
brown = Corpus('brown').load(path_to_brown_corpus)

# set corpus as reference corpus
conc.set_reference_corpus(brown)

source

Conc.keywords

 Conc.keywords (effect_size_measure:str='log_ratio',
                statistical_significance_measure:str='log_likelihood',
                order:str|None=None, order_descending:bool=True,
                statistical_significance_cut:float|None=None,
                apply_bonferroni:bool=False, min_document_frequency:int=0,
                min_document_frequency_reference:int=0,
                min_frequency:int=0, min_frequency_reference:int=0,
                case_sensitive:bool=False, normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_document_frequency:bool=False,
                exclude_tokens:list[str]=[], exclude_tokens_text:str='',
                restrict_tokens:list[str]=[], restrict_tokens_text:str='',
                exclude_punctuation:bool=True,
                handle_common_typographic_differences:bool=True,
                exclude_negative_keywords:bool=True)

Get keywords for the corpus.

Type Default Details
effect_size_measure str log_ratio effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure str log_likelihood statistical significance measure to use, currently only ‘log_likelihood’ is supported
order str | None None default of None orders by effect size measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending bool True order is descending or ascending
statistical_significance_cut float | None None statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni bool False apply Bonferroni correction to the statistical significance cut-off
min_document_frequency int 0 minimum document frequency in target for token to be included in the report
min_document_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
min_frequency int 0 minimum frequency in target for token to be included in the report
min_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from report results
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict report to return results for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
handle_common_typographic_differences bool True whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False
exclude_negative_keywords bool True whether to exclude negative keywords from the report
Returns Result return a Result object with the frequency table
conc.keywords(statistical_significance_cut = 0.0001, min_document_frequency_reference = 5).display()
Keywords
Target corpus: Reuters Corpus, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 said 25,379 1,944 181.44 19.83 9.15 3.19 16,380.21
2 net 6,988 31 49.96 0.32 157.95 7.30 7,078.84
3 billion 5,828 65 41.66 0.66 62.83 5.97 5,589.95
4 loss 5,124 85 36.63 0.87 42.24 5.40 4,724.67
5 u.s. 5,496 155 39.29 1.58 24.85 4.63 4,691.63
6 year 7,523 831 53.78 8.48 6.34 2.67 4,051.72
7 bank 3,640 83 26.02 0.85 30.73 4.94 3,217.71
8 company 4,670 319 33.39 3.25 10.26 3.36 3,154.17
9 profit 2,960 36 21.16 0.37 57.61 5.85 2,817.73
10 oil 3,262 100 23.32 1.02 22.86 4.51 2,741.87
11 share 3,146 99 22.49 1.01 22.27 4.48 2,631.00
12 shares 2,652 45 18.96 0.46 41.30 5.37 2,438.84
13 trade 3,094 143 22.12 1.46 15.16 3.92 2,367.94
14 market 2,810 158 20.09 1.61 12.46 3.64 2,030.41
15 its 7,402 1,780 52.92 18.16 2.91 1.54 1,987.46
16 stock 2,629 146 18.79 1.49 12.62 3.66 1,907.10
17 prices 2,194 61 15.69 0.62 25.20 4.66 1,877.65
18 japan 1,854 34 13.25 0.35 38.21 5.26 1,688.89
19 april 2,039 71 14.58 0.72 20.12 4.33 1,670.31
20 quarter 1,852 44 13.24 0.45 29.49 4.88 1,626.89
Report based on word tokens
Filtered tokens by minimum document frequency in reference corpus (5)
Keywords filtered based on p-value: 0.0001
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,398,782
Total word tokens in reference corpus: 980,144
Keywords: 3,691
Showing 20 rows
Page 1 of 185

source

Conc.collocates

 Conc.collocates (token_str:str, effect_size_measure:str='logdice',
                  statistical_significance_measure:str='log_likelihood',
                  order:str|None=None, order_descending:bool=True,
                  statistical_significance_cut:float|None=None,
                  apply_bonferroni:bool=False,
                  context_length:int|tuple[int,int]=5,
                  min_collocate_frequency:int=5, page_size:int=20,
                  page_current:int=1, exclude_punctuation:bool=True)

Report collocates for a given token string.

Type Default Details
token_str str Token to search for
effect_size_measure str logdice statistical measure to use for collocation calculation: logdice, mutual_information
statistical_significance_measure str log_likelihood statistical significance measure to use, currently only ‘log_likelihood’ is supported
order str | None None default of None orders by collocation_measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood
order_descending bool True order is descending or ascending
statistical_significance_cut float | None None statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni bool False apply Bonferroni correction to the statistical significance cut-off
context_length int | tuple[int, int] 5 Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5))
min_collocate_frequency int 5 Minimum count of collocates
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
exclude_punctuation bool True exclude punctuation tokens
Returns Result
conc.collocates('the company', context_length = (0, 1), exclude_punctuation = False).display()
Collocates of "the company"
Reuters Corpus
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 said 1,173 25,379 10.40 5,149.40
2 's 518 9,627 10.38 2,429.14
3 also 107 2,532 9.28 450.99
4 reported 51 775 8.74 259.63
5 has 69 4,874 8.14 151.05
6 had 47 2,975 7.98 111.88
7 earned 22 159 7.78 145.70
8 would 43 4,688 7.49 63.27
9 is 59 7,673 7.48 70.93
10 will 49 5,951 7.47 63.94
11 did 20 673 7.43 70.77
12 was 46 5,826 7.40 57.12
13 today 15 1,445 6.75 25.03
14 . 165 49,406 6.69 35.53
15 added 13 1,116 6.65 24.16
16 expects 11 628 6.59 28.20
17 might 10 402 6.54 32.04
18 does 9 408 6.38 26.84
19 lost 8 150 6.32 37.38
20 earlier 10 1,100 6.28 14.57
Report based on word and punctuation tokens
Context tokens left: 0, context tokens right: 1
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 49
Showing 20 rows
Page 1 of 3
  • Report an issue