conc
  1. API
  2. conc
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. conc

conc

An interface to create Conc reports for corpus linguistic analysis of frequency, concordances, ngrams, keyness, and collocation.

source

Conc

 Conc (corpus)

Unified interface to Conc reporting for analysis of frequency, ngrams, concordances, keyness, and collocates.

Details
corpus Corpus instance
# load (or build) a corpus
reuters = Corpus('reuters').load(path_to_reuters_corpus)
#get a summary
reuters.summary()
Corpus Summary
Attribute Value
Name Reuters Corpus
Description Reuters corpus (Reuters-21578, Distribution 1.0). "The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data *for research purposes only*. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name (Reuters-21578, Distribution 1.0), and inform your readers of the current location of the data set." https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created 2025-06-09 12:44:27
Conc Version 0.0.1
Corpus Path /home/geoff/data/conc-test-corpora/reuters.corpus
Document Count 10,788
Token Count 1,552,919
Word Token Count 1,398,782
Unique Tokens 49,901
Unique Word Tokens 49,860
# create a Conc report instance for the corpus
conc = Conc(reuters)

source

Conc.frequencies

 Conc.frequencies (case_sensitive:bool=False, normalize_by:int=10000,
                   page_size:int=20, page_current:int=1,
                   show_token_id:bool=False,
                   show_document_frequency:bool=False,
                   exclude_tokens:list[str]=[],
                   exclude_tokens_text:str='',
                   restrict_tokens:list[str]=[],
                   restrict_tokens_text:str='',
                   exclude_punctuation:bool=True)

Report frequent tokens.

Type Default Details
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_token_id bool False show token_id in output
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from frequency report, can be used to remove stopwords
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict frequency report to return frequencies for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
Returns Result return a Result object with the frequency table
conc.frequencies(normalize_by=10000).display()
Frequencies
Frequencies of word tokens, Reuters Corpus
Rank Token Frequency Normalized Frequency
1 the 69,263 495.17
2 of 36,779 262.94
3 to 36,328 259.71
4 in 29,252 209.12
5 and 25,645 183.34
6 said 25,379 181.44
7 a 24,844 177.61
8 mln 18,621 133.12
9 vs 14,332 102.46
10 for 13,720 98.09
11 dlrs 12,411 88.73
12 it 11,104 79.38
13 pct 9,810 70.13
14 's 9,627 68.82
15 on 9,244 66.09
16 cts 8,357 59.74
17 from 8,216 58.74
18 is 7,673 54.85
19 that 7,540 53.90
20 year 7,523 53.78
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,398,782
Unique word tokens: 49,860
Showing 20 rows
Page 1 of 2494

source

Conc.ngrams

 Conc.ngrams (token_str:str, ngram_length:int=2,
              ngram_token_position:str='LEFT', normalize_by:int=10000,
              page_size:int=20, page_current:int=1,
              show_all_columns:bool=False, exclude_punctuation:bool=True,
              use_cache:bool=True)

Report ngram frequencies containing a token string.

Type Default Details
token_str str token string to get ngrams for
ngram_length int 2 length of ngram
ngram_token_position str LEFT specify if token sequence is on LEFT or RIGHT (support for ngrams with token in middle of sequence is in-development))
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False return raw df with all columns or just ngram and frequency
exclude_punctuation bool True do not return ngrams with punctuation tokens
use_cache bool True retrieve the results from cache if available
Returns Result return a Result object with ngram data
conc.ngrams(token_str = 'said', ngram_length = 3, ngram_token_position = 'RIGHT', exclude_punctuation = True).display()
Ngrams for "said"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 the department said 194 1.39
3 the sources said 165 1.18
4 of england said 122 0.87
5 the spokesman said 116 0.83
6 the bank said 114 0.81
7 agriculture department said 106 0.76
8 trade sources said 95 0.68
9 company also said 93 0.66
10 the report said 93 0.66
11 but he said 75 0.54
12 it also said 71 0.51
13 the official said 71 0.51
14 he also said 70 0.50
15 industry sources said 68 0.49
16 industries inc said 68 0.49
17 the group said 66 0.47
18 the officials said 64 0.46
19 the statement said 59 0.42
20 company spokesman said 54 0.39
Report based on word tokens
Ngram length: 3, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 4,698
Total ngrams: 12,707
Showing 20 rows
Page 1 of 235

source

Conc.ngram_frequencies

 Conc.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                         normalize_by:int=10000, page_size:int=20,
                         page_current:int=1,
                         exclude_punctuation:bool=True)

Report frequent ngrams.

Type Default Details
ngram_length int 2 length of ngram
case_sensitive bool False frequencies for tokens lowercased or with case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return
page_current int 1 current page
exclude_punctuation bool True exclude ngrams containing punctuation tokens
Returns Result return a Result object with the frequency table
conc.ngram_frequencies(ngram_length = 3, case_sensitive = False, exclude_punctuation = True, page_current = 1).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 mln dlrs in 795 5.68
3 cts vs loss 665 4.75
4 said it has 636 4.55
5 mln avg shrs 620 4.43
6 pct of the 608 4.35
7 the united states 603 4.31
8 qtr net shr 574 4.10
9 dlrs a share 546 3.90
10 inc said it 523 3.74
11 the company 's 518 3.70
12 cts net loss 517 3.70
13 the end of 501 3.58
14 cts a share 494 3.53
15 is expected to 429 3.07
16 nine mths shr 412 2.95
17 corp said it 412 2.95
18 said in a 407 2.91
19 the bank of 380 2.72
20 billion dlrs in 373 2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239

source

Conc.concordance

 Conc.concordance (token_str:str, context_length:int=5,
                   order:str='1R2R3R', page_size:int=20,
                   page_current:int=1, show_all_columns:bool=False,
                   use_cache:bool=True)

Report concordance for a token string.

Type Default Details
token_str str token string to get concordance for
context_length int 5 number of words to show on left and right of token string
order str 1R2R3R order of sort columns
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False df with all columns or just essentials
use_cache bool True retrieve the results from cache if available (currently ignored)
Returns Result concordance report results
conc.concordance('the company said', context_length = 5, order='1R2R3R').display()
Concordance for "the company said"
Reuters Corpus, Context tokens: 5, Order: 1R2R3R
Doc Id Left Node Right
2744 through a tender offer . The company said " The negotiations would determine
10501 1.25 dlrs a share . The company said " this could bring earnings
8353 of gold per ton . The company said & lt;Manitoba Mineral Resources Ltd
2186 . In a statement , the company said , " The SEC action
8898 Co > of Japan . The company said , " The discussions have
6379 In a brief statement , the company said , " We are studying
6221 special cost escrow accounts , the company said , adding , that there
4264 close in near future , the company said , adding it is prepared
6319 taxes . In addition , the company said , Georgia Power 's contracts
4664 the conversion of debentures . The company said , however , it expects
10302 against 1987 net income . The company said , however , that the
6464 the distribution of assets , the company said , it expects shareholders to
1364 of the public offering , the company said , it expects the secured
2911 part of the transaction , the company said , it granted IDC Acquisition
6660 While awaiting FDA approval , the company said , it is proceeding with
3545 is a preliminary estimate , the company said , it may be used
1702 . In the meantime , the company said , it plans today to
4788 Based on preliminary results , the company said , net income rose to
3595 the rights become exercisable , the company said , those held by shareholders
4889 group of First Delaware , the company said .
Total Concordance Lines: 1173
Total Documents: 911
Showing 20 lines
Page 1 of 59

source

Conc.concordance_plot

 Conc.concordance_plot (token_str:str, page_size:int=10,
                        append_info:bool=True)

Create a concordance plot.

Type Default Details
token_str str token string for concordance plot
page_size int 10 number of plots per page
append_info bool True append token position info to the concordance line preview screens visible when hover over the plot lines
Returns Plot concordance plot object, add .display() to view in notebook
conc.concordance_plot('cause', page_size=10).display()

Concordance Plot for "cause"

Reuters Corpus

Total Documents: 121
Total Concordance Lines: 135

source

Conc.set_reference_corpus

 Conc.set_reference_corpus (corpus:conc.corpus.Corpus)

Set a reference corpus for keyness analysis.

Type Details
corpus Corpus Reference corpus
Returns None
# load a corpus as a reference corpus
brown = Corpus('brown').load(path_to_brown_corpus)

# set corpus as reference corpus
conc.set_reference_corpus(brown)

source

Conc.keywords

 Conc.keywords (effect_size_measure:str='log_ratio',
                statistical_significance_measure:str='log_likelihood',
                order:str|None=None, order_descending:bool=True,
                statistical_significance_cut:float|None=None,
                apply_bonferroni:bool=False, min_document_frequency:int=0,
                min_document_frequency_reference:int=0,
                min_frequency:int=0, min_frequency_reference:int=0,
                case_sensitive:bool=False, normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_document_frequency:bool=False,
                exclude_tokens:list[str]=[], exclude_tokens_text:str='',
                restrict_tokens:list[str]=[], restrict_tokens_text:str='',
                exclude_punctuation:bool=True)

Get keywords for the corpus.

Type Default Details
effect_size_measure str log_ratio effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure str log_likelihood statistical significance measure to use, currently only ‘log_likelihood’ is supported
order str | None None default of None orders by effect size measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending bool True order is descending or ascending
statistical_significance_cut float | None None statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni bool False apply Bonferroni correction to the statistical significance cut-off
min_document_frequency int 0 minimum document frequency in target for token to be included in the report
min_document_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
min_frequency int 0 minimum frequency in target for token to be included in the report
min_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from report results
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict report to return results for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
Returns Result return a Result object with the frequency table
conc.keywords(statistical_significance_cut = 0.0001, min_document_frequency_reference = 5).display()
Keywords
Target corpus: Reuters Corpus, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 net 6,988 31 49.96 0.32 157.95 7.30 7,078.84
2 dividend 1,041 6 7.44 0.06 121.57 6.93 1,042.37
3 exports 1,214 10 8.68 0.10 85.07 6.41 1,191.05
4 4th 840 8 6.01 0.08 73.57 6.20 815.81
5 securities 839 8 6.00 0.08 73.49 6.20 814.76
6 currency 818 8 5.85 0.08 71.65 6.16 792.86
7 subsidiary 630 7 4.50 0.07 63.06 5.98 604.46
8 billion 5,828 65 41.66 0.66 62.83 5.97 5,589.95
9 pact 428 5 3.06 0.05 59.98 5.91 408.89
10 profit 2,960 36 21.16 0.37 57.61 5.85 2,817.73
11 spokesman 971 13 6.94 0.13 52.34 5.71 916.03
12 deficit 874 12 6.25 0.12 51.04 5.67 822.47
13 tender 752 11 5.38 0.11 47.90 5.58 703.10
14 brazil 545 8 3.90 0.08 47.74 5.58 509.37
15 economists 325 5 2.32 0.05 45.55 5.51 302.23
16 canadian 647 10 4.63 0.10 45.34 5.50 601.36
17 wheat 1,028 16 7.35 0.16 45.02 5.49 954.75
18 imports 946 15 6.76 0.15 44.19 5.47 876.78
19 barrels 433 7 3.10 0.07 43.34 5.44 400.44
20 loss 5,124 85 36.63 0.87 42.24 5.40 4,724.67
Report based on word tokens
Filtered tokens by minimum document frequency in reference corpus (5)
Keywords filtered based on p-value: 0.0001
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,398,782
Total word tokens in reference corpus: 980,144
Keywords: 3,691
Showing 20 rows
Page 1 of 185

source

Conc.collocates

 Conc.collocates (token_str:str, effect_size_measure:str='logdice',
                  statistical_significance_measure:str='log_likelihood',
                  order:str|None=None, order_descending:bool=True,
                  statistical_significance_cut:float|None=None,
                  apply_bonferroni:bool=False,
                  context_length:int|tuple[int,int]=5,
                  min_collocate_frequency:int=5, page_size:int=20,
                  page_current:int=1, exclude_punctuation:bool=True)

Report collocates for a given token string.

Type Default Details
token_str str Token to search for
effect_size_measure str logdice statistical measure to use for collocation calculation: logdice, mutual_information
statistical_significance_measure str log_likelihood statistical significance measure to use, currently only ‘log_likelihood’ is supported
order str | None None default of None orders by collocation_measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood
order_descending bool True order is descending or ascending
statistical_significance_cut float | None None statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni bool False apply Bonferroni correction to the statistical significance cut-off
context_length int | tuple[int, int] 5 Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5))
min_collocate_frequency int 5 Minimum count of collocates
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
exclude_punctuation bool True exclude punctuation tokens
Returns Result
conc.collocates('the company', context_length = (0, 1), exclude_punctuation = False).display()
Collocates of "the company"
Reuters Corpus
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 said 1,173 25,379 10.40 5,149.40
2 's 518 9,627 10.38 2,429.14
3 also 107 2,532 9.28 450.99
4 reported 51 775 8.74 259.63
5 has 69 4,874 8.14 151.05
6 had 47 2,975 7.98 111.88
7 earned 22 159 7.78 145.70
8 would 43 4,688 7.49 63.27
9 is 59 7,673 7.48 70.93
10 will 49 5,951 7.47 63.94
11 did 20 673 7.43 70.77
12 was 46 5,826 7.40 57.12
13 today 15 1,445 6.75 25.03
14 . 165 49,406 6.69 35.53
15 added 13 1,116 6.65 24.16
16 expects 11 628 6.59 28.20
17 might 10 402 6.54 32.04
18 does 9 408 6.38 26.84
19 lost 8 150 6.32 37.38
20 earlier 10 1,100 6.28 14.57
Report based on word and punctuation tokens
Context tokens left: 0, context tokens right: 1
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 49
Showing 20 rows
Page 1 of 3
  • Report an issue