conc

An interface to create Conc reports for corpus linguistic analysis of frequency, concordances, ngrams, keyness, and collocation.

source

Conc

 Conc (corpus)

Unified interface to Conc reporting for analysis of frequency, ngrams, concordances, keyness, and collocates.

	Details
corpus	Corpus instance

# load (or build) a corpus
reuters = Corpus('reuters').load(path_to_reuters_corpus)

#get a summary
reuters.summary()

Corpus Summary

Attribute	Value
Name	Reuters Corpus
Description	Reuters corpus (Reuters-21578, Distribution 1.0). "The copyright for the text of newswire articles and Reuters annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only. If you publish results based on this data set, please acknowledge its use, refer to the data set by the name (Reuters-21578, Distribution 1.0), and inform your readers of the current location of the data set." https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created	2025-07-09 09:21:55
Conc Version	0.1.6
Corpus Path	/home/geoff/data/conc-test-corpora/reuters.corpus
Document Count	10,788
Token Count	1,552,919
Word Token Count	1,398,782
Unique Tokens	49,901
Unique Word Tokens	49,860

# create a Conc report instance for the corpus
conc = Conc(reuters)

source

Conc.frequencies

 Conc.frequencies (case_sensitive:bool=False, normalize_by:int=10000,
                   page_size:int=20, page_current:int=1,
                   show_token_id:bool=False,
                   show_document_frequency:bool=False,
                   exclude_tokens:list[str]=[],
                   exclude_tokens_text:str='',
                   restrict_tokens:list[str]=[],
                   restrict_tokens_text:str='',
                   exclude_punctuation:bool=True)

Report frequent tokens.

	Type	Default	Details
case_sensitive	bool	False	frequencies for tokens with or without case preserved
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of rows to return, if 0 returns all
page_current	int	1	current page, ignored if page_size is 0
show_token_id	bool	False	show token_id in output
show_document_frequency	bool	False	show document frequency in output
exclude_tokens	list	[]	exclude specific tokens from frequency report, can be used to remove stopwords
exclude_tokens_text	str		text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens	list	[]	restrict frequency report to return frequencies for a list of specific tokens
restrict_tokens_text	str		text to explain which tokens are included, will be added to the report notes
exclude_punctuation	bool	True	exclude punctuation tokens
Returns	Result		return a Result object with the frequency table

conc.frequencies(normalize_by=10000).display()

Frequencies
Frequencies of word tokens, Reuters Corpus
Rank	Token	Frequency	Normalized Frequency
1	the	69,263	495.17
2	of	36,779	262.94
3	to	36,328	259.71
4	in	29,252	209.12
5	and	25,645	183.34
6	said	25,379	181.44
7	a	24,844	177.61
8	mln	18,621	133.12
9	vs	14,332	102.46
10	for	13,720	98.09
11	dlrs	12,411	88.73
12	it	11,104	79.38
13	pct	9,810	70.13
14	's	9,627	68.82
15	on	9,244	66.09
16	cts	8,357	59.74
17	from	8,216	58.74
18	is	7,673	54.85
19	that	7,540	53.90
20	year	7,523	53.78
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,398,782
Unique word tokens: 49,860
Showing 20 rows
Page 1 of 2494

source

Conc.ngrams

 Conc.ngrams (token_str:str, ngram_length:int=2,
              ngram_token_position:str='LEFT', normalize_by:int=10000,
              page_size:int=20, page_current:int=1,
              show_all_columns:bool=False, exclude_punctuation:bool=True,
              use_cache:bool=True)

Report ngram frequencies containing a token string.

	Type	Default	Details
token_str	str		token string to get ngrams for
ngram_length	int	2	length of ngram
ngram_token_position	str	LEFT	specify if token sequence is on LEFT or RIGHT or MIDDLE (support for other positions is in-development)
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of results to display per results page
page_current	int	1	current page of results
show_all_columns	bool	False	return raw df with all columns or just ngram and frequency
exclude_punctuation	bool	True	do not return ngrams with punctuation tokens
use_cache	bool	True	retrieve the results from cache if available
Returns	Result		return a Result object with ngram data

conc.ngrams(token_str = 'said', ngram_length = 3, ngram_token_position = 'RIGHT', exclude_punctuation = True).display()

Ngrams for "said"
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	the company said	1,173	8.39
2	the department said	194	1.39
3	the sources said	165	1.18
4	of england said	122	0.87
5	the spokesman said	116	0.83
6	the bank said	114	0.81
7	agriculture department said	106	0.76
8	trade sources said	95	0.68
9	company also said	93	0.66
10	the report said	93	0.66
11	but he said	75	0.54
12	it also said	71	0.51
13	the official said	71	0.51
14	he also said	70	0.50
15	industry sources said	68	0.49
16	industries inc said	68	0.49
17	the group said	66	0.47
18	the officials said	64	0.46
19	the statement said	59	0.42
20	company spokesman said	54	0.39
Report based on word tokens
Ngram length: 3, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 4,698
Total ngrams: 12,707
Showing 20 rows
Page 1 of 235

source

Conc.ngram_frequencies

 Conc.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                         normalize_by:int=10000, page_size:int=20,
                         page_current:int=1,
                         show_document_frequency:bool=False,
                         exclude_punctuation:bool=True)

Report frequent ngrams.

	Type	Default	Details
ngram_length	int	2	length of ngram
case_sensitive	bool	False	frequencies for tokens lowercased or with case preserved
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of rows to return
page_current	int	1	current page
show_document_frequency	bool	False	show document frequency in output (slow to compute for large corpora)
exclude_punctuation	bool	True	exclude ngrams containing punctuation tokens
Returns	Result		return a Result object with the frequency table

conc.ngram_frequencies(ngram_length = 3, case_sensitive = False, exclude_punctuation = True, page_current = 1).display()

Ngram Frequencies
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	the company said	1,173	8.39
2	mln dlrs in	795	5.68
3	cts vs loss	665	4.75
4	said it has	636	4.55
5	mln avg shrs	620	4.43
6	pct of the	608	4.35
7	the united states	603	4.31
8	qtr net shr	574	4.10
9	dlrs a share	546	3.90
10	inc said it	523	3.74
11	the company 's	518	3.70
12	cts net loss	517	3.70
13	the end of	501	3.58
14	cts a share	494	3.53
15	is expected to	429	3.07
16	corp said it	412	2.95
17	nine mths shr	412	2.95
18	said in a	407	2.91
19	the bank of	380	2.72
20	billion dlrs in	373	2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239

source

Conc.concordance

 Conc.concordance (token_str:str, context_length:int=5,
                   order:str='1R2R3R', page_size:int=20,
                   page_current:int=1, show_all_columns:bool=False,
                   use_cache:bool=True, ignore_punctuation:bool=True,
                   filter_context_str:str|None=None,
                   filter_context_length:int|tuple[int,int]=5)

Report concordance for a token string.

	Type	Default	Details
token_str	str		token string to get concordance for
context_length	int	5	number of words to show on left and right of token string
order	str	1R2R3R	order of sort columns - one of 1L2L3L, 3L2L1L, 2L1L1R, 1L1R2R, 1R2R3R (default if ommitted), LEFT, RIGHT
page_size	int	20	number of results to display per results page
page_current	int	1	current page of results
show_all_columns	bool	False	df with all columns or just essentials
use_cache	bool	True	retrieve the results from cache if available (currently ignored)
ignore_punctuation	bool	True	whether to ignore punctuation in the concordance sort
filter_context_str	str \| None	None	if a string is provided, the concordance lines will be filtered to show lines with contexts containing this string
filter_context_length	int \| tuple[int, int]	5	ignored if filter_context_str is None, otherwise this is the context window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right)
Returns	Result		concordance report results

conc.concordance('the company said', context_length = 5, order='1R2R3R').display()

Concordance for "the company said"
Reuters Corpus, Context tokens: 5, Order: 1R2R3R
Doc Id	Left	Node	Right
3754	dividend payment since 1981 ,	the company said	.
10165	second quarter of 1987 ,	the company said	.
7883	of Metex common stock ,	the company said	.
4520	110 dlrs per tonne ,	the company said	.
1801	close in early July ,	the company said	.
10681	Transamerica Occidental Life subsidiary ,	the company said	.
7856	grades of crude oil ,	the company said	.
10162	of record March 25 ,	the company said	.
8573	adhesives for engineering plastics ,	the company said	.
5681	saleable for 13 months ,	the company said	.
4497	Corp earlier this year ,	the company said	.
8740	encouraging signs at yearend ,	the company said	.
10143	close of business today ,	the company said	.
10163	heavy to 14.60 dlrs ,	the company said	.
7484	operated by National Pizza ,	the company said	.
3651	its sixth largest overall ,	the company said	.
9655	May one , 1987 ,	the company said	.
10741	approvals and other conditions ,	the company said	.
8987	eastward from that point ,	the company said	.
7408	of any other shareholder ,	the company said	.
Total Concordance Lines: 1173
Total Documents: 911
Showing 20 lines
Page 1 of 59

source

Conc.concordance_plot

 Conc.concordance_plot (token_str:str, page_size:int=10,
                        append_info:bool=True)

Create a concordance plot.

	Type	Default	Details
token_str	str		token string for concordance plot
page_size	int	10	number of plots per page
append_info	bool	True	append token position info to the concordance line preview screens visible when hover over the plot lines
Returns	Plot		concordance plot object, add .display() to view in notebook

conc.concordance_plot('cause', page_size=10).display()

Concordance Plot for "cause"

Reuters Corpus

Total Documents: 121
Total Concordance Lines: 135

Page 1 of 13

source

Conc.set_reference_corpus

 Conc.set_reference_corpus
                            (corpus:conc.corpus.Corpus|conc.listcorpus.Lis
                            tCorpus)

Set a reference corpus for keyness analysis.

	Type	Details
corpus	conc.corpus.Corpus \| conc.listcorpus.ListCorpus	Reference corpus
Returns	None

# load a corpus as a reference corpus
brown = Corpus('brown').load(path_to_brown_corpus)

# set corpus as reference corpus
conc.set_reference_corpus(brown)

source

Conc.keywords

 Conc.keywords (effect_size_measure:str='log_ratio',
                statistical_significance_measure:str='log_likelihood',
                order:str|None=None, order_descending:bool=True,
                statistical_significance_cut:float|None=None,
                apply_bonferroni:bool=False, min_document_frequency:int=0,
                min_document_frequency_reference:int=0,
                min_frequency:int=0, min_frequency_reference:int=0,
                case_sensitive:bool=False, normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_document_frequency:bool=False,
                exclude_tokens:list[str]=[], exclude_tokens_text:str='',
                restrict_tokens:list[str]=[], restrict_tokens_text:str='',
                exclude_punctuation:bool=True,
                handle_common_typographic_differences:bool=True,
                exclude_negative_keywords:bool=True)

Get keywords for the corpus.

	Type	Default	Details
effect_size_measure	str	log_ratio	effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure	str	log_likelihood	statistical significance measure to use, currently only ‘log_likelihood’ is supported
order	str \| None	None	default of None orders by effect size measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending	bool	True	order is descending or ascending
statistical_significance_cut	float \| None	None	statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni	bool	False	apply Bonferroni correction to the statistical significance cut-off
min_document_frequency	int	0	minimum document frequency in target for token to be included in the report
min_document_frequency_reference	int	0	minimum document frequency in reference for token to be included in the report
min_frequency	int	0	minimum frequency in target for token to be included in the report
min_frequency_reference	int	0	minimum document frequency in reference for token to be included in the report
case_sensitive	bool	False	frequencies for tokens with or without case preserved
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of rows to return, if 0 returns all
page_current	int	1	current page, ignored if page_size is 0
show_document_frequency	bool	False	show document frequency in output
exclude_tokens	list	[]	exclude specific tokens from report results
exclude_tokens_text	str		text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens	list	[]	restrict report to return results for a list of specific tokens
restrict_tokens_text	str		text to explain which tokens are included, will be added to the report notes
exclude_punctuation	bool	True	exclude punctuation tokens
handle_common_typographic_differences	bool	True	whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False
exclude_negative_keywords	bool	True	whether to exclude negative keywords from the report
Returns	Result		return a Result object with the frequency table

conc.keywords(statistical_significance_cut = 0.0001, min_document_frequency_reference = 5).display()

Keywords
Target corpus: Reuters Corpus, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	said	25,379	1,944	181.44	19.83	9.15	3.19	16,380.21
2	net	6,988	31	49.96	0.32	157.95	7.30	7,078.84
3	billion	5,828	65	41.66	0.66	62.83	5.97	5,589.95
4	loss	5,124	85	36.63	0.87	42.24	5.40	4,724.67
5	u.s.	5,496	155	39.29	1.58	24.85	4.63	4,691.63
6	year	7,523	831	53.78	8.48	6.34	2.67	4,051.72
7	bank	3,640	83	26.02	0.85	30.73	4.94	3,217.71
8	company	4,670	319	33.39	3.25	10.26	3.36	3,154.17
9	profit	2,960	36	21.16	0.37	57.61	5.85	2,817.73
10	oil	3,262	100	23.32	1.02	22.86	4.51	2,741.87
11	share	3,146	99	22.49	1.01	22.27	4.48	2,631.00
12	shares	2,652	45	18.96	0.46	41.30	5.37	2,438.84
13	trade	3,094	143	22.12	1.46	15.16	3.92	2,367.94
14	market	2,810	158	20.09	1.61	12.46	3.64	2,030.41
15	its	7,402	1,780	52.92	18.16	2.91	1.54	1,987.46
16	stock	2,629	146	18.79	1.49	12.62	3.66	1,907.10
17	prices	2,194	61	15.69	0.62	25.20	4.66	1,877.65
18	japan	1,854	34	13.25	0.35	38.21	5.26	1,688.89
19	april	2,039	71	14.58	0.72	20.12	4.33	1,670.31
20	quarter	1,852	44	13.24	0.45	29.49	4.88	1,626.89
Report based on word tokens
Filtered tokens by minimum document frequency in reference corpus (5)
Keywords filtered based on p-value: 0.0001
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,398,782
Total word tokens in reference corpus: 980,144
Keywords: 3,691
Showing 20 rows
Page 1 of 185

source

Conc.collocates

 Conc.collocates (token_str:str, effect_size_measure:str='logdice',
                  statistical_significance_measure:str='log_likelihood',
                  order:str|None=None, order_descending:bool=True,
                  statistical_significance_cut:float|None=None,
                  apply_bonferroni:bool=False,
                  context_length:int|tuple[int,int]=5,
                  min_collocate_frequency:int=5, page_size:int=20,
                  page_current:int=1, exclude_punctuation:bool=True)

Report collocates for a given token string.

	Type	Default	Details
token_str	str		Token to search for
effect_size_measure	str	logdice	statistical measure to use for collocation calculation: logdice, mutual_information
statistical_significance_measure	str	log_likelihood	statistical significance measure to use, currently only ‘log_likelihood’ is supported
order	str \| None	None	default of None orders by collocation_measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood
order_descending	bool	True	order is descending or ascending
statistical_significance_cut	float \| None	None	statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni	bool	False	apply Bonferroni correction to the statistical significance cut-off
context_length	int \| tuple[int, int]	5	Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5))
min_collocate_frequency	int	5	Minimum count of collocates
page_size	int	20	number of rows to return, if 0 returns all
page_current	int	1	current page, ignored if page_size is 0
exclude_punctuation	bool	True	exclude punctuation tokens
Returns	Result

conc.collocates('the company', context_length = (0, 1), exclude_punctuation = False).display()

Collocates of "the company"
Reuters Corpus
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	said	1,173	25,379	10.40	5,149.40
2	's	518	9,627	10.38	2,429.14
3	also	107	2,532	9.28	450.99
4	reported	51	775	8.74	259.63
5	has	69	4,874	8.14	151.05
6	had	47	2,975	7.98	111.88
7	earned	22	159	7.78	145.70
8	would	43	4,688	7.49	63.27
9	is	59	7,673	7.48	70.93
10	will	49	5,951	7.47	63.94
11	did	20	673	7.43	70.77
12	was	46	5,826	7.40	57.12
13	today	15	1,445	6.75	25.03
14	.	165	49,406	6.69	35.53
15	added	13	1,116	6.65	24.16
16	expects	11	628	6.59	28.20
17	might	10	402	6.54	32.04
18	does	9	408	6.38	26.84
19	lost	8	150	6.32	37.38
20	earlier	10	1,100	6.28	14.57
Report based on word and punctuation tokens
Context tokens left: 0, context tokens right: 1
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 49
Showing 20 rows
Page 1 of 3