collocates

Functionality for collocation analysis.

About Conc’s Collocates functionality

Conc implements logDice as introduced in Rychlý’s (2008) paper “A Lexicographer-Friendly Association Score”. Conc’s implementation of Mutual Information is based on the formula in Rychlý’s paper.

Using the Collocates class

There are examples below showing how to use the Collocates class directly to output collocation tables. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.

Collocates class API reference

source

Collocates

 Collocates (corpus:conc.corpus.Corpus)

Class for collocation analysis reporting.

	Type	Details
corpus	Corpus	Corpus instance

source

Collocates.collocates

 Collocates.collocates (token_str:str, effect_size_measure:str='logdice',
                        statistical_significance_measure:str='log_likeliho
                        od', order:str|None=None,
                        order_descending:bool=True,
                        statistical_significance_cut:float|None=None,
                        apply_bonferroni:bool=False,
                        context_length:int|tuple[int,int]=5,
                        min_collocate_frequency:int=5, page_size:int=20,
                        page_current:int=1, exclude_punctuation:bool=True)

Report collocates for a given token string.

	Type	Default	Details
token_str	str		Token to search for
effect_size_measure	str	logdice	statistical measure to use for collocation calculation: logdice, mutual_information
statistical_significance_measure	str	log_likelihood	statistical significance measure to use, currently only ‘log_likelihood’ is supported
order	str \| None	None	default of None orders by collocation measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood
order_descending	bool	True	order is descending or ascending
statistical_significance_cut	float \| None	None	statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni	bool	False	apply Bonferroni correction to the statistical significance cut-off
context_length	int \| tuple[int, int]	5	Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5))
min_collocate_frequency	int	5	Minimum count of collocates
page_size	int	20	number of rows to return, if 0 returns all
page_current	int	1	current page, ignored if page_size is 0
exclude_punctuation	bool	True	exclude punctuation tokens
Returns	Result

Examples

See the note above about accessing this functionality through the Conc class.

collocates = Collocates(reuters)

query = 'economy'
collocates.collocates(query, order = None, order_descending = True, statistical_significance_cut = 0.0001, apply_bonferroni=True, effect_size_measure='logdice', context_length = 5, min_collocate_frequency = 5, page_current = 1).display()

Collocates of "economy"
Reuters Corpus
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	stimulate	29	85	10.39	206.37
2	boost	20	222	9.60	84.59
3	japanese	35	944	9.52	88.82
4	domestic	27	700	9.39	70.45
5	german	23	537	9.35	64.41
6	world	35	1,173	9.32	75.37
7	grew	12	103	9.09	57.00
8	sluggish	10	44	8.94	61.75
9	economy	18	621	8.89	195.51
10	measures	13	288	8.87	37.66
11	sectors	10	89	8.85	46.76
12	performance	11	165	8.84	40.00
13	signs	10	107	8.81	43.03
14	economists	12	325	8.70	30.36
15	impact	11	249	8.69	31.43
16	west	20	964	8.69	30.92
17	strength	9	95	8.69	38.97
18	good	12	361	8.65	28.11
19	shows	8	65	8.58	38.90
20	u.s.	70	5,496	8.55	57.99
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 204 tests)
Unique collocates: 34
Showing 20 rows
Page 1 of 2