= Collocates(reuters) collocates
collocates
Functionality for collocation analysis.
Collocates
Collocates (corpus:conc.corpus.Corpus)
Class for collocation analysis reporting.
Type | Details | |
---|---|---|
corpus | Corpus | Corpus instance |
Collocates.collocates
Collocates.collocates (token_str:str, effect_size_measure:str='logdice', statistical_significance_measure:str='log_likeliho od', order:str|None=None, order_descending:bool=True, statistical_significance_cut:float|None=None, apply_bonferroni:bool=False, context_length:int|tuple[int,int]=5, min_collocate_frequency:int=5, page_size:int=20, page_current:int=1, exclude_punctuation:bool=True)
Report collocates for a given token string.
Type | Default | Details | |
---|---|---|---|
token_str | str | Token to search for | |
effect_size_measure | str | logdice | statistical measure to use for collocation calculation: logdice, mutual_information |
statistical_significance_measure | str | log_likelihood | statistical significance measure to use, currently only ‘log_likelihood’ is supported |
order | str | None | None | default of None orders by collocation measure, results can also be ordered by: collocate_frequency, frequency, log_likelihood |
order_descending | bool | True | order is descending or ascending |
statistical_significance_cut | float | None | None | statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0 |
apply_bonferroni | bool | False | apply Bonferroni correction to the statistical significance cut-off |
context_length | int | tuple[int, int] | 5 | Window size per side in tokens - if an int (e.g. 5) context lengths on left and right will be the same, for independent control of left and right context length pass a tuple (context_length_left, context_left_right) (e.g. (0, 5)) |
min_collocate_frequency | int | 5 | Minimum count of collocates |
page_size | int | 20 | number of rows to return, if 0 returns all |
page_current | int | 1 | current page, ignored if page_size is 0 |
exclude_punctuation | bool | True | exclude punctuation tokens |
Returns | Result |
for word in ["economy"]: # brown used 'i went in', 'any of us', for testing "economy"
Collocates of "economy" | |||||
---|---|---|---|---|---|
Reuters Corpus | |||||
Rank | Token | Collocate Frequency | Frequency | Logdice | Log Likelihood |
1 | stimulate | 29 | 85 | 10.39 | 206.37 |
2 | boost | 20 | 222 | 9.60 | 84.59 |
3 | japanese | 35 | 944 | 9.52 | 88.82 |
4 | domestic | 27 | 700 | 9.39 | 70.45 |
5 | german | 23 | 537 | 9.35 | 64.41 |
6 | world | 35 | 1,173 | 9.32 | 75.37 |
7 | grew | 12 | 103 | 9.09 | 57.00 |
8 | sluggish | 10 | 44 | 8.94 | 61.75 |
9 | economy | 18 | 621 | 8.89 | 195.51 |
10 | measures | 13 | 288 | 8.87 | 37.66 |
11 | sectors | 10 | 89 | 8.85 | 46.76 |
12 | performance | 11 | 165 | 8.84 | 40.00 |
13 | signs | 10 | 107 | 8.81 | 43.03 |
14 | economists | 12 | 325 | 8.70 | 30.36 |
15 | impact | 11 | 249 | 8.69 | 31.43 |
16 | west | 20 | 964 | 8.69 | 30.92 |
17 | strength | 9 | 95 | 8.69 | 38.97 |
18 | good | 12 | 361 | 8.65 | 28.11 |
19 | shows | 8 | 65 | 8.58 | 38.90 |
20 | u.s. | 70 | 5,496 | 8.55 | 57.99 |
Report based on word tokens | |||||
Context tokens left: 5, context tokens right: 5 | |||||
Filtered tokens by minimum collocation frequency (5) | |||||
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 204 tests) | |||||
Unique collocates: 34 | |||||
Showing 20 rows | |||||
Page 1 of 2 |
CPU times: user 82.1 ms, sys: 165 ms, total: 247 ms
Wall time: 159 ms