# load the target corpus
= Corpus().load(path_to_gardenparty_corpus) gardenparty
keyness
Functionality for keyness analysis.
Keyness
Keyness (corpus:conc.corpus.Corpus, reference_corpus:conc.corpus.Corpus)
Class for keyness analysis reporting.
Type | Details | |
---|---|---|
corpus | Corpus | Corpus instance |
reference_corpus | Corpus | Corpus for comparison |
Keyness.keywords
Keyness.keywords (effect_size_measure:str='log_ratio', statistical_significance_measure:str='log_likelihood', order:str|None=None, order_descending:bool=True, statistical_significance_cut:float|None=None, apply_bonferroni:bool=False, min_document_frequency:int=0, min_document_frequency_reference:int=0, min_frequency:int=0, min_frequency_reference:int=0, case_sensitive:bool=False, normalize_by:int=10000, page_size:int=20, page_current:int=1, show_document_frequency:bool=False, exclude_tokens:list[str]=[], exclude_tokens_text:str='', restrict_tokens:list[str]=[], restrict_tokens_text:str='', exclude_punctuation:bool=True)
Get keywords for the corpus.
Type | Default | Details | |
---|---|---|---|
effect_size_measure | str | log_ratio | effect size measure to use, currently only ‘log_ratio’ is supported |
statistical_significance_measure | str | log_likelihood | statistical significance measure to use, currently only ‘log_likelihood’ is supported |
order | str | None | None | default of None orders by effect size measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood |
order_descending | bool | True | order is descending or ascending |
statistical_significance_cut | float | None | None | statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0 |
apply_bonferroni | bool | False | apply Bonferroni correction to the statistical significance cut-off |
min_document_frequency | int | 0 | minimum document frequency in target for token to be included in the report |
min_document_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
min_frequency | int | 0 | minimum frequency in target for token to be included in the report |
min_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
case_sensitive | bool | False | frequencies for tokens with or without case preserved |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of rows to return, if 0 returns all |
page_current | int | 1 | current page, ignored if page_size is 0 |
show_document_frequency | bool | False | show document frequency in output |
exclude_tokens | list | [] | exclude specific tokens from report results |
exclude_tokens_text | str | text to explain which tokens have been excluded, will be added to the report notes | |
restrict_tokens | list | [] | restrict report to return results for a list of specific tokens |
restrict_tokens_text | str | text to explain which tokens are included, will be added to the report notes | |
exclude_punctuation | bool | True | exclude punctuation tokens |
Returns | Result | return a Result object with the frequency table |
# load the reference corpus
= Corpus().load(path_to_brown_corpus) brown
# instantiate the Keyness class
= Keyness(corpus = gardenparty, reference_corpus = brown) keyness
# generate and display the keywords report
= True, min_document_frequency_reference = 5, statistical_significance_cut = 0.0001, apply_bonferroni = True, order_descending = True, page_current = 1).display() keyness.keywords(show_document_frequency
Keywords | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus | ||||||||||
Rank | Token | Frequency | Frequency Reference | Document Frequency | Document Frequency Reference | Normalized Frequency | Normalized Frequency Reference | Relative Risk | Log Ratio | Log Likelihood |
1 | laura | 86 | 14 | 2 | 6 | 13.58 | 0.14 | 95.10 | 6.57 | 402.74 |
2 | jug | 30 | 6 | 2 | 5 | 4.74 | 0.06 | 77.41 | 6.27 | 136.44 |
3 | grandma | 73 | 15 | 2 | 5 | 11.53 | 0.15 | 75.34 | 6.24 | 330.64 |
4 | meadows | 33 | 7 | 1 | 5 | 5.21 | 0.07 | 72.98 | 6.19 | 148.73 |
5 | con | 27 | 7 | 1 | 5 | 4.26 | 0.07 | 59.71 | 5.90 | 117.62 |
6 | bye | 25 | 7 | 9 | 7 | 3.95 | 0.07 | 55.29 | 5.79 | 107.37 |
7 | velvet | 14 | 5 | 6 | 5 | 2.21 | 0.05 | 43.35 | 5.44 | 57.19 |
8 | shone | 13 | 5 | 7 | 5 | 2.05 | 0.05 | 40.25 | 5.33 | 52.21 |
9 | queer | 15 | 6 | 5 | 6 | 2.37 | 0.06 | 38.70 | 5.27 | 59.69 |
10 | gloves | 17 | 7 | 7 | 5 | 2.69 | 0.07 | 37.60 | 5.23 | 67.18 |
11 | cried | 59 | 26 | 12 | 23 | 9.32 | 0.27 | 35.13 | 5.13 | 229.24 |
12 | faintly | 14 | 7 | 7 | 6 | 2.21 | 0.07 | 30.96 | 4.95 | 52.61 |
13 | darling | 36 | 18 | 8 | 13 | 5.69 | 0.18 | 30.96 | 4.95 | 135.27 |
14 | sandy | 11 | 6 | 3 | 6 | 1.74 | 0.06 | 28.38 | 4.83 | 40.33 |
15 | alice | 21 | 13 | 2 | 6 | 3.32 | 0.13 | 25.01 | 4.64 | 74.09 |
16 | oh | 149 | 93 | 15 | 62 | 23.53 | 0.95 | 24.80 | 4.63 | 524.30 |
17 | handkerchief | 14 | 9 | 8 | 6 | 2.21 | 0.09 | 24.08 | 4.59 | 48.80 |
18 | charlotte | 22 | 15 | 1 | 5 | 3.47 | 0.15 | 22.71 | 4.51 | 75.22 |
19 | dear | 78 | 54 | 13 | 36 | 12.32 | 0.55 | 22.36 | 4.48 | 265.31 |
20 | breathed | 13 | 9 | 7 | 9 | 2.05 | 0.09 | 22.36 | 4.48 | 44.22 |
Report based on word tokens | ||||||||||
Filtered tokens by minimum document frequency in reference corpus (5) | ||||||||||
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 3378 tests) | ||||||||||
Normalized Frequency is per 10,000 tokens | ||||||||||
Total word tokens in target corpus: 63,311 | ||||||||||
Total word tokens in reference corpus: 980,144 | ||||||||||
Keywords: 243 | ||||||||||
Showing 20 rows | ||||||||||
Page 1 of 13 |