# load the target corpus
= Corpus().load(path_to_gardenparty_corpus) gardenparty
keyness
About Conc’s Keyness functionality
Conc implements Log Ratio, which is a keyness measure introduced by Andrew Hardie in this “informal introduction”. In that piece Hardie also discusses Relative Risk, which is also implemented in Conc. The Log Likelihood Ratio implementation is based on discussion in Paul Rayson’s Log-likelihood and effect size calculator. Rayson discusses a variety of keyness measures and practical issues in implementing these. Conc follows the approach mentioned by Rayson of applying an observed frequency of 0.5 for words that do not appear in the target or reference corpus to avoid division by zero issues. Conc also follows handling of zero values in calculating log likelihood (see Note 2).
Using the Keyness class
There are examples below showing how to use the Keyness class directly to output keyword tables. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.
Keyness class API reference
Keyness
Keyness (corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus, reference_corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus)
Class for keyness analysis reporting.
Type | Details | |
---|---|---|
corpus | conc.corpus.Corpus | conc.listcorpus.ListCorpus | Corpus instance |
reference_corpus | conc.corpus.Corpus | conc.listcorpus.ListCorpus | Corpus for comparison |
Keyness.keywords
Keyness.keywords (effect_size_measure:str='log_ratio', statistical_significance_measure:str='log_likelihood', order:str|None=None, order_descending:bool=True, statistical_significance_cut:float|None=None, apply_bonferroni:bool=False, min_document_frequency:int=0, min_document_frequency_reference:int=0, min_frequency:int=0, min_frequency_reference:int=0, case_sensitive:bool=False, normalize_by:int=10000, page_size:int=20, page_current:int=1, show_document_frequency:bool=False, exclude_tokens:list[str]=[], exclude_tokens_text:str='', restrict_tokens:list[str]=[], restrict_tokens_text:str='', exclude_punctuation:bool=True, handle_common_typographic_differences:bool=True, exclude_negative_keywords:bool=True)
Get keywords for the corpus.
Type | Default | Details | |
---|---|---|---|
effect_size_measure | str | log_ratio | effect size measure to use, currently only ‘log_ratio’ is supported |
statistical_significance_measure | str | log_likelihood | statistical significance measure to use, currently only ‘log_likelihood’ is supported |
order | str | None | None | default of None orders by statistical significance measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood |
order_descending | bool | True | order is descending or ascending |
statistical_significance_cut | float | None | None | statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0 |
apply_bonferroni | bool | False | apply Bonferroni correction to the statistical significance cut-off |
min_document_frequency | int | 0 | minimum document frequency in target for token to be included in the report |
min_document_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
min_frequency | int | 0 | minimum frequency in target for token to be included in the report |
min_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
case_sensitive | bool | False | frequencies for tokens with or without case preserved |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of rows to return, if 0 returns all |
page_current | int | 1 | current page, ignored if page_size is 0 |
show_document_frequency | bool | False | show document frequency in output |
exclude_tokens | list | [] | exclude specific tokens from report results |
exclude_tokens_text | str | text to explain which tokens have been excluded, will be added to the report notes | |
restrict_tokens | list | [] | restrict report to return results for a list of specific tokens |
restrict_tokens_text | str | text to explain which tokens are included, will be added to the report notes | |
exclude_punctuation | bool | True | exclude punctuation tokens |
handle_common_typographic_differences | bool | True | whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False |
exclude_negative_keywords | bool | True | whether to exclude negative keywords from the report |
Returns | Result | return a Result object with the frequency table |
Examples
See the note above about accessing this functionality through the Conc class.
# load the reference corpus
= Corpus().load(path_to_brown_corpus) brown
# instantiate the Keyness class
= Keyness(corpus = gardenparty, reference_corpus = brown) keyness
# generate and display the keywords report
= True, statistical_significance_cut = 0.0001, apply_bonferroni = True, order_descending = True, page_current = 1).display() keyness.keywords(show_document_frequency
Keywords | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus | ||||||||||
Rank | Token | Frequency | Frequency Reference | Document Frequency | Document Frequency Reference | Normalized Frequency | Normalized Frequency Reference | Relative Risk | Log Ratio | Log Likelihood |
1 | she | 1,171 | 2,060 | 15 | 216 | 196.76 | 21.02 | 9.36 | 3.23 | 2,710.68 |
2 | her | 937 | 2,887 | 15 | 253 | 157.44 | 29.45 | 5.35 | 2.42 | 1,442.33 |
3 | josephine | 117 | 0 | 1 | 0 | 19.66 | 0.00 | 3,853.78 | 11.91 | 669.34 |
4 | said | 514 | 1,944 | 15 | 315 | 86.37 | 19.83 | 4.35 | 2.12 | 648.89 |
5 | n’t | 522 | 2,016 | 15 | 286 | 87.71 | 20.57 | 4.26 | 2.09 | 644.51 |
6 | you | 642 | 3,265 | 15 | 293 | 107.87 | 33.31 | 3.24 | 1.70 | 566.70 |
7 | it | 1,021 | 6,991 | 15 | 498 | 171.56 | 71.33 | 2.41 | 1.27 | 552.39 |
8 | oh | 149 | 93 | 15 | 62 | 25.04 | 0.95 | 26.39 | 4.72 | 540.97 |
9 | little | 307 | 823 | 15 | 322 | 51.58 | 8.40 | 6.14 | 2.62 | 531.41 |
10 | constantia | 91 | 0 | 1 | 0 | 15.29 | 0.00 | 2,997.38 | 11.55 | 520.60 |
11 | i | 719 | 4,370 | 15 | 335 | 120.81 | 44.59 | 2.71 | 1.44 | 483.12 |
12 | laura | 86 | 14 | 2 | 6 | 14.45 | 0.14 | 101.17 | 6.66 | 412.65 |
13 | isabel | 71 | 1 | 2 | 1 | 11.93 | 0.01 | 1,169.31 | 10.19 | 395.76 |
14 | grandma | 73 | 15 | 2 | 5 | 12.27 | 0.15 | 80.15 | 6.32 | 339.03 |
15 | was | 1,102 | 9,931 | 15 | 466 | 185.17 | 101.32 | 1.83 | 0.87 | 307.65 |
16 | fenella | 49 | 0 | 1 | 0 | 8.23 | 0.00 | 1,613.98 | 10.66 | 280.32 |
17 | dear | 78 | 54 | 13 | 36 | 13.11 | 0.55 | 23.79 | 4.57 | 273.99 |
18 | beryl | 50 | 3 | 1 | 3 | 8.40 | 0.03 | 274.49 | 8.10 | 263.34 |
19 | hammond | 47 | 2 | 1 | 2 | 7.90 | 0.02 | 387.02 | 8.60 | 252.40 |
20 | yes | 87 | 109 | 14 | 74 | 14.62 | 1.11 | 13.15 | 3.72 | 241.33 |
Report based on word tokens | ||||||||||
Frequent tokens with apostrophes have been normalized in reference corpus to match target usage | ||||||||||
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 5392 tests) | ||||||||||
Normalized Frequency is per 10,000 tokens | ||||||||||
Total word tokens in target corpus: 59,514 | ||||||||||
Total word tokens in reference corpus: 980,144 | ||||||||||
Keywords: 337 | ||||||||||
Showing 20 rows | ||||||||||
Page 1 of 17 |