# load the target corpus
= Corpus().load(path_to_gardenparty_corpus) gardenparty
keyness
Functionality for keyness analysis.
Keyness
Keyness (corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus, reference_corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus)
Class for keyness analysis reporting.
Type | Details | |
---|---|---|
corpus | conc.corpus.Corpus | conc.listcorpus.ListCorpus | Corpus instance |
reference_corpus | conc.corpus.Corpus | conc.listcorpus.ListCorpus | Corpus for comparison |
Keyness.keywords
Keyness.keywords (effect_size_measure:str='log_ratio', statistical_significance_measure:str='log_likelihood', order:str|None=None, order_descending:bool=True, statistical_significance_cut:float|None=None, apply_bonferroni:bool=False, min_document_frequency:int=0, min_document_frequency_reference:int=0, min_frequency:int=0, min_frequency_reference:int=0, case_sensitive:bool=False, normalize_by:int=10000, page_size:int=20, page_current:int=1, show_document_frequency:bool=False, exclude_tokens:list[str]=[], exclude_tokens_text:str='', restrict_tokens:list[str]=[], restrict_tokens_text:str='', exclude_punctuation:bool=True, handle_common_typographic_differences:bool=True, exclude_negative_keywords:bool=True)
Get keywords for the corpus.
Type | Default | Details | |
---|---|---|---|
effect_size_measure | str | log_ratio | effect size measure to use, currently only ‘log_ratio’ is supported |
statistical_significance_measure | str | log_likelihood | statistical significance measure to use, currently only ‘log_likelihood’ is supported |
order | str | None | None | default of None orders by statistical significance measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood |
order_descending | bool | True | order is descending or ascending |
statistical_significance_cut | float | None | None | statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0 |
apply_bonferroni | bool | False | apply Bonferroni correction to the statistical significance cut-off |
min_document_frequency | int | 0 | minimum document frequency in target for token to be included in the report |
min_document_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
min_frequency | int | 0 | minimum frequency in target for token to be included in the report |
min_frequency_reference | int | 0 | minimum document frequency in reference for token to be included in the report |
case_sensitive | bool | False | frequencies for tokens with or without case preserved |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of rows to return, if 0 returns all |
page_current | int | 1 | current page, ignored if page_size is 0 |
show_document_frequency | bool | False | show document frequency in output |
exclude_tokens | list | [] | exclude specific tokens from report results |
exclude_tokens_text | str | text to explain which tokens have been excluded, will be added to the report notes | |
restrict_tokens | list | [] | restrict report to return results for a list of specific tokens |
restrict_tokens_text | str | text to explain which tokens are included, will be added to the report notes | |
exclude_punctuation | bool | True | exclude punctuation tokens |
handle_common_typographic_differences | bool | True | whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False |
exclude_negative_keywords | bool | True | whether to exclude negative keywords from the report |
Returns | Result | return a Result object with the frequency table |
# load the reference corpus
= Corpus().load(path_to_brown_corpus) brown
# instantiate the Keyness class
= Keyness(corpus = gardenparty, reference_corpus = brown) keyness
# generate and display the keywords report
= True, statistical_significance_cut = 0.0001, apply_bonferroni = True, order_descending = True, page_current = 1).display() keyness.keywords(show_document_frequency
Keywords | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus | ||||||||||
Rank | Token | Frequency | Frequency Reference | Document Frequency | Document Frequency Reference | Normalized Frequency | Normalized Frequency Reference | Relative Risk | Log Ratio | Log Likelihood |
1 | she | 1,171 | 2,060 | 15 | 216 | 196.76 | 21.02 | 9.36 | 3.23 | 2,710.68 |
2 | her | 937 | 2,887 | 15 | 253 | 157.44 | 29.45 | 5.35 | 2.42 | 1,442.33 |
3 | josephine | 117 | 0 | 1 | 0 | 19.66 | 0.00 | 3,853.78 | 11.91 | 669.34 |
4 | said | 514 | 1,944 | 15 | 315 | 86.37 | 19.83 | 4.35 | 2.12 | 648.89 |
5 | n’t | 522 | 2,016 | 15 | 286 | 87.71 | 20.57 | 4.26 | 2.09 | 644.51 |
6 | you | 642 | 3,265 | 15 | 293 | 107.87 | 33.31 | 3.24 | 1.70 | 566.70 |
7 | it | 1,021 | 6,991 | 15 | 498 | 171.56 | 71.33 | 2.41 | 1.27 | 552.39 |
8 | oh | 149 | 93 | 15 | 62 | 25.04 | 0.95 | 26.39 | 4.72 | 540.97 |
9 | little | 307 | 823 | 15 | 322 | 51.58 | 8.40 | 6.14 | 2.62 | 531.41 |
10 | constantia | 91 | 0 | 1 | 0 | 15.29 | 0.00 | 2,997.38 | 11.55 | 520.60 |
11 | i | 719 | 4,370 | 15 | 335 | 120.81 | 44.59 | 2.71 | 1.44 | 483.12 |
12 | laura | 86 | 14 | 2 | 6 | 14.45 | 0.14 | 101.17 | 6.66 | 412.65 |
13 | isabel | 71 | 1 | 2 | 1 | 11.93 | 0.01 | 1,169.31 | 10.19 | 395.76 |
14 | grandma | 73 | 15 | 2 | 5 | 12.27 | 0.15 | 80.15 | 6.32 | 339.03 |
15 | was | 1,102 | 9,931 | 15 | 466 | 185.17 | 101.32 | 1.83 | 0.87 | 307.65 |
16 | fenella | 49 | 0 | 1 | 0 | 8.23 | 0.00 | 1,613.98 | 10.66 | 280.32 |
17 | dear | 78 | 54 | 13 | 36 | 13.11 | 0.55 | 23.79 | 4.57 | 273.99 |
18 | beryl | 50 | 3 | 1 | 3 | 8.40 | 0.03 | 274.49 | 8.10 | 263.34 |
19 | hammond | 47 | 2 | 1 | 2 | 7.90 | 0.02 | 387.02 | 8.60 | 252.40 |
20 | yes | 87 | 109 | 14 | 74 | 14.62 | 1.11 | 13.15 | 3.72 | 241.33 |
Report based on word tokens | ||||||||||
Frequent tokens with apostrophes have been normalized in reference corpus to match target usage | ||||||||||
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 5392 tests) | ||||||||||
Normalized Frequency is per 10,000 tokens | ||||||||||
Total word tokens in target corpus: 59,514 | ||||||||||
Total word tokens in reference corpus: 980,144 | ||||||||||
Keywords: 337 | ||||||||||
Showing 20 rows | ||||||||||
Page 1 of 17 |