conc
  1. API
  2. keyness
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. keyness

keyness

Functionality for keyness analysis.

source

Keyness

 Keyness (corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus,
          reference_corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus)

Class for keyness analysis reporting.

Type Details
corpus conc.corpus.Corpus | conc.listcorpus.ListCorpus Corpus instance
reference_corpus conc.corpus.Corpus | conc.listcorpus.ListCorpus Corpus for comparison

source

Keyness.keywords

 Keyness.keywords (effect_size_measure:str='log_ratio',
                   statistical_significance_measure:str='log_likelihood',
                   order:str|None=None, order_descending:bool=True,
                   statistical_significance_cut:float|None=None,
                   apply_bonferroni:bool=False,
                   min_document_frequency:int=0,
                   min_document_frequency_reference:int=0,
                   min_frequency:int=0, min_frequency_reference:int=0,
                   case_sensitive:bool=False, normalize_by:int=10000,
                   page_size:int=20, page_current:int=1,
                   show_document_frequency:bool=False,
                   exclude_tokens:list[str]=[],
                   exclude_tokens_text:str='',
                   restrict_tokens:list[str]=[],
                   restrict_tokens_text:str='',
                   exclude_punctuation:bool=True,
                   handle_common_typographic_differences:bool=True,
                   exclude_negative_keywords:bool=True)

Get keywords for the corpus.

Type Default Details
effect_size_measure str log_ratio effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure str log_likelihood statistical significance measure to use, currently only ‘log_likelihood’ is supported
order str | None None default of None orders by statistical significance measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending bool True order is descending or ascending
statistical_significance_cut float | None None statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni bool False apply Bonferroni correction to the statistical significance cut-off
min_document_frequency int 0 minimum document frequency in target for token to be included in the report
min_document_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
min_frequency int 0 minimum frequency in target for token to be included in the report
min_frequency_reference int 0 minimum document frequency in reference for token to be included in the report
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from report results
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict report to return results for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
handle_common_typographic_differences bool True whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False
exclude_negative_keywords bool True whether to exclude negative keywords from the report
Returns Result return a Result object with the frequency table
# load the target corpus
gardenparty = Corpus().load(path_to_gardenparty_corpus)
# load the reference corpus
brown = Corpus().load(path_to_brown_corpus)
# instantiate the Keyness class
keyness = Keyness(corpus = gardenparty, reference_corpus = brown)
# generate and display the keywords report
keyness.keywords(show_document_frequency = True, statistical_significance_cut = 0.0001, apply_bonferroni = True, order_descending = True, page_current = 1).display()
Keywords
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Document Frequency Document Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 she 1,171 2,060 15 216 196.76 21.02 9.36 3.23 2,710.68
2 her 937 2,887 15 253 157.44 29.45 5.35 2.42 1,442.33
3 josephine 117 0 1 0 19.66 0.00 3,853.78 11.91 669.34
4 said 514 1,944 15 315 86.37 19.83 4.35 2.12 648.89
5 n’t 522 2,016 15 286 87.71 20.57 4.26 2.09 644.51
6 you 642 3,265 15 293 107.87 33.31 3.24 1.70 566.70
7 it 1,021 6,991 15 498 171.56 71.33 2.41 1.27 552.39
8 oh 149 93 15 62 25.04 0.95 26.39 4.72 540.97
9 little 307 823 15 322 51.58 8.40 6.14 2.62 531.41
10 constantia 91 0 1 0 15.29 0.00 2,997.38 11.55 520.60
11 i 719 4,370 15 335 120.81 44.59 2.71 1.44 483.12
12 laura 86 14 2 6 14.45 0.14 101.17 6.66 412.65
13 isabel 71 1 2 1 11.93 0.01 1,169.31 10.19 395.76
14 grandma 73 15 2 5 12.27 0.15 80.15 6.32 339.03
15 was 1,102 9,931 15 466 185.17 101.32 1.83 0.87 307.65
16 fenella 49 0 1 0 8.23 0.00 1,613.98 10.66 280.32
17 dear 78 54 13 36 13.11 0.55 23.79 4.57 273.99
18 beryl 50 3 1 3 8.40 0.03 274.49 8.10 263.34
19 hammond 47 2 1 2 7.90 0.02 387.02 8.60 252.40
20 yes 87 109 14 74 14.62 1.11 13.15 3.72 241.33
Report based on word tokens
Frequent tokens with apostrophes have been normalized in reference corpus to match target usage
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 5392 tests)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 59,514
Total word tokens in reference corpus: 980,144
Keywords: 337
Showing 20 rows
Page 1 of 17
  • Report an issue