keyness

Functionality for keyness analysis.

About Conc’s Keyness functionality

Conc implements Log Ratio, which is a keyness measure introduced by Andrew Hardie in this “informal introduction”. In that piece Hardie also discusses Relative Risk, which is also implemented in Conc. The Log Likelihood Ratio implementation is based on discussion in Paul Rayson’s Log-likelihood and effect size calculator. Rayson discusses a variety of keyness measures and practical issues in implementing these. Conc follows the approach mentioned by Rayson of applying an observed frequency of 0.5 for words that do not appear in the target or reference corpus to avoid division by zero issues. Conc also follows handling of zero values in calculating log likelihood (see Note 2).

Using the Keyness class

There are examples below showing how to use the Keyness class directly to output keyword tables. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.

Keyness class API reference

source

Keyness

 Keyness (corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus,
          reference_corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus)

Class for keyness analysis reporting.

	Type	Details
corpus	conc.corpus.Corpus \| conc.listcorpus.ListCorpus	Corpus instance
reference_corpus	conc.corpus.Corpus \| conc.listcorpus.ListCorpus	Corpus for comparison

source

Keyness.keywords

 Keyness.keywords (effect_size_measure:str='log_ratio',
                   statistical_significance_measure:str='log_likelihood',
                   order:str|None=None, order_descending:bool=True,
                   statistical_significance_cut:float|None=None,
                   apply_bonferroni:bool=False,
                   min_document_frequency:int=0,
                   min_document_frequency_reference:int=0,
                   min_frequency:int=0, min_frequency_reference:int=0,
                   case_sensitive:bool=False, normalize_by:int=10000,
                   page_size:int=20, page_current:int=1,
                   show_document_frequency:bool=False,
                   exclude_tokens:list[str]=[],
                   exclude_tokens_text:str='',
                   restrict_tokens:list[str]=[],
                   restrict_tokens_text:str='',
                   exclude_punctuation:bool=True,
                   handle_common_typographic_differences:bool=True,
                   exclude_negative_keywords:bool=True)

Get keywords for the corpus.

	Type	Default	Details
effect_size_measure	str	log_ratio	effect size measure to use, currently only ‘log_ratio’ is supported
statistical_significance_measure	str	log_likelihood	statistical significance measure to use, currently only ‘log_likelihood’ is supported
order	str \| None	None	default of None orders by statistical significance measure, results can also be ordered by: frequency, frequency_reference, document_frequency, document_frequency_reference, log_likelihood
order_descending	bool	True	order is descending or ascending
statistical_significance_cut	float \| None	None	statistical significance p-value to filter results, e.g. 0.05 or 0.01 or 0.001 - ignored if None or 0
apply_bonferroni	bool	False	apply Bonferroni correction to the statistical significance cut-off
min_document_frequency	int	0	minimum document frequency in target for token to be included in the report
min_document_frequency_reference	int	0	minimum document frequency in reference for token to be included in the report
min_frequency	int	0	minimum frequency in target for token to be included in the report
min_frequency_reference	int	0	minimum document frequency in reference for token to be included in the report
case_sensitive	bool	False	frequencies for tokens with or without case preserved
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of rows to return, if 0 returns all
page_current	int	1	current page, ignored if page_size is 0
show_document_frequency	bool	False	show document frequency in output
exclude_tokens	list	[]	exclude specific tokens from report results
exclude_tokens_text	str		text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens	list	[]	restrict report to return results for a list of specific tokens
restrict_tokens_text	str		text to explain which tokens are included, will be added to the report notes
exclude_punctuation	bool	True	exclude punctuation tokens
handle_common_typographic_differences	bool	True	whether to detect and normalize common differences in word tokens due to typographic differences (i.e. currently focused on apostrophes in common English contractions), ignored when exclude_punctuation is False
exclude_negative_keywords	bool	True	whether to exclude negative keywords from the report
Returns	Result		return a Result object with the frequency table

Examples

See the note above about accessing this functionality through the Conc class.

# load the target corpus
gardenparty = Corpus().load(path_to_gardenparty_corpus)

# load the reference corpus
brown = Corpus().load(path_to_brown_corpus)

# instantiate the Keyness class
keyness = Keyness(corpus = gardenparty, reference_corpus = brown)

# generate and display the keywords report
keyness.keywords(show_document_frequency = True, statistical_significance_cut = 0.0001, apply_bonferroni = True, order_descending = True, page_current = 1).display()

Keywords
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Document Frequency	Document Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	she	1,171	2,060	15	216	196.76	21.02	9.36	3.23	2,710.68
2	her	937	2,887	15	253	157.44	29.45	5.35	2.42	1,442.33
3	josephine	117	0	1	0	19.66	0.00	3,853.78	11.91	669.34
4	said	514	1,944	15	315	86.37	19.83	4.35	2.12	648.89
5	n’t	522	2,016	15	286	87.71	20.57	4.26	2.09	644.51
6	you	642	3,265	15	293	107.87	33.31	3.24	1.70	566.70
7	it	1,021	6,991	15	498	171.56	71.33	2.41	1.27	552.39
8	oh	149	93	15	62	25.04	0.95	26.39	4.72	540.97
9	little	307	823	15	322	51.58	8.40	6.14	2.62	531.41
10	constantia	91	0	1	0	15.29	0.00	2,997.38	11.55	520.60
11	i	719	4,370	15	335	120.81	44.59	2.71	1.44	483.12
12	laura	86	14	2	6	14.45	0.14	101.17	6.66	412.65
13	isabel	71	1	2	1	11.93	0.01	1,169.31	10.19	395.76
14	grandma	73	15	2	5	12.27	0.15	80.15	6.32	339.03
15	was	1,102	9,931	15	466	185.17	101.32	1.83	0.87	307.65
16	fenella	49	0	1	0	8.23	0.00	1,613.98	10.66	280.32
17	dear	78	54	13	36	13.11	0.55	23.79	4.57	273.99
18	beryl	50	3	1	3	8.40	0.03	274.49	8.10	263.34
19	hammond	47	2	1	2	7.90	0.02	387.02	8.60	252.40
20	yes	87	109	14	74	14.62	1.11	13.15	3.72	241.33
Report based on word tokens
Frequent tokens with apostrophes have been normalized in reference corpus to match target usage
Keywords filtered based on p-value 0.0001 with Bonferroni correction (based on 5392 tests)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 59,514
Total word tokens in reference corpus: 980,144
Keywords: 337
Showing 20 rows
Page 1 of 17