# load the corpus
= Corpus().load(path_to_brown_corpus) brown
frequency
Functionality for frequency analysis.
Using the Frequency class
There are examples below showing how to use the Frequency class directly to get frequency lists for a corpus. The example below also shows example frequency lists. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.
Frequency class API reference
Frequency
Frequency (corpus:conc.corpus.Corpus|conc.listcorpus.ListCorpus)
Class for frequency analysis reporting
Type | Details | |
---|---|---|
corpus | conc.corpus.Corpus | conc.listcorpus.ListCorpus | Corpus instance |
Frequency.frequencies
Frequency.frequencies (case_sensitive:bool=False, normalize_by:int=10000, page_size:int=20, page_current:int=1, show_token_id:bool=False, show_document_frequency:bool=False, exclude_tokens:list[str]=[], exclude_tokens_text:str='', restrict_tokens:list[str]=[], restrict_tokens_text:str='', exclude_punctuation:bool=True)
Report frequent tokens.
Type | Default | Details | |
---|---|---|---|
case_sensitive | bool | False | frequencies for tokens with or without case preserved |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of rows to return, if 0 returns all |
page_current | int | 1 | current page, ignored if page_size is 0 |
show_token_id | bool | False | show token_id in output |
show_document_frequency | bool | False | show document frequency in output |
exclude_tokens | list | [] | exclude specific tokens from frequency report, can be used to remove stopwords |
exclude_tokens_text | str | text to explain which tokens have been excluded, will be added to the report notes | |
restrict_tokens | list | [] | restrict frequency report to return frequencies for a list of specific tokens |
restrict_tokens_text | str | text to explain which tokens are included, will be added to the report notes | |
exclude_punctuation | bool | True | exclude punctuation tokens |
Returns | Result | return a Result object with the frequency table |
Examples
See the note above about accessing this functionality through the Conc class.
# instantiate the Frequency class
= Frequency(brown) freq_brown
# run the frequencies method and display the results
=10000, page_size=20).display() freq_brown.frequencies(normalize_by
Frequencies | |||
---|---|---|---|
Frequencies of word tokens, Brown Corpus | |||
Rank | Token | Frequency | Normalized Frequency |
1 | the | 63,516 | 648.03 |
2 | of | 36,321 | 370.57 |
3 | and | 27,787 | 283.50 |
4 | to | 25,868 | 263.92 |
5 | a | 22,190 | 226.40 |
6 | in | 19,751 | 201.51 |
7 | that | 10,409 | 106.20 |
8 | is | 10,138 | 103.43 |
9 | was | 9,931 | 101.32 |
10 | for | 8,905 | 90.85 |
11 | with | 7,043 | 71.86 |
12 | it | 6,991 | 71.33 |
13 | he | 6,772 | 69.09 |
14 | as | 6,738 | 68.75 |
15 | his | 6,523 | 66.55 |
16 | on | 6,459 | 65.90 |
17 | be | 6,365 | 64.94 |
18 | 's | 5,285 | 53.92 |
19 | had | 5,200 | 53.05 |
20 | by | 5,156 | 52.60 |
Report based on word tokens | |||
Normalized Frequency is per 10,000 tokens | |||
Total word tokens: 980,144 | |||
Unique word tokens: 42,907 | |||
Showing 20 rows | |||
Page 1 of 2146 |
from conc.core import get_stop_words
= get_stop_words(save_path, spacy_model = 'en_core_web_sm')
stop_words =10000, show_document_frequency = True, exclude_tokens = stop_words, page_size=20).display() freq_brown.frequencies(normalize_by
Frequencies | ||||
---|---|---|---|---|
Frequencies of word tokens, Brown Corpus | ||||
Rank | Token | Frequency | Document Frequency | Normalized Frequency |
1 | said | 1,944 | 315 | 19.83 |
2 | time | 1,667 | 450 | 17.01 |
3 | new | 1,595 | 390 | 16.27 |
4 | man | 1,346 | 326 | 13.73 |
5 | like | 1,287 | 366 | 13.13 |
6 | af | 989 | 49 | 10.09 |
7 | years | 953 | 346 | 9.72 |
8 | way | 925 | 365 | 9.44 |
9 | state | 883 | 200 | 9.01 |
10 | long | 863 | 354 | 8.80 |
11 | people | 851 | 286 | 8.68 |
12 | world | 848 | 274 | 8.65 |
13 | year | 831 | 242 | 8.48 |
14 | little | 823 | 322 | 8.40 |
15 | good | 813 | 320 | 8.29 |
16 | men | 772 | 248 | 7.88 |
17 | work | 767 | 310 | 7.83 |
18 | day | 767 | 311 | 7.83 |
19 | old | 734 | 278 | 7.49 |
20 | life | 728 | 284 | 7.43 |
Report based on word tokens | ||||
Tokens excluded from report: 306 | ||||
Normalized Frequency is per 10,000 tokens | ||||
Total word tokens: 980,144 | ||||
Unique word tokens: 42,601 | ||||
Showing 20 rows | ||||
Page 1 of 2131 |