source
Frequency
Frequency (corpus:conc.corpus.Corpus)
Class for frequency analysis reporting
corpus |
Corpus |
Corpus instance |
source
Frequency.frequencies
Frequency.frequencies (case_sensitive:bool=False, normalize_by:int=10000,
page_size:int=20, page_current:int=1,
show_token_id:bool=False,
show_document_frequency:bool=False,
exclude_tokens:list[str]=[],
exclude_tokens_text:str='',
restrict_tokens:list[str]=[],
restrict_tokens_text:str='',
exclude_punctuation:bool=True)
Report frequent tokens.
case_sensitive |
bool |
False |
frequencies for tokens with or without case preserved |
normalize_by |
int |
10000 |
normalize frequencies by a number (e.g. 10000) |
page_size |
int |
20 |
number of rows to return, if 0 returns all |
page_current |
int |
1 |
current page, ignored if page_size is 0 |
show_token_id |
bool |
False |
show token_id in output |
show_document_frequency |
bool |
False |
show document frequency in output |
exclude_tokens |
list |
[] |
exclude specific tokens from frequency report, can be used to remove stopwords |
exclude_tokens_text |
str |
|
text to explain which tokens have been excluded, will be added to the report notes |
restrict_tokens |
list |
[] |
restrict frequency report to return frequencies for a list of specific tokens |
restrict_tokens_text |
str |
|
text to explain which tokens are included, will be added to the report notes |
exclude_punctuation |
bool |
True |
exclude punctuation tokens |
Returns |
Result |
|
return a Result object with the frequency table |
# load the corpus
brown = Corpus().load(path_to_brown_corpus)
# instantiate the Frequency class
freq_brown = Frequency(brown)
# run the frequencies method and display the results
freq_brown.frequencies(normalize_by=10000, page_size=20).display()
Frequencies of word tokens, Brown Corpus |
1 |
the |
63,516 |
648.03 |
2 |
of |
36,321 |
370.57 |
3 |
and |
27,787 |
283.50 |
4 |
to |
25,868 |
263.92 |
5 |
a |
22,190 |
226.40 |
6 |
in |
19,751 |
201.51 |
7 |
that |
10,409 |
106.20 |
8 |
is |
10,138 |
103.43 |
9 |
was |
9,931 |
101.32 |
10 |
for |
8,905 |
90.85 |
11 |
with |
7,043 |
71.86 |
12 |
it |
6,991 |
71.33 |
13 |
he |
6,772 |
69.09 |
14 |
as |
6,738 |
68.75 |
15 |
his |
6,523 |
66.55 |
16 |
on |
6,459 |
65.90 |
17 |
be |
6,365 |
64.94 |
18 |
's |
5,285 |
53.92 |
19 |
had |
5,200 |
53.05 |
20 |
by |
5,156 |
52.60 |
Report based on word tokens |
Normalized Frequency is per 10,000 tokens |
Total word tokens: 980,144 |
Unique tokens: 42,907 |
Showing 20 rows |
Page 1 of 2146 |
from conc.core import get_stop_words
stop_words = get_stop_words(save_path, spacy_model = 'en_core_web_sm')
freq_brown.frequencies(normalize_by=10000, show_document_frequency = True, exclude_tokens = stop_words, page_size=20).display()
Frequencies of word tokens, Brown Corpus |
1 |
said |
1,944 |
315 |
19.83 |
2 |
time |
1,667 |
450 |
17.01 |
3 |
new |
1,595 |
390 |
16.27 |
4 |
man |
1,346 |
326 |
13.73 |
5 |
like |
1,287 |
366 |
13.13 |
6 |
af |
989 |
49 |
10.09 |
7 |
years |
953 |
346 |
9.72 |
8 |
way |
925 |
365 |
9.44 |
9 |
state |
883 |
200 |
9.01 |
10 |
long |
863 |
354 |
8.80 |
11 |
people |
851 |
286 |
8.68 |
12 |
world |
848 |
274 |
8.65 |
13 |
year |
831 |
242 |
8.48 |
14 |
little |
823 |
322 |
8.40 |
15 |
good |
813 |
320 |
8.29 |
16 |
men |
772 |
248 |
7.88 |
17 |
work |
767 |
310 |
7.83 |
18 |
day |
767 |
311 |
7.83 |
19 |
old |
734 |
278 |
7.49 |
20 |
life |
728 |
284 |
7.43 |
Report based on word tokens |
Tokens excluded from report: 306 |
Normalized Frequency is per 10,000 tokens |
Total word tokens: 980,144 |
Unique tokens: 42,601 |
Showing 20 rows |
Page 1 of 2131 |