conc
  1. API
  2. frequency
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. frequency

frequency

Functionality for frequency analysis.

source

Frequency

 Frequency (corpus:conc.corpus.Corpus)

Class for frequency analysis reporting

Type Details
corpus Corpus Corpus instance

source

Frequency.frequencies

 Frequency.frequencies (case_sensitive:bool=False, normalize_by:int=10000,
                        page_size:int=20, page_current:int=1,
                        show_token_id:bool=False,
                        show_document_frequency:bool=False,
                        exclude_tokens:list[str]=[],
                        exclude_tokens_text:str='',
                        restrict_tokens:list[str]=[],
                        restrict_tokens_text:str='',
                        exclude_punctuation:bool=True)

Report frequent tokens.

Type Default Details
case_sensitive bool False frequencies for tokens with or without case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return, if 0 returns all
page_current int 1 current page, ignored if page_size is 0
show_token_id bool False show token_id in output
show_document_frequency bool False show document frequency in output
exclude_tokens list [] exclude specific tokens from frequency report, can be used to remove stopwords
exclude_tokens_text str text to explain which tokens have been excluded, will be added to the report notes
restrict_tokens list [] restrict frequency report to return frequencies for a list of specific tokens
restrict_tokens_text str text to explain which tokens are included, will be added to the report notes
exclude_punctuation bool True exclude punctuation tokens
Returns Result return a Result object with the frequency table
# load the corpus
brown = Corpus().load(path_to_brown_corpus)
# instantiate the Frequency class
freq_brown = Frequency(brown)
# run the frequencies method and display the results
freq_brown.frequencies(normalize_by=10000, page_size=20).display()
Frequencies
Frequencies of word tokens, Brown Corpus
Rank Token Frequency Normalized Frequency
1 the 63,516 648.03
2 of 36,321 370.57
3 and 27,787 283.50
4 to 25,868 263.92
5 a 22,190 226.40
6 in 19,751 201.51
7 that 10,409 106.20
8 is 10,138 103.43
9 was 9,931 101.32
10 for 8,905 90.85
11 with 7,043 71.86
12 it 6,991 71.33
13 he 6,772 69.09
14 as 6,738 68.75
15 his 6,523 66.55
16 on 6,459 65.90
17 be 6,365 64.94
18 's 5,285 53.92
19 had 5,200 53.05
20 by 5,156 52.60
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 980,144
Unique tokens: 42,907
Showing 20 rows
Page 1 of 2146
from conc.core import get_stop_words
stop_words = get_stop_words(save_path, spacy_model = 'en_core_web_sm')
freq_brown.frequencies(normalize_by=10000, show_document_frequency = True, exclude_tokens = stop_words, page_size=20).display()
Frequencies
Frequencies of word tokens, Brown Corpus
Rank Token Frequency Document Frequency Normalized Frequency
1 said 1,944 315 19.83
2 time 1,667 450 17.01
3 new 1,595 390 16.27
4 man 1,346 326 13.73
5 like 1,287 366 13.13
6 af 989 49 10.09
7 years 953 346 9.72
8 way 925 365 9.44
9 state 883 200 9.01
10 long 863 354 8.80
11 people 851 286 8.68
12 world 848 274 8.65
13 year 831 242 8.48
14 little 823 322 8.40
15 good 813 320 8.29
16 men 772 248 7.88
17 work 767 310 7.83
18 day 767 311 7.83
19 old 734 278 7.49
20 life 728 284 7.43
Report based on word tokens
Tokens excluded from report: 306
Normalized Frequency is per 10,000 tokens
Total word tokens: 980,144
Unique tokens: 42,601
Showing 20 rows
Page 1 of 2131
  • Report an issue