conc
  1. API
  2. ngrams
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. ngrams

ngrams

Functionality for ngram analysis.

source

Ngrams

 Ngrams (corpus:conc.corpus.Corpus)

Class for n-gram analysis reporting.

Type Details
corpus Corpus Corpus instance

source

Ngrams.ngrams

 Ngrams.ngrams (token_str:str, ngram_length:int|None=2,
                ngram_token_position:str='LEFT', normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_all_columns:bool=False,
                exclude_punctuation:bool=True, use_cache:bool=True)

Report ngram frequencies containing a token string.

Type Default Details
token_str str token string to get ngrams for
ngram_length int | None 2 length of ngram, if set to None it will use the number of tokens in the token_str + 1
ngram_token_position str LEFT specify if token sequence is on LEFT or RIGHT (support for ngrams with token in middle of sequence is in-development))
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False return raw df with all columns or just ngram and frequency
exclude_punctuation bool True do not return ngrams with punctuation tokens
use_cache bool True retrieve the results from cache if available (currently ignored)
Returns Result return a Result object with ngram data
# load the corpus
reuters = Corpus().load(path_to_reuters_corpus)

# instantiate the Ngrams class
ngrams_reuters = Ngrams(reuters)
# run the ngrams method and display the results
ngrams_reuters.ngrams('environmental', ngram_length = 2, ngram_token_position = 'LEFT').display()
Ngrams for "environmental"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 environmental protection 4 0.03
2 environmental systems 4 0.03
3 environmental services 3 0.02
4 environmental damage 2 0.01
5 environmental regulations 2 0.01
6 environmental impact 2 0.01
7 environmental controls 1 0.01
8 environmental approval 1 0.01
9 environmental and 1 0.01
10 environmental sciences 1 0.01
11 environmental service 1 0.01
12 environmental concerns 1 0.01
13 environmental issues 1 0.01
14 environmental power 1 0.01
15 environmental plan 1 0.01
16 environmental had 1 0.01
17 environmental control 1 0.01
18 environmental management 1 0.01
19 environmental subsidiary 1 0.01
20 environmental was 1 0.01
Report based on word tokens
Ngram length: 2, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 21
Total ngrams: 32
Showing 20 rows
Page 1 of 2
# run the ngrams method and display the results
ngrams_reuters.ngrams('the highest', ngram_length = 3, ngram_token_position = 'LEFT', page_size = 10).display()
Ngrams for "the highest"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the highest since 8 0.06
2 the highest level 4 0.03
3 the highest in 3 0.02
4 the highest rate 2 0.01
5 the highest interest 2 0.01
6 the highest priority 2 0.01
7 the highest number 2 0.01
8 the highest agriculture 2 0.01
9 the highest such 2 0.01
10 the highest positive 2 0.01
Report based on word tokens
Ngram length: 3, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 31
Total ngrams: 50
Showing 10 rows
Page 1 of 4

source

Ngrams.ngram_frequencies

 Ngrams.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                           normalize_by:int=10000, page_size:int=20,
                           page_current:int=1,
                           exclude_punctuation:bool=True)

Report frequent ngrams.

Type Default Details
ngram_length int 2 length of ngram
case_sensitive bool False frequencies for tokens lowercased or with case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return
page_current int 1 current page
exclude_punctuation bool True exclude ngrams containing punctuation tokens
Returns Result return a Result object with the frequency table
ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = False).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 mln dlrs in 795 5.68
3 cts vs loss 665 4.75
4 said it has 636 4.55
5 mln avg shrs 620 4.43
6 pct of the 608 4.35
7 the united states 603 4.31
8 qtr net shr 574 4.10
9 dlrs a share 546 3.90
10 inc said it 523 3.74
11 the company 's 518 3.70
12 cts net loss 517 3.70
13 the end of 501 3.58
14 cts a share 494 3.53
15 is expected to 429 3.07
16 corp said it 412 2.95
17 nine mths shr 412 2.95
18 said in a 407 2.91
19 the bank of 380 2.72
20 billion dlrs in 373 2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239
ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = True).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 The company said 747 5.34
2 mln dlrs in 726 5.19
3 cts vs loss 645 4.61
4 said it has 632 4.52
5 mln Avg shrs 615 4.40
6 pct of the 608 4.35
7 QTR NET Shr 559 4.00
8 the United States 524 3.75
9 dlrs a share 519 3.71
10 Inc said it 514 3.67
11 cts Net loss 509 3.64
12 the end of 501 3.58
13 cts a share 490 3.50
14 the company 's 476 3.40
15 is expected to 426 3.05
16 the company said 426 3.05
17 said in a 407 2.91
18 Corp said it 392 2.80
19 Nine mths Shr 370 2.65
20 cts Oper net 363 2.60
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 702,051
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 35103
  • Report an issue