Report ngram frequencies containing a token string.
Type
Default
Details
token_str
str
token string to get ngrams for
ngram_length
int | None
2
length of ngram, if set to None it will use the number of tokens in the token_str + 1
ngram_token_position
str
LEFT
specify if token sequence is on LEFT or RIGHT (support for ngrams with token in middle of sequence is in-development))
normalize_by
int
10000
normalize frequencies by a number (e.g. 10000)
page_size
int
20
number of results to display per results page
page_current
int
1
current page of results
show_all_columns
bool
False
return raw df with all columns or just ngram and frequency
exclude_punctuation
bool
True
do not return ngrams with punctuation tokens
use_cache
bool
True
retrieve the results from cache if available (currently ignored)
Returns
Result
return a Result object with ngram data
# load the corpusreuters = Corpus().load(path_to_reuters_corpus)# instantiate the Ngrams classngrams_reuters = Ngrams(reuters)
# run the ngrams method and display the resultsngrams_reuters.ngrams('environmental', ngram_length =2, ngram_token_position ='LEFT').display()
Ngrams for "environmental"
Reuters Corpus
Rank
Ngram
Frequency
Normalized Frequency
1
environmental protection
4
0.03
2
environmental systems
4
0.03
3
environmental services
3
0.02
4
environmental damage
2
0.01
5
environmental regulations
2
0.01
6
environmental impact
2
0.01
7
environmental controls
1
0.01
8
environmental approval
1
0.01
9
environmental and
1
0.01
10
environmental sciences
1
0.01
11
environmental service
1
0.01
12
environmental concerns
1
0.01
13
environmental issues
1
0.01
14
environmental power
1
0.01
15
environmental plan
1
0.01
16
environmental had
1
0.01
17
environmental control
1
0.01
18
environmental management
1
0.01
19
environmental subsidiary
1
0.01
20
environmental was
1
0.01
Report based on word tokens
Ngram length: 2, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 21
Total ngrams: 32
Showing 20 rows
Page 1 of 2
# run the ngrams method and display the resultsngrams_reuters.ngrams('the highest', ngram_length =3, ngram_token_position ='LEFT', page_size =10).display()