conc
  1. API
  2. ngrams
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • listcorpus
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. API
  2. ngrams

ngrams

Functionality for ngram analysis.

source

Ngrams

 Ngrams (corpus:conc.corpus.Corpus)

Class for n-gram analysis reporting.

Type Details
corpus Corpus Corpus instance

source

Ngrams.ngrams

 Ngrams.ngrams (token_str:str, ngram_length:int|None=2,
                ngram_token_position:str='LEFT', normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_all_columns:bool=False,
                exclude_punctuation:bool=True, use_cache:bool=True)

Report ngram frequencies containing a token string.

Type Default Details
token_str str token string to get ngrams for
ngram_length int | None 2 length of ngram, if set to None it will use the number of tokens in the token_str + 1
ngram_token_position str LEFT specify if token sequence is on LEFT or RIGHT or MIDDLE (support for other positions is in-development)
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of results to display per results page
page_current int 1 current page of results
show_all_columns bool False return raw df with all columns or just ngram and frequency
exclude_punctuation bool True do not return ngrams with punctuation tokens
use_cache bool True retrieve the results from cache if available (currently ignored)
Returns Result return a Result object with ngram data
# load the corpus
reuters = Corpus().load(path_to_reuters_corpus)

# instantiate the Ngrams class
ngrams_reuters = Ngrams(reuters)
# run the ngrams method and display the results
ngrams_reuters.ngrams('environmental', ngram_length = 2, ngram_token_position = 'LEFT').display()
Ngrams for "environmental"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 environmental protection 4 0.03
2 environmental systems 4 0.03
3 environmental services 3 0.02
4 environmental damage 2 0.01
5 environmental regulations 2 0.01
6 environmental impact 2 0.01
7 environmental controls 1 0.01
8 environmental approval 1 0.01
9 environmental and 1 0.01
10 environmental sciences 1 0.01
11 environmental service 1 0.01
12 environmental concerns 1 0.01
13 environmental issues 1 0.01
14 environmental power 1 0.01
15 environmental plan 1 0.01
16 environmental had 1 0.01
17 environmental control 1 0.01
18 environmental management 1 0.01
19 environmental subsidiary 1 0.01
20 environmental was 1 0.01
Report based on word tokens
Ngram length: 2, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 21
Total ngrams: 32
Showing 20 rows
Page 1 of 2
# run the ngrams method and display the results
ngrams_reuters.ngrams('the highest', ngram_length = 3, ngram_token_position = 'LEFT', page_size = 10).display()
Ngrams for "the highest"
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the highest since 8 0.06
2 the highest level 4 0.03
3 the highest in 3 0.02
4 the highest rate 2 0.01
5 the highest interest 2 0.01
6 the highest priority 2 0.01
7 the highest number 2 0.01
8 the highest agriculture 2 0.01
9 the highest such 2 0.01
10 the highest positive 2 0.01
Report based on word tokens
Ngram length: 3, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 31
Total ngrams: 50
Showing 10 rows
Page 1 of 4

source

Ngrams.ngram_frequencies

 Ngrams.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                           normalize_by:int=10000, page_size:int=20,
                           page_current:int=1,
                           show_document_frequency:bool=False,
                           exclude_punctuation:bool=True)

Report frequent ngrams.

Type Default Details
ngram_length int 2 length of ngram
case_sensitive bool False frequencies for tokens lowercased or with case preserved
normalize_by int 10000 normalize frequencies by a number (e.g. 10000)
page_size int 20 number of rows to return
page_current int 1 current page
show_document_frequency bool False show document frequency in output
exclude_punctuation bool True exclude ngrams containing punctuation tokens
Returns Result return a Result object with the frequency table

Ngram frequencies is the slowest operation in Conc currently and will be optimised in the future.

ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = False).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 the company said 1,173 8.39
2 mln dlrs in 795 5.68
3 cts vs loss 665 4.75
4 said it has 636 4.55
5 mln avg shrs 620 4.43
6 pct of the 608 4.35
7 the united states 603 4.31
8 qtr net shr 574 4.10
9 dlrs a share 546 3.90
10 inc said it 523 3.74
11 the company 's 518 3.70
12 cts net loss 517 3.70
13 the end of 501 3.58
14 cts a share 494 3.53
15 is expected to 429 3.07
16 corp said it 412 2.95
17 nine mths shr 412 2.95
18 said in a 407 2.91
19 the bank of 380 2.72
20 billion dlrs in 373 2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239
ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = True).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency
1 The company said 747 5.34
2 mln dlrs in 726 5.19
3 cts vs loss 645 4.61
4 said it has 632 4.52
5 mln Avg shrs 615 4.40
6 pct of the 608 4.35
7 QTR NET Shr 559 4.00
8 the United States 524 3.75
9 dlrs a share 519 3.71
10 Inc said it 514 3.67
11 cts Net loss 509 3.64
12 the end of 501 3.58
13 cts a share 490 3.50
14 the company 's 476 3.40
15 the company said 426 3.05
16 is expected to 426 3.05
17 said in a 407 2.91
18 Corp said it 392 2.80
19 Nine mths Shr 370 2.65
20 cts Oper net 363 2.60
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 702,051
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 35103
ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = False, show_document_frequency = True).display()
Ngram Frequencies
Reuters Corpus
Rank Ngram Frequency Normalized Frequency Document Frequency
1 the company said 1,173 8.39 911
2 mln dlrs in 795 5.68 549
3 cts vs loss 665 4.75 474
4 said it has 636 4.55 586
5 mln avg shrs 620 4.43 450
6 pct of the 608 4.35 508
7 the united states 603 4.31 391
8 qtr net shr 574 4.10 573
9 dlrs a share 546 3.90 375
10 inc said it 523 3.74 521
11 the company 's 518 3.70 406
12 cts net loss 517 3.70 389
13 the end of 501 3.58 384
14 cts a share 494 3.53 290
15 is expected to 429 3.07 381
16 corp said it 412 2.95 410
17 nine mths shr 412 2.95 412
18 said in a 407 2.91 393
19 the bank of 380 2.72 304
20 billion dlrs in 373 2.67 235
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239
  • Report an issue