# load the corpus
= Corpus().load(path_to_reuters_corpus)
reuters
# instantiate the Ngrams class
= Ngrams(reuters) ngrams_reuters
ngrams
Functionality for ngram analysis.
Using the Ngrams class
There are examples below showing how to use the Ngrams class directly to output Ngram clusters or frequency lists based on ngrams. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.
Ngrams class API reference
Ngrams
Ngrams (corpus:conc.corpus.Corpus)
Class for n-gram analysis reporting.
Type | Details | |
---|---|---|
corpus | Corpus | Corpus instance |
Ngrams.ngrams
Ngrams.ngrams (token_str:str, ngram_length:int|None=2, ngram_token_position:str='LEFT', normalize_by:int=10000, page_size:int=20, page_current:int=1, show_all_columns:bool=False, exclude_punctuation:bool=True, use_cache:bool=True)
Report ngram frequencies containing a token string.
Type | Default | Details | |
---|---|---|---|
token_str | str | token string to get ngrams for | |
ngram_length | int | None | 2 | length of ngram, if set to None it will use the number of tokens in the token_str + 1 |
ngram_token_position | str | LEFT | specify if token sequence is on LEFT or RIGHT or MIDDLE (support for other positions is in-development) |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of results to display per results page |
page_current | int | 1 | current page of results |
show_all_columns | bool | False | return raw df with all columns or just ngram and frequency |
exclude_punctuation | bool | True | do not return ngrams with punctuation tokens |
use_cache | bool | True | retrieve the results from cache if available (currently ignored) |
Returns | Result | return a Result object with ngram data |
Examples
See the note above about accessing this functionality through the Conc class.
# run the ngrams method and display the results
'environmental', ngram_length = 2, ngram_token_position = 'LEFT').display() ngrams_reuters.ngrams(
Ngrams for "environmental" | |||
---|---|---|---|
Reuters Corpus | |||
Rank | Ngram | Frequency | Normalized Frequency |
1 | environmental protection | 4 | 0.03 |
2 | environmental systems | 4 | 0.03 |
3 | environmental services | 3 | 0.02 |
4 | environmental damage | 2 | 0.01 |
5 | environmental regulations | 2 | 0.01 |
6 | environmental impact | 2 | 0.01 |
7 | environmental controls | 1 | 0.01 |
8 | environmental approval | 1 | 0.01 |
9 | environmental and | 1 | 0.01 |
10 | environmental sciences | 1 | 0.01 |
11 | environmental service | 1 | 0.01 |
12 | environmental concerns | 1 | 0.01 |
13 | environmental issues | 1 | 0.01 |
14 | environmental power | 1 | 0.01 |
15 | environmental plan | 1 | 0.01 |
16 | environmental had | 1 | 0.01 |
17 | environmental control | 1 | 0.01 |
18 | environmental management | 1 | 0.01 |
19 | environmental subsidiary | 1 | 0.01 |
20 | environmental was | 1 | 0.01 |
Report based on word tokens | |||
Ngram length: 2, Token position: left | |||
Ngrams containing punctuation tokens excluded | |||
Normalized Frequency is per 10,000 tokens | |||
Total unique ngrams: 21 | |||
Total ngrams: 32 | |||
Showing 20 rows | |||
Page 1 of 2 |
# run the ngrams method and display the results
'the highest', ngram_length = 3, ngram_token_position = 'LEFT', page_size = 10).display() ngrams_reuters.ngrams(
Ngrams for "the highest" | |||
---|---|---|---|
Reuters Corpus | |||
Rank | Ngram | Frequency | Normalized Frequency |
1 | the highest since | 8 | 0.06 |
2 | the highest level | 4 | 0.03 |
3 | the highest in | 3 | 0.02 |
4 | the highest rate | 2 | 0.01 |
5 | the highest interest | 2 | 0.01 |
6 | the highest priority | 2 | 0.01 |
7 | the highest number | 2 | 0.01 |
8 | the highest agriculture | 2 | 0.01 |
9 | the highest such | 2 | 0.01 |
10 | the highest positive | 2 | 0.01 |
Report based on word tokens | |||
Ngram length: 3, Token position: left | |||
Ngrams containing punctuation tokens excluded | |||
Normalized Frequency is per 10,000 tokens | |||
Total unique ngrams: 31 | |||
Total ngrams: 50 | |||
Showing 10 rows | |||
Page 1 of 4 |
Ngrams.ngram_frequencies
Ngrams.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False, normalize_by:int=10000, page_size:int=20, page_current:int=1, show_document_frequency:bool=False, exclude_punctuation:bool=True)
Report frequent ngrams.
Type | Default | Details | |
---|---|---|---|
ngram_length | int | 2 | length of ngram |
case_sensitive | bool | False | frequencies for tokens lowercased or with case preserved |
normalize_by | int | 10000 | normalize frequencies by a number (e.g. 10000) |
page_size | int | 20 | number of rows to return |
page_current | int | 1 | current page |
show_document_frequency | bool | False | show document frequency in output |
exclude_punctuation | bool | True | exclude ngrams containing punctuation tokens |
Returns | Result | return a Result object with the frequency table |
Ngram frequencies is the slowest operation in Conc currently and will be optimised in the future.
Examples
See the note above about accessing this functionality through the Conc class.
= 3, case_sensitive = False).display() ngrams_reuters.ngram_frequencies(ngram_length
Ngram Frequencies | |||
---|---|---|---|
Reuters Corpus | |||
Rank | Ngram | Frequency | Normalized Frequency |
1 | the company said | 1,173 | 8.39 |
2 | mln dlrs in | 795 | 5.68 |
3 | cts vs loss | 665 | 4.75 |
4 | said it has | 636 | 4.55 |
5 | mln avg shrs | 620 | 4.43 |
6 | pct of the | 608 | 4.35 |
7 | the united states | 603 | 4.31 |
8 | qtr net shr | 574 | 4.10 |
9 | dlrs a share | 546 | 3.90 |
10 | inc said it | 523 | 3.74 |
11 | the company 's | 518 | 3.70 |
12 | cts net loss | 517 | 3.70 |
13 | the end of | 501 | 3.58 |
14 | cts a share | 494 | 3.53 |
15 | is expected to | 429 | 3.07 |
16 | corp said it | 412 | 2.95 |
17 | nine mths shr | 412 | 2.95 |
18 | said in a | 407 | 2.91 |
19 | the bank of | 380 | 2.72 |
20 | billion dlrs in | 373 | 2.67 |
Report based on word tokens | |||
Ngram length: 3 | |||
Ngrams containing punctuation tokens excluded | |||
Normalized Frequency is per 10,000 tokens | |||
Total unique ngrams: 684,778 | |||
Total ngrams: 1,128,352 | |||
Showing 20 rows | |||
Page 1 of 34239 |
= 3, case_sensitive = True).display() ngrams_reuters.ngram_frequencies(ngram_length
Ngram Frequencies | |||
---|---|---|---|
Reuters Corpus | |||
Rank | Ngram | Frequency | Normalized Frequency |
1 | The company said | 747 | 5.34 |
2 | mln dlrs in | 726 | 5.19 |
3 | cts vs loss | 645 | 4.61 |
4 | said it has | 632 | 4.52 |
5 | mln Avg shrs | 615 | 4.40 |
6 | pct of the | 608 | 4.35 |
7 | QTR NET Shr | 559 | 4.00 |
8 | the United States | 524 | 3.75 |
9 | dlrs a share | 519 | 3.71 |
10 | Inc said it | 514 | 3.67 |
11 | cts Net loss | 509 | 3.64 |
12 | the end of | 501 | 3.58 |
13 | cts a share | 490 | 3.50 |
14 | the company 's | 476 | 3.40 |
15 | the company said | 426 | 3.05 |
16 | is expected to | 426 | 3.05 |
17 | said in a | 407 | 2.91 |
18 | Corp said it | 392 | 2.80 |
19 | Nine mths Shr | 370 | 2.65 |
20 | cts Oper net | 363 | 2.60 |
Report based on word tokens | |||
Ngram length: 3 | |||
Ngrams containing punctuation tokens excluded | |||
Normalized Frequency is per 10,000 tokens | |||
Total unique ngrams: 702,051 | |||
Total ngrams: 1,128,352 | |||
Showing 20 rows | |||
Page 1 of 35103 |
= 3, case_sensitive = False, show_document_frequency = True).display() ngrams_reuters.ngram_frequencies(ngram_length
Ngram Frequencies | ||||
---|---|---|---|---|
Reuters Corpus | ||||
Rank | Ngram | Frequency | Normalized Frequency | Document Frequency |
1 | the company said | 1,173 | 8.39 | 911 |
2 | mln dlrs in | 795 | 5.68 | 549 |
3 | cts vs loss | 665 | 4.75 | 474 |
4 | said it has | 636 | 4.55 | 586 |
5 | mln avg shrs | 620 | 4.43 | 450 |
6 | pct of the | 608 | 4.35 | 508 |
7 | the united states | 603 | 4.31 | 391 |
8 | qtr net shr | 574 | 4.10 | 573 |
9 | dlrs a share | 546 | 3.90 | 375 |
10 | inc said it | 523 | 3.74 | 521 |
11 | the company 's | 518 | 3.70 | 406 |
12 | cts net loss | 517 | 3.70 | 389 |
13 | the end of | 501 | 3.58 | 384 |
14 | cts a share | 494 | 3.53 | 290 |
15 | is expected to | 429 | 3.07 | 381 |
16 | corp said it | 412 | 2.95 | 410 |
17 | nine mths shr | 412 | 2.95 | 412 |
18 | said in a | 407 | 2.91 | 393 |
19 | the bank of | 380 | 2.72 | 304 |
20 | billion dlrs in | 373 | 2.67 | 235 |
Report based on word tokens | ||||
Ngram length: 3 | ||||
Ngrams containing punctuation tokens excluded | ||||
Normalized Frequency is per 10,000 tokens | ||||
Total unique ngrams: 684,778 | ||||
Total ngrams: 1,128,352 | ||||
Showing 20 rows | ||||
Page 1 of 34239 |