ngrams

Functionality for ngram analysis.

Using the Ngrams class

There are examples below showing how to use the Ngrams class directly to output Ngram clusters or frequency lists based on ngrams. The recommended way to use this functionality is through the Conc class. This provides an interface to create frequency lists, concordances, collocation tables, keyword tables and more.

Ngrams class API reference

source

Ngrams

 Ngrams (corpus:conc.corpus.Corpus)

Class for n-gram analysis reporting.

	Type	Details
corpus	Corpus	Corpus instance

source

Ngrams.ngrams

 Ngrams.ngrams (token_str:str, ngram_length:int|None=2,
                ngram_token_position:str='LEFT', normalize_by:int=10000,
                page_size:int=20, page_current:int=1,
                show_all_columns:bool=False,
                exclude_punctuation:bool=True, use_cache:bool=True)

Report ngram frequencies containing a token string.

	Type	Default	Details
token_str	str		token string to get ngrams for
ngram_length	int \| None	2	length of ngram, if set to None it will use the number of tokens in the token_str + 1
ngram_token_position	str	LEFT	specify if token sequence is on LEFT or RIGHT or MIDDLE (support for other positions is in-development)
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of results to display per results page
page_current	int	1	current page of results
show_all_columns	bool	False	return raw df with all columns or just ngram and frequency
exclude_punctuation	bool	True	do not return ngrams with punctuation tokens
use_cache	bool	True	retrieve the results from cache if available (currently ignored)
Returns	Result		return a Result object with ngram data

Examples

See the note above about accessing this functionality through the Conc class.

# load the corpus
reuters = Corpus().load(path_to_reuters_corpus)

# instantiate the Ngrams class
ngrams_reuters = Ngrams(reuters)

# run the ngrams method and display the results
ngrams_reuters.ngrams('environmental', ngram_length = 2, ngram_token_position = 'LEFT').display()

Ngrams for "environmental"
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	environmental protection	4	0.03
2	environmental systems	4	0.03
3	environmental services	3	0.02
4	environmental damage	2	0.01
5	environmental regulations	2	0.01
6	environmental impact	2	0.01
7	environmental controls	1	0.01
8	environmental approval	1	0.01
9	environmental and	1	0.01
10	environmental sciences	1	0.01
11	environmental service	1	0.01
12	environmental concerns	1	0.01
13	environmental issues	1	0.01
14	environmental power	1	0.01
15	environmental plan	1	0.01
16	environmental had	1	0.01
17	environmental control	1	0.01
18	environmental management	1	0.01
19	environmental subsidiary	1	0.01
20	environmental was	1	0.01
Report based on word tokens
Ngram length: 2, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 21
Total ngrams: 32
Showing 20 rows
Page 1 of 2

# run the ngrams method and display the results
ngrams_reuters.ngrams('the highest', ngram_length = 3, ngram_token_position = 'LEFT', page_size = 10).display()

Ngrams for "the highest"
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	the highest since	8	0.06
2	the highest level	4	0.03
3	the highest in	3	0.02
4	the highest rate	2	0.01
5	the highest interest	2	0.01
6	the highest priority	2	0.01
7	the highest number	2	0.01
8	the highest agriculture	2	0.01
9	the highest such	2	0.01
10	the highest positive	2	0.01
Report based on word tokens
Ngram length: 3, Token position: left
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 31
Total ngrams: 50
Showing 10 rows
Page 1 of 4

source

Ngrams.ngram_frequencies

 Ngrams.ngram_frequencies (ngram_length:int=2, case_sensitive:bool=False,
                           normalize_by:int=10000, page_size:int=20,
                           page_current:int=1,
                           show_document_frequency:bool=False,
                           exclude_punctuation:bool=True)

Report frequent ngrams.

	Type	Default	Details
ngram_length	int	2	length of ngram
case_sensitive	bool	False	frequencies for tokens lowercased or with case preserved
normalize_by	int	10000	normalize frequencies by a number (e.g. 10000)
page_size	int	20	number of rows to return
page_current	int	1	current page
show_document_frequency	bool	False	show document frequency in output
exclude_punctuation	bool	True	exclude ngrams containing punctuation tokens
Returns	Result		return a Result object with the frequency table

Ngram frequencies is the slowest operation in Conc currently and will be optimised in the future.

Examples

See the note above about accessing this functionality through the Conc class.

ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = False).display()

Ngram Frequencies
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	the company said	1,173	8.39
2	mln dlrs in	795	5.68
3	cts vs loss	665	4.75
4	said it has	636	4.55
5	mln avg shrs	620	4.43
6	pct of the	608	4.35
7	the united states	603	4.31
8	qtr net shr	574	4.10
9	dlrs a share	546	3.90
10	inc said it	523	3.74
11	the company 's	518	3.70
12	cts net loss	517	3.70
13	the end of	501	3.58
14	cts a share	494	3.53
15	is expected to	429	3.07
16	corp said it	412	2.95
17	nine mths shr	412	2.95
18	said in a	407	2.91
19	the bank of	380	2.72
20	billion dlrs in	373	2.67
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239

ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = True).display()

Ngram Frequencies
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	The company said	747	5.34
2	mln dlrs in	726	5.19
3	cts vs loss	645	4.61
4	said it has	632	4.52
5	mln Avg shrs	615	4.40
6	pct of the	608	4.35
7	QTR NET Shr	559	4.00
8	the United States	524	3.75
9	dlrs a share	519	3.71
10	Inc said it	514	3.67
11	cts Net loss	509	3.64
12	the end of	501	3.58
13	cts a share	490	3.50
14	the company 's	476	3.40
15	the company said	426	3.05
16	is expected to	426	3.05
17	said in a	407	2.91
18	Corp said it	392	2.80
19	Nine mths Shr	370	2.65
20	cts Oper net	363	2.60
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 702,051
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 35103

ngrams_reuters.ngram_frequencies(ngram_length = 3, case_sensitive = False, show_document_frequency = True).display()

Ngram Frequencies
Reuters Corpus
Rank	Ngram	Frequency	Normalized Frequency	Document Frequency
1	the company said	1,173	8.39	911
2	mln dlrs in	795	5.68	549
3	cts vs loss	665	4.75	474
4	said it has	636	4.55	586
5	mln avg shrs	620	4.43	450
6	pct of the	608	4.35	508
7	the united states	603	4.31	391
8	qtr net shr	574	4.10	573
9	dlrs a share	546	3.90	375
10	inc said it	523	3.74	521
11	the company 's	518	3.70	406
12	cts net loss	517	3.70	389
13	the end of	501	3.58	384
14	cts a share	494	3.53	290
15	is expected to	429	3.07	381
16	corp said it	412	2.95	410
17	nine mths shr	412	2.95	412
18	said in a	407	2.91	393
19	the bank of	380	2.72	304
20	billion dlrs in	373	2.67	235
Report based on word tokens
Ngram length: 3
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 684,778
Total ngrams: 1,128,352
Showing 20 rows
Page 1 of 34239