Performance

Information on Conc performance across different corpus sizes.

This page reports timing results of corpus building/loading and Conc report methods with different size corpora using a machine with Intel Core i7-14700F, NVME SSD and 16GB usable RAM under WSL.

from conc.corpus import Corpus
from conc.conc import Conc

test_corpora = {
                'us-congressional-speeches-subset-10k': 'US Congressional Speeches Subset 10k',
                'us-congressional-speeches-subset-100k': 'US Congressional Speeches Subset 100k',
                'us-congressional-speeches-subset-200k': 'US Congressional Speeches Subset 200k',
                'us-congressional-speeches-subset-500k': 'US Congressional Speeches Subset 500k'
                }

Corpus build time varies from 4 seconds for 2m token data source (10k texts) to 150 seconds for 100m token data source (500k texts). Currently to build corpora larger than this requires large RAM. Work on memory management is ongoing, but this will improve when Polars new streaming engine matures. This is in the Roadmap for the library.

corpora = {}
for slug, name in test_corpora.items():
    logger.info(f'Starting {name} build ...')
    description = f'1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus. '
    try:

    except Exception as e:
        raise e

CPU times: user 4.45 s, sys: 224 ms, total: 4.67 s
Wall time: 3.82 s
CPU times: user 46.3 s, sys: 2.45 s, total: 48.7 s
Wall time: 30 s
CPU times: user 1min 38s, sys: 10.8 s, total: 1min 49s
Wall time: 1min 2s
CPU times: user 3min 55s, sys: 32 s, total: 4min 27s
Wall time: 2min 26s

Corpora are loaded lazily - meaning large data tables are only accessed when required. Similar load times regardless of corpus size …

for slug, name in test_corpora.items():

    corpus.summary()
    del corpus

CPU times: user 211 ms, sys: 15.7 ms, total: 227 ms
Wall time: 266 ms

Corpus Summary

Attribute	Value
Name	US Congressional Speeches Subset 10k
Description	1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created	2025-06-09 15:03:14
Conc Version	0.0.1
Corpus Path	/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-10k.corpus
Document Count	10,000
Token Count	1,954,972
Word Token Count	1,767,904
Unique Tokens	50,640
Unique Word Tokens	50,520

CPU times: user 182 ms, sys: 27.6 ms, total: 209 ms
Wall time: 220 ms

Corpus Summary

Attribute	Value
Name	US Congressional Speeches Subset 100k
Description	1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created	2025-06-09 15:03:44
Conc Version	0.0.1
Corpus Path	/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-100k.corpus
Document Count	100,000
Token Count	19,927,241
Word Token Count	18,020,769
Unique Tokens	214,502
Unique Word Tokens	214,175

CPU times: user 209 ms, sys: 0 ns, total: 209 ms
Wall time: 219 ms

Corpus Summary

Attribute	Value
Name	US Congressional Speeches Subset 200k
Description	1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created	2025-06-09 15:04:47
Conc Version	0.0.1
Corpus Path	/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-200k.corpus
Document Count	200,000
Token Count	39,963,039
Word Token Count	36,136,744
Unique Tokens	345,631
Unique Word Tokens	345,310

CPU times: user 207 ms, sys: 0 ns, total: 207 ms
Wall time: 217 ms

Corpus Summary

Attribute	Value
Name	US Congressional Speeches Subset 500k
Description	1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created	2025-06-09 15:07:14
Conc Version	0.0.1
Corpus Path	/home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-500k.corpus
Document Count	500,000
Token Count	99,902,593
Word Token Count	90,341,944
Unique Tokens	655,344
Unique Word Tokens	654,824

for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus

Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 10k
Rank	Token	Frequency	Normalized Frequency
1	the	135,984	769.18
2	of	67,597	382.36
3	to	60,132	340.13
4	and	44,832	253.59
5	in	36,959	209.06
6	that	34,135	193.08
7	a	29,557	167.19
8	i	29,329	165.90
9	is	25,175	142.40
10	this	19,173	108.45
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,767,904
Unique word tokens: 50,520
Showing 10 rows
Page 1 of 5053

CPU times: user 22.9 ms, sys: 10.1 ms, total: 33 ms
Wall time: 37.3 ms

Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 100k
Rank	Token	Frequency	Normalized Frequency
1	the	1,389,439	771.02
2	of	687,127	381.30
3	to	610,266	338.65
4	and	459,220	254.83
5	in	379,946	210.84
6	that	346,216	192.12
7	a	302,256	167.73
8	i	297,077	164.85
9	is	250,677	139.10
10	this	192,933	107.06
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 18,020,769
Unique word tokens: 214,175
Showing 10 rows
Page 1 of 21418

CPU times: user 61.5 ms, sys: 38 ms, total: 99.4 ms
Wall time: 46.1 ms

Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 200k
Rank	Token	Frequency	Normalized Frequency
1	the	2,781,475	769.71
2	of	1,377,003	381.05
3	to	1,225,404	339.10
4	and	922,720	255.34
5	in	760,867	210.55
6	that	695,665	192.51
7	a	606,747	167.90
8	i	593,766	164.31
9	is	504,385	139.58
10	this	386,922	107.07
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 36,136,744
Unique word tokens: 345,310
Showing 10 rows
Page 1 of 34532

CPU times: user 53.7 ms, sys: 78.1 ms, total: 132 ms
Wall time: 49.8 ms

Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 500k
Rank	Token	Frequency	Normalized Frequency
1	the	6,951,503	769.47
2	of	3,446,705	381.52
3	to	3,059,159	338.62
4	and	2,308,134	255.49
5	in	1,902,118	210.55
6	that	1,737,689	192.35
7	a	1,514,676	167.66
8	i	1,481,424	163.98
9	is	1,261,935	139.68
10	this	966,165	106.95
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 90,341,944
Unique word tokens: 654,824
Showing 10 rows
Page 1 of 65483

CPU times: user 104 ms, sys: 105 ms, total: 210 ms
Wall time: 53.2 ms

for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus

Ngrams for "economy"
US Congressional Speeches Subset 10k
Rank	Ngram	Frequency	Normalized Frequency
1	the economy	94	0.53
2	our economy	59	0.33
3	of economy	23	0.13
4	american economy	11	0.06
5	for economy	8	0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 106
Total ngrams: 355
Showing 5 rows
Page 1 of 22

CPU times: user 49.3 ms, sys: 28 ms, total: 77.3 ms
Wall time: 48.8 ms

Ngrams for "economy"
US Congressional Speeches Subset 100k
Rank	Ngram	Frequency	Normalized Frequency
1	the economy	930	0.52
2	our economy	643	0.36
3	of economy	203	0.11
4	american economy	116	0.06
5	national economy	84	0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 464
Total ngrams: 3,725
Showing 5 rows
Page 1 of 93

CPU times: user 338 ms, sys: 57 ms, total: 395 ms
Wall time: 198 ms

Ngrams for "economy"
US Congressional Speeches Subset 200k
Rank	Ngram	Frequency	Normalized Frequency
1	the economy	1,924	0.53
2	our economy	1,312	0.36
3	of economy	401	0.11
4	american economy	242	0.07
5	national economy	172	0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 682
Total ngrams: 7,668
Showing 5 rows
Page 1 of 137

CPU times: user 578 ms, sys: 233 ms, total: 811 ms
Wall time: 435 ms

Ngrams for "economy"
US Congressional Speeches Subset 500k
Rank	Ngram	Frequency	Normalized Frequency
1	the economy	4,818	0.53
2	our economy	3,258	0.36
3	of economy	1,039	0.12
4	american economy	588	0.07
5	national economy	448	0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 1,193
Total ngrams: 19,211
Showing 5 rows
Page 1 of 239

CPU times: user 1.66 s, sys: 552 ms, total: 2.21 s
Wall time: 1.02 s

# still working on this!
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus

Ngram Frequencies
US Congressional Speeches Subset 10k
Rank	Ngram	Frequency	Normalized Frequency
1	of the	22,312	126.21
2	in the	10,982	62.12
3	to the	9,119	51.58
4	it is	5,140	29.07
5	that the	5,123	28.98
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 396,623
Total ngrams: 1,584,710
Showing 5 rows
Page 1 of 79325

CPU times: user 1.5 s, sys: 147 ms, total: 1.65 s
Wall time: 209 ms

Ngram Frequencies
US Congressional Speeches Subset 100k
Rank	Ngram	Frequency	Normalized Frequency
1	of the	227,943	126.49
2	in the	114,241	63.39
3	to the	92,967	51.59
4	it is	51,659	28.67
5	that the	51,620	28.64
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 2,046,190
Total ngrams: 16,153,485
Showing 5 rows
Page 1 of 409238

CPU times: user 35.7 s, sys: 1.09 s, total: 36.8 s
Wall time: 831 ms

Ngram Frequencies
US Congressional Speeches Subset 200k
Rank	Ngram	Frequency	Normalized Frequency
1	of the	457,057	126.48
2	in the	228,891	63.34
3	to the	186,449	51.60
4	it is	103,619	28.67
5	that the	103,418	28.62
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 3,304,755
Total ngrams: 32,389,849
Showing 5 rows
Page 1 of 660951

CPU times: user 1min 9s, sys: 2.22 s, total: 1min 11s
Wall time: 4.33 s

Ngram Frequencies
US Congressional Speeches Subset 500k
Rank	Ngram	Frequency	Normalized Frequency
1	of the	1,140,304	126.22
2	in the	570,295	63.13
3	to the	467,816	51.78
4	it is	259,770	28.75
5	that the	258,068	28.57
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 6,158,427
Total ngrams: 80,976,586
Showing 5 rows
Page 1 of 1231686

CPU times: user 3min 16s, sys: 10.2 s, total: 3min 26s
Wall time: 11.9 s

for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus

Concordance for "economy"
US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R
Document Id	Left	Node	Right
5,878	ruled by a government of	economy	.
1,163	. help strengthen our Nations	economy	.
316	otherwise generally strong and prosperous	economy	.
6,910	this critical sector in our	economy	.
9,517	health care pressures in this	economy	.
Total Concordance Lines: 358
Total Documents: 251
Showing 5 lines
Page 1 of 72

CPU times: user 89.7 ms, sys: 1.14 ms, total: 90.8 ms
Wall time: 61.2 ms

Concordance for "economy"
US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R
Document Id	Left	Node	Right
82,659	amounts that it throws our	economy
75,018	Honey . I shrun the	economy	" ? It is honest
19,176	getting away from " Coolidge	economy	" already . and making
6,729	. We are talking "	economy	" and at the same
83,170	further into an " innovating	economy	" based on a highly
Total Concordance Lines: 3758
Total Documents: 2684
Showing 5 lines
Page 1 of 752

CPU times: user 414 ms, sys: 340 ms, total: 755 ms
Wall time: 437 ms

Concordance for "economy"
US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R
Document Id	Left	Node	Right
77,084	the way it is .	ECONOMY
6,026	its central office . Political	Economy
130,531	the maintenance of her national	economy
20,685	on something else . Coolidge	economy	! I am for it
132,603	railroads of this country .	Economy	! What about this pitpible
Total Concordance Lines: 7753
Total Documents: 5480
Showing 5 lines
Page 1 of 1551

CPU times: user 871 ms, sys: 596 ms, total: 1.47 s
Wall time: 831 ms

Concordance for "economy"
US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R
Document Id	Left	Node	Right
140,837	prayers are with them .	ECONOMY
162,997	its central office . Political	Economy
325,086	WHAT ARE CoNDrrONS IN THE	ECONOMY
64,711	country . Condition of Nations	Economy
360,787	country ! This spasm of	economy	!
Total Concordance Lines: 19399
Total Documents: 13564
Showing 5 lines
Page 1 of 3880

CPU times: user 2.83 s, sys: 1.42 s, total: 4.24 s
Wall time: 1.86 s

reference = Corpus().load(f'{save_path}brown.corpus')
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    conc.set_reference_corpus(reference)

    del corpus

Keywords
Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	unanimous	907	5	5.13	0.05	100.57	6.65	748.42
2	amendment	4,039	24	22.85	0.24	93.30	6.54	3,318.48
3	appropriation	716	5	4.05	0.05	79.39	6.31	582.28
4	senator	5,488	39	31.04	0.40	78.02	6.29	4,457.76
5	subcommittee	585	5	3.31	0.05	64.87	6.02	468.73
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,767,904
Total word tokens in reference corpus: 980,144
Keywords: 8,291
Showing 5 rows
Page 1 of 1659

CPU times: user 369 ms, sys: 220 ms, total: 589 ms
Wall time: 94.3 ms

Keywords
Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	unanimous	8,978	5	4.98	0.05	97.66	6.61	895.70
2	amendment	39,940	24	22.16	0.24	90.51	6.50	3,968.88
3	appropriation	6,847	5	3.80	0.05	74.48	6.22	672.68
4	senator	52,772	39	29.28	0.40	73.60	6.20	5,180.64
5	gentleman	32,178	28	17.86	0.29	62.51	5.97	3,123.80
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 18,020,769
Total word tokens in reference corpus: 980,144
Keywords: 12,136
Showing 5 rows
Page 1 of 2428

CPU times: user 2.78 s, sys: 417 ms, total: 3.19 s
Wall time: 274 ms

Keywords
Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	unanimous	17,813	5	4.93	0.05	96.63	6.59	897.98
2	amendment	80,078	24	22.16	0.24	90.50	6.50	4,023.10
3	appropriation	13,896	5	3.85	0.05	75.38	6.24	690.81
4	senator	105,824	39	29.28	0.40	73.60	6.20	5,252.88
5	gentleman	63,852	28	17.67	0.29	61.85	5.95	3,132.10
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 36,136,744
Total word tokens in reference corpus: 980,144
Keywords: 12,704
Showing 5 rows
Page 1 of 2541

CPU times: user 6.71 s, sys: 451 ms, total: 7.16 s
Wall time: 516 ms

Keywords
Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	unanimous	44,193	5	4.89	0.05	95.89	6.58	898.23
2	amendment	198,132	24	21.93	0.24	89.57	6.48	4,012.78
3	appropriation	34,215	5	3.79	0.05	74.24	6.21	685.45
4	senator	264,478	39	29.28	0.40	73.57	6.20	5,295.45
5	gentleman	159,877	28	17.70	0.29	61.95	5.95	3,163.94
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 90,341,944
Total word tokens in reference corpus: 980,144
Keywords: 13,118
Showing 5 rows
Page 1 of 2624

CPU times: user 20 s, sys: 802 ms, total: 20.8 s
Wall time: 1.17 s

for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus

Collocates of "economy"
US Congressional Speeches Subset 10k
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	economy	20	358	9.84	248.59
2	healthy	10	50	9.65	74.41
3	segment	9	24	9.59	80.17
4	our	93	5,938	8.92	221.67
5	false	6	55	8.89	36.86
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 115
Showing 5 rows
Page 1 of 24

CPU times: user 123 ms, sys: 21.9 ms, total: 145 ms
Wall time: 54.8 ms

Collocates of "economy"
US Congressional Speeches Subset 100k
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	our	1,084	60,051	9.12	2,801.70
2	efficiency	60	732	8.77	329.93
3	stimulate	51	299	8.69	358.80
4	global	55	618	8.69	311.68
5	jobs	83	3,100	8.63	274.52
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 864
Showing 5 rows
Page 1 of 173

CPU times: user 413 ms, sys: 244 ms, total: 657 ms
Wall time: 328 ms

Collocates of "economy"
US Congressional Speeches Subset 200k
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	our	2,219	121,489	9.14	5,670.76
2	global	119	1,221	8.76	689.99
3	sector	119	1,741	8.68	604.09
4	stimulate	101	611	8.63	698.06
5	jobs	166	6,312	8.60	534.83
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 1,524
Showing 5 rows
Page 1 of 305

CPU times: user 710 ms, sys: 212 ms, total: 922 ms
Wall time: 523 ms

Collocates of "economy"
US Congressional Speeches Subset 500k
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	our	5,656	304,919	9.16	14,596.00
2	stimulate	267	1,472	8.71	1,898.46
3	global	283	2,924	8.70	1,636.06
4	jobs	418	15,339	8.62	1,373.41
5	economy	446	19,399	8.56	5,491.13
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 2,786
Showing 5 rows
Page 1 of 558

CPU times: user 2.14 s, sys: 589 ms, total: 2.73 s
Wall time: 1.34 s