Information on Conc performance across different corpus sizes.
This page reports timing results of corpus building/loading and Conc report methods with different size corpora using a machine with Intel Core i7-14700F, NVME SSD and 16GB usable RAM under WSL.
from conc.corpus import Corpusfrom conc.conc import Conc
Corpus build time varies from 4 seconds for 2m token data source (10k texts) to 150 seconds for 100m token data source (500k texts). Currently to build corpora larger than this requires large RAM. Work on memory management is ongoing, but this will improve when Polars new streaming engine matures. This is in the Roadmap for the library.
corpora = {}for slug, name in test_corpora.items(): logger.info(f'Starting {name} build ...') description =f'1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus. 'try:exceptExceptionas e:raise e
CPU times: user 4.45 s, sys: 224 ms, total: 4.67 s
Wall time: 3.82 s
CPU times: user 46.3 s, sys: 2.45 s, total: 48.7 s
Wall time: 30 s
CPU times: user 1min 38s, sys: 10.8 s, total: 1min 49s
Wall time: 1min 2s
CPU times: user 3min 55s, sys: 32 s, total: 4min 27s
Wall time: 2min 26s
Corpora are loaded lazily - meaning large data tables are only accessed when required. Similar load times regardless of corpus size …
for slug, name in test_corpora.items(): corpus.summary()del corpus
CPU times: user 211 ms, sys: 15.7 ms, total: 227 ms
Wall time: 266 ms
Corpus Summary
Attribute
Value
Name
US Congressional Speeches Subset 10k
Description
1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
CPU times: user 182 ms, sys: 27.6 ms, total: 209 ms
Wall time: 220 ms
Corpus Summary
Attribute
Value
Name
US Congressional Speeches Subset 100k
Description
1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
CPU times: user 209 ms, sys: 0 ns, total: 209 ms
Wall time: 219 ms
Corpus Summary
Attribute
Value
Name
US Congressional Speeches Subset 200k
Description
1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
CPU times: user 207 ms, sys: 0 ns, total: 207 ms
Wall time: 217 ms
Corpus Summary
Attribute
Value
Name
US Congressional Speeches Subset 500k
Description
1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus)del corpus
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 10k
Rank
Token
Frequency
Normalized Frequency
1
the
135,984
769.18
2
of
67,597
382.36
3
to
60,132
340.13
4
and
44,832
253.59
5
in
36,959
209.06
6
that
34,135
193.08
7
a
29,557
167.19
8
i
29,329
165.90
9
is
25,175
142.40
10
this
19,173
108.45
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,767,904
Unique word tokens: 50,520
Showing 10 rows
Page 1 of 5053
CPU times: user 22.9 ms, sys: 10.1 ms, total: 33 ms
Wall time: 37.3 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 100k
Rank
Token
Frequency
Normalized Frequency
1
the
1,389,439
771.02
2
of
687,127
381.30
3
to
610,266
338.65
4
and
459,220
254.83
5
in
379,946
210.84
6
that
346,216
192.12
7
a
302,256
167.73
8
i
297,077
164.85
9
is
250,677
139.10
10
this
192,933
107.06
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 18,020,769
Unique word tokens: 214,175
Showing 10 rows
Page 1 of 21418
CPU times: user 61.5 ms, sys: 38 ms, total: 99.4 ms
Wall time: 46.1 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 200k
Rank
Token
Frequency
Normalized Frequency
1
the
2,781,475
769.71
2
of
1,377,003
381.05
3
to
1,225,404
339.10
4
and
922,720
255.34
5
in
760,867
210.55
6
that
695,665
192.51
7
a
606,747
167.90
8
i
593,766
164.31
9
is
504,385
139.58
10
this
386,922
107.07
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 36,136,744
Unique word tokens: 345,310
Showing 10 rows
Page 1 of 34532
CPU times: user 53.7 ms, sys: 78.1 ms, total: 132 ms
Wall time: 49.8 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 500k
Rank
Token
Frequency
Normalized Frequency
1
the
6,951,503
769.47
2
of
3,446,705
381.52
3
to
3,059,159
338.62
4
and
2,308,134
255.49
5
in
1,902,118
210.55
6
that
1,737,689
192.35
7
a
1,514,676
167.66
8
i
1,481,424
163.98
9
is
1,261,935
139.68
10
this
966,165
106.95
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 90,341,944
Unique word tokens: 654,824
Showing 10 rows
Page 1 of 65483
CPU times: user 104 ms, sys: 105 ms, total: 210 ms
Wall time: 53.2 ms
for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus)del corpus
Ngrams for "economy"
US Congressional Speeches Subset 10k
Rank
Ngram
Frequency
Normalized Frequency
1
the economy
94
0.53
2
our economy
59
0.33
3
of economy
23
0.13
4
american economy
11
0.06
5
for economy
8
0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 106
Total ngrams: 355
Showing 5 rows
Page 1 of 22
CPU times: user 49.3 ms, sys: 28 ms, total: 77.3 ms
Wall time: 48.8 ms
Ngrams for "economy"
US Congressional Speeches Subset 100k
Rank
Ngram
Frequency
Normalized Frequency
1
the economy
930
0.52
2
our economy
643
0.36
3
of economy
203
0.11
4
american economy
116
0.06
5
national economy
84
0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 464
Total ngrams: 3,725
Showing 5 rows
Page 1 of 93
CPU times: user 338 ms, sys: 57 ms, total: 395 ms
Wall time: 198 ms
Ngrams for "economy"
US Congressional Speeches Subset 200k
Rank
Ngram
Frequency
Normalized Frequency
1
the economy
1,924
0.53
2
our economy
1,312
0.36
3
of economy
401
0.11
4
american economy
242
0.07
5
national economy
172
0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 682
Total ngrams: 7,668
Showing 5 rows
Page 1 of 137
CPU times: user 578 ms, sys: 233 ms, total: 811 ms
Wall time: 435 ms
Ngrams for "economy"
US Congressional Speeches Subset 500k
Rank
Ngram
Frequency
Normalized Frequency
1
the economy
4,818
0.53
2
our economy
3,258
0.36
3
of economy
1,039
0.12
4
american economy
588
0.07
5
national economy
448
0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 1,193
Total ngrams: 19,211
Showing 5 rows
Page 1 of 239
CPU times: user 1.66 s, sys: 552 ms, total: 2.21 s
Wall time: 1.02 s
# still working on this!for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus)del corpus
Ngram Frequencies
US Congressional Speeches Subset 10k
Rank
Ngram
Frequency
Normalized Frequency
1
of the
22,312
126.21
2
in the
10,982
62.12
3
to the
9,119
51.58
4
it is
5,140
29.07
5
that the
5,123
28.98
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 396,623
Total ngrams: 1,584,710
Showing 5 rows
Page 1 of 79325
CPU times: user 1.5 s, sys: 147 ms, total: 1.65 s
Wall time: 209 ms
Ngram Frequencies
US Congressional Speeches Subset 100k
Rank
Ngram
Frequency
Normalized Frequency
1
of the
227,943
126.49
2
in the
114,241
63.39
3
to the
92,967
51.59
4
it is
51,659
28.67
5
that the
51,620
28.64
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 2,046,190
Total ngrams: 16,153,485
Showing 5 rows
Page 1 of 409238
CPU times: user 35.7 s, sys: 1.09 s, total: 36.8 s
Wall time: 831 ms
Ngram Frequencies
US Congressional Speeches Subset 200k
Rank
Ngram
Frequency
Normalized Frequency
1
of the
457,057
126.48
2
in the
228,891
63.34
3
to the
186,449
51.60
4
it is
103,619
28.67
5
that the
103,418
28.62
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 3,304,755
Total ngrams: 32,389,849
Showing 5 rows
Page 1 of 660951
CPU times: user 1min 9s, sys: 2.22 s, total: 1min 11s
Wall time: 4.33 s
Ngram Frequencies
US Congressional Speeches Subset 500k
Rank
Ngram
Frequency
Normalized Frequency
1
of the
1,140,304
126.22
2
in the
570,295
63.13
3
to the
467,816
51.78
4
it is
259,770
28.75
5
that the
258,068
28.57
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 6,158,427
Total ngrams: 80,976,586
Showing 5 rows
Page 1 of 1231686
CPU times: user 3min 16s, sys: 10.2 s, total: 3min 26s
Wall time: 11.9 s
for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus)del corpus
Concordance for "economy"
US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R
Document Id
Left
Node
Right
5,878
ruled by a government of
economy
.
1,163
. help strengthen our Nations
economy
.
316
otherwise generally strong and prosperous
economy
.
6,910
this critical sector in our
economy
.
9,517
health care pressures in this
economy
.
Total Concordance Lines: 358
Total Documents: 251
Showing 5 lines
Page 1 of 72
CPU times: user 89.7 ms, sys: 1.14 ms, total: 90.8 ms
Wall time: 61.2 ms
Concordance for "economy"
US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R
Document Id
Left
Node
Right
82,659
amounts that it throws our
economy
75,018
Honey . I shrun the
economy
" ? It is honest
19,176
getting away from " Coolidge
economy
" already . and making
6,729
. We are talking "
economy
" and at the same
83,170
further into an " innovating
economy
" based on a highly
Total Concordance Lines: 3758
Total Documents: 2684
Showing 5 lines
Page 1 of 752
CPU times: user 414 ms, sys: 340 ms, total: 755 ms
Wall time: 437 ms
Concordance for "economy"
US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R
Document Id
Left
Node
Right
77,084
the way it is .
ECONOMY
6,026
its central office . Political
Economy
130,531
the maintenance of her national
economy
20,685
on something else . Coolidge
economy
! I am for it
132,603
railroads of this country .
Economy
! What about this pitpible
Total Concordance Lines: 7753
Total Documents: 5480
Showing 5 lines
Page 1 of 1551
CPU times: user 871 ms, sys: 596 ms, total: 1.47 s
Wall time: 831 ms
Concordance for "economy"
US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R
Document Id
Left
Node
Right
140,837
prayers are with them .
ECONOMY
162,997
its central office . Political
Economy
325,086
WHAT ARE CoNDrrONS IN THE
ECONOMY
64,711
country . Condition of Nations
Economy
360,787
country ! This spasm of
economy
!
Total Concordance Lines: 19399
Total Documents: 13564
Showing 5 lines
Page 1 of 3880
CPU times: user 2.83 s, sys: 1.42 s, total: 4.24 s
Wall time: 1.86 s
reference = Corpus().load(f'{save_path}brown.corpus')for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus) conc.set_reference_corpus(reference)del corpus
Keywords
Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus
Rank
Token
Frequency
Frequency Reference
Normalized Frequency
Normalized Frequency Reference
Relative Risk
Log Ratio
Log Likelihood
1
unanimous
907
5
5.13
0.05
100.57
6.65
748.42
2
amendment
4,039
24
22.85
0.24
93.30
6.54
3,318.48
3
appropriation
716
5
4.05
0.05
79.39
6.31
582.28
4
senator
5,488
39
31.04
0.40
78.02
6.29
4,457.76
5
subcommittee
585
5
3.31
0.05
64.87
6.02
468.73
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,767,904
Total word tokens in reference corpus: 980,144
Keywords: 8,291
Showing 5 rows
Page 1 of 1659
CPU times: user 369 ms, sys: 220 ms, total: 589 ms
Wall time: 94.3 ms
Keywords
Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus
Rank
Token
Frequency
Frequency Reference
Normalized Frequency
Normalized Frequency Reference
Relative Risk
Log Ratio
Log Likelihood
1
unanimous
8,978
5
4.98
0.05
97.66
6.61
895.70
2
amendment
39,940
24
22.16
0.24
90.51
6.50
3,968.88
3
appropriation
6,847
5
3.80
0.05
74.48
6.22
672.68
4
senator
52,772
39
29.28
0.40
73.60
6.20
5,180.64
5
gentleman
32,178
28
17.86
0.29
62.51
5.97
3,123.80
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 18,020,769
Total word tokens in reference corpus: 980,144
Keywords: 12,136
Showing 5 rows
Page 1 of 2428
CPU times: user 2.78 s, sys: 417 ms, total: 3.19 s
Wall time: 274 ms
Keywords
Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus
Rank
Token
Frequency
Frequency Reference
Normalized Frequency
Normalized Frequency Reference
Relative Risk
Log Ratio
Log Likelihood
1
unanimous
17,813
5
4.93
0.05
96.63
6.59
897.98
2
amendment
80,078
24
22.16
0.24
90.50
6.50
4,023.10
3
appropriation
13,896
5
3.85
0.05
75.38
6.24
690.81
4
senator
105,824
39
29.28
0.40
73.60
6.20
5,252.88
5
gentleman
63,852
28
17.67
0.29
61.85
5.95
3,132.10
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 36,136,744
Total word tokens in reference corpus: 980,144
Keywords: 12,704
Showing 5 rows
Page 1 of 2541
CPU times: user 6.71 s, sys: 451 ms, total: 7.16 s
Wall time: 516 ms
Keywords
Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus
Rank
Token
Frequency
Frequency Reference
Normalized Frequency
Normalized Frequency Reference
Relative Risk
Log Ratio
Log Likelihood
1
unanimous
44,193
5
4.89
0.05
95.89
6.58
898.23
2
amendment
198,132
24
21.93
0.24
89.57
6.48
4,012.78
3
appropriation
34,215
5
3.79
0.05
74.24
6.21
685.45
4
senator
264,478
39
29.28
0.40
73.57
6.20
5,295.45
5
gentleman
159,877
28
17.70
0.29
61.95
5.95
3,163.94
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 90,341,944
Total word tokens in reference corpus: 980,144
Keywords: 13,118
Showing 5 rows
Page 1 of 2624
CPU times: user 20 s, sys: 802 ms, total: 20.8 s
Wall time: 1.17 s
for slug, name in test_corpora.items(): corpus = Corpus().load(f'{save_path}{slug}.corpus') conc = Conc(corpus)del corpus
Collocates of "economy"
US Congressional Speeches Subset 10k
Rank
Token
Collocate Frequency
Frequency
Logdice
Log Likelihood
1
economy
20
358
9.84
248.59
2
healthy
10
50
9.65
74.41
3
segment
9
24
9.59
80.17
4
our
93
5,938
8.92
221.67
5
false
6
55
8.89
36.86
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 115
Showing 5 rows
Page 1 of 24
CPU times: user 123 ms, sys: 21.9 ms, total: 145 ms
Wall time: 54.8 ms
Collocates of "economy"
US Congressional Speeches Subset 100k
Rank
Token
Collocate Frequency
Frequency
Logdice
Log Likelihood
1
our
1,084
60,051
9.12
2,801.70
2
efficiency
60
732
8.77
329.93
3
stimulate
51
299
8.69
358.80
4
global
55
618
8.69
311.68
5
jobs
83
3,100
8.63
274.52
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 864
Showing 5 rows
Page 1 of 173
CPU times: user 413 ms, sys: 244 ms, total: 657 ms
Wall time: 328 ms
Collocates of "economy"
US Congressional Speeches Subset 200k
Rank
Token
Collocate Frequency
Frequency
Logdice
Log Likelihood
1
our
2,219
121,489
9.14
5,670.76
2
global
119
1,221
8.76
689.99
3
sector
119
1,741
8.68
604.09
4
stimulate
101
611
8.63
698.06
5
jobs
166
6,312
8.60
534.83
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 1,524
Showing 5 rows
Page 1 of 305
CPU times: user 710 ms, sys: 212 ms, total: 922 ms
Wall time: 523 ms
Collocates of "economy"
US Congressional Speeches Subset 500k
Rank
Token
Collocate Frequency
Frequency
Logdice
Log Likelihood
1
our
5,656
304,919
9.16
14,596.00
2
stimulate
267
1,472
8.71
1,898.46
3
global
283
2,924
8.70
1,636.06
4
jobs
418
15,339
8.62
1,373.41
5
economy
446
19,399
8.56
5,491.13
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 2,786
Showing 5 rows
Page 1 of 558
CPU times: user 2.14 s, sys: 589 ms, total: 2.73 s
Wall time: 1.34 s