conc
  1. Explanations
  2. Performance
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. Explanations
  2. Performance

Performance

Information on Conc performance across different corpus sizes.

This page reports timing results of corpus building/loading and Conc report methods with different size corpora using a machine with Intel Core i7-14700F, NVME SSD and 16GB usable RAM under WSL.

from conc.corpus import Corpus
from conc.conc import Conc
test_corpora = {
                'us-congressional-speeches-subset-10k': 'US Congressional Speeches Subset 10k',
                'us-congressional-speeches-subset-100k': 'US Congressional Speeches Subset 100k',
                'us-congressional-speeches-subset-200k': 'US Congressional Speeches Subset 200k',
                'us-congressional-speeches-subset-500k': 'US Congressional Speeches Subset 500k'
                }

Corpus build time varies from 4 seconds for 2m token data source (10k texts) to 150 seconds for 100m token data source (500k texts). Currently to build corpora larger than this requires large RAM. Work on memory management is ongoing, but this will improve when Polars new streaming engine matures. This is in the Roadmap for the library.

corpora = {}
for slug, name in test_corpora.items():
    logger.info(f'Starting {name} build ...')
    description = f'1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus. '
    try:

    except Exception as e:
        raise e
CPU times: user 4.45 s, sys: 224 ms, total: 4.67 s
Wall time: 3.82 s
CPU times: user 46.3 s, sys: 2.45 s, total: 48.7 s
Wall time: 30 s
CPU times: user 1min 38s, sys: 10.8 s, total: 1min 49s
Wall time: 1min 2s
CPU times: user 3min 55s, sys: 32 s, total: 4min 27s
Wall time: 2min 26s

Corpora are loaded lazily - meaning large data tables are only accessed when required. Similar load times regardless of corpus size …

for slug, name in test_corpora.items():

    corpus.summary()
    del corpus
CPU times: user 211 ms, sys: 15.7 ms, total: 227 ms
Wall time: 266 ms
Corpus Summary
Attribute Value
Name US Congressional Speeches Subset 10k
Description 1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created 2025-06-09 15:03:14
Conc Version 0.0.1
Corpus Path /home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-10k.corpus
Document Count 10,000
Token Count 1,954,972
Word Token Count 1,767,904
Unique Tokens 50,640
Unique Word Tokens 50,520
CPU times: user 182 ms, sys: 27.6 ms, total: 209 ms
Wall time: 220 ms
Corpus Summary
Attribute Value
Name US Congressional Speeches Subset 100k
Description 1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created 2025-06-09 15:03:44
Conc Version 0.0.1
Corpus Path /home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-100k.corpus
Document Count 100,000
Token Count 19,927,241
Word Token Count 18,020,769
Unique Tokens 214,502
Unique Word Tokens 214,175
CPU times: user 209 ms, sys: 0 ns, total: 209 ms
Wall time: 219 ms
Corpus Summary
Attribute Value
Name US Congressional Speeches Subset 200k
Description 1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created 2025-06-09 15:04:47
Conc Version 0.0.1
Corpus Path /home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-200k.corpus
Document Count 200,000
Token Count 39,963,039
Word Token Count 36,136,744
Unique Tokens 345,631
Unique Word Tokens 345,310
CPU times: user 207 ms, sys: 0 ns, total: 207 ms
Wall time: 217 ms
Corpus Summary
Attribute Value
Name US Congressional Speeches Subset 500k
Description 1 million speeches sampled from https://huggingface.co/datasets/Eugleo/us-congressional-speeches-subset to create corpora of varying sizes for development and testing. The dataset card at Huggingface is empty, so there is no further information available on the contents. The title indicates how many speeches are included in this corpus.
Date Created 2025-06-09 15:07:14
Conc Version 0.0.1
Corpus Path /home/geoff/data/conc-test-corpora/us-congressional-speeches-subset-500k.corpus
Document Count 500,000
Token Count 99,902,593
Word Token Count 90,341,944
Unique Tokens 655,344
Unique Word Tokens 654,824
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 10k
Rank Token Frequency Normalized Frequency
1 the 135,984 769.18
2 of 67,597 382.36
3 to 60,132 340.13
4 and 44,832 253.59
5 in 36,959 209.06
6 that 34,135 193.08
7 a 29,557 167.19
8 i 29,329 165.90
9 is 25,175 142.40
10 this 19,173 108.45
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 1,767,904
Unique word tokens: 50,520
Showing 10 rows
Page 1 of 5053
CPU times: user 22.9 ms, sys: 10.1 ms, total: 33 ms
Wall time: 37.3 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 100k
Rank Token Frequency Normalized Frequency
1 the 1,389,439 771.02
2 of 687,127 381.30
3 to 610,266 338.65
4 and 459,220 254.83
5 in 379,946 210.84
6 that 346,216 192.12
7 a 302,256 167.73
8 i 297,077 164.85
9 is 250,677 139.10
10 this 192,933 107.06
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 18,020,769
Unique word tokens: 214,175
Showing 10 rows
Page 1 of 21418
CPU times: user 61.5 ms, sys: 38 ms, total: 99.4 ms
Wall time: 46.1 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 200k
Rank Token Frequency Normalized Frequency
1 the 2,781,475 769.71
2 of 1,377,003 381.05
3 to 1,225,404 339.10
4 and 922,720 255.34
5 in 760,867 210.55
6 that 695,665 192.51
7 a 606,747 167.90
8 i 593,766 164.31
9 is 504,385 139.58
10 this 386,922 107.07
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 36,136,744
Unique word tokens: 345,310
Showing 10 rows
Page 1 of 34532
CPU times: user 53.7 ms, sys: 78.1 ms, total: 132 ms
Wall time: 49.8 ms
Frequencies
Frequencies of word tokens, US Congressional Speeches Subset 500k
Rank Token Frequency Normalized Frequency
1 the 6,951,503 769.47
2 of 3,446,705 381.52
3 to 3,059,159 338.62
4 and 2,308,134 255.49
5 in 1,902,118 210.55
6 that 1,737,689 192.35
7 a 1,514,676 167.66
8 i 1,481,424 163.98
9 is 1,261,935 139.68
10 this 966,165 106.95
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 90,341,944
Unique word tokens: 654,824
Showing 10 rows
Page 1 of 65483
CPU times: user 104 ms, sys: 105 ms, total: 210 ms
Wall time: 53.2 ms
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus
Ngrams for "economy"
US Congressional Speeches Subset 10k
Rank Ngram Frequency Normalized Frequency
1 the economy 94 0.53
2 our economy 59 0.33
3 of economy 23 0.13
4 american economy 11 0.06
5 for economy 8 0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 106
Total ngrams: 355
Showing 5 rows
Page 1 of 22
CPU times: user 49.3 ms, sys: 28 ms, total: 77.3 ms
Wall time: 48.8 ms
Ngrams for "economy"
US Congressional Speeches Subset 100k
Rank Ngram Frequency Normalized Frequency
1 the economy 930 0.52
2 our economy 643 0.36
3 of economy 203 0.11
4 american economy 116 0.06
5 national economy 84 0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 464
Total ngrams: 3,725
Showing 5 rows
Page 1 of 93
CPU times: user 338 ms, sys: 57 ms, total: 395 ms
Wall time: 198 ms
Ngrams for "economy"
US Congressional Speeches Subset 200k
Rank Ngram Frequency Normalized Frequency
1 the economy 1,924 0.53
2 our economy 1,312 0.36
3 of economy 401 0.11
4 american economy 242 0.07
5 national economy 172 0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 682
Total ngrams: 7,668
Showing 5 rows
Page 1 of 137
CPU times: user 578 ms, sys: 233 ms, total: 811 ms
Wall time: 435 ms
Ngrams for "economy"
US Congressional Speeches Subset 500k
Rank Ngram Frequency Normalized Frequency
1 the economy 4,818 0.53
2 our economy 3,258 0.36
3 of economy 1,039 0.12
4 american economy 588 0.07
5 national economy 448 0.05
Report based on word tokens
Ngram length: 2, Token position: right
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 1,193
Total ngrams: 19,211
Showing 5 rows
Page 1 of 239
CPU times: user 1.66 s, sys: 552 ms, total: 2.21 s
Wall time: 1.02 s
# still working on this!
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus
Ngram Frequencies
US Congressional Speeches Subset 10k
Rank Ngram Frequency Normalized Frequency
1 of the 22,312 126.21
2 in the 10,982 62.12
3 to the 9,119 51.58
4 it is 5,140 29.07
5 that the 5,123 28.98
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 396,623
Total ngrams: 1,584,710
Showing 5 rows
Page 1 of 79325
CPU times: user 1.5 s, sys: 147 ms, total: 1.65 s
Wall time: 209 ms
Ngram Frequencies
US Congressional Speeches Subset 100k
Rank Ngram Frequency Normalized Frequency
1 of the 227,943 126.49
2 in the 114,241 63.39
3 to the 92,967 51.59
4 it is 51,659 28.67
5 that the 51,620 28.64
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 2,046,190
Total ngrams: 16,153,485
Showing 5 rows
Page 1 of 409238
CPU times: user 35.7 s, sys: 1.09 s, total: 36.8 s
Wall time: 831 ms
Ngram Frequencies
US Congressional Speeches Subset 200k
Rank Ngram Frequency Normalized Frequency
1 of the 457,057 126.48
2 in the 228,891 63.34
3 to the 186,449 51.60
4 it is 103,619 28.67
5 that the 103,418 28.62
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 3,304,755
Total ngrams: 32,389,849
Showing 5 rows
Page 1 of 660951
CPU times: user 1min 9s, sys: 2.22 s, total: 1min 11s
Wall time: 4.33 s
Ngram Frequencies
US Congressional Speeches Subset 500k
Rank Ngram Frequency Normalized Frequency
1 of the 1,140,304 126.22
2 in the 570,295 63.13
3 to the 467,816 51.78
4 it is 259,770 28.75
5 that the 258,068 28.57
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 6,158,427
Total ngrams: 80,976,586
Showing 5 rows
Page 1 of 1231686
CPU times: user 3min 16s, sys: 10.2 s, total: 3min 26s
Wall time: 11.9 s
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus
Concordance for "economy"
US Congressional Speeches Subset 10k, Context tokens: 5, Order: 1R2R3R
Document Id Left Node Right
5,878 ruled by a government of economy .
1,163 . help strengthen our Nations economy .
316 otherwise generally strong and prosperous economy .
6,910 this critical sector in our economy .
9,517 health care pressures in this economy .
Total Concordance Lines: 358
Total Documents: 251
Showing 5 lines
Page 1 of 72
CPU times: user 89.7 ms, sys: 1.14 ms, total: 90.8 ms
Wall time: 61.2 ms
Concordance for "economy"
US Congressional Speeches Subset 100k, Context tokens: 5, Order: 1R2R3R
Document Id Left Node Right
82,659 amounts that it throws our economy
75,018 Honey . I shrun the economy " ? It is honest
19,176 getting away from " Coolidge economy " already . and making
6,729 . We are talking " economy " and at the same
83,170 further into an " innovating economy " based on a highly
Total Concordance Lines: 3758
Total Documents: 2684
Showing 5 lines
Page 1 of 752
CPU times: user 414 ms, sys: 340 ms, total: 755 ms
Wall time: 437 ms
Concordance for "economy"
US Congressional Speeches Subset 200k, Context tokens: 5, Order: 1R2R3R
Document Id Left Node Right
77,084 the way it is . ECONOMY
6,026 its central office . Political Economy
130,531 the maintenance of her national economy
20,685 on something else . Coolidge economy ! I am for it
132,603 railroads of this country . Economy ! What about this pitpible
Total Concordance Lines: 7753
Total Documents: 5480
Showing 5 lines
Page 1 of 1551
CPU times: user 871 ms, sys: 596 ms, total: 1.47 s
Wall time: 831 ms
Concordance for "economy"
US Congressional Speeches Subset 500k, Context tokens: 5, Order: 1R2R3R
Document Id Left Node Right
140,837 prayers are with them . ECONOMY
162,997 its central office . Political Economy
325,086 WHAT ARE CoNDrrONS IN THE ECONOMY
64,711 country . Condition of Nations Economy
360,787 country ! This spasm of economy !
Total Concordance Lines: 19399
Total Documents: 13564
Showing 5 lines
Page 1 of 3880
CPU times: user 2.83 s, sys: 1.42 s, total: 4.24 s
Wall time: 1.86 s
reference = Corpus().load(f'{save_path}brown.corpus')
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)
    conc.set_reference_corpus(reference)

    del corpus
Keywords
Target corpus: US Congressional Speeches Subset 10k, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 unanimous 907 5 5.13 0.05 100.57 6.65 748.42
2 amendment 4,039 24 22.85 0.24 93.30 6.54 3,318.48
3 appropriation 716 5 4.05 0.05 79.39 6.31 582.28
4 senator 5,488 39 31.04 0.40 78.02 6.29 4,457.76
5 subcommittee 585 5 3.31 0.05 64.87 6.02 468.73
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 1,767,904
Total word tokens in reference corpus: 980,144
Keywords: 8,291
Showing 5 rows
Page 1 of 1659
CPU times: user 369 ms, sys: 220 ms, total: 589 ms
Wall time: 94.3 ms
Keywords
Target corpus: US Congressional Speeches Subset 100k, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 unanimous 8,978 5 4.98 0.05 97.66 6.61 895.70
2 amendment 39,940 24 22.16 0.24 90.51 6.50 3,968.88
3 appropriation 6,847 5 3.80 0.05 74.48 6.22 672.68
4 senator 52,772 39 29.28 0.40 73.60 6.20 5,180.64
5 gentleman 32,178 28 17.86 0.29 62.51 5.97 3,123.80
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 18,020,769
Total word tokens in reference corpus: 980,144
Keywords: 12,136
Showing 5 rows
Page 1 of 2428
CPU times: user 2.78 s, sys: 417 ms, total: 3.19 s
Wall time: 274 ms
Keywords
Target corpus: US Congressional Speeches Subset 200k, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 unanimous 17,813 5 4.93 0.05 96.63 6.59 897.98
2 amendment 80,078 24 22.16 0.24 90.50 6.50 4,023.10
3 appropriation 13,896 5 3.85 0.05 75.38 6.24 690.81
4 senator 105,824 39 29.28 0.40 73.60 6.20 5,252.88
5 gentleman 63,852 28 17.67 0.29 61.85 5.95 3,132.10
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 36,136,744
Total word tokens in reference corpus: 980,144
Keywords: 12,704
Showing 5 rows
Page 1 of 2541
CPU times: user 6.71 s, sys: 451 ms, total: 7.16 s
Wall time: 516 ms
Keywords
Target corpus: US Congressional Speeches Subset 500k, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 unanimous 44,193 5 4.89 0.05 95.89 6.58 898.23
2 amendment 198,132 24 21.93 0.24 89.57 6.48 4,012.78
3 appropriation 34,215 5 3.79 0.05 74.24 6.21 685.45
4 senator 264,478 39 29.28 0.40 73.57 6.20 5,295.45
5 gentleman 159,877 28 17.70 0.29 61.95 5.95 3,163.94
Report based on word tokens
Filtered tokens by minimum frequency in target corpus (5), minimum frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 90,341,944
Total word tokens in reference corpus: 980,144
Keywords: 13,118
Showing 5 rows
Page 1 of 2624
CPU times: user 20 s, sys: 802 ms, total: 20.8 s
Wall time: 1.17 s
for slug, name in test_corpora.items():
    corpus = Corpus().load(f'{save_path}{slug}.corpus')
    conc = Conc(corpus)

    del corpus
Collocates of "economy"
US Congressional Speeches Subset 10k
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 economy 20 358 9.84 248.59
2 healthy 10 50 9.65 74.41
3 segment 9 24 9.59 80.17
4 our 93 5,938 8.92 221.67
5 false 6 55 8.89 36.86
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 115
Showing 5 rows
Page 1 of 24
CPU times: user 123 ms, sys: 21.9 ms, total: 145 ms
Wall time: 54.8 ms
Collocates of "economy"
US Congressional Speeches Subset 100k
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 our 1,084 60,051 9.12 2,801.70
2 efficiency 60 732 8.77 329.93
3 stimulate 51 299 8.69 358.80
4 global 55 618 8.69 311.68
5 jobs 83 3,100 8.63 274.52
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 864
Showing 5 rows
Page 1 of 173
CPU times: user 413 ms, sys: 244 ms, total: 657 ms
Wall time: 328 ms
Collocates of "economy"
US Congressional Speeches Subset 200k
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 our 2,219 121,489 9.14 5,670.76
2 global 119 1,221 8.76 689.99
3 sector 119 1,741 8.68 604.09
4 stimulate 101 611 8.63 698.06
5 jobs 166 6,312 8.60 534.83
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 1,524
Showing 5 rows
Page 1 of 305
CPU times: user 710 ms, sys: 212 ms, total: 922 ms
Wall time: 523 ms
Collocates of "economy"
US Congressional Speeches Subset 500k
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 our 5,656 304,919 9.16 14,596.00
2 stimulate 267 1,472 8.71 1,898.46
3 global 283 2,924 8.70 1,636.06
4 jobs 418 15,339 8.62 1,373.41
5 economy 446 19,399 8.56 5,491.13
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 2,786
Showing 5 rows
Page 1 of 558
CPU times: user 2.14 s, sys: 589 ms, total: 2.73 s
Wall time: 1.34 s
  • Report an issue