conc
  1. Tutorials
  2. Quick Conc Recipes
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. Tutorials
  2. Quick Conc Recipes

Quick Conc Recipes

Code snippets for common tasks in Conc.

The Get started with Conc tutorial (in progress) is a detailed step through Conc functionality. This page provides simple code recipes for common Conc tasks. See the Conc API Reference for information on available methods, functions and parameters.

Install Conc

Basic install

See the installation page if you have a pre-2013 device, want to install the latest development version, or want to install optional dependencies.

Conc is tested with Python 3.10+. Create a new environment (with venv, conda or similiar) and run the following command to install Conc in your terminal. If you working within a notebook environment, you can usually run commands in a code cell with an ! symbol before the command.

pip install conc

The Conc install process installs spaCy, but you will need a spaCy model. Once the package and its dependencies are installed, run the following command to download the English language model for spaCy. If you want to use a different language, consult the spaCy models page for information on available models.

python -m spacy download en_core_web_sm

Building or loading a corpus for analysis

Build a Conc corpus from text files (or a compressed archive of text files) and prepare to report on it

The example assumes you’ve set the name and description of your corpus, as well as the path_to_source_file and save_path variables. The path_to_source_file can be a directory of text files or a compressed archive of text files (i.e. .zip, .tar.gz).

from conc.corpus import Corpus
from conc.conc import Conc
corpus = Corpus(name=name, description=description).build_from_files(source_path = path_to_source_file, save_path = save_path)
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports
Corpus Summary
Attribute Value
Name Garden Party Corpus
Description A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created 2025-06-23 13:16:12
Conc Version 0.1.4
Corpus Path /home/geoff/data/conc-test-corpora/garden-party.corpus
Document Count 15
Token Count 74,664
Word Token Count 59,514
Unique Tokens 5,410
Unique Word Tokens 5,392
Build a Conc corpus from a CSV (or a compressed CSV) and prepare to report on it

The example assumes you’ve set the name and description of your corpus, as well as the path_to_source_file and save_path variables. The path_to_source_file can be a .csv file or a compressed .csv.gz file. This example assumes the CSV has a column called text containing the text to be analysed and another column called source, which will be retained as metadata.

from conc.corpus import Corpus
from conc.conc import Conc
corpus = Corpus(name=name, description=description).build_from_csv(source_path=path_to_source_file, save_path=save_path, text_column='text', metadata_columns=['source'])
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports
Corpus Summary
Attribute Value
Name Brown Corpus
Description A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created 2025-06-23 13:16:15
Conc Version 0.1.4
Corpus Path /home/geoff/data/conc-test-corpora/brown.corpus
Document Count 500
Token Count 1,138,566
Word Token Count 980,144
Unique Tokens 42,930
Unique Word Tokens 42,907
Load a Conc corpus and prepare to report on it

The example assumes you’ve built a corpus previously with Conc and you’ve set the corpus_path (i.e. the corpus path created when you built your corpus).

from conc.corpus import Corpus
from conc.conc import Conc
corpus = Corpus().load(corpus_path=corpus_path)
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports
Corpus Summary
Attribute Value
Name Garden Party Corpus
Description A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created 2025-06-23 13:16:12
Conc Version 0.1.4
Corpus Path /home/geoff/data/conc-test-corpora/garden-party.corpus
Document Count 15
Token Count 74,664
Word Token Count 59,514
Unique Tokens 5,410
Unique Word Tokens 5,392

Reports for corpus analysis

The example snippets below assume you’ve built or loaded a corpus and prepared it for reporting (see above).

Frequency report
conc.frequencies().display()
Frequencies
Frequencies of word tokens, Garden Party Corpus
Rank Token Frequency Normalized Frequency
1 the 2,911 489.13
2 and 1,798 302.11
3 a 1,407 236.41
4 to 1,376 231.21
5 she 1,171 196.76
6 was 1,102 185.17
7 it 1,021 171.56
8 her 937 157.44
9 of 908 152.57
10 i 719 120.81
11 he 718 120.64
12 in 683 114.76
13 that 643 108.04
14 you 642 107.87
15 ’s 524 88.05
16 n’t 522 87.71
17 said 514 86.37
18 on 504 84.69
19 had 469 78.80
20 his 440 73.93
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 59,514
Unique word tokens: 5,392
Showing 20 rows
Page 1 of 270
Ngram frequencies report
conc.ngram_frequencies(ngram_length = 2).display()
Ngram Frequencies
Garden Party Corpus
Rank Ngram Frequency Normalized Frequency
1 it was 247 41.50
2 in the 214 35.96
3 on the 183 30.75
4 of the 156 26.21
5 to the 139 23.36
6 at the 133 22.35
7 she was 132 22.18
8 and the 124 20.84
9 it ’s 120 20.16
10 do n’t 105 17.64
11 they were 104 17.47
12 he was 100 16.80
13 she had 95 15.96
14 and she 88 14.79
15 could n’t 88 14.79
16 to be 88 14.79
17 she said 87 14.62
18 there was 84 14.11
19 a little 83 13.95
20 out of 83 13.95
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 24,351
Total ngrams: 47,706
Showing 20 rows
Page 1 of 1218
Keywords report

For a keywords report you will need another corpus to compare against. This example assumes you’ve already built that and have defined the reference_corpus_path variable. When you load the reference corpus, set it to reference_corpus or some other variable name to help you distinguish it from the corpus you are reporting on.

reference_corpus = Corpus().load(corpus_path=reference_corpus_path)
conc.set_reference_corpus(reference_corpus)
conc.keywords(min_document_frequency = 5, min_document_frequency_reference = 5).display()
Keywords
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus
Rank Token Frequency Frequency Reference Normalized Frequency Normalized Frequency Reference Relative Risk Log Ratio Log Likelihood
1 bye 25 7 4.20 0.07 58.82 5.88 110.23
2 velvet 14 5 2.35 0.05 46.11 5.53 58.78
3 shone 13 5 2.18 0.05 42.82 5.42 53.69
4 queer 15 6 2.52 0.06 41.17 5.36 61.39
5 gloves 17 7 2.86 0.07 40.00 5.32 69.11
6 cried 59 26 9.91 0.27 37.37 5.22 235.92
7 darling 36 18 6.05 0.18 32.94 5.04 139.33
8 faintly 14 7 2.35 0.07 32.94 5.04 54.18
9 oh 149 93 25.04 0.95 26.39 4.72 540.97
10 handkerchief 14 9 2.35 0.09 25.62 4.68 50.36
11 dear 78 54 13.11 0.55 23.79 4.57 273.99
12 breathed 13 9 2.18 0.09 23.79 4.57 45.67
13 awful 23 17 3.86 0.17 22.28 4.48 79.04
14 ah 26 20 4.37 0.20 21.41 4.42 88.12
15 breast 14 11 2.35 0.11 20.96 4.39 47.09
16 dashed 10 8 1.68 0.08 20.59 4.36 33.42
17 gasped 6 5 1.01 0.05 19.76 4.30 19.76
18 parted 6 5 1.01 0.05 19.76 4.30 19.76
19 timid 6 5 1.01 0.05 19.76 4.30 19.76
20 sigh 13 11 2.18 0.11 19.46 4.28 42.56
Report based on word tokens
Filtered tokens by minimum document frequency in target corpus (5), minimum document frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 59,514
Total word tokens in reference corpus: 980,144
Keywords: 825
Showing 20 rows
Page 1 of 42
Collocates report
conc.collocates('could').display()
Collocates of "could"
Garden Party Corpus
Rank Token Collocate Frequency Frequency Logdice Log Likelihood
1 n’t 111 522 12.28 233.85
2 have 35 240 11.33 50.03
3 she 94 1,171 11.13 52.86
4 he 55 718 10.93 27.90
5 they 35 398 10.89 23.65
6 it 71 1,021 10.89 28.44
7 be 26 251 10.86 23.35
8 help 12 27 10.71 44.49
9 what 23 270 10.63 14.61
10 not 19 229 10.48 11.45
11 but 25 417 10.36 6.43
12 how 13 126 10.32 11.60
13 could 16 207 10.31 107.37
14 why 11 99 10.20 11.00
15 hardly 8 17 10.19 30.81
16 do 16 242 10.19 5.58
17 understand 8 19 10.18 28.60
18 that 30 643 10.18 2.39
19 was 44 1,102 10.11 0.79
20 no 13 181 10.10 5.66
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 83
Showing 20 rows
Page 1 of 5
Concordance
conc.concordance('could', order = '1R2R3R', context_length = 10).display()
Concordance for "could"
Garden Party Corpus, Context tokens: 10, Order: 1R2R3R
Doc Id Left Node Right
4 herself and got as close to the sea as she could , and sung something , something she had made up
11 I sat up and called out as loud as I could , “ _ I do want to go on a
13 ’ll not be a minute . ” And before he could answer she was gone . He had half a mind
4 until I ’ve had something . Do you think we could ask Kate for two cups of hot water ? ”
2 , and there will be no time to explain what could be explained so simply .... But to - night it
1 Here ’s this huge house and garden . Surely you could be happy in — in — appreciating it for a
12 away from the Listening Ear . Good Heavens , what could be more tragic than that lament ! Every note was
1 even a successful , established , big paying concern — could be played with . A man had either to put
8 play . It was exactly like a play . Who could believe the sky at the back was n’t painted ?
13 . Was her luggage ready ? In that case they could cut off sharp with her cabin luggage and let the
2 but , after all , they ’re boys . I could cut off to sea , or get a job up
2 one leg over . But which leg ? She never could decide . And when she did finally put one leg
10 loved having to arrange things ; she always felt she could do it so much better than anybody else . Four
2 sitting there in the washhouse ; it was all they could do not to burst into a little chorus of animals
11 to practise . ” Oh , it was all I could do not to burst out crying . I went over
7 . But after supper they were all so tired they could do nothing but yawn until it was late enough to
14 astonished Fenella . “ You did n’t think your grandma could do that , did you ? ” said she .
2 wanted , somehow , to celebrate the fact that they could do what they liked now . There was no man
9 bent over her . This was such bliss that he could dream no further . But it gave him the courage
6 away — anywhere , as though by walking away he could escape .... It was cold in the street . There
Total Concordance Lines: 207
Total Documents: 15
Showing 20 lines
Page 1 of 11
Concordance plot
conc.concordance_plot('he').display()

Concordance Plot for "he"

Garden Party Corpus

Total Documents: 15
Total Concordance Lines: 718
You can still output individual texts (and access the text and tokens programmatically)! Here is how …

The texts are accessible via the Corpus object. If you display a text (as below) it will show any available metadata. Add show_metadata=False to just show the text. If you leave off the max_tokens parameter, it will display the entire text.

corpus.text(2).display(max_tokens = 300)
Metadata
Attribute Value
document_id 2
file at-the-bay.txt
I Very early morning. The sun was not yet risen, and the whole of Crescent Bay was hidden under a white sea-mist. The big bush-covered hills at the back were smothered. You could not see where they ended and the paddocks and bungalows began. The sandy road was gone and the paddocks and bungalows the other side of it; there were no white dunes covered with reddish grass beyond them; there was nothing to mark which was beach and where was the sea. A heavy dew had fallen. The grass was blue. Big drops hung on the bushes and just did not fall; the silvery, fluffy toi-toi was limp on its long stalks, and all the marigolds and the pinks in the bungalow gardens were bowed to the earth with wetness. Drenched were the cold fuchsias, round pearls of dew lay on the flat nasturtium leaves. It looked as though the sea had beaten up softly in the darkness, as though one immense wave had come rippling, rippling—how far? Perhaps if you had waked up in the middle of the night you might have seen a big fish flicking in at the window and gone again.... Ah-Aah! sounded the sleepy sea. And from the bush there came the sound of little streams flowing, quickly, lightly, slipping between the smooth stones, gushing into ferny basins and out again; and there was the splashing of big drops on large leaves, and something else—… [300 of 18238 tokens]

Texts are also accessible as strings if you want to work with them like that …

text = corpus.text(1).as_string(max_tokens = 50)
print(text)
That evening for the first time in his life, as he pressed through the
swing door and descended the three broad steps to the pavement, old Mr.
Neave felt he was too old for the spring. Spring—warm, eager,
restless—

or as tokens (this example loops through all texts in the corpus and displays the first 10 tokens) …

for i in range(1, corpus.document_count + 1):
    tokens = corpus.text(i).as_tokens() # this is all the tokens
    print(tokens[:10]) # slicing to first 10 tokens
['That', 'evening', 'for', 'the', 'first', 'time', 'in', 'his', 'life', ',']
['I', '\r\n\r\n', 'Very', 'early', 'morning', '.', 'The', 'sun', 'was', 'not']
['A', 'stout', 'man', 'with', 'a', 'pink', 'face', 'wears', 'dingy', 'white']
['I', '\r\n\r\n', 'The', 'week', 'after', 'was', 'one', 'of', 'the', 'busiest']
['Exactly', 'when', 'the', 'ball', 'began', 'Leila', 'would', 'have', 'found', 'it']
['When', 'the', 'literary', 'gentleman', ',', 'whose', 'flat', 'old', 'Ma', 'Parker']
['On', 'his', 'way', 'to', 'the', 'station', 'William', 'remembered', 'with', 'a']
['Although', 'it', 'was', 'so', 'brilliantly', 'fine', '—', 'the', 'blue', 'sky']
['Of', 'course', 'he', 'knew', '—', 'no', 'man', 'better', '—', 'that']
['And', 'after', 'all', 'the', 'weather', 'was', 'ideal', '.', 'They', 'could']
['_', 'Eleven', 'o’clock', '.', 'A', 'knock', 'at', 'the', 'door', '.']
['With', 'despair', '—', 'cold', ',', 'sharp', 'despair', '—', 'buried', 'deep']
['It', 'seemed', 'to', 'the', 'little', 'crowd', 'on', 'the', 'wharf', 'that']
['The', 'Picton', 'boat', 'was', 'due', 'to', 'leave', 'at', 'half', '-']
['In', 'her', 'blue', 'dress', ',', 'with', 'her', 'cheeks', 'lightly', 'flushed']

Working with Conc results

Accessing and working with Conc results

In the examples above we run reports and use .display() to output a report. You can access report data directly as a Polars dataframe using .to_frame(). This means you can work with the data in Polars to further filter the results or extend analysis.

# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
df = conc.keywords(restrict_tokens = pronouns).to_frame()
# now go for it - work with the Polars dataframe
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
df.select(['token', 'normalized_frequency', 'normalized_frequency_reference']).sort('normalized_frequency', descending=True).head(10)
token normalized_frequency normalized_frequency_reference
"she" 196.760426 21.01732
"it" 171.556272 71.326254
"her" 157.441946 29.454856
"i" 120.81191 44.585285
"he" 120.643882 69.091889
"you" 107.873778 33.311432
"his" 73.932184 66.551446
"they" 66.875021 28.995739
"them" 35.285815 18.232015
"him" 32.093289 26.720564
Working with Conc results using Pandas

You are not restricted to Conc for your analysis. Conc report data can be exported to other formats. For example, although Conc uses Polars internally for its efficiency, you can convert report results into a Pandas dataframe, which is flexible and interoperable with many Python libraries for data analysis. Here is an example …

import pandas as pd # Conc does not install Pandas - so if you haven't already, install it with "pip install pandas"
# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe, which is then converted to a Pandas dataframe
df = conc.keywords(restrict_tokens = pronouns).to_frame().to_pandas()
# now go for it - do the pandas stuff you are familiar with ...
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
# note you could do this in Polars (see above), but this is assuming you are more familiar with Pandas
df.sort_values(by='normalized_frequency', ascending=False)[['token', 'normalized_frequency', 'normalized_frequency_reference']].head(10)
token normalized_frequency normalized_frequency_reference
0 she 196.760426 21.017320
7 it 171.556272 71.326254
2 her 157.441946 29.454856
6 i 120.811910 44.585285
12 he 120.643882 69.091889
5 you 107.873778 33.311432
15 his 73.932184 66.551446
9 they 66.875021 28.995739
10 them 35.285815 18.232015
14 him 32.093289 26.720564
Want to work with Conc results using other software/libraries? Exporting Conc results to CSV, JSON and other formats

If you want to work with Conc results in other libraries or software, you can write the results to an interoperable data format. Conc results can be accessed as Polars dataframes, which can be exported to CSV or JSON formats. See the Polars input/output documentation for other formats supported, including Parquet, newline-delimited JSON, Excel and more.

# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
df = conc.keywords(restrict_tokens = pronouns).to_frame()

The result in df is currently a Polars dataframe, write to CSV …

df.write_csv(f'pronoun-results.csv')

Here is how to write the results to a JSON file …

df.write_json(f'pronoun-results.json')

Working with Conc corpus data

If Conc does not do something you want, you can work with a Conc corpus data directly.

Working with a loaded Corpus

When you build or load a corpus the token, vocab and other data is accessible via the Corpus object. These could be very large and are not loaded into memory, but they can be queried via Polars Lazy API. Read about the Anatomy of a Conc Corpus for more information. Here is an example …

import polars as pl
# e.g. filter vocab by tokens ending in 'ing'
corpus.vocab.filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming')
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
99 4979 5920 "something" 100 96 false false
171 2275 4862 "going" 59 59 false false
205 3206 5170 "looking" 48 46 false false
214 5509 830 "thing" 44 44 false false
236 3650 4679 "nothing" 51 41 false false
Working with Conc corpus data files directly

There are a number of different data files that make up a Conc corpus. These can be queried directly if required. Read about the Anatomy of a Conc Corpus for an explanation of the format and data. Here is an example …

import polars as pl
# e.g. lazy query to get tokens from vocab ending in 'ing'
display(pl.scan_parquet(f'{save_path}garden-party.corpus/vocab.parquet').filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming'))
rank tokens_sort_order token_id token frequency_lower frequency_orth is_punct is_space
99 4979 5920 "something" 100 96 false false
171 2275 4862 "going" 59 59 false false
205 3206 5170 "looking" 48 46 false false
214 5509 830 "thing" 44 44 false false
236 3650 4679 "nothing" 51 41 false false
  • Report an issue