from conc.corpus import Corpus
from conc.conc import Conc
Quick Conc Recipes
The Get started with Conc tutorial (in progress) is a detailed step through Conc functionality. This page provides simple code recipes for common Conc tasks. See the Conc API Reference for information on available methods, functions and parameters.
Install Conc
See the installation page if you have a pre-2013 device, want to install the latest development version, or want to install optional dependencies.
Conc is tested with Python 3.10+. Create a new environment (with venv, conda or similiar) and run the following command to install Conc in your terminal. If you working within a notebook environment, you can usually run commands in a code cell with an !
symbol before the command.
pip install conc
The Conc install process installs spaCy, but you will need a spaCy model. Once the package and its dependencies are installed, run the following command to download the English language model for spaCy. If you want to use a different language, consult the spaCy models page for information on available models.
python -m spacy download en_core_web_sm
Building or loading a corpus for analysis
The example assumes you’ve set the name
and description
of your corpus, as well as the path_to_source_file
and save_path
variables. The path_to_source_file
can be a directory of text files or a compressed archive of text files (i.e. .zip, .tar.gz).
= Corpus(name=name, description=description).build_from_files(source_path = path_to_source_file, save_path = save_path)
corpus
corpus.summary()= Conc(corpus=corpus) # prerequisite for running reports conc
Corpus Summary | |
---|---|
Attribute | Value |
Name | Garden Party Corpus |
Description | A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party |
Date Created | 2025-06-23 13:16:12 |
Conc Version | 0.1.4 |
Corpus Path | /home/geoff/data/conc-test-corpora/garden-party.corpus |
Document Count | 15 |
Token Count | 74,664 |
Word Token Count | 59,514 |
Unique Tokens | 5,410 |
Unique Word Tokens | 5,392 |
The example assumes you’ve set the name and description of your corpus, as well as the path_to_source_file
and save_path
variables. The path_to_source_file
can be a .csv file or a compressed .csv.gz file. This example assumes the CSV has a column called text
containing the text to be analysed and another column called source
, which will be retained as metadata.
from conc.corpus import Corpus
from conc.conc import Conc
= Corpus(name=name, description=description).build_from_csv(source_path=path_to_source_file, save_path=save_path, text_column='text', metadata_columns=['source'])
corpus
corpus.summary()= Conc(corpus=corpus) # prerequisite for running reports conc
Corpus Summary | |
---|---|
Attribute | Value |
Name | Brown Corpus |
Description | A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/. |
Date Created | 2025-06-23 13:16:15 |
Conc Version | 0.1.4 |
Corpus Path | /home/geoff/data/conc-test-corpora/brown.corpus |
Document Count | 500 |
Token Count | 1,138,566 |
Word Token Count | 980,144 |
Unique Tokens | 42,930 |
Unique Word Tokens | 42,907 |
The example assumes you’ve built a corpus previously with Conc and you’ve set the corpus_path
(i.e. the corpus path created when you built your corpus).
from conc.corpus import Corpus
from conc.conc import Conc
= Corpus().load(corpus_path=corpus_path)
corpus
corpus.summary()= Conc(corpus=corpus) # prerequisite for running reports conc
Corpus Summary | |
---|---|
Attribute | Value |
Name | Garden Party Corpus |
Description | A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party |
Date Created | 2025-06-23 13:16:12 |
Conc Version | 0.1.4 |
Corpus Path | /home/geoff/data/conc-test-corpora/garden-party.corpus |
Document Count | 15 |
Token Count | 74,664 |
Word Token Count | 59,514 |
Unique Tokens | 5,410 |
Unique Word Tokens | 5,392 |
Reports for corpus analysis
The example snippets below assume you’ve built or loaded a corpus and prepared it for reporting (see above).
conc.frequencies().display()
Frequencies | |||
---|---|---|---|
Frequencies of word tokens, Garden Party Corpus | |||
Rank | Token | Frequency | Normalized Frequency |
1 | the | 2,911 | 489.13 |
2 | and | 1,798 | 302.11 |
3 | a | 1,407 | 236.41 |
4 | to | 1,376 | 231.21 |
5 | she | 1,171 | 196.76 |
6 | was | 1,102 | 185.17 |
7 | it | 1,021 | 171.56 |
8 | her | 937 | 157.44 |
9 | of | 908 | 152.57 |
10 | i | 719 | 120.81 |
11 | he | 718 | 120.64 |
12 | in | 683 | 114.76 |
13 | that | 643 | 108.04 |
14 | you | 642 | 107.87 |
15 | ’s | 524 | 88.05 |
16 | n’t | 522 | 87.71 |
17 | said | 514 | 86.37 |
18 | on | 504 | 84.69 |
19 | had | 469 | 78.80 |
20 | his | 440 | 73.93 |
Report based on word tokens | |||
Normalized Frequency is per 10,000 tokens | |||
Total word tokens: 59,514 | |||
Unique word tokens: 5,392 | |||
Showing 20 rows | |||
Page 1 of 270 |
= 2).display() conc.ngram_frequencies(ngram_length
Ngram Frequencies | |||
---|---|---|---|
Garden Party Corpus | |||
Rank | Ngram | Frequency | Normalized Frequency |
1 | it was | 247 | 41.50 |
2 | in the | 214 | 35.96 |
3 | on the | 183 | 30.75 |
4 | of the | 156 | 26.21 |
5 | to the | 139 | 23.36 |
6 | at the | 133 | 22.35 |
7 | she was | 132 | 22.18 |
8 | and the | 124 | 20.84 |
9 | it ’s | 120 | 20.16 |
10 | do n’t | 105 | 17.64 |
11 | they were | 104 | 17.47 |
12 | he was | 100 | 16.80 |
13 | she had | 95 | 15.96 |
14 | and she | 88 | 14.79 |
15 | could n’t | 88 | 14.79 |
16 | to be | 88 | 14.79 |
17 | she said | 87 | 14.62 |
18 | there was | 84 | 14.11 |
19 | a little | 83 | 13.95 |
20 | out of | 83 | 13.95 |
Report based on word tokens | |||
Ngram length: 2 | |||
Ngrams containing punctuation tokens excluded | |||
Normalized Frequency is per 10,000 tokens | |||
Total unique ngrams: 24,351 | |||
Total ngrams: 47,706 | |||
Showing 20 rows | |||
Page 1 of 1218 |
For a keywords report you will need another corpus to compare against. This example assumes you’ve already built that and have defined the reference_corpus_path
variable. When you load the reference corpus, set it to reference_corpus
or some other variable name to help you distinguish it from the corpus you are reporting on.
= Corpus().load(corpus_path=reference_corpus_path)
reference_corpus
conc.set_reference_corpus(reference_corpus)= 5, min_document_frequency_reference = 5).display() conc.keywords(min_document_frequency
Keywords | ||||||||
---|---|---|---|---|---|---|---|---|
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus | ||||||||
Rank | Token | Frequency | Frequency Reference | Normalized Frequency | Normalized Frequency Reference | Relative Risk | Log Ratio | Log Likelihood |
1 | bye | 25 | 7 | 4.20 | 0.07 | 58.82 | 5.88 | 110.23 |
2 | velvet | 14 | 5 | 2.35 | 0.05 | 46.11 | 5.53 | 58.78 |
3 | shone | 13 | 5 | 2.18 | 0.05 | 42.82 | 5.42 | 53.69 |
4 | queer | 15 | 6 | 2.52 | 0.06 | 41.17 | 5.36 | 61.39 |
5 | gloves | 17 | 7 | 2.86 | 0.07 | 40.00 | 5.32 | 69.11 |
6 | cried | 59 | 26 | 9.91 | 0.27 | 37.37 | 5.22 | 235.92 |
7 | darling | 36 | 18 | 6.05 | 0.18 | 32.94 | 5.04 | 139.33 |
8 | faintly | 14 | 7 | 2.35 | 0.07 | 32.94 | 5.04 | 54.18 |
9 | oh | 149 | 93 | 25.04 | 0.95 | 26.39 | 4.72 | 540.97 |
10 | handkerchief | 14 | 9 | 2.35 | 0.09 | 25.62 | 4.68 | 50.36 |
11 | dear | 78 | 54 | 13.11 | 0.55 | 23.79 | 4.57 | 273.99 |
12 | breathed | 13 | 9 | 2.18 | 0.09 | 23.79 | 4.57 | 45.67 |
13 | awful | 23 | 17 | 3.86 | 0.17 | 22.28 | 4.48 | 79.04 |
14 | ah | 26 | 20 | 4.37 | 0.20 | 21.41 | 4.42 | 88.12 |
15 | breast | 14 | 11 | 2.35 | 0.11 | 20.96 | 4.39 | 47.09 |
16 | dashed | 10 | 8 | 1.68 | 0.08 | 20.59 | 4.36 | 33.42 |
17 | gasped | 6 | 5 | 1.01 | 0.05 | 19.76 | 4.30 | 19.76 |
18 | parted | 6 | 5 | 1.01 | 0.05 | 19.76 | 4.30 | 19.76 |
19 | timid | 6 | 5 | 1.01 | 0.05 | 19.76 | 4.30 | 19.76 |
20 | sigh | 13 | 11 | 2.18 | 0.11 | 19.46 | 4.28 | 42.56 |
Report based on word tokens | ||||||||
Filtered tokens by minimum document frequency in target corpus (5), minimum document frequency in reference corpus (5) | ||||||||
Normalized Frequency is per 10,000 tokens | ||||||||
Total word tokens in target corpus: 59,514 | ||||||||
Total word tokens in reference corpus: 980,144 | ||||||||
Keywords: 825 | ||||||||
Showing 20 rows | ||||||||
Page 1 of 42 |
'could').display() conc.collocates(
Collocates of "could" | |||||
---|---|---|---|---|---|
Garden Party Corpus | |||||
Rank | Token | Collocate Frequency | Frequency | Logdice | Log Likelihood |
1 | n’t | 111 | 522 | 12.28 | 233.85 |
2 | have | 35 | 240 | 11.33 | 50.03 |
3 | she | 94 | 1,171 | 11.13 | 52.86 |
4 | he | 55 | 718 | 10.93 | 27.90 |
5 | they | 35 | 398 | 10.89 | 23.65 |
6 | it | 71 | 1,021 | 10.89 | 28.44 |
7 | be | 26 | 251 | 10.86 | 23.35 |
8 | help | 12 | 27 | 10.71 | 44.49 |
9 | what | 23 | 270 | 10.63 | 14.61 |
10 | not | 19 | 229 | 10.48 | 11.45 |
11 | but | 25 | 417 | 10.36 | 6.43 |
12 | how | 13 | 126 | 10.32 | 11.60 |
13 | could | 16 | 207 | 10.31 | 107.37 |
14 | why | 11 | 99 | 10.20 | 11.00 |
15 | hardly | 8 | 17 | 10.19 | 30.81 |
16 | do | 16 | 242 | 10.19 | 5.58 |
17 | understand | 8 | 19 | 10.18 | 28.60 |
18 | that | 30 | 643 | 10.18 | 2.39 |
19 | was | 44 | 1,102 | 10.11 | 0.79 |
20 | no | 13 | 181 | 10.10 | 5.66 |
Report based on word tokens | |||||
Context tokens left: 5, context tokens right: 5 | |||||
Filtered tokens by minimum collocation frequency (5) | |||||
Unique collocates: 83 | |||||
Showing 20 rows | |||||
Page 1 of 5 |
'could', order = '1R2R3R', context_length = 10).display() conc.concordance(
Concordance for "could" | |||
---|---|---|---|
Garden Party Corpus, Context tokens: 10, Order: 1R2R3R | |||
Doc Id | Left | Node | Right |
4 | herself and got as close to the sea as she | could | , and sung something , something she had made up |
11 | I sat up and called out as loud as I | could | , “ _ I do want to go on a |
13 | ’ll not be a minute . ” And before he | could | answer she was gone . He had half a mind |
4 | until I ’ve had something . Do you think we | could | ask Kate for two cups of hot water ? ” |
2 | , and there will be no time to explain what | could | be explained so simply .... But to - night it |
1 | Here ’s this huge house and garden . Surely you | could | be happy in — in — appreciating it for a |
12 | away from the Listening Ear . Good Heavens , what | could | be more tragic than that lament ! Every note was |
1 | even a successful , established , big paying concern — | could | be played with . A man had either to put |
8 | play . It was exactly like a play . Who | could | believe the sky at the back was n’t painted ? |
13 | . Was her luggage ready ? In that case they | could | cut off sharp with her cabin luggage and let the |
2 | but , after all , they ’re boys . I | could | cut off to sea , or get a job up |
2 | one leg over . But which leg ? She never | could | decide . And when she did finally put one leg |
10 | loved having to arrange things ; she always felt she | could | do it so much better than anybody else . Four |
2 | sitting there in the washhouse ; it was all they | could | do not to burst into a little chorus of animals |
11 | to practise . ” Oh , it was all I | could | do not to burst out crying . I went over |
7 | . But after supper they were all so tired they | could | do nothing but yawn until it was late enough to |
14 | astonished Fenella . “ You did n’t think your grandma | could | do that , did you ? ” said she . |
2 | wanted , somehow , to celebrate the fact that they | could | do what they liked now . There was no man |
9 | bent over her . This was such bliss that he | could | dream no further . But it gave him the courage |
6 | away — anywhere , as though by walking away he | could | escape .... It was cold in the street . There |
Total Concordance Lines: 207 | |||
Total Documents: 15 | |||
Showing 20 lines | |||
Page 1 of 11 |
'he').display() conc.concordance_plot(
Concordance Plot for "he"
Garden Party Corpus
Total Concordance Lines: 718
The texts are accessible via the Corpus object. If you display a text (as below) it will show any available metadata. Add show_metadata=False
to just show the text. If you leave off the max_tokens
parameter, it will display the entire text.
2).display(max_tokens = 300) corpus.text(
Metadata | |
---|---|
Attribute | Value |
document_id | 2 |
file | at-the-bay.txt |
Texts are also accessible as strings if you want to work with them like that …
= corpus.text(1).as_string(max_tokens = 50)
text print(text)
That evening for the first time in his life, as he pressed through the
swing door and descended the three broad steps to the pavement, old Mr.
Neave felt he was too old for the spring. Spring—warm, eager,
restless—
or as tokens (this example loops through all texts in the corpus and displays the first 10 tokens) …
for i in range(1, corpus.document_count + 1):
= corpus.text(i).as_tokens() # this is all the tokens
tokens print(tokens[:10]) # slicing to first 10 tokens
['That', 'evening', 'for', 'the', 'first', 'time', 'in', 'his', 'life', ',']
['I', '\r\n\r\n', 'Very', 'early', 'morning', '.', 'The', 'sun', 'was', 'not']
['A', 'stout', 'man', 'with', 'a', 'pink', 'face', 'wears', 'dingy', 'white']
['I', '\r\n\r\n', 'The', 'week', 'after', 'was', 'one', 'of', 'the', 'busiest']
['Exactly', 'when', 'the', 'ball', 'began', 'Leila', 'would', 'have', 'found', 'it']
['When', 'the', 'literary', 'gentleman', ',', 'whose', 'flat', 'old', 'Ma', 'Parker']
['On', 'his', 'way', 'to', 'the', 'station', 'William', 'remembered', 'with', 'a']
['Although', 'it', 'was', 'so', 'brilliantly', 'fine', '—', 'the', 'blue', 'sky']
['Of', 'course', 'he', 'knew', '—', 'no', 'man', 'better', '—', 'that']
['And', 'after', 'all', 'the', 'weather', 'was', 'ideal', '.', 'They', 'could']
['_', 'Eleven', 'o’clock', '.', 'A', 'knock', 'at', 'the', 'door', '.']
['With', 'despair', '—', 'cold', ',', 'sharp', 'despair', '—', 'buried', 'deep']
['It', 'seemed', 'to', 'the', 'little', 'crowd', 'on', 'the', 'wharf', 'that']
['The', 'Picton', 'boat', 'was', 'due', 'to', 'leave', 'at', 'half', '-']
['In', 'her', 'blue', 'dress', ',', 'with', 'her', 'cheeks', 'lightly', 'flushed']
Working with Conc results
In the examples above we run reports and use .display()
to output a report. You can access report data directly as a Polars dataframe using .to_frame()
. This means you can work with the data in Polars to further filter the results or extend analysis.
# same specific tokens to restrict the results below to
= ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours',
pronouns 'his', 'hers', 'its', 'ours', 'yours', 'theirs']
# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
= conc.keywords(restrict_tokens = pronouns).to_frame() df
# now go for it - work with the Polars dataframe
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
'token', 'normalized_frequency', 'normalized_frequency_reference']).sort('normalized_frequency', descending=True).head(10) df.select([
token | normalized_frequency | normalized_frequency_reference |
---|---|---|
"she" | 196.760426 | 21.01732 |
"it" | 171.556272 | 71.326254 |
"her" | 157.441946 | 29.454856 |
"i" | 120.81191 | 44.585285 |
"he" | 120.643882 | 69.091889 |
"you" | 107.873778 | 33.311432 |
"his" | 73.932184 | 66.551446 |
"they" | 66.875021 | 28.995739 |
"them" | 35.285815 | 18.232015 |
"him" | 32.093289 | 26.720564 |
You are not restricted to Conc for your analysis. Conc report data can be exported to other formats. For example, although Conc uses Polars internally for its efficiency, you can convert report results into a Pandas dataframe, which is flexible and interoperable with many Python libraries for data analysis. Here is an example …
import pandas as pd # Conc does not install Pandas - so if you haven't already, install it with "pip install pandas"
# same specific tokens to restrict the results below to
= ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours',
pronouns 'his', 'hers', 'its', 'ours', 'yours', 'theirs']
# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe, which is then converted to a Pandas dataframe
= conc.keywords(restrict_tokens = pronouns).to_frame().to_pandas() df
# now go for it - do the pandas stuff you are familiar with ...
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
# note you could do this in Polars (see above), but this is assuming you are more familiar with Pandas
='normalized_frequency', ascending=False)[['token', 'normalized_frequency', 'normalized_frequency_reference']].head(10) df.sort_values(by
token | normalized_frequency | normalized_frequency_reference | |
---|---|---|---|
0 | she | 196.760426 | 21.017320 |
7 | it | 171.556272 | 71.326254 |
2 | her | 157.441946 | 29.454856 |
6 | i | 120.811910 | 44.585285 |
12 | he | 120.643882 | 69.091889 |
5 | you | 107.873778 | 33.311432 |
15 | his | 73.932184 | 66.551446 |
9 | they | 66.875021 | 28.995739 |
10 | them | 35.285815 | 18.232015 |
14 | him | 32.093289 | 26.720564 |
If you want to work with Conc results in other libraries or software, you can write the results to an interoperable data format. Conc results can be accessed as Polars dataframes, which can be exported to CSV or JSON formats. See the Polars input/output documentation for other formats supported, including Parquet, newline-delimited JSON, Excel and more.
# same specific tokens to restrict the results below to
= ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours',
pronouns 'his', 'hers', 'its', 'ours', 'yours', 'theirs']
# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
= conc.keywords(restrict_tokens = pronouns).to_frame() df
The result in df is currently a Polars dataframe, write to CSV …
f'pronoun-results.csv') df.write_csv(
Here is how to write the results to a JSON file …
f'pronoun-results.json') df.write_json(
Working with Conc corpus data
If Conc does not do something you want, you can work with a Conc corpus data directly.
When you build or load a corpus the token, vocab and other data is accessible via the Corpus object. These could be very large and are not loaded into memory, but they can be queried via Polars Lazy API. Read about the Anatomy of a Conc Corpus for more information. Here is an example …
import polars as pl
# e.g. filter vocab by tokens ending in 'ing'
filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming') corpus.vocab.
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
99 | 4979 | 5920 | "something" | 100 | 96 | false | false |
171 | 2275 | 4862 | "going" | 59 | 59 | false | false |
205 | 3206 | 5170 | "looking" | 48 | 46 | false | false |
214 | 5509 | 830 | "thing" | 44 | 44 | false | false |
236 | 3650 | 4679 | "nothing" | 51 | 41 | false | false |
There are a number of different data files that make up a Conc corpus. These can be queried directly if required. Read about the Anatomy of a Conc Corpus for an explanation of the format and data. Here is an example …
import polars as pl
# e.g. lazy query to get tokens from vocab ending in 'ing'
f'{save_path}garden-party.corpus/vocab.parquet').filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming')) display(pl.scan_parquet(
rank | tokens_sort_order | token_id | token | frequency_lower | frequency_orth | is_punct | is_space |
---|---|---|---|---|---|---|---|
99 | 4979 | 5920 | "something" | 100 | 96 | false | false |
171 | 2275 | 4862 | "going" | 59 | 59 | false | false |
205 | 3206 | 5170 | "looking" | 48 | 46 | false | false |
214 | 5509 | 830 | "thing" | 44 | 44 | false | false |
236 | 3650 | 4679 | "nothing" | 51 | 41 | false | false |