Quick Conc Recipes

Code snippets for common tasks in Conc.

The Get started with Conc tutorial (in progress) is a detailed step through Conc functionality. This page provides simple code recipes for common Conc tasks. See the Conc API Reference for information on available methods, functions and parameters.

Install Conc

Basic install

See the installation page if you have a pre-2013 device, want to install the latest development version, or want to install optional dependencies.

Conc is tested with Python 3.10+. Create a new environment (with venv, conda or similiar) and run the following command to install Conc in your terminal. If you working within a notebook environment, you can usually run commands in a code cell with an ! symbol before the command.

pip install conc

The Conc install process installs spaCy, but you will need a spaCy model. Once the package and its dependencies are installed, run the following command to download the English language model for spaCy. If you want to use a different language, consult the spaCy models page for information on available models.

python -m spacy download en_core_web_sm

Building or loading a corpus for analysis

Build a Conc corpus from text files (or a compressed archive of text files) and prepare to report on it

The example assumes you’ve set the name and description of your corpus, as well as the path_to_source_file and save_path variables. The path_to_source_file can be a directory of text files or a compressed archive of text files (i.e. .zip, .tar.gz).

from conc.corpus import Corpus
from conc.conc import Conc

corpus = Corpus(name=name, description=description).build_from_files(source_path = path_to_source_file, save_path = save_path)
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports

Corpus Summary

Attribute	Value
Name	Garden Party Corpus
Description	A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created	2025-07-06 12:11:27
Conc Version	0.1.6
Corpus Path	/home/geoff/data/conc-test-corpora/garden-party.corpus
Document Count	15
Token Count	74,664
Word Token Count	59,514
Unique Tokens	5,410
Unique Word Tokens	5,392

Build a Conc corpus from a CSV (or a compressed CSV) and prepare to report on it

The example assumes you’ve set the name and description of your corpus, as well as the path_to_source_file and save_path variables. The path_to_source_file can be a .csv file or a compressed .csv.gz file. This example assumes the CSV has a column called text containing the text to be analysed and another column called source, which will be retained as metadata.

from conc.corpus import Corpus
from conc.conc import Conc

corpus = Corpus(name=name, description=description).build_from_csv(source_path=path_to_source_file, save_path=save_path, text_column='text', metadata_columns=['source', 'category'])
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports

Corpus Summary

Attribute	Value
Name	Brown Corpus
Description	A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
Date Created	2025-07-06 12:11:34
Conc Version	0.1.6
Corpus Path	/home/geoff/data/conc-test-corpora/brown.corpus
Document Count	500
Token Count	1,138,566
Word Token Count	980,144
Unique Tokens	42,930
Unique Word Tokens	42,907

Load a Conc corpus and prepare to report on it

The example assumes you’ve built a corpus previously with Conc and you’ve set the corpus_path (i.e. the corpus path created when you built your corpus).

from conc.corpus import Corpus
from conc.conc import Conc

corpus = Corpus().load(corpus_path=corpus_path)
corpus.summary()
conc = Conc(corpus=corpus) # prerequisite for running reports

Corpus Summary

Attribute	Value
Name	Garden Party Corpus
Description	A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created	2025-07-06 12:11:27
Conc Version	0.1.6
Corpus Path	/home/geoff/data/conc-test-corpora/garden-party.corpus
Document Count	15
Token Count	74,664
Word Token Count	59,514
Unique Tokens	5,410
Unique Word Tokens	5,392

Building and loading a lightweight corpus representation (list corpus) for use as a reference corpus

A Conc corpus created using the Corpus class includes a representation of the tokenised text of each document in the corpus. Conc also supports a lightweight corpus representation, referred to as a list corpus (see the ListCorpus class), which is intended for use as a reference corpus for keyness analysis.

Building a list corpus

To build a list corpus, you need a Conc corpus. There are recipes for building a Conc corpus above.

Note: It can be helpful to add standardize_word_token_punctuation_characters=True to Corpus.build_from_files or Corpus_build_from_csv when building a Conc corpus for use as a reference corpus. This will standardise the apostrophe character in frequent word tokens. The Keyness.keywords method has a parameter called handle_common_typographic_differences that is set to True by default which adjusts frequent tokens with apostrophes in the reference corpus to match the target corpus usage.

Once you have built a Conc corpus, a list corpus can be created using the ListCorpus.build_from_corpus method. A new directory will be created in save_path with the name of the source corpus. The created directory name ends with .listcorpus to differentiate a list corpus from the standard Conc corpus format. List corpora support methods to describe the corpus as shown in the example below. See the ListCorpus documentation for the full API and documentation on the format.

from conc.listcorpus import ListCorpus

listcorpus = ListCorpus().build_from_corpus(source_corpus_path = path_to_brown_corpus, save_path = save_path)
listcorpus.summary()

List Corpus Summary

Attribute	Value
name	Brown Corpus
description	A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
date_created	2025-07-06 12:11:34
conc_version	0.1.6
corpus_path	/home/geoff/data/conc-test-corpora/brown.listcorpus
document_count	500
token_count	1,138,566
word_token_count	980,144
unique_tokens	42,930
unique_word_tokens	42,907

Load a list corpus

listcorpus = ListCorpus().load(path_to_brown_listcorpus)
listcorpus.summary()

List Corpus Summary

Attribute	Value
name	Brown Corpus
description	A Standard Corpus of Present-Day Edited American English, for use with Digital Computers. by W. N. Francis and H. Kucera (1964) Department of Linguistics, Brown University Providence, Rhode Island, USA Revised 1971, Revised and Amplified 1979 http://www.hit.uib.no/icame/brown/bcm.html. This version downloaded via NLTK https://www.nltk.org/nltk_data/.
date_created	2025-07-06 12:11:34
conc_version	0.1.6
corpus_path	/home/geoff/data/conc-test-corpora/brown.listcorpus
document_count	500
token_count	1,138,566
word_token_count	980,144
unique_tokens	42,930
unique_word_tokens	42,907

Reports for corpus analysis

The example snippets below assume you’ve built or loaded a corpus and prepared it for reporting (see above).

Frequency report

conc.frequencies().display()

Frequencies
Frequencies of word tokens, Garden Party Corpus
Rank	Token	Frequency	Normalized Frequency
1	the	2,911	489.13
2	and	1,798	302.11
3	a	1,407	236.41
4	to	1,376	231.21
5	she	1,171	196.76
6	was	1,102	185.17
7	it	1,021	171.56
8	her	937	157.44
9	of	908	152.57
10	i	719	120.81
11	he	718	120.64
12	in	683	114.76
13	that	643	108.04
14	you	642	107.87
15	’s	524	88.05
16	n’t	522	87.71
17	said	514	86.37
18	on	504	84.69
19	had	469	78.80
20	his	440	73.93
Report based on word tokens
Normalized Frequency is per 10,000 tokens
Total word tokens: 59,514
Unique word tokens: 5,392
Showing 20 rows
Page 1 of 270

Ngram frequencies report

conc.ngram_frequencies(ngram_length = 2).display()

Ngram Frequencies
Garden Party Corpus
Rank	Ngram	Frequency	Normalized Frequency
1	it was	247	41.50
2	in the	214	35.96
3	on the	183	30.75
4	of the	156	26.21
5	to the	139	23.36
6	at the	133	22.35
7	she was	132	22.18
8	and the	124	20.84
9	it ’s	120	20.16
10	do n’t	105	17.64
11	they were	104	17.47
12	he was	100	16.80
13	she had	95	15.96
14	could n’t	88	14.79
15	and she	88	14.79
16	to be	88	14.79
17	she said	87	14.62
18	there was	84	14.11
19	a little	83	13.95
20	out of	83	13.95
Report based on word tokens
Ngram length: 2
Ngrams containing punctuation tokens excluded
Normalized Frequency is per 10,000 tokens
Total unique ngrams: 24,351
Total ngrams: 47,706
Showing 20 rows
Page 1 of 1218

Keywords report

For a keywords report you will need another corpus to compare against. This example assumes you’ve already built that and have defined the reference_corpus_path variable. When you load the reference corpus, set it to reference_corpus or some other variable name to help you distinguish it from the corpus you are reporting on.

reference_corpus = Corpus().load(corpus_path=reference_corpus_path)
conc.set_reference_corpus(reference_corpus)
conc.keywords(min_document_frequency = 5, min_document_frequency_reference = 5).display()

Keywords
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	bye	25	7	4.20	0.07	58.82	5.88	110.23
2	velvet	14	5	2.35	0.05	46.11	5.53	58.78
3	shone	13	5	2.18	0.05	42.82	5.42	53.69
4	queer	15	6	2.52	0.06	41.17	5.36	61.39
5	gloves	17	7	2.86	0.07	40.00	5.32	69.11
6	cried	59	26	9.91	0.27	37.37	5.22	235.92
7	darling	36	18	6.05	0.18	32.94	5.04	139.33
8	faintly	14	7	2.35	0.07	32.94	5.04	54.18
9	oh	149	93	25.04	0.95	26.39	4.72	540.97
10	handkerchief	14	9	2.35	0.09	25.62	4.68	50.36
11	dear	78	54	13.11	0.55	23.79	4.57	273.99
12	breathed	13	9	2.18	0.09	23.79	4.57	45.67
13	awful	23	17	3.86	0.17	22.28	4.48	79.04
14	ah	26	20	4.37	0.20	21.41	4.42	88.12
15	breast	14	11	2.35	0.11	20.96	4.39	47.09
16	dashed	10	8	1.68	0.08	20.59	4.36	33.42
17	gasped	6	5	1.01	0.05	19.76	4.30	19.76
18	parted	6	5	1.01	0.05	19.76	4.30	19.76
19	timid	6	5	1.01	0.05	19.76	4.30	19.76
20	sigh	13	11	2.18	0.11	19.46	4.28	42.56
Report based on word tokens
Filtered tokens by minimum document frequency in target corpus (5), minimum document frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 59,514
Total word tokens in reference corpus: 980,144
Keywords: 825
Showing 20 rows
Page 1 of 42

Note: Conc’s lightweight list corpus format can be used as a reference corpus for keyness analysis. There is a recipe above to build a list corpus.

reference_corpus = ListCorpus().load(corpus_path=path_to_listcorpus)
conc.set_reference_corpus(reference_corpus)
conc.keywords(min_document_frequency = 5, min_document_frequency_reference = 5).display()

Keywords
Target corpus: Garden Party Corpus, Reference corpus: Brown Corpus
Rank	Token	Frequency	Frequency Reference	Normalized Frequency	Normalized Frequency Reference	Relative Risk	Log Ratio	Log Likelihood
1	bye	25	7	4.20	0.07	58.82	5.88	110.23
2	velvet	14	5	2.35	0.05	46.11	5.53	58.78
3	shone	13	5	2.18	0.05	42.82	5.42	53.69
4	queer	15	6	2.52	0.06	41.17	5.36	61.39
5	gloves	17	7	2.86	0.07	40.00	5.32	69.11
6	cried	59	26	9.91	0.27	37.37	5.22	235.92
7	darling	36	18	6.05	0.18	32.94	5.04	139.33
8	faintly	14	7	2.35	0.07	32.94	5.04	54.18
9	oh	149	93	25.04	0.95	26.39	4.72	540.97
10	handkerchief	14	9	2.35	0.09	25.62	4.68	50.36
11	dear	78	54	13.11	0.55	23.79	4.57	273.99
12	breathed	13	9	2.18	0.09	23.79	4.57	45.67
13	awful	23	17	3.86	0.17	22.28	4.48	79.04
14	ah	26	20	4.37	0.20	21.41	4.42	88.12
15	breast	14	11	2.35	0.11	20.96	4.39	47.09
16	dashed	10	8	1.68	0.08	20.59	4.36	33.42
17	gasped	6	5	1.01	0.05	19.76	4.30	19.76
18	parted	6	5	1.01	0.05	19.76	4.30	19.76
19	timid	6	5	1.01	0.05	19.76	4.30	19.76
20	sigh	13	11	2.18	0.11	19.46	4.28	42.56
Report based on word tokens
Filtered tokens by minimum document frequency in target corpus (5), minimum document frequency in reference corpus (5)
Normalized Frequency is per 10,000 tokens
Total word tokens in target corpus: 59,514
Total word tokens in reference corpus: 980,144
Keywords: 825
Showing 20 rows
Page 1 of 42

Collocates report

conc.collocates('could').display()

Collocates of "could"
Garden Party Corpus
Rank	Token	Collocate Frequency	Frequency	Logdice	Log Likelihood
1	n’t	111	522	12.28	233.85
2	have	35	240	11.33	50.03
3	she	94	1,171	11.13	52.86
4	he	55	718	10.93	27.90
5	they	35	398	10.89	23.65
6	it	71	1,021	10.89	28.44
7	be	26	251	10.86	23.35
8	help	12	27	10.71	44.49
9	what	23	270	10.63	14.61
10	not	19	229	10.48	11.45
11	but	25	417	10.36	6.43
12	how	13	126	10.32	11.60
13	could	16	207	10.31	107.37
14	why	11	99	10.20	11.00
15	hardly	8	17	10.19	30.81
16	do	16	242	10.19	5.58
17	understand	8	19	10.18	28.60
18	that	30	643	10.18	2.39
19	was	44	1,102	10.11	0.79
20	no	13	181	10.10	5.66
Report based on word tokens
Context tokens left: 5, context tokens right: 5
Filtered tokens by minimum collocation frequency (5)
Unique collocates: 83
Showing 20 rows
Page 1 of 5

Concordance

conc.concordance('could', order = '1R2R3R', context_length = 10).display()

Concordance for "could"
Garden Party Corpus, Context tokens: 10, Order: 1R2R3R
Doc Id	Left	Node	Right
4	herself and got as close to the sea as she	could	, and sung something , something she had made up
13	’ll not be a minute . ” And before he	could	answer she was gone . He had half a mind
4	until I ’ve had something . Do you think we	could	ask Kate for two cups of hot water ? ”
2	, and there will be no time to explain what	could	be explained so simply .... But to - night it
1	Here ’s this huge house and garden . Surely you	could	be happy in — in — appreciating it for a
12	away from the Listening Ear . Good Heavens , what	could	be more tragic than that lament ! Every note was
1	even a successful , established , big paying concern —	could	be played with . A man had either to put
8	play . It was exactly like a play . Who	could	believe the sky at the back was n’t painted ?
13	. Was her luggage ready ? In that case they	could	cut off sharp with her cabin luggage and let the
2	but , after all , they ’re boys . I	could	cut off to sea , or get a job up
2	one leg over . But which leg ? She never	could	decide . And when she did finally put one leg
10	loved having to arrange things ; she always felt she	could	do it so much better than anybody else . Four
2	sitting there in the washhouse ; it was all they	could	do not to burst into a little chorus of animals
11	to practise . ” Oh , it was all I	could	do not to burst out crying . I went over
7	. But after supper they were all so tired they	could	do nothing but yawn until it was late enough to
14	astonished Fenella . “ You did n’t think your grandma	could	do that , did you ? ” said she .
2	wanted , somehow , to celebrate the fact that they	could	do what they liked now . There was no man
9	bent over her . This was such bliss that he	could	dream no further . But it gave him the courage
6	away — anywhere , as though by walking away he	could	escape .... It was cold in the street . There
9	to Anne . “ Anne , do you think you	could	ever care for me ? ” It was done .
Total Concordance Lines: 207
Total Documents: 15
Showing 20 lines
Page 1 of 11

Concordance with lines filtered to show only those with a specific context string

conc.concordance('could', order = '1R2R3R', context_length = 10, filter_context_str = 'she', filter_context_length = 5).display()

Concordance for "could"
Garden Party Corpus, Context tokens: 10, Order: 1R2R3R
Doc Id	Left	Node	Right
4	herself and got as close to the sea as she	could	, and sung something , something she had made up
13	’ll not be a minute . ” And before he	could	answer she was gone . He had half a mind
2	one leg over . But which leg ? She never	could	decide . And when she did finally put one leg
10	loved having to arrange things ; she always felt she	could	do it so much better than anybody else . Four
5	have to dance , out of politeness , until she	could	find Meg . Very stiffly she walked into the middle
15	and a tiny horn filled with fresh strawberries . She	could	hardly bear to watch him . But just as the
4	guest in case — ” “ Oh , but she	could	hardly expect to be paid ! ” cried Constantia .
4	our dear father _ so _ much , ” she	could	have cried if she ’d wanted to . “ Have
6	? Was n’t there anywhere in the world where she	could	have her cry out — at last ? Ma Parker
10	quite warm . A warm little silver star . She	could	have kissed it . The front door bell pealed ,
7	induced William ... ? How extraordinary it was .... What	could	have made him ... ? She felt confused , more
13	to go chasing after the ship ’s doctor ? She	could	have sent a note from the hotel even if the
8	biting its tail just by her left ear . She	could	have taken it off and laid it on her lap
6	her . Oh , was n’t there anywhere where she	could	hide and keep herself to herself and stay as long
4	her Buddha . Oh , what was it , what	could	it be ? And yet she had always felt there
13	affair had been urgent . Urgent ? Did it —	could	it mean that she had been ill on the voyage
10	long black velvet ribbon . Never had she imagined she	could	look like that . Is mother right ? she thought
12	would have to leave the school , too . She	could	never face the Science Mistress or the girls after it
2	her real grudge against life ; that was what she	could	not understand . That was the question she asked and
4	such a strange smile ; she looked different . She	could	n’t be going to cry . “ Jug , Jug
Concordance lines restricted to those containing "she" in span 5 tokens to left, 5 tokens to right
Total Concordance Lines: 72
Total Documents: 13
Showing 20 lines
Page 1 of 4

Concordance plot

conc.concordance_plot('he').display()

Concordance Plot for "he"

Garden Party Corpus

Total Documents: 15
Total Concordance Lines: 718

Page 1 of 2

You can still output individual texts (and access the text and tokens programmatically)! Here is how …

The texts are accessible via the Corpus object. If you display a text (as below) it will show any available metadata. Add show_metadata=False to just show the text. If you leave off the max_tokens parameter, it will display the entire text. Check the Text class documentation for information on available parameters to control reflow and wrapping of text via the display method.

corpus.text(2).display(max_tokens = 300)

Metadata

Attribute	Value
document_id	2
file	at-the-bay.txt

I Very early morning. The sun was not yet risen, and the whole of Crescent Bay was hidden under a white sea-mist. The big bush-covered hills at the back were smothered. You could not see where they ended and the paddocks and bungalows began. The sandy road was gone and the paddocks and bungalows the other side of it; there were no white dunes covered with reddish grass beyond them; there was nothing to mark which was beach and where was the sea. A heavy dew had fallen. The grass was blue. Big drops hung on the bushes and just did not fall; the silvery, fluffy toi-toi was limp on its long stalks, and all the marigolds and the pinks in the bungalow gardens were bowed to the earth with wetness. Drenched were the cold fuchsias, round pearls of dew lay on the flat nasturtium leaves. It looked as though the sea had beaten up softly in the darkness, as though one immense wave had come rippling, rippling—how far? Perhaps if you had waked up in the middle of the night you might have seen a big fish flicking in at the window and gone again.... Ah-Aah! sounded the sleepy sea. And from the bush there came the sound of little streams flowing, quickly, lightly, slipping between the smooth stones, gushing into ferny basins and out again; and there was the splashing of big drops on large leaves, and something else—… [300 of 18238 tokens]

Texts are also accessible as strings if you want to work with them like that …

text = corpus.text(1).as_string(max_tokens = 50)
print(text)

That evening for the first time in his life, as he pressed through the
swing door and descended the three broad steps to the pavement, old Mr.
Neave felt he was too old for the spring. Spring—warm, eager,
restless—

or as tokens (this example loops through all texts in the corpus and displays the first 10 tokens) …

for i in range(1, corpus.document_count + 1):
    tokens = corpus.text(i).as_tokens() # this is all the tokens
    print(tokens[:10]) # slicing to first 10 tokens

['That', 'evening', 'for', 'the', 'first', 'time', 'in', 'his', 'life', ',']
['I', '\r\n\r\n', 'Very', 'early', 'morning', '.', 'The', 'sun', 'was', 'not']
['A', 'stout', 'man', 'with', 'a', 'pink', 'face', 'wears', 'dingy', 'white']
['I', '\r\n\r\n', 'The', 'week', 'after', 'was', 'one', 'of', 'the', 'busiest']
['Exactly', 'when', 'the', 'ball', 'began', 'Leila', 'would', 'have', 'found', 'it']
['When', 'the', 'literary', 'gentleman', ',', 'whose', 'flat', 'old', 'Ma', 'Parker']
['On', 'his', 'way', 'to', 'the', 'station', 'William', 'remembered', 'with', 'a']
['Although', 'it', 'was', 'so', 'brilliantly', 'fine', '—', 'the', 'blue', 'sky']
['Of', 'course', 'he', 'knew', '—', 'no', 'man', 'better', '—', 'that']
['And', 'after', 'all', 'the', 'weather', 'was', 'ideal', '.', 'They', 'could']
['_', 'Eleven', 'o’clock', '.', 'A', 'knock', 'at', 'the', 'door', '.']
['With', 'despair', '—', 'cold', ',', 'sharp', 'despair', '—', 'buried', 'deep']
['It', 'seemed', 'to', 'the', 'little', 'crowd', 'on', 'the', 'wharf', 'that']
['The', 'Picton', 'boat', 'was', 'due', 'to', 'leave', 'at', 'half', '-']
['In', 'her', 'blue', 'dress', ',', 'with', 'her', 'cheeks', 'lightly', 'flushed']

Working with Conc results

Accessing and working with Conc results

In the examples above we run reports and use .display() to output a report. You can access report data directly as a Polars dataframe using .to_frame(). This means you can work with the data in Polars to further filter the results or extend analysis.

# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
df = conc.keywords(restrict_tokens = pronouns).to_frame()

# now go for it - work with the Polars dataframe
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
df.select(['token', 'normalized_frequency', 'normalized_frequency_reference']).sort('normalized_frequency', descending=True).head(10)

token	normalized_frequency	normalized_frequency_reference
"she"	196.760426	21.01732
"it"	171.556272	71.326254
"her"	157.441946	29.454856
"i"	120.81191	44.585285
"he"	120.643882	69.091889
"you"	107.873778	33.311432
"his"	73.932184	66.551446
"they"	66.875021	28.995739
"them"	35.285815	18.232015
"him"	32.093289	26.720564

Working with Conc results using Pandas

You are not restricted to Conc for your analysis. Conc report data can be exported to other formats. For example, although Conc uses Polars internally for its efficiency, you can convert report results into a Pandas dataframe, which is flexible and interoperable with many Python libraries for data analysis. Here is an example …

import pandas as pd # Conc does not install Pandas - so if you haven't already, install it with "pip install pandas"

# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe, which is then converted to a Pandas dataframe
df = conc.keywords(restrict_tokens = pronouns).to_frame().to_pandas()

# now go for it - do the pandas stuff you are familiar with ...
# e.g. show a table of results showing specific columns, display top 10 results ordered by normalized frequency descending
# note you could do this in Polars (see above), but this is assuming you are more familiar with Pandas
df.sort_values(by='normalized_frequency', ascending=False)[['token', 'normalized_frequency', 'normalized_frequency_reference']].head(10)

	token	normalized_frequency	normalized_frequency_reference
0	she	196.760426	21.017320
7	it	171.556272	71.326254
2	her	157.441946	29.454856
6	i	120.811910	44.585285
12	he	120.643882	69.091889
5	you	107.873778	33.311432
15	his	73.932184	66.551446
9	they	66.875021	28.995739
10	them	35.285815	18.232015
14	him	32.093289	26.720564

Want to work with Conc results using other software/libraries? Exporting Conc results to CSV, JSON and other formats

If you want to work with Conc results in other libraries or software, you can write the results to an interoperable data format. Conc results can be accessed as Polars dataframes, which can be exported to CSV or JSON formats. See the Polars input/output documentation for other formats supported, including Parquet, newline-delimited JSON, Excel and more.

# same specific tokens to restrict the results below to
pronouns = ['i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them', 'mine', 'yours', 
            'his', 'hers', 'its', 'ours', 'yours', 'theirs']

# retrieve keyword report for these pronouns, which has normalized frequencies for each corpus
# the next line returns the result as a Polars dataframe using the `to_frame` method
df = conc.keywords(restrict_tokens = pronouns).to_frame()

The result in df is currently a Polars dataframe, write to CSV …

df.write_csv(f'pronoun-results.csv')

Here is how to write the results to a JSON file …

df.write_json(f'pronoun-results.json')

Working with Conc corpus data

If Conc does not do something you want, you can work with a Conc corpus data directly.

Working with a loaded Corpus

When you build or load a corpus the token, vocab and other data is accessible via the Corpus object. These could be very large and are not loaded into memory, but they can be queried via Polars Lazy API. Read about the Anatomy of a Conc Corpus for more information. Here is an example …

import polars as pl

# e.g. filter vocab by tokens ending in 'ing'
corpus.vocab.filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming')

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
99	4979	5920	"something"	100	96	false	false
171	2275	4862	"going"	59	59	false	false
205	3206	5170	"looking"	48	46	false	false
214	5509	830	"thing"	44	44	false	false
236	3650	4679	"nothing"	51	41	false	false

Working with Conc corpus data files directly

There are a number of different data files that make up a Conc corpus. These can be queried directly if required. Read about the Anatomy of a Conc Corpus for an explanation of the format and data. Here is an example …

import polars as pl

# e.g. lazy query to get tokens from vocab ending in 'ing'
display(pl.scan_parquet(f'{save_path}garden-party.corpus/vocab.parquet').filter(pl.col('token').str.ends_with('ing')).head(5).collect(engine='streaming'))

rank	tokens_sort_order	token_id	token	frequency_lower	frequency_orth	is_punct	is_space
99	4979	5920	"something"	100	96	false	false
171	2275	4862	"going"	59	59	false	false
205	3206	5170	"looking"	48	46	false	false
214	5509	830	"thing"	44	44	false	false
236	3650	4679	"nothing"	51	41	false	false