import os
from conc.corpora import get_garden_party
Get Started with Conc
This is a quick, no-frills introduction to using Conc. You can skip part 1 if you already have some data you want to work with.
1. Get some sample texts
For this getting started guide I’m going to use the example of a collection of short stories from Katherine Mansfield’s The Garden Party as sample texts. This corpus is available as a zip file of text files and can be downloaded via the conc.corpora submodule. First, we will import the function from conc.corpora to get the sample data.
Now we define where we want the data to be stored (source_path
) and where we want the corpus to be saved (save_path
). When the corpus is built it will be saved in a new directory in save_path
. Note: the os.environ.get in the paths below are not required. You can specify paths directly as strings (e.g. /some/path/).
= f'{os.environ.get("HOME")}/data/'
source_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/' save_path
Now we download the data. This will create the source_path directory defined above if it is not already there (and it is somewhere your user can write).
=source_path) get_garden_party(source_path
2. Build the corpus
You can currently build a Conc corpus from:
- a directory of text files or a .zip/.tar/.tar.gz containing text files (
Corpus.build_from_files
)
- a .csv file (or .csv.gz file) with a column containing your text (
Corpus.build_from_csv
)
More source types will be added in the future, but lots of data can be wrangled into these formats.
Both methods support importing metadata. See the documentation links above for more details.
For information on the Conc corpus format, see the Anatomy of a Conc Corpus.
The following code imports the Corpus
class from conc.corpus
.
from conc.corpus import Corpus
The following line creates a Corpus, gives it a name and description, and builds it from the Garden Party source files.
Remember, a new directory for your corpus will be created in save_path
. The name of that directory is a slugified version of the name you pass in. For the Garden Party Corpus, the directory garden-party.corpus will be created. The folder name can be changed later if you want. You can distribute your corpus by sharing the directory and its contents.
The build process time depends on the size of your corpus. The build process produces a corpus format that is quick to load and use. In this case, the corpus is small and it is done in a couple of seconds even on a old, slow computer.
= 'Garden Party Corpus'
name = 'A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party'
description = 'garden-party-corpus.zip'
source_file
= Corpus(name=name, description=description).build_from_files(source_path = f'{source_path}{source_file}', save_path = save_path) corpus
To get information on the corpus, including various summary counts and information on the path of the corpus, you can use the Corpus.summary
method.
corpus.summary()
Corpus Summary | |
---|---|
Attribute | Value |
Name | Garden Party Corpus |
Description | A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party |
Date Created | 2025-06-18 10:37:28 |
Conc Version | 0.1.3 |
Corpus Path | /home/geoff/data/conc-test-corpora/garden-party.corpus |
Document Count | 15 |
Token Count | 74,664 |
Word Token Count | 63,311 |
Unique Tokens | 5,410 |
Unique Word Tokens | 5,398 |
3. Load a Conc corpus
Here is how we can load the corpus we just build. We don’t need to pass in a name and description, we just need the path to the corpus.
= Corpus().load(corpus_path=f'{save_path}garden-party.corpus') corpus
Let’s check our corpus information again. We could use the summary method again here, but we can also access this information using the Corpus.info
method. Here we include the include_disk_usage
parameter to get additional information on how much disk space our corpus is using.
print(corpus.info(include_disk_usage=True))
┌────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Attribute ┆ Value │
╞════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Name ┆ Garden Party Corpus │
│ Description ┆ A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. │
│ ┆ https://github.com/ucdh/scraping-garden-party │
│ Date Created ┆ 2025-06-18 10:37:28 │
│ Conc Version ┆ 0.1.3 │
│ Corpus Path ┆ /home/geoff/data/conc-test-corpora/garden-party.corpus │
│ Document Count ┆ 15 │
│ Token Count ┆ 74,664 │
│ Word Token Count ┆ 63,311 │
│ Unique Tokens ┆ 5,410 │
│ Unique Word Tokens ┆ 5,398 │
│ Corpus Metadata (Mb) ┆ 0.001 │
│ Document Metadata (Mb) ┆ 0.001 │
│ Tokens (Mb) ┆ 0.259 │
│ Vocab (Mb) ┆ 0.073 │
│ Punctuation Positions (Mb) ┆ 0.028 │
│ Space Positions (Mb) ┆ 0.017 │
└────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
4. Using Conc
To use the corpus we need to import the Conc
class from conc.conc
.
from conc.conc import Conc
The Conc class is the main interface for working with your corpus. It provides methods for a range of corpus analysis, including analysis of frequency, ngrams, concordances, collocates, and keyness. There are classes for all these different analyses, but the Conc class provides the most straightforward way to do analysis.
Here we instantiate a Conc
object with the corpus just loaded.
= Conc(corpus=corpus) conc
This getting started guide is a work-in-progress. Check out the Conc code recipes to see example code to generate Conc reports, as well as the tables or visualisations they create. More documentation on Conc reports will be available soon, but for now refer to recipes and API reference. There are API documentation pages for the main analysis types (e.g. concordancing, keyness analysis) with information on the various parameters available.