conc
  1. Tutorials
  2. Get Started with Conc
  • Introduction to Conc
  • Tutorials
    • Get Started with Conc
    • Quick Conc Recipes
    • Installing Conc
  • Explanations
    • Why Conc?
    • Anatomy of a corpus
    • Performance
  • Development
    • Releases
    • Roadmap
    • Developer Guide
  • API
    • corpus
    • conc
    • corpora
    • frequency
    • ngrams
    • concordance
    • keyness
    • collocates
    • result
    • plot
    • text
    • core
  1. Tutorials
  2. Get Started with Conc

Get Started with Conc

Installed Conc? The getting started guide steps you through building and loading corpora and introduces how to use Conc for analysis.

This is a quick, no-frills introduction to using Conc. You can skip part 1 if you already have some data you want to work with.

1. Get some sample texts

For this getting started guide I’m going to use the example of a collection of short stories from Katherine Mansfield’s The Garden Party as sample texts. This corpus is available as a zip file of text files and can be downloaded via the conc.corpora submodule. First, we will import the function from conc.corpora to get the sample data.

import os
from conc.corpora import get_garden_party

Now we define where we want the data to be stored (source_path) and where we want the corpus to be saved (save_path). When the corpus is built it will be saved in a new directory in save_path. Note: the os.environ.get in the paths below are not required. You can specify paths directly as strings (e.g. /some/path/).

source_path = f'{os.environ.get("HOME")}/data/'  
save_path = f'{os.environ.get("HOME")}/data/conc-test-corpora/'

Now we download the data. This will create the source_path directory defined above if it is not already there (and it is somewhere your user can write).

get_garden_party(source_path=source_path)

2. Build the corpus

You can currently build a Conc corpus from:

  • a directory of text files or a .zip/.tar/.tar.gz containing text files (Corpus.build_from_files)
  • a .csv file (or .csv.gz file) with a column containing your text (Corpus.build_from_csv)

More source types will be added in the future, but lots of data can be wrangled into these formats.

Both methods support importing metadata. See the documentation links above for more details.

For information on the Conc corpus format, see the Anatomy of a Conc Corpus.

The following code imports the Corpus class from conc.corpus.

from conc.corpus import Corpus

The following line creates a Corpus, gives it a name and description, and builds it from the Garden Party source files.

Remember, a new directory for your corpus will be created in save_path. The name of that directory is a slugified version of the name you pass in. For the Garden Party Corpus, the directory garden-party.corpus will be created. The folder name can be changed later if you want. You can distribute your corpus by sharing the directory and its contents.

The build process time depends on the size of your corpus. The build process produces a corpus format that is quick to load and use. In this case, the corpus is small and it is done in a couple of seconds even on a old, slow computer.

name = 'Garden Party Corpus'
description = 'A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party'
source_file = 'garden-party-corpus.zip'

corpus = Corpus(name=name, description=description).build_from_files(source_path = f'{source_path}{source_file}', save_path = save_path)

To get information on the corpus, including various summary counts and information on the path of the corpus, you can use the Corpus.summary method.

corpus.summary()
Corpus Summary
Attribute Value
Name Garden Party Corpus
Description A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. https://github.com/ucdh/scraping-garden-party
Date Created 2025-06-18 10:37:28
Conc Version 0.1.3
Corpus Path /home/geoff/data/conc-test-corpora/garden-party.corpus
Document Count 15
Token Count 74,664
Word Token Count 63,311
Unique Tokens 5,410
Unique Word Tokens 5,398

3. Load a Conc corpus

Here is how we can load the corpus we just build. We don’t need to pass in a name and description, we just need the path to the corpus.

corpus = Corpus().load(corpus_path=f'{save_path}garden-party.corpus')

Let’s check our corpus information again. We could use the summary method again here, but we can also access this information using the Corpus.info method. Here we include the include_disk_usage parameter to get additional information on how much disk space our corpus is using.

print(corpus.info(include_disk_usage=True))
┌────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Attribute                  ┆ Value                                                                                                                                                                                                                                             │
╞════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Name                       ┆ Garden Party Corpus                                                                                                                                                                                                                               │
│ Description                ┆ A corpus of short stories from The Garden Party: and Other Stories by Katherine Mansfield. Texts downloaded from Project Gutenberg https://gutenberg.org/ and are in the public domain. The text files contain the short story without the title. │
│                            ┆ https://github.com/ucdh/scraping-garden-party                                                                                                                                                                                                     │
│ Date Created               ┆ 2025-06-18 10:37:28                                                                                                                                                                                                                               │
│ Conc Version               ┆ 0.1.3                                                                                                                                                                                                                                             │
│ Corpus Path                ┆ /home/geoff/data/conc-test-corpora/garden-party.corpus                                                                                                                                                                                            │
│ Document Count             ┆ 15                                                                                                                                                                                                                                                │
│ Token Count                ┆ 74,664                                                                                                                                                                                                                                            │
│ Word Token Count           ┆ 63,311                                                                                                                                                                                                                                            │
│ Unique Tokens              ┆ 5,410                                                                                                                                                                                                                                             │
│ Unique Word Tokens         ┆ 5,398                                                                                                                                                                                                                                             │
│ Corpus Metadata (Mb)       ┆ 0.001                                                                                                                                                                                                                                             │
│ Document Metadata (Mb)     ┆ 0.001                                                                                                                                                                                                                                             │
│ Tokens (Mb)                ┆ 0.259                                                                                                                                                                                                                                             │
│ Vocab (Mb)                 ┆ 0.073                                                                                                                                                                                                                                             │
│ Punctuation Positions (Mb) ┆ 0.028                                                                                                                                                                                                                                             │
│ Space Positions (Mb)       ┆ 0.017                                                                                                                                                                                                                                             │
└────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

4. Using Conc

To use the corpus we need to import the Conc class from conc.conc.

from conc.conc import Conc

The Conc class is the main interface for working with your corpus. It provides methods for a range of corpus analysis, including analysis of frequency, ngrams, concordances, collocates, and keyness. There are classes for all these different analyses, but the Conc class provides the most straightforward way to do analysis.

Here we instantiate a Conc object with the corpus just loaded.

conc = Conc(corpus=corpus)

This getting started guide is a work-in-progress. Check out the Conc code recipes to see example code to generate Conc reports, as well as the tables or visualisations they create. More documentation on Conc reports will be available soon, but for now refer to recipes and API reference. There are API documentation pages for the main analysis types (e.g. concordancing, keyness analysis) with information on the various parameters available.

  • Report an issue