Geoff Ford Political Science x Digital Methods

Generate a corpus with an LLM

This semester I’m introducing postgraduate students to linguistic features of Large Language Model (LLM) output, teaching practical skills to generate and evaluate text output, and developing their critical understanding of the technology and its applications. I’ve released a Github repository I’m using in class this semester. The notebook in the repository demonstrates how to use the OpenRouter API to generate a corpus using an LLM. Students are using this code in labs and will adapt the code and prompts to generate their own data for a group assignment.

The Corpus Building Project requires the students to come up with a novel research question, collect appropriate data, and write a report to document their process, decisions and features of their corpus. Many of the students are developing corpora to study features of specific kinds of generated and human-authored text. The initial project is focused on building text data-sets, which is very relevant with current interest in the possibilities of synthetic data. Students have opportunities through the remainder of the course to apply different text analytic techniques to analyse their corpora, including information extraction, text classification, and corpus-assisted discourse analysis.

The notebook provides multiple examples for students:

  1. Generating multiple texts from a single prompt.
  2. Generating structured data to seed generation of more complex texts.
  3. Generating a corpus using prompts from a CSV (e.g. generating text based on scraped news headlines).
  4. Generating chatbot-like texts, with memory of preceeding turns.
  5. Generating structured data and saving this to a CSV.

OpenRouter is helpful, because it provides access to a wide range of LLMs, including well-known commercial models and the latest open source models. They also provide free API access to some of the best small models (with rate limits).

For more information on the repository check out the README file.

Github Repository

Here is the GitHub repository:
https://github.com/polsci/generate-a-corpus-with-an-LLM