corpress

Create a text corpus from a WordPress site using the WordPress API.

Geoff Ford
https://geoffford.nz/corpress-release

GitHub Release DOI

Corpress documentation

Corpress (Cor from Corpus, Press from WordPress) provides a simple approach to retrieve posts or pages from a WordPress site’s REST API and create a corpus (i.e. a data-set of texts). Corpress provides an efficient and standardized way to collect text data from WordPress sites, avoiding the need for customized scrapers. Not all WordPress sites provide access to the REST API, but many do.

I’m a political scientist who applies corpus linguistics and digital methods in my research. I’m releasing Corpress with academic researchers in mind. This tool is intended for academic research. Please cite Corpress if you use it in your research.

Corpress attempts to detect a REST API endpoint from a website URL for posts (default) and pages, then downloads JSON from the API, and then processes the JSON to create a corpus. You can create a corpus in: 1. ‘txt’ format: texts are saved in separate .txt files, compatible with common corpus linguistics tools, like AntConc. An optional meta-data file can be output with the link to each file, title, and date; or
2. ‘csv’ format: meta-data and text is saved in a single CSV file.

I’ve used nbdev to develop this library, which uses a Jupyter notebooks to develop code, documentation, code examples and tests. If you want to contribute, you will need to clone the Github repo and setup nbdev.

Acknowledgements

This library was developed through my research on these projects:
* Mapping LAWS project: Issue Mapping and Analysing the Lethal Autonomous Weapons Debate (Funded by Royal Society of New Zealand’s Marsden Fund, Grant 19-UOC-068)
* Into the Deep: Analysing the Actors and Controversies Driving the Adoption of the World’s First Deep Sea Mining Governance (Funded by Royal Society of New Zealand’s Marsden Fund, Grant 22-UOC-059)

TODO

  • Add in a way to zip a txt format corpus
  • Sort out encoding - currently assumes UTF-8 all the way.
  • Add checks on JSON save path.

Install

pip install corpress

Before using

  • There are good reasons not to collect and/or distribute corpora and it is the end-user’s responsibility to use this software in an ethical way.
  • Depending on the nature of the texts collected, what you are doing when analyzing the texts, and how you disseminate your research, it may be appropriate to process the texts further (e.g. to remove personally identifying information).
  • Not all Wordpress sites make the REST API accessible. See example output when there is no REST API available.
  • It is possible the API data may differ from what is visible online. You should check the texts in your corpus to make sure you have what you expect!
  • Corpress will exit with appropriate logging information if an API endpoint is not found, not accessible or returns unexpected data. Just read what it returns.
  • Collecting data uses energy and server resources. It is your responsibility to set an appropriate User Agent and seconds between requests to the API to be thoughtful and respectful in your use of this tool.

How to use

The corpress function is the intended way to invoke Corpress and create a corpus. Other functions are relevant if you just want to get the API endpoint or download the JSON data. If you want the data in a different format, you could just generate the CSV and then convert that to whatever format you need.

Corpress is intentionally verbose in terms of log output. This is helpful to record and understand the process of collecting the data.

Most WordPress sites don’t have more than 100s to 1000s of posts. Using a Jupyter Notebook could be helpful to view and capture the log data from running Corpress and scope the corpus.

Here’s a step-by-step description, with discussion of the key functionality.

First import the corpress function.

from corpress.core import corpress

You are going to need to set a few arguments for corpress. The corpress function is documented in full here. Here I’m breaking it down and showing an example.

  • url: Set the URL of the WordPress website, Corpress will try to determine the endpoint from this.
  • endpoint_type: Do you want ‘posts’ or ‘pages’. If you want both, see the note on collecting both posts and pages.
  • corpus_format: How do you want your corpus saved? ‘txt’ is a directory of txt files, ‘csv’ is a single CSV with meta-data and text.
url = 'https://www.adho.org/'
endpoint_type = 'posts'
corpus_format = 'txt'

Setup where and how to save the data. Corpress will try and create directory paths if they don’t exist. * json_save_path (required): Specify the directory where Corpress will save the JSON data. Note: you should set a new path for every new Wordpress site you collect.
* corpus_save_path: Required for ‘txt’ corpus format, this is where the .txt files will be saved. Set as None or ommit if using ‘csv’ format.
* csv_save_file: * For ‘txt’ corpus format this is optional. This provides a way to export meta-data (date, title, link to text etc) for each text in the corpus. * For ‘csv’ corpus format this is required. This specifies the file where the meta-data and text will be saved.
* include_title_in_text: Depending on the data you are collecting and what you want to do with it, you can save the title of the post/page as part of the text or not. This is set to True by default.

json_save_path = '../test_data/example/json/'
corpus_save_path = '../test_data/example/txt/'
csv_save_file = csv_save_file='../test_data/example/metadata.csv'
include_title_in_text = True

Set how you query the API: * seconds_between_requests: By default this is set to one request every 5 seconds. You can’t specify less than 1 second. It may be appropiate if you are collecting lots of texts to specify a large number of seconds between requests.
* headers: Corpress uses the Requests Python Library for HTTP requests. You can pass headers you want in HTTP requests directly as a dict. See documentation here. The most relevant one is to set a User-Agent header. See the note below about how to set an appropriate User-Agent.
* params: The posts and pages endpoints support a number of parameters. This includes parameters to specify a search term, restrict dates and set the way results are ordered. Set additional parameters as a dict. See the Requests library documentation on passing parameters in URLS to understand this. * max_pages: By default Corpress will collect all post (or pages). That might not be necessary. Interpret max_pages as the maximum number of successful API requests. The REST API normally returns 10 posts/pages per request, so if you want 100 posts you would set max_pages to 10.

Set an appropriate User-Agent

Here’s a suggested format: Your Research Project (https://university.edu/webpage). See how to set this below.

seconds_between_requests = 5
headers = {'User-Agent': 'Your Research Project (https://university.edu/webpage)'}
params = {'search': 'common'} # just comment out or remove this line to collect every post, I've just chosen a word arbitrarily here
max_pages = None # collecting all available data, if want less data - set to an integer

Now you can call the corpress function and create a corpus. There will be lots of information logged about collecting and processing the texts. When completed it will output a table with a summary of the process and texts collected. This is the same data returned by the corpress function.

result = corpress(url=url, 
                  endpoint_type=endpoint_type, 
                  corpus_format=corpus_format, 
                  json_save_path=json_save_path, 
                  corpus_save_path=corpus_save_path, 
                  csv_save_file=csv_save_file, 
                  include_title_in_text=include_title_in_text, 
                  seconds_between_requests=seconds_between_requests, 
                  headers=headers, 
                  params=params, 
                  max_pages=max_pages)
2024-08-23 11:21:25 - INFO - Found REST API endpoint link
2024-08-23 11:21:25 - INFO - Setting posts route https://adho.org/wp-json/wp/v2/posts
2024-08-23 11:21:25 - INFO - Using JSON save path: ../test_data/example/json/
2024-08-23 11:21:27 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=1
2024-08-23 11:21:27 - INFO - Total pages to retrieve is 3
2024-08-23 11:21:34 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=2
2024-08-23 11:21:40 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=3
2024-08-23 11:21:45 - INFO - Creating corpus in txt format
2024-08-23 11:21:45 - INFO - Using corpus save path: ../test_data/example/txt/
2024-08-23 11:21:45 - INFO - Creating CSV file for metadata: ../test_data/example/metadata.csv
2024-08-23 11:21:45 - INFO - Processing JSON: posts-3.json
2024-08-23 11:21:45 - INFO - Processing JSON: posts-2.json
2024-08-23 11:21:45 - INFO - Processing JSON: posts-1.json
Key Value
0 url https://www.adho.org/
1 endpoint_url https://adho.org/wp-json/wp/v2/posts
2 headers {'User-Agent': 'Your Research Project (https:/...
3 params {'search': 'common'}
4 get_api_url True
5 get_json True
6 create_corpus True
7 corpus_format txt
8 corpus_save_path ../test_data/example/txt/
9 csv_save_file ../test_data/example/metadata.csv
10 corpus_texts_count 29

You can now preview the data you’ve collected.

import pandas as pd
pd.set_option('display.max_colwidth', None) # to display full text in pandas dataframe
metadata = pd.read_csv(csv_save_file)
metadata = metadata.sort_values('date')
metadata[['date', 'link', 'title', 'filename']].head(5) # display first 5 rows of metadata, this is not all the fields available
date link title filename
8 2012-12-06 ADHO Adopts Creative Commons License for Its Web Site https://adho.org/2012/12/06/adho-adopts-creative-commons-license-for-its-web-site/ 2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt
7 2013-03-28 Apply to be ADHO’s Publications Liaison https://adho.org/2013/03/28/apply-to-be-adhos-publications-liaison/ 2013-03-28-post-366-apply-to-be-adhos-publications-liaison.txt
6 2013-06-23 ADHO Calls for Proposals for New Special Interest Groups https://adho.org/2013/06/23/adho-calls-for-proposals-for-new-special-interest-groups/ 2013-06-23-post-338-adho-calls-for-proposals-for-new-special-interest-groups.txt
5 2013-07-09 Participate in the Joint ADHO and centerNet AGM at Digital Humanities 2013 https://adho.org/2013/07/09/participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013/ 2013-07-09-post-408-participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013.txt
4 2013-07-14 Digital Humanities 2015 to be held in Sydney, Australia https://adho.org/2013/07/14/digital-humanities-2015-to-be-held-in-sydney-australia/ 2013-07-14-post-288-digital-humanities-2015-to-be-held-in-sydney-australia.txt

You can view a specific text file (if you used the ‘txt’ format) like this:

import os
filename = '2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt'
with open(os.path.join(corpus_save_path, filename), 'r', encoding = 'utf-8') as file:
    text = file.read()   
    print(text)
ADHO Adopts Creative Commons License for Its Web Site

The Alliance of Digital Humanities Organizations (ADHO) is pleased to announce that all content on its web site is now available under a Creative Commons Attribution (CC-BY) license. This means that individuals and organizations are welcome to re-use and adapt ADHO’s documents and resources, so long as ADHO is cited as the source. Neil Fraistat, Chair of ADHO’s Steering Committee, notes that “this is one of an ongoing series of actions this year that are being designed to make ADHO resources more open and available to the larger community.”
 
ADHO’s decision to adopt the CC-BY license was prompted by the recognition that through explicitly sharing its work it can have a greater impact, contribute to best practices, and demonstrate its support for open access. Recently the Program Committee for the 2013 Digital Humanities conference  revamped ADHO’s Guidelines for Proposal Authors & Reviewers, making them more inclusive, concrete, and transparent. PC chair Bethany Nowviskie received a request from the organizers of another conference to re-use these guidelines. Prompted by Nowviskie's suggestion, the ADHO Steering Committee determined that not only should the conference guidelines be made freely available, but its entire web site.
 
In adopting a Creative Commons license for its website, ADHO follows suit with several of its existing publications, including Digital Studies/Le Champ Numerique, Digital Humanities Quarterly, and DH Answers.

Collecting both posts and pages

If you want to collect both posts and pages, just invoke corpress twice: once with endpoint_type set to ‘posts’ and then with it set to ‘pages’.

If you are outputting in the ‘txt’ corpus format without a metadata file (i.e. csv_save_file set to None or omitted from the function call), you won’t have a problem. The filenames for posts/pages won’t conflict.

If you are specifying a csv_save_file - either because you are outputting in the ‘csv’ corpus format or in the ‘txt’ format and wanting the meta-data - make sure you use a separate csv_save_file for ‘posts’ and ‘pages’. You will get two separate files, combining these with a library like Pandas, which is installed with Corpress, is trivial. I will leave that for you to Google how to merge two CSV files into one using Pandas.

No REST API available

Here’s an example showing what you will see if there no REST API is accessible.

# test a site that has no endpoint
result = corpress(url = 'https://www.whitehouse.gov/', 
                endpoint_type='posts',
                corpus_format='txt',
                json_save_path = '../test_data/json/', 
                corpus_save_path = '../test_data/corpus/', 
                max_pages=2)
2024-08-23 11:21:46 - INFO - No REST API endpoint link in markup
2024-08-23 11:21:46 - INFO - Guessing posts route based on URL https://www.whitehouse.gov/wp-json/wp/v2/posts
2024-08-23 11:21:46 - INFO - Using JSON save path: ../test_data/json/
2024-08-23 11:21:46 - INFO - Max pages to retrieve from API is set: 2
2024-08-23 11:21:47 - INFO - Downloading https://www.whitehouse.gov/wp-json/wp/v2/posts?page=1
2024-08-23 11:21:47 - ERROR - Error downloading page 1 from https://www.whitehouse.gov/wp-json/wp/v2/posts
2024-08-23 11:21:47 - ERROR - Status code: 403
2024-08-23 11:21:47 - ERROR - It appears that this website does not provide access to the REST API. Exiting.
2024-08-23 11:21:47 - ERROR - Error downloading data. Exiting.
Key Value
0 url https://www.whitehouse.gov/
1 endpoint_url https://www.whitehouse.gov/wp-json/wp/v2/posts
2 headers None
3 params None
4 get_api_url True
5 get_json False
6 create_corpus False
7 corpus_format txt
8 corpus_save_path ../test_data/corpus/
9 csv_save_file None
10 corpus_texts_count 0