Geoff Ford Political Science x Digital Methods

CorPress: a Python library to build corpora from WordPress sites

I’ve released a pip-installable Python library, CorPress, which provides an efficient and standardised way to create corpora (i.e. text data-sets) from WordPress sites.

WordPress is a pervasive content management system. Look closely and you will find governments, think tanks, NGOs, independent media, GLAM institutions, community groups, corporations, and individuals who use Wordpress for their websites. This tool is intended for academic researchers who want to collect and analyse discourse on the web, including researchers in corpus linguistics, digital methods, media studies, and digital humanities. The tool will also be helpful to build small custom-corpora for teaching and student projects.

CorPress provides a standardised way to gather web text in a targetted way without the need for a custom scraper. CorPress makes use of a machine-readable way to access texts from a WordPress site. The WordPress REST API typically provides access to all posts and pages on a WordPress site, as well as ways to search and filter posts by keywords, dates and other criteria. CorPress detects a REST API if available (it isn’t always), queries the API for posts or pages, and outputs a corpus in two flexible formats (CSV and txt file format compatible with common corpus linguistics software tools). As well as the texts themselves, CorPress outputs meta-data (date published, title, and a link to view the original page) and the tool output provides information on the process and resulting data-set.

The documentation provides an overview of the functionality and explains how to use the library.

If you are an academic researcher using this for your research, please cite the Github repository for CorPress DOI. If you have feedback about CorPress I’d love to hear from you. If you use this in your research, please let me know!

Acknowledgements

I’ve worked on this over the last couple of years on the Mapping LAWS and Mining the Sea projects and this is an output of these projects. Thanks to the Marsden Fund for their support for these projects and the opportunity to develop and release this tool.