core

Corpress functions. You can call any of these functions directly, but use the corpress function if you want to gather data and output a corpus in one step.

source

get_api_url

 get_api_url (url:str, endpoint_type:str='posts', headers:dict=None)

Queries a URL to get the REST API route for the endpoint type provided.

Type Default Details
url str the URL of the WordPress website
endpoint_type str posts posts or pages
headers dict None optional headers for requests

source

get_json

 get_json (endpoint_url:str, endpoint_type:str='posts', headers:dict=None,
           params:dict=None, json_save_path:str=None,
           seconds_between_requests:int=5, max_pages:int=None)

Download and save JSON data from a specific REST API endpoint.

Type Default Details
endpoint_url str the URL of the WordPress REST API endpoint
endpoint_type str posts the type of data to download
headers dict None optional headers for requests
params dict None optional parameters to pass to the API
json_save_path str None path to save the JSON data
seconds_between_requests int 5 number of seconds to wait between requests, must be at least 1
max_pages int None maximum number of pages to download
Returns bool True if successful, False otherwise

source

create_corpus

 create_corpus (corpus_format:str='txt', json_save_path:str=None,
                corpus_save_path:str=None, csv_save_file:str=None,
                include_title_in_text:bool=True)

Create a corpus from downloaded JSON data in txt or csv format.

Type Default Details
corpus_format str txt format of the corpus files, txt or csv
json_save_path str None path to JSON data
corpus_save_path str None path to save corpus in txt format
csv_save_file str None path to CSV file to output corpus in CSV format (or metadata if txt corpus)
include_title_in_text bool True include the title in the text file
Returns bool True if successful, False if there are errors parsing the JSON

source

result_reporting

 result_reporting (result:dict, output:bool=True)

Outputs the results of the corpress process

Type Default Details
result dict the result dictionary
output bool True output the results
Returns dict returns the result dictionary

source

corpress

 corpress (url:str, endpoint_type:str='posts', headers:dict=None,
           params:dict=None, corpus_format:str='txt',
           json_save_path:str=None, corpus_save_path:str=None,
           csv_save_file:str=None, seconds_between_requests:int=5,
           max_pages:int=None, include_title_in_text:bool=True,
           output:bool=True)

Retrieve data from the REST API and create a corpus.

Type Default Details
url str the URL of the WordPress website
endpoint_type str posts posts or pages
headers dict None optional headers for requests
params dict None optional parameters to pass to the API
corpus_format str txt format of the corpus files, txt or csv
json_save_path str None path to save the JSON data
corpus_save_path str None path to save the corpus in txt format
csv_save_file str None path to CSV file to output corpus in CSV format (or metadata if txt corpus)
seconds_between_requests int 5 number of seconds to wait between requests
max_pages int None maximum number of pages to download
include_title_in_text bool True option to include the title in the text file
output bool True option to output the results of the process
Returns dict dictionary with results of each stage of the process and the number of texts in the corpus