core
Corpress functions. You can call any of these functions directly, but use the corpress function if you want to gather data and output a corpus in one step.
get_api_url
get_api_url (url:str, endpoint_type:str='posts', headers:dict=None)
Queries a URL to get the REST API route for the endpoint type provided.
Type | Default | Details | |
---|---|---|---|
url | str | the URL of the WordPress website | |
endpoint_type | str | posts | posts or pages |
headers | dict | None | optional headers for requests |
get_json
get_json (endpoint_url:str, endpoint_type:str='posts', headers:dict=None, params:dict=None, json_save_path:str=None, seconds_between_requests:int=5, max_pages:int=None)
Download and save JSON data from a specific REST API endpoint.
Type | Default | Details | |
---|---|---|---|
endpoint_url | str | the URL of the WordPress REST API endpoint | |
endpoint_type | str | posts | the type of data to download |
headers | dict | None | optional headers for requests |
params | dict | None | optional parameters to pass to the API |
json_save_path | str | None | path to save the JSON data |
seconds_between_requests | int | 5 | number of seconds to wait between requests, must be at least 1 |
max_pages | int | None | maximum number of pages to download |
Returns | bool | True if successful, False otherwise |
create_corpus
create_corpus (corpus_format:str='txt', json_save_path:str=None, corpus_save_path:str=None, csv_save_file:str=None, include_title_in_text:bool=True)
Create a corpus from downloaded JSON data in txt or csv format.
Type | Default | Details | |
---|---|---|---|
corpus_format | str | txt | format of the corpus files, txt or csv |
json_save_path | str | None | path to JSON data |
corpus_save_path | str | None | path to save corpus in txt format |
csv_save_file | str | None | path to CSV file to output corpus in CSV format (or metadata if txt corpus) |
include_title_in_text | bool | True | include the title in the text file |
Returns | bool | True if successful, False if there are errors parsing the JSON |
result_reporting
result_reporting (result:dict, output:bool=True)
Outputs the results of the corpress process
Type | Default | Details | |
---|---|---|---|
result | dict | the result dictionary | |
output | bool | True | output the results |
Returns | dict | returns the result dictionary |
corpress
corpress (url:str, endpoint_type:str='posts', headers:dict=None, params:dict=None, corpus_format:str='txt', json_save_path:str=None, corpus_save_path:str=None, csv_save_file:str=None, seconds_between_requests:int=5, max_pages:int=None, include_title_in_text:bool=True, output:bool=True)
Retrieve data from the REST API and create a corpus.
Type | Default | Details | |
---|---|---|---|
url | str | the URL of the WordPress website | |
endpoint_type | str | posts | posts or pages |
headers | dict | None | optional headers for requests |
params | dict | None | optional parameters to pass to the API |
corpus_format | str | txt | format of the corpus files, txt or csv |
json_save_path | str | None | path to save the JSON data |
corpus_save_path | str | None | path to save the corpus in txt format |
csv_save_file | str | None | path to CSV file to output corpus in CSV format (or metadata if txt corpus) |
seconds_between_requests | int | 5 | number of seconds to wait between requests |
max_pages | int | None | maximum number of pages to download |
include_title_in_text | bool | True | option to include the title in the text file |
output | bool | True | option to output the results of the process |
Returns | dict | dictionary with results of each stage of the process and the number of texts in the corpus |