core

Helper functions for textplumber.

pass_tokens

 pass_tokens (tokens:list)

Pass through function so pre-tokenized input can be passed to CountVectorizer or TfidfVectorizer.

source

get_stop_words

 get_stop_words (save_to:str|None='lexicons_stop_words.txt',
                 spacy_compatible:bool=True)

Get stop words from NLTK (with option to cache to disk).

	Type	Default	Details
save_to	str \| None	lexicons_stop_words.txt	where to save the file, None will not save
spacy_compatible	bool	True	whether to make the stop words compatible with spacy

source

get_example_data

 get_example_data (train_split_name:str='train',
                   test_split_name:str='validation',
                   label_column:str='category',
                   target_labels:list=['blog', 'author', 'speech'])

Get data for examples using Huggingface dataset hallisky/AuthorMix. Majority classes are automatically undersampled.

	Type	Default	Details
train_split_name	str	train	this can be defined, but probably unnecessary to change
test_split_name	str	validation	could be ‘test’
label_column	str	category	‘category’ or ‘style’
target_labels	list	[‘blog’, ‘author’, ‘speech’]	see the dataset card for information on labels https://huggingface.co/datasets/hallisky/AuthorMix

The AuthorMix dataset is used for testing and for examples to document specific components. There are two fields that can be used as label columns. The default is ‘category’, but ‘style’ is also available. Here are some possible configurations.

from textplumber.report import preview_splits

print('Load all classes ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data()
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

Load all classes ...
Train: 9444 samples, 3 classes

	label_name	count
0
0	speech	3148
1	author	3148
2	blog	3148

Test: 3598 samples, 3 classes

	label_name	count
0
1	author	1877
2	blog	981
0	speech	740

print('With specific target labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(target_labels = ['author', 'speech'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

With specific target labels ...
Train: 6296 samples, 2 classes

	label_name	count
0
0	speech	3148
1	author	3148

Test: 2617 samples, 2 classes

	label_name	count
0
1	author	1877
0	speech	740

print('Using style rather than category with author names as labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['fitzgerald', 'hemingway', 'woolf'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

Using style rather than category with author names as labels ...
Train: 4407 samples, 3 classes

	label_name	count
0
3	woolf	1469
4	fitzgerald	1469
5	hemingway	1469

Test: 1877 samples, 3 classes

	label_name	count
0
4	fitzgerald	885
5	hemingway	504
3	woolf	488

print('Using style rather than category with president names as labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['obama', 'trump'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)

Using style rather than category with president names as labels ...
Train: 2336 samples, 2 classes

	label_name	count
0
0	obama	1168
2	trump	1168

Test: 601 samples, 2 classes

	label_name	count
0
2	trump	328
0	obama	273