core

Helper functions for textplumber.

source

pass_tokens

 pass_tokens (tokens:list)

Pass through function so pre-tokenized input can be passed to CountVectorizer or TfidfVectorizer.


source

get_stop_words

 get_stop_words (save_to:str|None='lexicons_stop_words.txt',
                 spacy_compatible:bool=True)

Get stop words from NLTK (with option to cache to disk).

Type Default Details
save_to str | None lexicons_stop_words.txt where to save the file, None will not save
spacy_compatible bool True whether to make the stop words compatible with spacy

source

get_example_data

 get_example_data (train_split_name:str='train',
                   test_split_name:str='validation',
                   label_column:str='category',
                   target_labels:list=['blog', 'author', 'speech'])

Get data for examples using Huggingface dataset hallisky/AuthorMix. Majority classes are automatically undersampled.

Type Default Details
train_split_name str train this can be defined, but probably unnecessary to change
test_split_name str validation could be ‘test’
label_column str category ‘category’ or ‘style’
target_labels list [‘blog’, ‘author’, ‘speech’] see the dataset card for information on labels https://huggingface.co/datasets/hallisky/AuthorMix

The AuthorMix dataset is used for testing and for examples to document specific components. There are two fields that can be used as label columns. The default is ‘category’, but ‘style’ is also available. Here are some possible configurations.

from textplumber.report import preview_splits
print('Load all classes ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data()
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Load all classes ...
Train: 9444 samples, 3 classes
label_name count
0
0 speech 3148
1 author 3148
2 blog 3148
Test: 3598 samples, 3 classes
label_name count
0
1 author 1877
2 blog 981
0 speech 740
print('With specific target labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(target_labels = ['author', 'speech'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
With specific target labels ...
Train: 6296 samples, 2 classes
label_name count
0
0 speech 3148
1 author 3148
Test: 2617 samples, 2 classes
label_name count
0
1 author 1877
0 speech 740
print('Using style rather than category with author names as labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['fitzgerald', 'hemingway', 'woolf'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Using style rather than category with author names as labels ...
Train: 4407 samples, 3 classes
label_name count
0
3 woolf 1469
4 fitzgerald 1469
5 hemingway 1469
Test: 1877 samples, 3 classes
label_name count
0
4 fitzgerald 885
5 hemingway 504
3 woolf 488
print('Using style rather than category with president names as labels ...')
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['obama', 'trump'])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Using style rather than category with president names as labels ...
Train: 2336 samples, 2 classes
label_name count
0
0 obama 1168
2 trump 1168
Test: 601 samples, 2 classes
label_name count
0
2 trump 328
0 obama 273