source
pass_tokens
pass_tokens (tokens:list)
Pass through function so pre-tokenized input can be passed to CountVectorizer or TfidfVectorizer.
source
get_stop_words
get_stop_words (save_to:str|None='lexicons_stop_words.txt',
spacy_compatible:bool=True)
Get stop words from NLTK (with option to cache to disk).
save_to
str | None
lexicons_stop_words.txt
where to save the file, None will not save
spacy_compatible
bool
True
whether to make the stop words compatible with spacy
source
get_example_data
get_example_data (train_split_name:str='train',
test_split_name:str='validation',
label_column:str='category',
target_labels:list=['blog', 'author', 'speech'])
Get data for examples using Huggingface dataset hallisky/AuthorMix. Majority classes are automatically undersampled.
train_split_name
str
train
this can be defined, but probably unnecessary to change
test_split_name
str
validation
could be ‘test’
label_column
str
category
‘category’ or ‘style’
target_labels
list
[‘blog’, ‘author’, ‘speech’]
see the dataset card for information on labels https://huggingface.co/datasets/hallisky/AuthorMix
The AuthorMix dataset is used for testing and for examples to document specific components. There are two fields that can be used as label columns. The default is ‘category’, but ‘style’ is also available. Here are some possible configurations.
from textplumber.report import preview_splits
print ('Load all classes ...' )
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data()
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Load all classes ...
Train: 9444 samples, 3 classes
0
0
speech
3148
1
author
3148
2
blog
3148
Test: 3598 samples, 3 classes
0
1
author
1877
2
blog
981
0
speech
740
print ('With specific target labels ...' )
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(target_labels = ['author' , 'speech' ])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
With specific target labels ...
Train: 6296 samples, 2 classes
0
0
speech
3148
1
author
3148
Test: 2617 samples, 2 classes
0
1
author
1877
0
speech
740
print ('Using style rather than category with author names as labels ...' )
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style' , target_labels = ['fitzgerald' , 'hemingway' , 'woolf' ])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Using style rather than category with author names as labels ...
Train: 4407 samples, 3 classes
0
3
woolf
1469
4
fitzgerald
1469
5
hemingway
1469
Test: 1877 samples, 3 classes
0
4
fitzgerald
885
5
hemingway
504
3
woolf
488
print ('Using style rather than category with president names as labels ...' )
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style' , target_labels = ['obama' , 'trump' ])
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Using style rather than category with president names as labels ...
Train: 2336 samples, 2 classes
0
0
obama
1168
2
trump
1168
Test: 601 samples, 2 classes
0
2
trump
328
0
obama
273