from textplumber.tokens import TokensVectorizer
from textplumber.preprocess import SpacyPreprocessor
from textplumber.store import TextFeatureStore
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
tokens
TokensVectorizer
TokensVectorizer (feature_store:textplumber.store.TextFeatureStore, vectorizer_type:str='count', lowercase:bool=False, min_token_length:int=0, remove_punctuation:bool=False, remove_numbers:bool=False, stop_words:list[str]|None=None, min_df:float|int=1, max_df:float|int=1.0, max_features:int=5000, ngram_range:tuple=(1, 1), vocabulary:list|None=None, encoding:str='utf-8', decode_error:str='ignore')
Sci-kit Learn pipeline component to extract token features. This component should be used after the SpacyPreprocessor component with the same feature store. The component gets the tokens from the feature store and returns a matrix of counts (via CountVectorizer) or Tf-idf scores (using TfidfVectorizer).
Type | Default | Details | |
---|---|---|---|
feature_store | TextFeatureStore | the feature store to use - this should be the same feature store used in the SpacyPreprocessor component | |
vectorizer_type | str | count | the type of vectorizer to use - ‘count’ for CountVectorizer or ‘tfidf’ for TfidfVectorizer |
lowercase | bool | False | whether to lowercase the tokens |
min_token_length | int | 0 | the minimum token length to use |
remove_punctuation | bool | False | whether to remove punctuation from the tokens |
remove_numbers | bool | False | whether to remove numbers from the tokens |
stop_words | list[str] | None | None | the stop words to use - passed to CountVectorizer or TfidfVectorizer |
min_df | float | int | 1 | the minimum document frequency to use - passed to CountVectorizer or TfidfVectorizer |
max_df | float | int | 1.0 | the maximum document frequency to use - passed to CountVectorizer or TfidfVectorizer |
max_features | int | 5000 | the maximum number of features to use, setting a default to avoid memory issues - passed to CountVectorizer or TfidfVectorizer |
ngram_range | tuple | (1, 1) | the ngram range to use (min_n, max_n) - passed to CountVectorizer or TfidfVectorizer |
vocabulary | list | None | None | list of tokens to use - passed to CountVectorizer or TfidfVectorizer |
encoding | str | utf-8 | the encoding to use - passed to CountVectorizer or TfidfVectorizer |
decode_error | str | ignore | what to do if there is an error decoding ‘strict’, ‘ignore’, ‘replace’ - passed to CountVectorizer or TfidfVectorizer |
TokensVectorizer.fit
TokensVectorizer.fit (X, y=None)
Fit the vectorizer to the tokens.
TokensVectorizer.transform
TokensVectorizer.transform (X)
Transform the texts to a matrix of counts or tf-idf scores.
TokensVectorizer.get_feature_names_out
TokensVectorizer.get_feature_names_out (input_features=None)
Get the feature names out from the vectorizer.
Example
Here is an example …
Here we load text samples from Ernest Hemingway and Virginia Woolf available in the AuthorMix dataset.
= get_example_data(label_column = 'style', target_labels = ['hemingway', 'woolf']) X_train, y_train, X_test, y_test, target_classes, target_names
Create a feature store to save preprocessed texts.
= TextFeatureStore('feature_store_example_tokens.sqlite') feature_store
The SpacyPreprocessor
component is required before the TokensVectorizer
. Here we train a model with 500 token features based on token counts.
= Pipeline([
pipeline 'preprocessor', SpacyPreprocessor(feature_store=feature_store)),
('tokens', TokensVectorizer(feature_store=feature_store, max_features=500)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)), ('tokens', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=500)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)), ('tokens', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=500)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=500)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total= 17.3s
[Pipeline] ............ (step 2 of 3) Processing tokens, total= 0.4s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 0.1s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
hemingway 0.908 0.954 0.930 504
woolf 0.950 0.900 0.924 488
accuracy 0.927 992
macro avg 0.929 0.927 0.927 992
weighted avg 0.929 0.927 0.927 992
Here we use TokenVectorizer
with a more complex pipeline that extracts Tf-Idf weights for unigrams and bigrams and selects 500 unigrams and 500 bigrams as features using Sci-kit learn’s SelectKBest transformer and chi2 scores. Since the feature store has already been populated this training run is faster.
= Pipeline([
pipeline 'preprocessor', SpacyPreprocessor(feature_store=feature_store)),
('features', FeatureUnion([
('tokens', Pipeline([
('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (1, 1), max_features=20000)),
('selector', SelectKBest(score_func=chi2, k=500)),
(
])),'ngrams', Pipeline([
('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (2, 2), max_features=20000)),
('selector', SelectKBest(score_func=chi2, k=500)),
(
])),
])),'classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)), ('features', FeatureUnion(transformer_list=[('tokens', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, vectorizer_type='tfidf')), ('selec... ('ngrams', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, ngram_range=(2, 2), vectorizer_type='tfidf')), ('selector', SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>))]))])), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)), ('features', FeatureUnion(transformer_list=[('tokens', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, vectorizer_type='tfidf')), ('selec... ('ngrams', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, ngram_range=(2, 2), vectorizer_type='tfidf')), ('selector', SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>))]))])), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)
FeatureUnion(transformer_list=[('tokens', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, vectorizer_type='tfidf')), ('selector', SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>))])), ('ngrams', Pipeline(steps=[('vectorizer', TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, ngram_range=(2, 2), vectorizer_type='tfidf')), ('selector', SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>))]))])
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, vectorizer_type='tfidf')
SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>)
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>, max_features=20000, ngram_range=(2, 2), vectorizer_type='tfidf')
SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total= 0.2s
[Pipeline] .......... (step 2 of 3) Processing features, total= 1.6s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 0.1s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
hemingway 0.953 0.962 0.958 504
woolf 0.961 0.951 0.956 488
accuracy 0.957 992
macro avg 0.957 0.957 0.957 992
weighted avg 0.957 0.957 0.957 992