tokens

Extract token features.

source

TokensVectorizer

 TokensVectorizer (feature_store:textplumber.store.TextFeatureStore,
                   vectorizer_type:str='count', lowercase:bool=False,
                   min_token_length:int=0, remove_punctuation:bool=False,
                   remove_numbers:bool=False,
                   stop_words:list[str]|None=None, min_df:float|int=1,
                   max_df:float|int=1.0, max_features:int=5000,
                   ngram_range:tuple=(1, 1), vocabulary:list|None=None,
                   encoding:str='utf-8', decode_error:str='ignore')

Sci-kit Learn pipeline component to extract token features. This component should be used after the SpacyPreprocessor component with the same feature store. The component gets the tokens from the feature store and returns a matrix of counts (via CountVectorizer) or Tf-idf scores (using TfidfVectorizer).

Type Default Details
feature_store TextFeatureStore the feature store to use - this should be the same feature store used in the SpacyPreprocessor component
vectorizer_type str count the type of vectorizer to use - ‘count’ for CountVectorizer or ‘tfidf’ for TfidfVectorizer
lowercase bool False whether to lowercase the tokens
min_token_length int 0 the minimum token length to use
remove_punctuation bool False whether to remove punctuation from the tokens
remove_numbers bool False whether to remove numbers from the tokens
stop_words list[str] | None None the stop words to use - passed to CountVectorizer or TfidfVectorizer
min_df float | int 1 the minimum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_df float | int 1.0 the maximum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_features int 5000 the maximum number of features to use, setting a default to avoid memory issues - passed to CountVectorizer or TfidfVectorizer
ngram_range tuple (1, 1) the ngram range to use (min_n, max_n) - passed to CountVectorizer or TfidfVectorizer
vocabulary list | None None list of tokens to use - passed to CountVectorizer or TfidfVectorizer
encoding str utf-8 the encoding to use - passed to CountVectorizer or TfidfVectorizer
decode_error str ignore what to do if there is an error decoding ‘strict’, ‘ignore’, ‘replace’ - passed to CountVectorizer or TfidfVectorizer

source

TokensVectorizer.fit

 TokensVectorizer.fit (X, y=None)

Fit the vectorizer to the tokens.


source

TokensVectorizer.transform

 TokensVectorizer.transform (X)

Transform the texts to a matrix of counts or tf-idf scores.


source

TokensVectorizer.get_feature_names_out

 TokensVectorizer.get_feature_names_out (input_features=None)

Get the feature names out from the vectorizer.

Example

Here is an example …

from textplumber.tokens import TokensVectorizer
from textplumber.preprocess import SpacyPreprocessor
from textplumber.store import TextFeatureStore
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

Here we load text samples from Ernest Hemingway and Virginia Woolf available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['hemingway', 'woolf'])

Create a feature store to save preprocessed texts.

feature_store = TextFeatureStore('feature_store_example_tokens.sqlite')

The SpacyPreprocessor component is required before the TokensVectorizer. Here we train a model with 500 token features based on token counts.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('tokens', TokensVectorizer(feature_store=feature_store, max_features=500)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)),
                ('tokens',
                 TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                  max_features=500)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  17.3s
[Pipeline] ............ (step 2 of 3) Processing tokens, total=   0.4s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

   hemingway      0.908     0.954     0.930       504
       woolf      0.950     0.900     0.924       488

    accuracy                          0.927       992
   macro avg      0.929     0.927     0.927       992
weighted avg      0.929     0.927     0.927       992

Here we use TokenVectorizer with a more complex pipeline that extracts Tf-Idf weights for unigrams and bigrams and selects 500 unigrams and 500 bigrams as features using Sci-kit learn’s SelectKBest transformer and chi2 scores. Since the feature store has already been populated this training run is faster.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('features', FeatureUnion([
        ('tokens', Pipeline([
            ('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (1, 1), max_features=20000)),
            ('selector', SelectKBest(score_func=chi2, k=500)),
        ])),
        ('ngrams', Pipeline([
            ('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (2, 2), max_features=20000)),
            ('selector', SelectKBest(score_func=chi2, k=500)),
        ])),
    ])),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)),
                ('features',
                 FeatureUnion(transformer_list=[('tokens',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selec...
                                                ('ngrams',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   ngram_range=(2,
                                                                                                2),
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selector',
                                                                  SelectKBest(k=500,
                                                                              score_func=<function chi2 at 0x7f0df8cb36a0>))]))])),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=   0.2s
[Pipeline] .......... (step 2 of 3) Processing features, total=   1.6s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

   hemingway      0.953     0.962     0.958       504
       woolf      0.961     0.951     0.956       488

    accuracy                          0.957       992
   macro avg      0.957     0.957     0.957       992
weighted avg      0.957     0.957     0.957       992