tokens

Extract token features.

TokensVectorizer

 TokensVectorizer (feature_store:textplumber.store.TextFeatureStore,
                   vectorizer_type:str='count', lowercase:bool=False,
                   min_token_length:int=0, remove_punctuation:bool=False,
                   remove_numbers:bool=False,
                   stop_words:list[str]|None=None, min_df:float|int=1,
                   max_df:float|int=1.0, max_features:int=5000,
                   ngram_range:tuple=(1, 1), vocabulary:list|None=None,
                   encoding:str='utf-8', decode_error:str='ignore')

Sci-kit Learn pipeline component to extract token features. This component should be used after the SpacyPreprocessor component with the same feature store. The component gets the tokens from the feature store and returns a matrix of counts (via CountVectorizer) or Tf-idf scores (using TfidfVectorizer).

	Type	Default	Details
feature_store	TextFeatureStore		the feature store to use - this should be the same feature store used in the SpacyPreprocessor component
vectorizer_type	str	count	the type of vectorizer to use - ‘count’ for CountVectorizer or ‘tfidf’ for TfidfVectorizer
lowercase	bool	False	whether to lowercase the tokens
min_token_length	int	0	the minimum token length to use
remove_punctuation	bool	False	whether to remove punctuation from the tokens
remove_numbers	bool	False	whether to remove numbers from the tokens
stop_words	list[str] \| None	None	the stop words to use - passed to CountVectorizer or TfidfVectorizer
min_df	float \| int	1	the minimum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_df	float \| int	1.0	the maximum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_features	int	5000	the maximum number of features to use, setting a default to avoid memory issues - passed to CountVectorizer or TfidfVectorizer
ngram_range	tuple	(1, 1)	the ngram range to use (min_n, max_n) - passed to CountVectorizer or TfidfVectorizer
vocabulary	list \| None	None	list of tokens to use - passed to CountVectorizer or TfidfVectorizer
encoding	str	utf-8	the encoding to use - passed to CountVectorizer or TfidfVectorizer
decode_error	str	ignore	what to do if there is an error decoding ‘strict’, ‘ignore’, ‘replace’ - passed to CountVectorizer or TfidfVectorizer

source

TokensVectorizer.fit

 TokensVectorizer.fit (X, y=None)

Fit the vectorizer to the tokens.

source

TokensVectorizer.transform

 TokensVectorizer.transform (X)

Transform the texts to a matrix of counts or tf-idf scores.

source

TokensVectorizer.get_feature_names_out

 TokensVectorizer.get_feature_names_out (input_features=None)

Get the feature names out from the vectorizer.

Example

Here is an example …

from textplumber.tokens import TokensVectorizer
from textplumber.preprocess import SpacyPreprocessor
from textplumber.store import TextFeatureStore
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

Here we load text samples from Ernest Hemingway and Virginia Woolf available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['hemingway', 'woolf'])

Create a feature store to save preprocessed texts.

feature_store = TextFeatureStore('feature_store_example_tokens.sqlite')

The SpacyPreprocessor component is required before the TokensVectorizer. Here we train a model with 500 token features based on token counts.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('tokens', TokensVectorizer(feature_store=feature_store, max_features=500)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)),
                ('tokens',
                 TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                  max_features=500)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  17.3s
[Pipeline] ............ (step 2 of 3) Processing tokens, total=   0.4s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

   hemingway      0.908     0.954     0.930       504
       woolf      0.950     0.900     0.924       488

    accuracy                          0.927       992
   macro avg      0.929     0.927     0.927       992
weighted avg      0.929     0.927     0.927       992

Here we use TokenVectorizer with a more complex pipeline that extracts Tf-Idf weights for unigrams and bigrams and selects 500 unigrams and 500 bigrams as features using Sci-kit learn’s SelectKBest transformer and chi2 scores. Since the feature store has already been populated this training run is faster.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('features', FeatureUnion([
        ('tokens', Pipeline([
            ('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (1, 1), max_features=20000)),
            ('selector', SelectKBest(score_func=chi2, k=500)),
        ])),
        ('ngrams', Pipeline([
            ('vectorizer', TokensVectorizer(feature_store=feature_store, vectorizer_type = 'tfidf', ngram_range = (2, 2), max_features=20000)),
            ('selector', SelectKBest(score_func=chi2, k=500)),
        ])),
    ])),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)),
                ('features',
                 FeatureUnion(transformer_list=[('tokens',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selec...
                                                ('ngrams',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   ngram_range=(2,
                                                                                                2),
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selector',
                                                                  SelectKBest(k=500,
                                                                              score_func=<function chi2 at 0x7f0df8cb36a0>))]))])),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

Pipeline

?Documentation for PipelineiNot fitted

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)),
                ('features',
                 FeatureUnion(transformer_list=[('tokens',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selec...
                                                ('ngrams',
                                                 Pipeline(steps=[('vectorizer',
                                                                  TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                                   max_features=20000,
                                                                                   ngram_range=(2,
                                                                                                2),
                                                                                   vectorizer_type='tfidf')),
                                                                 ('selector',
                                                                  SelectKBest(k=500,
                                                                              score_func=<function chi2 at 0x7f0df8cb36a0>))]))])),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

SpacyPreprocessor

SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>)

features: FeatureUnion

?Documentation for features: FeatureUnion

FeatureUnion(transformer_list=[('tokens',
                                Pipeline(steps=[('vectorizer',
                                                 TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                  max_features=20000,
                                                                  vectorizer_type='tfidf')),
                                                ('selector',
                                                 SelectKBest(k=500,
                                                             score_func=<function chi2 at 0x7f0df8cb36a0>))])),
                               ('ngrams',
                                Pipeline(steps=[('vectorizer',
                                                 TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                                                                  max_features=20000,
                                                                  ngram_range=(2,
                                                                               2),
                                                                  vectorizer_type='tfidf')),
                                                ('selector',
                                                 SelectKBest(k=500,
                                                             score_func=<function chi2 at 0x7f0df8cb36a0>))]))])

tokens

TokensVectorizer

TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                 max_features=20000, vectorizer_type='tfidf')

SelectKBest

?Documentation for SelectKBest

SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>)

ngrams

TokensVectorizer

TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f0df71abdd0>,
                 max_features=20000, ngram_range=(2, 2),
                 vectorizer_type='tfidf')

SelectKBest

?Documentation for SelectKBest

SelectKBest(k=500, score_func=<function chi2 at 0x7f0df8cb36a0>)

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=5000, random_state=55)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=   0.2s
[Pipeline] .......... (step 2 of 3) Processing features, total=   1.6s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

   hemingway      0.953     0.962     0.958       504
       woolf      0.961     0.951     0.956       488

    accuracy                          0.957       992
   macro avg      0.957     0.957     0.957       992
weighted avg      0.957     0.957     0.957       992