chars

Extract character-level features, like character ngrams.

This functionality is not available in the latest version available on Pypi (0.0.8), but will be released as part of version 0.0.9.

source

CharNgramVectorizer

 CharNgramVectorizer
                      (feature_store:textplumber.store.TextFeatureStore=No
                      ne, vectorizer_type:str='count',
                      ngram_range:tuple=(2, 2), lowercase:bool=False,
                      min_df:float|int=1, max_df:float|int=1.0,
                      max_features:int=5000, vocabulary:list|None=None,
                      analyzer:str='char', encoding:str='utf-8',
                      decode_error:str='ignore')

Sci-kit Learn pipeline component to extract character ngram features.

	Type	Default	Details
feature_store	TextFeatureStore	None	(not implemented currently)
vectorizer_type	str	count	the type of vectorizer to use - ‘count’ for CountVectorizer or ‘tfidf’ for TfidfVectorizer
ngram_range	tuple	(2, 2)	the ngram range to use (min_n, max_n) - passed to CountVectorizer or TfidfVectorizer
lowercase	bool	False	whether to lowercase the character ngrams - passed to CountVectorizer or TfidfVectorizer
min_df	float \| int	1	the minimum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_df	float \| int	1.0	the maximum document frequency to use - passed to CountVectorizer or TfidfVectorizer
max_features	int	5000	the maximum number of features to use, setting a default to avoid memory issues - passed to CountVectorizer or TfidfVectorizer
vocabulary	list \| None	None	list of tokens to use - passed to CountVectorizer or TfidfVectorizer
analyzer	str	char	the analyzer to use - ‘char’ or ’char_wb - passed to CountVectorizer or TfidfVectorizer
encoding	str	utf-8	the encoding to use - passed to CountVectorizer or TfidfVectorizer
decode_error	str	ignore	what to do if there is an error decoding ‘strict’, ‘ignore’, ‘replace’ - passed to CountVectorizer or TfidfVectorizer

source

CharNgramVectorizer.fit

 CharNgramVectorizer.fit (X, y=None)

Fit the vectorizer.

source

CharNgramVectorizer.transform

 CharNgramVectorizer.transform (X)

Transform the texts to a matrix of counts or tf-idf scores.

source

CharNgramVectorizer.get_feature_names_out

 CharNgramVectorizer.get_feature_names_out (input_features=None)

Get the feature names out from the model.

Example

Here is an example demonstrating how to use CharNgramVectorizer in a pipeline.

from textplumber.chars import CharNgramVectorizer
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

Here we load text samples from Ernest Hemingway and Virginia Woolf available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['hemingway', 'woolf'])

The next cell creates a very simply classification pipeline that extracts 1000 lower-cased character bigrams as features.

pipeline = Pipeline([
    ('charngrams', CharNgramVectorizer(ngram_range = (2, 2), lowercase = True, max_features=1000)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('charngrams', CharNgramVectorizer(max_features=1000)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ........ (step 1 of 2) Processing charngrams, total=   0.8s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.9s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

   hemingway      0.921     0.921     0.921       504
       woolf      0.918     0.918     0.918       488

    accuracy                          0.919       992
   macro avg      0.919     0.919     0.919       992
weighted avg      0.919     0.919     0.919       992

The lowercase is set to False by default, meaning ‘Go’ is different to ‘go’. As this example shows, preserving case can make a difference to accuracy.

pipeline = Pipeline([
    ('charngrams', CharNgramVectorizer(ngram_range = (2, 2), lowercase = False, max_features=1000)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('charngrams',
                 CharNgramVectorizer(lowercase=False, max_features=1000)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ........ (step 1 of 2) Processing charngrams, total=   0.9s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.9s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

   hemingway      0.927     0.937     0.932       504
       woolf      0.934     0.924     0.929       488

    accuracy                          0.930       992
   macro avg      0.931     0.930     0.930       992
weighted avg      0.930     0.930     0.930       992

In this example the ngram range and max_features are adjusted to extract more ngrams of varying lengths. However, only 500 features are used as features for classification (i.e. half the number used in the examples above) by selecting features based on mutual information scores.

pipeline = Pipeline([
    ('charngrams', CharNgramVectorizer(ngram_range = (2, 4), lowercase = False, max_features=20000)),
    ('selector', SelectKBest(score_func=mutual_info_classif, k=500)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('charngrams',
                 CharNgramVectorizer(lowercase=False, max_features=20000,
                                     ngram_range=(2, 4))),
                ('selector',
                 SelectKBest(k=500,
                             score_func=<function mutual_info_classif at 0x7f0c49b2c5e0>)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ........ (step 1 of 3) Processing charngrams, total=   3.3s
[Pipeline] .......... (step 2 of 3) Processing selector, total=  27.0s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   1.8s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

   hemingway      0.938     0.933     0.935       504
       woolf      0.931     0.936     0.934       488

    accuracy                          0.934       992
   macro avg      0.934     0.935     0.934       992
weighted avg      0.934     0.934     0.934       992