from textplumber.chars import CharNgramVectorizer
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
chars
This functionality is not available in the latest version available on Pypi (0.0.8), but will be released as part of version 0.0.9.
CharNgramVectorizer
CharNgramVectorizer (feature_store:textplumber.store.TextFeatureStore=No ne, vectorizer_type:str='count', ngram_range:tuple=(2, 2), lowercase:bool=False, min_df:float|int=1, max_df:float|int=1.0, max_features:int=5000, vocabulary:list|None=None, analyzer:str='char', encoding:str='utf-8', decode_error:str='ignore')
Sci-kit Learn pipeline component to extract character ngram features.
Type | Default | Details | |
---|---|---|---|
feature_store | TextFeatureStore | None | (not implemented currently) |
vectorizer_type | str | count | the type of vectorizer to use - ‘count’ for CountVectorizer or ‘tfidf’ for TfidfVectorizer |
ngram_range | tuple | (2, 2) | the ngram range to use (min_n, max_n) - passed to CountVectorizer or TfidfVectorizer |
lowercase | bool | False | whether to lowercase the character ngrams - passed to CountVectorizer or TfidfVectorizer |
min_df | float | int | 1 | the minimum document frequency to use - passed to CountVectorizer or TfidfVectorizer |
max_df | float | int | 1.0 | the maximum document frequency to use - passed to CountVectorizer or TfidfVectorizer |
max_features | int | 5000 | the maximum number of features to use, setting a default to avoid memory issues - passed to CountVectorizer or TfidfVectorizer |
vocabulary | list | None | None | list of tokens to use - passed to CountVectorizer or TfidfVectorizer |
analyzer | str | char | the analyzer to use - ‘char’ or ’char_wb - passed to CountVectorizer or TfidfVectorizer |
encoding | str | utf-8 | the encoding to use - passed to CountVectorizer or TfidfVectorizer |
decode_error | str | ignore | what to do if there is an error decoding ‘strict’, ‘ignore’, ‘replace’ - passed to CountVectorizer or TfidfVectorizer |
CharNgramVectorizer.fit
CharNgramVectorizer.fit (X, y=None)
Fit the vectorizer.
CharNgramVectorizer.transform
CharNgramVectorizer.transform (X)
Transform the texts to a matrix of counts or tf-idf scores.
CharNgramVectorizer.get_feature_names_out
CharNgramVectorizer.get_feature_names_out (input_features=None)
Get the feature names out from the model.
Example
Here is an example demonstrating how to use CharNgramVectorizer
in a pipeline.
Here we load text samples from Ernest Hemingway and Virginia Woolf available in the AuthorMix dataset.
= get_example_data(label_column = 'style', target_labels = ['hemingway', 'woolf']) X_train, y_train, X_test, y_test, target_classes, target_names
The next cell creates a very simply classification pipeline that extracts 1000 lower-cased character bigrams as features.
= Pipeline([
pipeline 'charngrams', CharNgramVectorizer(ngram_range = (2, 2), lowercase = True, max_features=1000)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('charngrams', CharNgramVectorizer(max_features=1000)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('charngrams', CharNgramVectorizer(max_features=1000)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
CharNgramVectorizer(max_features=1000)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ........ (step 1 of 2) Processing charngrams, total= 0.8s
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 0.9s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
hemingway 0.921 0.921 0.921 504
woolf 0.918 0.918 0.918 488
accuracy 0.919 992
macro avg 0.919 0.919 0.919 992
weighted avg 0.919 0.919 0.919 992
The lowercase
is set to False by default, meaning ‘Go’ is different to ‘go’. As this example shows, preserving case can make a difference to accuracy.
= Pipeline([
pipeline 'charngrams', CharNgramVectorizer(ngram_range = (2, 2), lowercase = False, max_features=1000)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('charngrams', CharNgramVectorizer(lowercase=False, max_features=1000)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('charngrams', CharNgramVectorizer(lowercase=False, max_features=1000)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
CharNgramVectorizer(lowercase=False, max_features=1000)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ........ (step 1 of 2) Processing charngrams, total= 0.9s
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 0.9s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
hemingway 0.927 0.937 0.932 504
woolf 0.934 0.924 0.929 488
accuracy 0.930 992
macro avg 0.931 0.930 0.930 992
weighted avg 0.930 0.930 0.930 992
In this example the ngram range and max_features are adjusted to extract more ngrams of varying lengths. However, only 500 features are used as features for classification (i.e. half the number used in the examples above) by selecting features based on mutual information scores.
= Pipeline([
pipeline 'charngrams', CharNgramVectorizer(ngram_range = (2, 4), lowercase = False, max_features=20000)),
('selector', SelectKBest(score_func=mutual_info_classif, k=500)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('charngrams', CharNgramVectorizer(lowercase=False, max_features=20000, ngram_range=(2, 4))), ('selector', SelectKBest(k=500, score_func=<function mutual_info_classif at 0x7f0c49b2c5e0>)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('charngrams', CharNgramVectorizer(lowercase=False, max_features=20000, ngram_range=(2, 4))), ('selector', SelectKBest(k=500, score_func=<function mutual_info_classif at 0x7f0c49b2c5e0>)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
CharNgramVectorizer(lowercase=False, max_features=20000, ngram_range=(2, 4))
SelectKBest(k=500, score_func=<function mutual_info_classif at 0x7f0c49b2c5e0>)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ........ (step 1 of 3) Processing charngrams, total= 3.3s
[Pipeline] .......... (step 2 of 3) Processing selector, total= 27.0s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 1.8s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
hemingway 0.938 0.933 0.935 504
woolf 0.931 0.936 0.934 488
accuracy 0.934 992
macro avg 0.934 0.935 0.934 992
weighted avg 0.934 0.934 0.934 992