from textplumber.embeddings import Model2VecEmbedder
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix
from textplumber.store import TextFeatureStore
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
embeddings
Model2VecEmbedder
Model2VecEmbedder (feature_store:textplumber.store.TextFeatureStore, model_name:str='minishlab/potion-base-8M', batch_size:int=5000)
Sci-kit Learn pipeline component to extract embeddings using Model2Vec.
Type | Default | Details | |
---|---|---|---|
feature_store | TextFeatureStore | the feature store to use - this should be the same feature store used in the SpacyPreprocessor component | |
model_name | str | minishlab/potion-base-8M | the model name to use |
batch_size | int | 5000 | batch size for encoding text |
Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance.”
Model2VecEmbedder.fit
Model2VecEmbedder.fit (X, y=None)
Fit is implemented, but does nothing.
Model2VecEmbedder.transform
Model2VecEmbedder.transform (X)
Generate embeddings for the texts using Model2Vec. If the embeddings are already in the feature store, they are used instead of recomputing them. Processing is done in batches of 1000 texts to avoid memory issues.
Model2VecEmbedder.get_feature_names_out
Model2VecEmbedder.get_feature_names_out (input_features=None)
Get the feature names out from the model.
Example
Here is an example demonstrating how to use Model2VecEmbedder
in a pipeline.
Here we load text samples from Barack Obama and Donald Trump available in the AuthorMix dataset.
= get_example_data(label_column = 'style', target_labels = ['obama', 'trump']) X_train, y_train, X_test, y_test, target_classes, target_names
Create a feature store to store embeddings …
= TextFeatureStore('feature_store_example_embeddings.sqlite') feature_store
A very simply classification pipeline …
= Pipeline([
pipeline 'embeddings', Model2VecEmbedder(feature_store=feature_store)),
('classifier', LogisticRegression(random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('embeddings', Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7f3e0972d4d0>)), ('classifier', LogisticRegression(random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('embeddings', Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7f3e0972d4d0>)), ('classifier', LogisticRegression(random_state=55))], verbose=True)
Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7f3e0972d4d0>)
LogisticRegression(random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ........ (step 1 of 2) Processing embeddings, total= 0.6s
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.7s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
obama 0.784 0.824 0.804 273
trump 0.847 0.811 0.829 328
accuracy 0.817 601
macro avg 0.816 0.818 0.816 601
weighted avg 0.818 0.817 0.817 601