embeddings

Extract text embedding features.

Model2VecEmbedder

 Model2VecEmbedder (feature_store:textplumber.store.TextFeatureStore,
                    model_name:str='minishlab/potion-base-8M',
                    batch_size:int=5000)

Sci-kit Learn pipeline component to extract embeddings using Model2Vec.

	Type	Default	Details
feature_store	TextFeatureStore		the feature store to use - this should be the same feature store used in the SpacyPreprocessor component
model_name	str	minishlab/potion-base-8M	the model name to use
batch_size	int	5000	batch size for encoding text

Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance.”

Model2Vec

source

Model2VecEmbedder.fit

 Model2VecEmbedder.fit (X, y=None)

Fit is implemented, but does nothing.

source

Model2VecEmbedder.transform

 Model2VecEmbedder.transform (X)

Generate embeddings for the texts using Model2Vec. If the embeddings are already in the feature store, they are used instead of recomputing them. Processing is done in batches to avoid memory issues.

source

Model2VecEmbedder.get_feature_names_out

 Model2VecEmbedder.get_feature_names_out (input_features=None)

Get the feature names out from the model.

Example

Here is an example demonstrating how to use Model2VecEmbedder in a pipeline.

from textplumber.embeddings import Model2VecEmbedder
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix
from textplumber.store import TextFeatureStore

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Here we load text samples from Barack Obama and Donald Trump available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['obama', 'trump'])

Create a feature store to store embeddings …

feature_store = TextFeatureStore('feature_store_example_embeddings.sqlite')

A very simply classification pipeline …

pipeline = Pipeline([
    ('embeddings', Model2VecEmbedder(feature_store=feature_store)),
    ('classifier', LogisticRegression(random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('embeddings',
                 Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7f3ac1ed0390>)),
                ('classifier', LogisticRegression(random_state=55))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ........ (step 1 of 2) Processing embeddings, total=   0.2s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   1.3s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

       obama      0.784     0.824     0.804       273
       trump      0.847     0.811     0.829       328

    accuracy                          0.817       601
   macro avg      0.816     0.818     0.816       601
weighted avg      0.818     0.817     0.817       601