from textplumber.report import plot_confusion_matrix, get_label_names, save_results, plot_logistic_regression_features_from_pipeline
from textplumber.vader import VaderSentimentEstimator, VaderSentimentExtractor
from textplumber.preprocess import SpacyPreprocessor
from textplumber.embeddings import Model2VecEmbedder
from textplumber.core import get_stop_words
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import pandas as pd
vader
Textplumber implements feature extraction using VADER, “a lexicon and rule-based sentiment analysis tool”. VADER’s GitHub repository and Hutto and Gilbert’s 2014 paper about VADER are the best explanation of VADER and how the VADER lexicon and rules were derived.
This functionality is not available in the latest version available on Pypi (0.0.8), but will be released as part of version 0.0.9.
VaderSentimentExtractor
VaderSentimentExtractor (feature_store:textplumber.store.TextFeatureStor e=None, output:str='polarity', neutral_threshold:float=0.05, profile_first_n:int=3, profile_last_n:int=3, profile_sample_n:int=4, profile_min_sentence_chars:int=10, profile_sections:int=10)
Sci-kit Learn pipeline component to extract sentiment features using VADER.
Type | Default | Details | |
---|---|---|---|
feature_store | TextFeatureStore | None | (not implemented currently) |
output | str | polarity | ‘polarity’ (VADER’s compound score), ‘proportions’ (ratios for proportions of text that are positive, neutral or negative), or ‘allstats’ (equivalent to ‘polarity’ + ‘proportions’), ‘labels’ (positive, neutral, negative), profile (for a document sentiment profile vector consisting of document-level and sentence-level features with their order in the document represented) |
neutral_threshold | float | 0.05 | threshold for neutral sentiment |
profile_first_n | int | 3 | number of sentences at start of doc to profile |
profile_last_n | int | 3 | number of sentences at end of doc to profile |
profile_sample_n | int | 4 | number of sentences to sample from doc sentences after first and last removed |
profile_min_sentence_chars | int | 10 | minimum number of characters in body sentences to be included in the profile |
profile_sections | int | 10 | number of sections to split the document into for profiling |
By default neutral_threshold
is set to 0.05. This means that any text with polarity greater than -0.05 and less than 0.05 will be ‘neutral’. The 0.05 value default is the recommendation of the VADER Github page, but this can be tuned as needed.
VaderSentimentExtractor.fit
VaderSentimentExtractor.fit (X, y=None)
Fit is implemented, but does nothing.
VaderSentimentExtractor.convert_score_to_label
VaderSentimentExtractor.convert_score_to_label (score:float, label_mapping=None)
Convert VADER score to label.
VaderSentimentExtractor.convert_scores_to_labels
VaderSentimentExtractor.convert_scores_to_labels (scores:list[float], label_mapping=None)
Convert VADER score to label.
VaderSentimentExtractor.section_profile
VaderSentimentExtractor.section_profile (text)
Mean pooling of VADER scores across document sections .
VaderSentimentExtractor.profile
VaderSentimentExtractor.profile (text:str, doc_level_scores:dict)
Create a document profile with VADER scores, which makes use of document level scores and sentence-level scores across the document.
Type | Details | |
---|---|---|
text | str | the document text |
doc_level_scores | dict | VADER scores for document text |
Returns | list | a document profile vector consisting of the document level scores and sentence-level scores across the document |
VaderSentimentExtractor.transform
VaderSentimentExtractor.transform (X)
Extracts the sentiment from the text using VADER.
VaderSentimentExtractor.get_feature_names_out
VaderSentimentExtractor.get_feature_names_out (input_features=None)
Get the feature names out from the model.
VaderSentimentExtractor.plot_sentiment_structure
VaderSentimentExtractor.plot_sentiment_structure (X:list[str], y:list, target_classes:list=Non e, target_names:list=None, n_sections:int=10, n_clusters:int=5, sampl es_per_cluster:int=5, renderer:str='svg')
Plot the sentiment structure of documents. For each class, cluster the documents by sentiment structure, and plot up to 5 samples per cluster. Adds space and labels between clusters, with a border around each cluster.
Type | Default | Details | |
---|---|---|---|
X | list | ||
y | list | ||
target_classes | list | None | |
target_names | list | None | |
n_sections | int | 10 | Number of chunks per document |
n_clusters | int | 5 | Number of clusters per class |
samples_per_cluster | int | 5 | |
renderer | str | svg | ‘svg’ or ‘png’ |
VaderSentimentEstimator
VaderSentimentEstimator (output:str='labels', neutral_threshold:float=0.05, label_mapping:dict|None=None)
Sci-kit Learn pipeline component to predict sentiment using VADER.
Type | Default | Details | |
---|---|---|---|
output | str | labels | ‘polarity’ (VADER’s compound score) or ‘labels’ (positive, neutral, negative) |
neutral_threshold | float | 0.05 | threshold for neutral sentiment (see note for VaderSentimentExtractor) |
label_mapping | dict | None | None | (ignored if labels is None) mapping of labels to desired labels - keys should be ‘positive’, ‘neutral’, ‘negative’ and values should be desired labels |
If output
is set to labels
then VaderSentimentEstimator
functions as a pseudo-classifier. With output
set to polarity
, it functions as a regressor or scorer.
By default the VaderSentimentEstimator is setup to work with three classes (i.e. it returns ‘positive’, ‘neutral’ or ‘negative’). If you only have two classes (negative/positive), set the neutral_threshold to 0 to remove the neutral class. Even slightly negative or positive scores will be assigned a non-neutral label. A 0 polarity score will be assigned to positive in the two-class case. The classes might be better interpreted as negative and not-negative in this instance.
neutral_threshold = 0
You may also want/need to create a label mapping so that the correct class IDs are returned by the estimator.
For example …
label_mapping = {
'positive': 1,
'negative': 0
}
VaderSentimentEstimator.predict
VaderSentimentEstimator.predict (X)
Predict the sentiment of texts using VADER.
Example using VADER as a predictor
Here is an example demonstrating how to use VaderSentimentEstimator
in a pipeline.
Here we load text samples from a sentiment dataset. The dataset has validation and test sets, but we are just working with the train split in this instance.
= 'cardiffnlp/tweet_eval'
dataset_name = 'sentiment'
dataset_dir = load_dataset(dataset_name, data_dir = dataset_dir, split='train') dataset
Only getting 5000 for each class …
= 'label'
label_column = get_label_names(dataset, label_column)
target_names = list(range(len(target_names)))
target_classes
# selecting 5000 per class here ...
= dataset.to_pandas()
df = [
sampled_dfs =5000, random_state=42)
group.sample(nfor _, group in df.groupby('label')
]= pd.concat(sampled_dfs, ignore_index=True)
df
= df['text']
X = df[label_column] y
= train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) X_train, X_test, y_train, y_test
'display.max_colwidth', 200)
pd.set_option(= pd.DataFrame({'text': X_train, 'label': y_train})
df 'label_name'] = df['label'].apply(lambda x: target_names[x])
df[5) df.head(
text | label | label_name | |
---|---|---|---|
5497 | Harper walks. #Nats now have two men on with one out. Escobar will bat next. Still 8-7 Mets in the bottom of the 9th. | 1 | neutral |
12677 | May or may not be getting ready to watch the very first season EVER of Big Brother. @user I'm so ready to see what Julie looks like! | 2 | positive |
8037 | Patrick Leahy & Christian Bale together again. U.S. Senator to make his 2nd cameo in Batman movie in Dark Knight Rises | 1 | neutral |
6670 | "A smartphone review that the tech press needs to read twice, obviously great for consumers too: Moto G 3rd gen review | 1 | neutral |
79 | Armed with the tools of power & intimidation Harper is trying to steal our country. On October 19th, armed with pencils LET'S TAKE IT BACK. | 0 | negative |
Setup a label mapping so that the label values that match the dataset are returned rather than the default ‘positive’, ‘neutral’, ‘negative’ labels returned by VaderSentimentEstimator
.
= {
label_mapping 'negative': 0,
'neutral': 1,
'positive': 2
}
Our pipeline only has one component! Notice that the label_mapping is passed to the estimator to ensure we return comparable labels to the data-set labels.
= Pipeline([
pipeline 'vader_estimator', VaderSentimentEstimator(output = 'labels', label_mapping = label_mapping)),
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('vader_estimator', VaderSentimentEstimator(label_mapping={'negative': 0, 'neutral': 1, 'positive': 2}))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vader_estimator', VaderSentimentEstimator(label_mapping={'negative': 0, 'neutral': 1, 'positive': 2}))], verbose=True)
VaderSentimentEstimator(label_mapping={'negative': 0, 'neutral': 1, 'positive': 2})
Note: VADER is based on a lexicon and heuristics that are independent of the training data. The fit call is not really required below (and it does nothing).
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ... (step 1 of 1) Processing vader_estimator, total= 0.0s
Log some results for the final cell summary.
= 'tweet_eval sentiment dataset, 5000 rows per class randomly sampled from train split (per class - 4000 train, 1000 test)'
dataset_descriptor = 'Assigning labels using VADER'
experiment_descriptor = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'vader_estimator') results
Here are the results …
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
negative 0.657 0.591 0.622 1000
neutral 0.533 0.392 0.452 1000
positive 0.514 0.702 0.594 1000
accuracy 0.562 3000
macro avg 0.568 0.562 0.556 3000
weighted avg 0.568 0.562 0.556 3000
Example using VADER as a feature extractor
Here is an example demonstrating how to use VaderSentimentExtractor
in a pipeline. We will use the same dataset as above so we can compare results between predictions based on VADER and the predictions of a machine learning model that uses VADER statistics as features.
The VaderSentimentExtractor can return VADER’s compound polarity score (‘polarity’), proportions of positive/neutral/negative (‘proportions’), or all statistics (‘allstats’).
Note: when proportions are being returned by VADER, these do not make use of rules reflected in the compound polarity score. See VADER’s Github documentation for more information.
= Pipeline([
pipeline 'vader_extractor', VaderSentimentExtractor(output = 'allstats')),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('vader_extractor', VaderSentimentExtractor(output='allstats')), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vader_extractor', VaderSentimentExtractor(output='allstats')), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
VaderSentimentExtractor(output='allstats')
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ... (step 1 of 2) Processing vader_extractor, total= 0.4s
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 0.0s
Logging results for summary …
= 'Logistic Regression classifier using VADER features'
experiment_descriptor = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier') results
As might be expected, the performance of a classifier based on statistical output from VADER is similar to the results of the first experiment, which used VADER to assign the class …
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
negative 0.660 0.632 0.646 1000
neutral 0.527 0.493 0.510 1000
positive 0.570 0.631 0.599 1000
accuracy 0.585 3000
macro avg 0.586 0.585 0.585 3000
weighted avg 0.586 0.585 0.585 3000
It is anticipated that the VaderSentimentExtractor component will be used alongside other features. Here is an example where the VADER statistics are supplemented by other features.
Setup a feature store to save preprocessed texts …
= TextFeatureStore('feature_store_example_vader.sqlite') feature_store
Augment the VADER features with embeddings …
= Pipeline([
pipeline 'preprocess', SpacyPreprocessor(feature_store=feature_store)),
('features', FeatureUnion([
('vader', VaderSentimentExtractor(output = 'allstats')),
('embeddings', Model2VecEmbedder(feature_store=feature_store)),
(=True)),
], verbose'classifier', LogisticRegression(max_iter = 5000, random_state=55)),
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocess', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>)), ('features', FeatureUnion(transformer_list=[('vader', VaderSentimentExtractor(output='allstats')), ('embeddings', Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>))], verbose=True)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>)), ('features', FeatureUnion(transformer_list=[('vader', VaderSentimentExtractor(output='allstats')), ('embeddings', Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>))], verbose=True)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>)
FeatureUnion(transformer_list=[('vader', VaderSentimentExtractor(output='allstats')), ('embeddings', Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>))], verbose=True)
VaderSentimentExtractor(output='allstats')
Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ........ (step 1 of 3) Processing preprocess, total= 11.0s
[FeatureUnion] ......... (step 1 of 2) Processing vader, total= 0.4s
[FeatureUnion] .... (step 2 of 2) Processing embeddings, total= 0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total= 0.8s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 8.1s
Logging results for summary …
= 'Logistic Regression classifier using VADER features and Model2Vec embeddings'
experiment_descriptor = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier') results
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
negative 0.704 0.725 0.714 1000
neutral 0.594 0.574 0.584 1000
positive 0.663 0.666 0.665 1000
accuracy 0.655 3000
macro avg 0.654 0.655 0.654 3000
weighted avg 0.654 0.655 0.654 3000
Do the VADER statistics contribute to the accuracy? In other words, could we remove the VADER stats for similar performance? This sanity check is run in the next cell. Results are shown in the results summary.
= Pipeline([
pipeline 'preprocess', SpacyPreprocessor(feature_store=feature_store)),
('features', FeatureUnion([
('embeddings', Model2VecEmbedder(feature_store=feature_store)),
(=True)),
], verbose'classifier', LogisticRegression(max_iter = 5000, random_state=55)),
(=True)
], verbose
pipeline.fit(X_train, y_train)= pipeline.predict(X_test)
y_pred
= 'Logistic Regression classifier using only Model2Vec embeddings (Sanity Check)'
experiment_descriptor = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier') results
[Pipeline] ........ (step 1 of 3) Processing preprocess, total= 0.4s
[FeatureUnion] .... (step 1 of 1) Processing embeddings, total= 0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total= 0.5s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 5.5s
Results for the four experiments are shown below. The best model achieves an F1 score of 0.655. While much better performance is possible for sentiment classification and only a small amount of training data was used, the model with VADER statistics and embeddings outperformed:
- a model trained on embeddings alone
- a model trained on VADER’s statistical output
- the predictions of the VADER algorithm itself.
= ['experiment', 'accuracy_f1', 'negative_f1', 'neutral_f1', 'positive_f1']
fields 'results_example_vader.csv').sort_values(by='accuracy_f1', ascending=False)[fields]) display(pd.read_csv(
experiment | accuracy_f1 | negative_f1 | neutral_f1 | positive_f1 | |
---|---|---|---|---|---|
2 | Logistic Regression classifier using VADER features and Model2Vec embeddings | 0.655000 | 0.714286 | 0.583927 | 0.664671 |
3 | Logistic Regression classifier using only Model2Vec embeddings (Sanity Check) | 0.634333 | 0.680997 | 0.559836 | 0.659023 |
1 | Logistic Regression classifier using VADER features | 0.585333 | 0.645557 | 0.509561 | 0.598956 |
0 | Assigning labels using VADER | 0.561667 | 0.622105 | 0.451873 | 0.593658 |