vader

Extract sentiment scores from a text using VADER.

Textplumber implements feature extraction using VADER, “a lexicon and rule-based sentiment analysis tool”. VADER’s GitHub repository and Hutto and Gilbert’s 2014 paper about VADER are the best explanation of VADER and how the VADER lexicon and rules were derived.

This functionality is not available in the latest version available on Pypi (0.0.8), but will be released as part of version 0.0.9.

source

VaderSentimentExtractor

 VaderSentimentExtractor
                          (feature_store:textplumber.store.TextFeatureStor
                          e=None, output:str='polarity',
                          neutral_threshold:float=0.05,
                          profile_first_n:int=3, profile_last_n:int=3,
                          profile_sample_n:int=4,
                          profile_min_sentence_chars:int=10,
                          profile_sections:int=10)

Sci-kit Learn pipeline component to extract sentiment features using VADER.

	Type	Default	Details
feature_store	TextFeatureStore	None	(not implemented currently)
output	str	polarity	‘polarity’ (VADER’s compound score), ‘proportions’ (ratios for proportions of text that are positive, neutral or negative), or ‘allstats’ (equivalent to ‘polarity’ + ‘proportions’), ‘labels’ (positive, neutral, negative), profile (for a document sentiment profile vector consisting of document-level and sentence-level features with their order in the document represented)
neutral_threshold	float	0.05	threshold for neutral sentiment
profile_first_n	int	3	number of sentences at start of doc to profile
profile_last_n	int	3	number of sentences at end of doc to profile
profile_sample_n	int	4	number of sentences to sample from doc sentences after first and last removed
profile_min_sentence_chars	int	10	minimum number of characters in body sentences to be included in the profile
profile_sections	int	10	number of sections to split the document into for profiling

By default neutral_threshold is set to 0.05. This means that any text with polarity greater than -0.05 and less than 0.05 will be ‘neutral’. The 0.05 value default is the recommendation of the VADER Github page, but this can be tuned as needed.

source

VaderSentimentExtractor.fit

 VaderSentimentExtractor.fit (X, y=None)

Fit is implemented, but does nothing.

source

VaderSentimentExtractor.convert_score_to_label

 VaderSentimentExtractor.convert_score_to_label (score:float,
                                                 label_mapping=None)

Convert VADER score to label.

source

VaderSentimentExtractor.convert_scores_to_labels

 VaderSentimentExtractor.convert_scores_to_labels (scores:list[float],
                                                   label_mapping=None)

Convert VADER score to label.

source

VaderSentimentExtractor.section_profile

 VaderSentimentExtractor.section_profile (text)

Mean pooling of VADER scores across document sections .

source

VaderSentimentExtractor.profile

 VaderSentimentExtractor.profile (text:str, doc_level_scores:dict)

Create a document profile with VADER scores, which makes use of document level scores and sentence-level scores across the document.

	Type	Details
text	str	the document text
doc_level_scores	dict	VADER scores for document text
Returns	list	a document profile vector consisting of the document level scores and sentence-level scores across the document

source

VaderSentimentExtractor.transform

 VaderSentimentExtractor.transform (X)

Extracts the sentiment from the text using VADER.

source

VaderSentimentExtractor.get_feature_names_out

 VaderSentimentExtractor.get_feature_names_out (input_features=None)

Get the feature names out from the model.

source

VaderSentimentExtractor.plot_sentiment_structure

 VaderSentimentExtractor.plot_sentiment_structure (X:list[str], y:list,
                                                   target_classes:list=Non
                                                   e,
                                                   target_names:list=None,
                                                   n_sections:int=10,
                                                   n_clusters:int=5, sampl
                                                   es_per_cluster:int=5,
                                                   renderer:str='svg')

Plot the sentiment structure of documents. For each class, cluster the documents by sentiment structure, and plot up to 5 samples per cluster. Adds space and labels between clusters, with a border around each cluster.

	Type	Default	Details
X	list
y	list
target_classes	list	None
target_names	list	None
n_sections	int	10	Number of chunks per document
n_clusters	int	5	Number of clusters per class
samples_per_cluster	int	5
renderer	str	svg	‘svg’ or ‘png’

source

VaderSentimentEstimator

 VaderSentimentEstimator (output:str='labels',
                          neutral_threshold:float=0.05,
                          label_mapping:dict|None=None)

Sci-kit Learn pipeline component to predict sentiment using VADER.

	Type	Default	Details
output	str	labels	‘polarity’ (VADER’s compound score) or ‘labels’ (positive, neutral, negative)
neutral_threshold	float	0.05	threshold for neutral sentiment (see note for VaderSentimentExtractor)
label_mapping	dict \| None	None	(ignored if labels is None) mapping of labels to desired labels - keys should be ‘positive’, ‘neutral’, ‘negative’ and values should be desired labels

If output is set to labels then VaderSentimentEstimator functions as a pseudo-classifier. With output set to polarity, it functions as a regressor or scorer.

By default the VaderSentimentEstimator is setup to work with three classes (i.e. it returns ‘positive’, ‘neutral’ or ‘negative’). If you only have two classes (negative/positive), set the neutral_threshold to 0 to remove the neutral class. Even slightly negative or positive scores will be assigned a non-neutral label. A 0 polarity score will be assigned to positive in the two-class case. The classes might be better interpreted as negative and not-negative in this instance.

neutral_threshold = 0

You may also want/need to create a label mapping so that the correct class IDs are returned by the estimator.

For example …

label_mapping = {
    'positive': 1,
    'negative': 0
}

source

VaderSentimentEstimator.predict

 VaderSentimentEstimator.predict (X)

Predict the sentiment of texts using VADER.

Example using VADER as a predictor

Here is an example demonstrating how to use VaderSentimentEstimator in a pipeline.

from textplumber.report import plot_confusion_matrix, get_label_names, save_results, plot_logistic_regression_features_from_pipeline
from textplumber.vader import VaderSentimentEstimator, VaderSentimentExtractor
from textplumber.preprocess import SpacyPreprocessor
from textplumber.embeddings import Model2VecEmbedder
from textplumber.core import get_stop_words

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from datasets import load_dataset
from sklearn.model_selection import train_test_split

import pandas as pd

Here we load text samples from a sentiment dataset. The dataset has validation and test sets, but we are just working with the train split in this instance.

dataset_name = 'cardiffnlp/tweet_eval'
dataset_dir = 'sentiment'
dataset = load_dataset(dataset_name, data_dir = dataset_dir, split='train')

Only getting 5000 for each class …

label_column = 'label'
target_names = get_label_names(dataset, label_column)
target_classes = list(range(len(target_names)))

# selecting 5000 per class here ...
df = dataset.to_pandas()
sampled_dfs = [
    group.sample(n=5000, random_state=42)
    for _, group in df.groupby('label')
]
df = pd.concat(sampled_dfs, ignore_index=True)

X = df['text']
y = df[label_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

pd.set_option('display.max_colwidth', 200)
df = pd.DataFrame({'text': X_train, 'label': y_train})
df['label_name'] = df['label'].apply(lambda x: target_names[x])
df.head(5)

	text	label	label_name
5497	Harper walks. #Nats now have two men on with one out. Escobar will bat next. Still 8-7 Mets in the bottom of the 9th.	1	neutral
12677	May or may not be getting ready to watch the very first season EVER of Big Brother. @user I'm so ready to see what Julie looks like!	2	positive
8037	Patrick Leahy & Christian Bale together again. U.S. Senator to make his 2nd cameo in Batman movie in Dark Knight Rises	1	neutral
6670	"A smartphone review that the tech press needs to read twice, obviously great for consumers too: Moto G 3rd gen review	1	neutral
79	Armed with the tools of power & intimidation Harper is trying to steal our country. On October 19th, armed with pencils LET'S TAKE IT BACK.	0	negative

Setup a label mapping so that the label values that match the dataset are returned rather than the default ‘positive’, ‘neutral’, ‘negative’ labels returned by VaderSentimentEstimator.

label_mapping = {
    'negative': 0,
    'neutral': 1, 
    'positive': 2
}

Our pipeline only has one component! Notice that the label_mapping is passed to the estimator to ensure we return comparable labels to the data-set labels.

pipeline = Pipeline([
    ('vader_estimator', VaderSentimentEstimator(output = 'labels', label_mapping = label_mapping)),
], verbose=True)

display(pipeline)

Pipeline(steps=[('vader_estimator',
                 VaderSentimentEstimator(label_mapping={'negative': 0,
                                                        'neutral': 1,
                                                        'positive': 2}))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Note: VADER is based on a lexicon and heuristics that are independent of the training data. The fit call is not really required below (and it does nothing).

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ... (step 1 of 1) Processing vader_estimator, total=   0.0s

Log some results for the final cell summary.

dataset_descriptor = 'tweet_eval sentiment dataset, 5000 rows per class randomly sampled from train split (per class - 4000 train, 1000 test)'
experiment_descriptor = 'Assigning labels using VADER'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'vader_estimator')

Here are the results …

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

    negative      0.657     0.591     0.622      1000
     neutral      0.533     0.392     0.452      1000
    positive      0.514     0.702     0.594      1000

    accuracy                          0.562      3000
   macro avg      0.568     0.562     0.556      3000
weighted avg      0.568     0.562     0.556      3000

Example using VADER as a feature extractor

Here is an example demonstrating how to use VaderSentimentExtractor in a pipeline. We will use the same dataset as above so we can compare results between predictions based on VADER and the predictions of a machine learning model that uses VADER statistics as features.

The VaderSentimentExtractor can return VADER’s compound polarity score (‘polarity’), proportions of positive/neutral/negative (‘proportions’), or all statistics (‘allstats’).

Note: when proportions are being returned by VADER, these do not make use of rules reflected in the compound polarity score. See VADER’s Github documentation for more information.

pipeline = Pipeline([
    ('vader_extractor', VaderSentimentExtractor(output = 'allstats')),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('vader_extractor', VaderSentimentExtractor(output='allstats')),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ... (step 1 of 2) Processing vader_extractor, total=   0.4s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.0s

Logging results for summary …

experiment_descriptor = 'Logistic Regression classifier using VADER features'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

As might be expected, the performance of a classifier based on statistical output from VADER is similar to the results of the first experiment, which used VADER to assign the class …

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

    negative      0.660     0.632     0.646      1000
     neutral      0.527     0.493     0.510      1000
    positive      0.570     0.631     0.599      1000

    accuracy                          0.585      3000
   macro avg      0.586     0.585     0.585      3000
weighted avg      0.586     0.585     0.585      3000

It is anticipated that the VaderSentimentExtractor component will be used alongside other features. Here is an example where the VADER statistics are supplemented by other features.

Setup a feature store to save preprocessed texts …

feature_store = TextFeatureStore('feature_store_example_vader.sqlite')

Augment the VADER features with embeddings …

pipeline = Pipeline([
        ('preprocess', SpacyPreprocessor(feature_store=feature_store)),
        ('features', FeatureUnion([
                ('vader', VaderSentimentExtractor(output = 'allstats')),
                ('embeddings', Model2VecEmbedder(feature_store=feature_store)),
        ], verbose=True)),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55)),
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocess',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>)),
                ('features',
                 FeatureUnion(transformer_list=[('vader',
                                                 VaderSentimentExtractor(output='allstats')),
                                                ('embeddings',
                                                 Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7fba26048d90>))],
                              verbose=True)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ........ (step 1 of 3) Processing preprocess, total=  11.0s
[FeatureUnion] ......... (step 1 of 2) Processing vader, total=   0.4s
[FeatureUnion] .... (step 2 of 2) Processing embeddings, total=   0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total=   0.8s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   8.1s

Logging results for summary …

experiment_descriptor = 'Logistic Regression classifier using VADER features and Model2Vec embeddings'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

    negative      0.704     0.725     0.714      1000
     neutral      0.594     0.574     0.584      1000
    positive      0.663     0.666     0.665      1000

    accuracy                          0.655      3000
   macro avg      0.654     0.655     0.654      3000
weighted avg      0.654     0.655     0.654      3000

Do the VADER statistics contribute to the accuracy? In other words, could we remove the VADER stats for similar performance? This sanity check is run in the next cell. Results are shown in the results summary.

pipeline = Pipeline([
        ('preprocess', SpacyPreprocessor(feature_store=feature_store)),
        ('features', FeatureUnion([
                ('embeddings', Model2VecEmbedder(feature_store=feature_store)),
        ], verbose=True)),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55)),
], verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

experiment_descriptor = 'Logistic Regression classifier using only Model2Vec embeddings (Sanity Check)'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

[Pipeline] ........ (step 1 of 3) Processing preprocess, total=   0.4s
[FeatureUnion] .... (step 1 of 1) Processing embeddings, total=   0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total=   0.5s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   5.5s

Results for the four experiments are shown below. The best model achieves an F1 score of 0.655. While much better performance is possible for sentiment classification and only a small amount of training data was used, the model with VADER statistics and embeddings outperformed:

a model trained on embeddings alone
a model trained on VADER’s statistical output
the predictions of the VADER algorithm itself.

fields = ['experiment', 'accuracy_f1', 'negative_f1', 'neutral_f1', 'positive_f1']
display(pd.read_csv('results_example_vader.csv').sort_values(by='accuracy_f1', ascending=False)[fields])

	experiment	accuracy_f1	negative_f1	neutral_f1	positive_f1
2	Logistic Regression classifier using VADER features and Model2Vec embeddings	0.655000	0.714286	0.583927	0.664671
3	Logistic Regression classifier using only Model2Vec embeddings (Sanity Check)	0.634333	0.680997	0.559836	0.659023
1	Logistic Regression classifier using VADER features	0.585333	0.645557	0.509561	0.598956
0	Assigning labels using VADER	0.561667	0.622105	0.451873	0.593658