vader

Sentiment estimator and feature extractor using VADER.

Textplumber implements a sentiment estimator and feature extraction using VADER, “a lexicon and rule-based sentiment analysis tool”. VADER’s GitHub repository and Hutto and Gilbert’s 2014 paper about VADER are the best explanation of VADER and how the VADER lexicon and rules were derived.

This functionality required Textplumber version 0.0.9 or above.

Note: Two experimental components are provided to extract sentiment scores across a document or extract POS-ngrams tagged with sentiment. These experimental components are likely to change.


source

VaderSentimentExtractor

 VaderSentimentExtractor
                          (feature_store:textplumber.store.TextFeatureStor
                          e=None, output:str='polarity',
                          neutral_threshold:float=0.05)

Sci-kit Learn pipeline component to extract sentiment features using VADER.

Type Default Details
feature_store TextFeatureStore None (not implemented currently)
output str polarity ‘polarity’ (VADER’s compound score), ‘proportions’ (ratios for proportions of text that are positive, neutral or negative), or ‘allstats’ (equivalent to ‘polarity’ + ‘proportions’), ‘labels’ (positive, neutral, negative)
neutral_threshold float 0.05 threshold for neutral sentiment

By default neutral_threshold is set to 0.05. This means that any text with polarity greater than -0.05 and less than 0.05 will be ‘neutral’. The 0.05 value default is the recommendation of the VADER Github page, but this can be tuned as needed.


source

VaderSentimentExtractor.fit

 VaderSentimentExtractor.fit (X, y=None)

Fit is implemented, but does nothing.


source

VaderSentimentExtractor.convert_score_to_label

 VaderSentimentExtractor.convert_score_to_label (score:float,
                                                 label_mapping=None)

Convert VADER score to label.


source

VaderSentimentExtractor.convert_scores_to_labels

 VaderSentimentExtractor.convert_scores_to_labels (scores:list[float],
                                                   label_mapping=None)

Convert VADER score to label.


source

VaderSentimentExtractor.transform

 VaderSentimentExtractor.transform (X)

Extracts the sentiment from the text using VADER.


source

VaderSentimentExtractor.get_feature_names_out

 VaderSentimentExtractor.get_feature_names_out (input_features=None)

Get the feature names out from the model.


source

VaderSentimentEstimator

 VaderSentimentEstimator (output:str='labels',
                          neutral_threshold:float=0.05,
                          label_mapping:dict|None=None)

Sci-kit Learn pipeline component to predict sentiment using VADER.

Type Default Details
output str labels ‘polarity’ (VADER’s compound score) or ‘labels’ (positive, neutral, negative)
neutral_threshold float 0.05 threshold for neutral sentiment (see note for VaderSentimentExtractor)
label_mapping dict | None None (ignored if labels is None) mapping of labels to desired labels - keys should be ‘positive’, ‘neutral’, ‘negative’ and values should be desired labels

If output is set to labels then VaderSentimentEstimator functions as a pseudo-classifier. With output set to polarity, it functions as a regressor or scorer.

By default the VaderSentimentEstimator is setup to work with three classes (i.e. it returns ‘positive’, ‘neutral’ or ‘negative’). If you only have two classes (negative/positive), set the neutral_threshold to 0 to remove the neutral class. Even slightly negative or positive scores will be assigned a non-neutral label. A 0 polarity score will be assigned to positive in the two-class case. The classes might be better interpreted as negative and not-negative in this instance.

neutral_threshold = 0

You may also want/need to create a label mapping so that the correct class IDs are returned by the estimator.

For example …

label_mapping = {
    'positive': 1,
    'negative': 0
}

source

VaderSentimentEstimator.predict

 VaderSentimentEstimator.predict (X)

Predict the sentiment of texts using VADER.

Example using VADER as a predictor

Here is an example demonstrating how to use VaderSentimentEstimator in a pipeline.

from textplumber.report import plot_confusion_matrix, get_label_names, save_results, plot_logistic_regression_features_from_pipeline
from textplumber.vader import VaderSentimentEstimator, VaderSentimentExtractor
from textplumber.preprocess import SpacyPreprocessor
from textplumber.embeddings import Model2VecEmbedder
from textplumber.core import get_stop_words

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from datasets import load_dataset
from sklearn.model_selection import train_test_split

import pandas as pd

Here we load text samples from a sentiment dataset. The dataset has validation and test sets, but we are just working with the train split in this instance.

dataset_name = 'cardiffnlp/tweet_eval'
dataset_dir = 'sentiment'
dataset = load_dataset(dataset_name, data_dir = dataset_dir, split='train')

Only getting 5000 for each class …

label_column = 'label'
target_names = get_label_names(dataset, label_column)
target_classes = list(range(len(target_names)))

# selecting 5000 per class here ...
df = dataset.to_pandas()
sampled_dfs = [
    group.sample(n=5000, random_state=42)
    for _, group in df.groupby('label')
]
df = pd.concat(sampled_dfs, ignore_index=True)

X = df['text']
y = df[label_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
pd.set_option('display.max_colwidth', 200)
df = pd.DataFrame({'text': X_train, 'label': y_train})
df['label_name'] = df['label'].apply(lambda x: target_names[x])
df.head(5)
text label label_name
5497 Harper walks. #Nats now have two men on with one out. Escobar will bat next. Still 8-7 Mets in the bottom of the 9th. 1 neutral
12677 May or may not be getting ready to watch the very first season EVER of Big Brother. @user I'm so ready to see what Julie looks like! 2 positive
8037 Patrick Leahy & Christian Bale together again. U.S. Senator to make his 2nd cameo in Batman movie in Dark Knight Rises 1 neutral
6670 "A smartphone review that the tech press needs to read twice, obviously great for consumers too: Moto G 3rd gen review 1 neutral
79 Armed with the tools of power & intimidation Harper is trying to steal our country. On October 19th, armed with pencils LET'S TAKE IT BACK. 0 negative

Setup a label mapping so that the label values that match the dataset are returned rather than the default ‘positive’, ‘neutral’, ‘negative’ labels returned by VaderSentimentEstimator.

label_mapping = {
    'negative': 0,
    'neutral': 1, 
    'positive': 2
}

Our pipeline only has one component! Notice that the label_mapping is passed to the estimator to ensure we return comparable labels to the data-set labels.

pipeline = Pipeline([
    ('vader_estimator', VaderSentimentEstimator(output = 'labels', label_mapping = label_mapping)),
], verbose=True)

display(pipeline)
Pipeline(steps=[('vader_estimator',
                 VaderSentimentEstimator(label_mapping={'negative': 0,
                                                        'neutral': 1,
                                                        'positive': 2}))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Note: VADER is based on a lexicon and heuristics that are independent of the training data. The fit call is not really required below (and it does nothing).

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ... (step 1 of 1) Processing vader_estimator, total=   0.0s

Log some results for the final cell summary.

dataset_descriptor = 'tweet_eval sentiment dataset, 5000 rows per class randomly sampled from train split (per class - 4000 train, 1000 test)'
experiment_descriptor = 'Assigning labels using VADER'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'vader_estimator')

Here are the results …

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

    negative      0.657     0.591     0.622      1000
     neutral      0.533     0.392     0.452      1000
    positive      0.514     0.702     0.594      1000

    accuracy                          0.562      3000
   macro avg      0.568     0.562     0.556      3000
weighted avg      0.568     0.562     0.556      3000

Example using VADER as a feature extractor

Here is an example demonstrating how to use VaderSentimentExtractor in a pipeline. We will use the same dataset as above so we can compare results between predictions based on VADER and the predictions of a machine learning model that uses VADER statistics as features.

The VaderSentimentExtractor can return VADER’s compound polarity score (‘polarity’), proportions of positive/neutral/negative (‘proportions’), or all statistics (‘allstats’).

Note: when proportions are being returned by VADER, these do not make use of rules reflected in the compound polarity score. See VADER’s Github documentation for more information.

pipeline = Pipeline([
    ('vader_extractor', VaderSentimentExtractor(output = 'allstats')),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('vader_extractor', VaderSentimentExtractor(output='allstats')),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ... (step 1 of 2) Processing vader_extractor, total=   0.4s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.0s

Logging results for summary …

experiment_descriptor = 'Logistic Regression classifier using VADER features'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

As might be expected, the performance of a classifier based on statistical output from VADER is similar to the results of the first experiment, which used VADER to assign the class …

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

    negative      0.660     0.632     0.646      1000
     neutral      0.527     0.493     0.510      1000
    positive      0.570     0.631     0.599      1000

    accuracy                          0.585      3000
   macro avg      0.586     0.585     0.585      3000
weighted avg      0.586     0.585     0.585      3000

It is anticipated that the VaderSentimentExtractor component will be used alongside other features. Here is an example where the VADER statistics are supplemented by other features.

Setup a feature store to save preprocessed texts …

feature_store = TextFeatureStore('feature_store_example_vader.sqlite')

Augment the VADER features with embeddings …

pipeline = Pipeline([
        ('preprocess', SpacyPreprocessor(feature_store=feature_store)),
        ('features', FeatureUnion([
                ('vader', VaderSentimentExtractor(output = 'allstats')),
                ('embeddings', Model2VecEmbedder(feature_store=feature_store)),
        ], verbose=True)),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55)),
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocess',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f44dd61b450>)),
                ('features',
                 FeatureUnion(transformer_list=[('vader',
                                                 VaderSentimentExtractor(output='allstats')),
                                                ('embeddings',
                                                 Model2VecEmbedder(feature_store=<textplumber.store.TextFeatureStore object at 0x7f44dd61b450>))],
                              verbose=True)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ........ (step 1 of 3) Processing preprocess, total=  12.1s
[FeatureUnion] ......... (step 1 of 2) Processing vader, total=   0.4s
[FeatureUnion] .... (step 2 of 2) Processing embeddings, total=   0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total=   0.8s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   8.6s

Logging results for summary …

experiment_descriptor = 'Logistic Regression classifier using VADER features and Model2Vec embeddings'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

    negative      0.704     0.725     0.714      1000
     neutral      0.594     0.574     0.584      1000
    positive      0.663     0.666     0.665      1000

    accuracy                          0.655      3000
   macro avg      0.654     0.655     0.654      3000
weighted avg      0.654     0.655     0.654      3000

Do the VADER statistics contribute to the accuracy? In other words, could we remove the VADER stats for similar performance? This sanity check is run in the next cell. Results are shown in the results summary.

pipeline = Pipeline([
        ('preprocess', SpacyPreprocessor(feature_store=feature_store)),
        ('features', FeatureUnion([
                ('embeddings', Model2VecEmbedder(feature_store=feature_store)),
        ], verbose=True)),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55)),
], verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

experiment_descriptor = 'Logistic Regression classifier using only Model2Vec embeddings (Sanity Check)'
results = save_results('results_example_vader.csv', pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')
[Pipeline] ........ (step 1 of 3) Processing preprocess, total=   0.5s
[FeatureUnion] .... (step 1 of 1) Processing embeddings, total=   0.5s
[Pipeline] .......... (step 2 of 3) Processing features, total=   0.5s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   5.0s

Results for the four experiments are shown below. The best model achieves an F1 score of 0.655. While much better performance is possible for sentiment classification and only a small amount of training data was used, the model with VADER statistics and embeddings outperformed:

  • a model trained on embeddings alone
  • a model trained on VADER’s statistical output
  • the predictions of the VADER algorithm itself.
fields = ['experiment', 'accuracy_f1', 'negative_f1', 'neutral_f1', 'positive_f1']
display(pd.read_csv('results_example_vader.csv').sort_values(by='accuracy_f1', ascending=False)[fields])
experiment accuracy_f1 negative_f1 neutral_f1 positive_f1
2 Logistic Regression classifier using VADER features and Model2Vec embeddings 0.655000 0.714286 0.583927 0.664671
3 Logistic Regression classifier using only Model2Vec embeddings (Sanity Check) 0.634333 0.680997 0.559836 0.659023
1 Logistic Regression classifier using VADER features 0.585333 0.645557 0.509561 0.598956
0 Assigning labels using VADER 0.561667 0.622105 0.451873 0.593658

Experimental: Extract sentiment profile features (VADER scores across a document)


source

VaderSentimentProfileExtractor

 VaderSentimentProfileExtractor
                                 (feature_store:textplumber.store.TextFeat
                                 ureStore=None, output:str='profile',
                                 profile_first_n:int=3,
                                 profile_last_n:int=3,
                                 profile_sample_n:int=4,
                                 profile_min_sentence_chars:int=10,
                                 profile_sections:int=10)

Sci-kit Learn pipeline component to extract document-level sentiment profiles consisting of document-level and sentence-level features with their order in the document represented using VADER. (This class is experimental and there may be breaking changes in the future).

Type Default Details
feature_store TextFeatureStore None (not implemented currently)
output str profile profile (for a document sentiment profile vector ) - other values (‘profile’, ‘profilesections’, ‘profileallstats’, ‘profileonly’) are likely to change
profile_first_n int 3 number of sentences at start of doc to profile
profile_last_n int 3 number of sentences at end of doc to profile
profile_sample_n int 4 number of sentences to sample from doc sentences after first and last removed
profile_min_sentence_chars int 10 minimum number of characters in body sentences to be included in the profile
profile_sections int 10 number of sections to split the document into for profiling

source

VaderSentimentProfileExtractor.fit

 VaderSentimentProfileExtractor.fit (X, y=None)

Fit is implemented, but does nothing.


source

VaderSentimentProfileExtractor.section_profile

 VaderSentimentProfileExtractor.section_profile (text)

Mean pooling of VADER scores across document sections .


source

VaderSentimentProfileExtractor.profile

 VaderSentimentProfileExtractor.profile (text:str, doc_level_scores:dict)

Create a document profile with VADER scores, which makes use of document level scores and sentence-level scores across the document.

Type Details
text str the document text
doc_level_scores dict VADER scores for document text
Returns list a document profile vector consisting of the document level scores and sentence-level scores across the document

source

VaderSentimentProfileExtractor.transform

 VaderSentimentProfileExtractor.transform (X)

Extracts the sentiment from the text using VADER.


source

VaderSentimentProfileExtractor.get_feature_names_out

 VaderSentimentProfileExtractor.get_feature_names_out
                                                       (input_features=Non
                                                       e)

Get the feature names out from the model.


source

VaderSentimentProfileExtractor.plot_sentiment_structure

 VaderSentimentProfileExtractor.plot_sentiment_structure (X:list[str],
                                                          y:list, target_c
                                                          lasses:list=None
                                                          , target_names:l
                                                          ist=None, n_sect
                                                          ions:int=10, n_c
                                                          lusters:int=5, s
                                                          amples_per_clust
                                                          er:int=5, render
                                                          er:str='svg')

Plot the sentiment structure of documents. For each class, cluster the documents by sentiment structure, and plot up to 5 samples per cluster. Adds space and labels between clusters, with a border around each cluster. (Experimental feature, will change in future).

Type Default Details
X list
y list
target_classes list None
target_names list None
n_sections int 10 Number of chunks per document
n_clusters int 5 Number of clusters per class
samples_per_cluster int 5
renderer str svg ‘svg’ or ‘png’

Loading a dataset of longer texts to demonstrate …

dataset_name = 'polsci/sentiment-polarity-dataset-v2.0'
dataset_dir = ''
dataset = load_dataset(dataset_name, data_dir = dataset_dir, split='train')
label_column = 'label'
target_names = get_label_names(dataset, label_column)
target_classes = list(range(len(target_names)))

# selecting 5000 per class here ...
df = dataset.to_pandas()
X = df['text']
y = df[label_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Below the VaderSentimentProfileExtractor is compared with VADER scoring across the whole text, using VaderSentimentEstimator, and a classifier that uses VADER statistics as features, using VaderSentimentExtractor. The profile-based features outperform either approach.

pipeline = Pipeline([
        ('classifier', VaderSentimentEstimator(output = 'labels', neutral_threshold = 0, label_mapping = {'positive': 1, 'negative': 0})),
], verbose=True)

display(pipeline)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

pipeline = Pipeline([
    ('vader_extractor', VaderSentimentExtractor(output = 'allstats')),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

pipeline = Pipeline([
    ('vader_extractor', VaderSentimentProfileExtractor()),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
Pipeline(steps=[('classifier',
                 VaderSentimentEstimator(label_mapping={'negative': 0,
                                                        'positive': 1},
                                         neutral_threshold=0))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[Pipeline] ........ (step 1 of 1) Processing classifier, total=   0.0s
              precision    recall  f1-score   support

         neg      0.727     0.465     0.567       200
         pos      0.607     0.825     0.699       200

    accuracy                          0.645       400
   macro avg      0.667     0.645     0.633       400
weighted avg      0.667     0.645     0.633       400
Pipeline(steps=[('vader_extractor', VaderSentimentExtractor(output='allstats')),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[Pipeline] ... (step 1 of 2) Processing vader_extractor, total=   9.7s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.0s
              precision    recall  f1-score   support

         neg      0.720     0.515     0.601       200
         pos      0.623     0.800     0.700       200

    accuracy                          0.657       400
   macro avg      0.671     0.657     0.650       400
weighted avg      0.671     0.657     0.650       400
Pipeline(steps=[('vader_extractor', VaderSentimentProfileExtractor()),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[Pipeline] ... (step 1 of 2) Processing vader_extractor, total=  10.6s
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.0s
              precision    recall  f1-score   support

         neg      0.714     0.675     0.694       200
         pos      0.692     0.730     0.710       200

    accuracy                          0.703       400
   macro avg      0.703     0.703     0.702       400
weighted avg      0.703     0.703     0.702       400

Experimental: Extract sentiment-tagged ngrams as features


source

VaderSentimentPOSNgramsExtractor

 VaderSentimentPOSNgramsExtractor
                                   (feature_store:textplumber.store.TextFe
                                   atureStore=None,
                                   output:str='sentimentposngrams',
                                   ngram_range:tuple=(2, 2))

Sci-kit Learn pipeline component to extract ngrams based on POS tags and sentiment from VADER lexicon. (This class is experimental and there may be breaking changes in the future, including the possibility of complete removal).

Type Default Details
feature_store TextFeatureStore None (not implemented currently)
output str sentimentposngrams sentimentposngrams or sentintensityposngrams, this is experimental and likely to change
ngram_range tuple (2, 2) ngram range for POS ngrams

source

VaderSentimentPOSNgramsExtractor.convert_score_to_token_label

 VaderSentimentPOSNgramsExtractor.convert_score_to_token_label
                                                                (score:flo
                                                                at)

Convert VADER score to a token label (experimental).


source

VaderSentimentPOSNgramsExtractor.get_sentiment_pos_ngrams

 VaderSentimentPOSNgramsExtractor.get_sentiment_pos_ngrams (text)

Get ngrams of POS features and lexicon+pos features (experimental).


source

VaderSentimentPOSNgramsExtractor.fit

 VaderSentimentPOSNgramsExtractor.fit (X, y=None)

Fit derives all ngrams.


source

VaderSentimentPOSNgramsExtractor.transform

 VaderSentimentPOSNgramsExtractor.transform (X)

Transform into sentiment ngrams.


source

VaderSentimentPOSNgramsExtractor.get_feature_names_out

 VaderSentimentPOSNgramsExtractor.get_feature_names_out
                                                         (input_features=N
                                                         one)

Get the feature names out from the model.