lexicons

Extract features from texts based on lexicons.

source

LexiconCountVectorizer

 LexiconCountVectorizer (feature_store:textplumber.store.TextFeatureStore,
                         lexicons:dict, lowercase:bool=True)

A Sci-kit Learn pipeline component to get document-level counts for one or more lexicons. This component should be used after the SpacyPreprocessor component with the same feature store.

Type Default Details
feature_store TextFeatureStore the feature store to use - this should be the same feature store used in the SpacyPreprocessor component
lexicons dict the lexicons to use - a dictionary with the lexicon name as the key and the lexicon (a list of tokens to count) as the value
lowercase bool True whether to lowercase the tokens

source

LexiconCountVectorizer.fit

 LexiconCountVectorizer.fit (X, y=None)

Fit the vectorizer to the tokens in the feature store.


source

LexiconCountVectorizer.transform

 LexiconCountVectorizer.transform (X)

Transform the texts to a matrix of counts.


source

LexiconCountVectorizer.get_feature_names_out

 LexiconCountVectorizer.get_feature_names_out (input_features=None)

Get the feature names out from the vectorizer.

Note: currently lexicon counts are not stored in the feature store. It will be added in the future to avoid recomputing lexicon counts. A feature store is still required, as this is the source of the tokens to calculate counts.

Available lexicons


source

get_empath_lexicons

 get_empath_lexicons (save_to:str|None='lexicons_empath.txt')

Get the empath lexicons from the empath github repo.

Type Default Details
save_to str | None lexicons_empath.txt where to save the file, None will not save
Returns dict a dictionary with the name of each empath category as the key and the lexicon (the corresponding list of tokens to count) as the value

The Empath library provides word lists corresponding to a number of lexical categories. These can be loaded using get_empath_lexicons to extract lexicon-based features from texts. There is an example below. Here is a preview of the data …

empath_lexicons = get_empath_lexicons()
print('Empath lexicon examples: ')
for key in ['love', 'hate', 'help', 'sleep']:
    print(f'{key}: {empath_lexicons[key][0:5]} (First 5 tokens of {len(empath_lexicons[key])})')

print()
print('Avaliable lexicons: ')    
for i, key in enumerate(empath_lexicons.keys()):
    if i % 8 == 0:
        print()
    print(f'{key}, ', end='')
Empath lexicon examples: 
love: ['love', 'indulge', 'closeness', 'yearn', 'love'] (First 5 tokens of 86)
hate: ['hate', 'despise', 'vindictive', 'infuriate', 'sexist'] (First 5 tokens of 102)
help: ['help', 'chore', 'responsible', 'help', 'grateful'] (First 5 tokens of 60)
sleep: ['sleep', 'sleepy', 'rest', 'sleep', 'bedroom'] (First 5 tokens of 53)

Avaliable lexicons: 

help, office, dance, money, wedding, domestic_work, sleep, medical_emergency, 
cold, hate, cheerfulness, aggression, occupation, envy, anticipation, family, 
vacation, crime, attractive, masculine, prison, health, pride, dispute, 
nervousness, government, weakness, horror, swearing_terms, leisure, suffering, royalty, 
wealthy, tourism, furniture, school, magic, beach, journalism, morning, 
banking, social_media, exercise, night, kill, blue_collar_job, art, ridicule, 
play, computer, college, optimism, stealing, real_estate, home, divine, 
sexual, fear, irritability, superhero, business, driving, pet, childish, 
cooking, exasperation, religion, hipster, internet, surprise, reading, worship, 
leader, independence, movement, body, noise, eating, medieval, zest, 
confusion, water, sports, death, healing, legend, heroic, celebration, 
restaurant, violence, programming, dominant_heirarchical, military, neglect, swimming, exotic, 
love, hiking, communication, hearing, order, sympathy, hygiene, weather, 
anonymity, trust, ancient, deception, fabric, air_travel, fight, dominant_personality, 
music, vehicle, politeness, toy, farming, meeting, war, speaking, 
listen, urban, shopping, disgust, fire, tool, phone, gain, 
sound, injury, sailing, rage, science, work, appearance, valuable, 
warmth, youth, sadness, fun, emotional, joy, affection, traveling, 
fashion, ugliness, lust, shame, torment, economics, anger, politics, 
ship, clothing, car, strength, technology, breaking, shape_and_size, power, 
white_collar_job, animal, party, terrorism, smell, disappointment, poor, plant, 
pain, beauty, timidity, philosophy, negotiate, negative_emotion, cleaning, messaging, 
competing, law, friends, payment, achievement, alcohol, liquid, feminine, 
weapon, children, monster, ocean, giving, contentment, writing, rural, 
positive_emotion, musical, 

source

get_sentiment_lexicons

 get_sentiment_lexicons (save_to:str|None='lexicons_sentiment.txt')
Type Default Details
save_to str | None lexicons_sentiment.txt where to save the file, None will not save

The get_sentiment_lexicons function can be used to download positive and negative lexicons based on the Vader Sentiment Lexicons. Vader’s valence scores are based on human raters. The Vader lexicon already excludes words with strong disagreement between raters. The get_sentiment_lexicons function also filters out words where raters gave both positive and negative valence scores. See the function code for details.

Note: when these lexicons are used with the LexiconCountVectorizer, the features returned are simple counts. It does not use Vader’s rules or Vader’s valence scores.

Here is a preview of the positive and negative lexicons …

sentiment_lexicons = get_sentiment_lexicons()
print('Sentiment lexicon examples: ')
for key in ['positive', 'negative']:
    print(f'{key}:')
    print(f'{sentiment_lexicons[key][0:5]}')
    print(f'(First 5 tokens of {len(sentiment_lexicons[key])})')
Sentiment lexicon examples: 
positive:
['absolved', 'absolving', 'accept', 'acceptable', 'accepted']
(First 5 tokens of 858)
negative:
['abandon', 'abandoned', 'abandoning', 'abducted', 'abhor']
(First 5 tokens of 1136)

Example

Here is an example …

from textplumber.preprocess import SpacyPreprocessor
from textplumber.lexicons import LexiconCountVectorizer, get_empath_lexicons, get_sentiment_lexicons
from textplumber.store import TextFeatureStore
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix, plot_logistic_regression_features_from_pipeline

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Here we load text samples from Barack Obama and Donald Trump available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['obama', 'trump'])

Create a feature store …

feature_store = TextFeatureStore('feature_store_example_lexicon.sqlite')

How accurately can we predict Obama or Trump using positive and negative sentiment lexicons (from get_sentiment_lexicons)?

sentiment_lexicons = get_sentiment_lexicons()

A SpacyPreprocessor component must be included in the pipeline prior to the LexiconCountVectorizer.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('lexicons', LexiconCountVectorizer(feature_store=feature_store, lexicons=sentiment_lexicons)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'negative': ['abandon',
                                                               'abandoned',
                                                               'abandoning',
                                                               'abducted',
                                                               'abhor',
                                                               'abhorred',
                                                               'abhorrent',
                                                               'abuse',
                                                               'abused',
                                                               '...
                                                               'accomplish',
                                                               'accomplished',
                                                               'achievable',
                                                               'acquitting',
                                                               'active',
                                                               'adequate',
                                                               'admirable',
                                                               'admire',
                                                               'admired',
                                                               'admiring',
                                                               'admit',
                                                               'admitted',
                                                               'adopt',
                                                               'adorable',
                                                               'adore',
                                                               'adored',
                                                               'adoring',
                                                               'adorn',
                                                               'adorning',
                                                               'advanced',
                                                               'advantage',
                                                               'advantaged',
                                                               'advantageous',
                                                               'advantaging', ...]})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  11.1s
[Pipeline] .......... (step 2 of 3) Processing lexicons, total=   0.7s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.0s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

       obama      0.571     0.487     0.526       273
       trump      0.620     0.695     0.655       328

    accuracy                          0.601       601
   macro avg      0.595     0.591     0.590       601
weighted avg      0.597     0.601     0.596       601

How does this compare to a model using features based on Empath’s categories (using get_empath_lexicons)?

empath_lexicons = get_empath_lexicons()

Train a model using features based on counts for each Empath category.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('lexicons', LexiconCountVectorizer(feature_store=feature_store, lexicons=empath_lexicons)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'achievement': ['achievement',
                                                                  'milestone',
                                                                  'scoreboard',
                                                                  'competitive',
                                                                  'renown',
                                                                  'surpass',
                                                                  'scoring',
                                                                  'winner',
                                                                  '...
                                                                    'decode',
                                                                    'clarification',
                                                                    'explain',
                                                                    'confide',
                                                                    'confirmation',
                                                                    'read',
                                                                    'introduce',
                                                                    'mention',
                                                                    'report',
                                                                    'discuss',
                                                                    'texts',
                                                                    'list',
                                                                    'email',
                                                                    'handwritten',
                                                                    'wrote',
                                                                    'respond',
                                                                    'communication',
                                                                    'idea',
                                                                    'convey',
                                                                    'writing',
                                                                    'informed',
                                                                    'exchange',
                                                                    'communicate',
                                                                    'socialize',
                                                                    'mobile', ...], ...})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=   0.1s
[Pipeline] .......... (step 2 of 3) Processing lexicons, total=   4.0s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

       obama      0.699     0.722     0.710       273
       trump      0.762     0.741     0.751       328

    accuracy                          0.732       601
   macro avg      0.730     0.731     0.731       601
weighted avg      0.733     0.732     0.732       601

Taking a closer look at discriminative features …

plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name = 'classifier', features_step_name = 'features')
Feature Log Odds (Logit) Odds Ratio
141 joy 1.297911 3.661641
58 irritability -1.270415 0.280715
78 medieval -1.177288 0.308113
49 computer -1.085358 0.337781
47 ridicule -1.020392 0.360453
130 sailing 0.982274 2.670523
81 water -0.959476 0.383094
45 blue_collar_job 0.869111 2.384791
99 hearing 0.832905 2.299991
148 torment 0.823888 2.279345
77 eating -0.807135 0.446134
6 sleep -0.778338 0.459169
109 air_travel -0.773281 0.461496
103 weather -0.765938 0.464898
80 confusion -0.763801 0.465892
56 sexual -0.693234 0.499956
131 rage 0.678534 1.970985
48 play 0.676723 1.967420
101 sympathy -0.665973 0.513773
190 writing 0.664452 1.943425