lexicons

Extract features from texts based on lexicons.

LexiconCountVectorizer

 LexiconCountVectorizer (feature_store:textplumber.store.TextFeatureStore,
                         lexicons:dict, lowercase:bool=True)

A Sci-kit Learn pipeline component to get document-level counts for one or more lexicons. This component should be used after the SpacyPreprocessor component with the same feature store.

	Type	Default	Details
feature_store	TextFeatureStore		the feature store to use - this should be the same feature store used in the SpacyPreprocessor component
lexicons	dict		the lexicons to use - a dictionary with the lexicon name as the key and the lexicon (a list of tokens to count) as the value
lowercase	bool	True	whether to lowercase the tokens

source

LexiconCountVectorizer.fit

 LexiconCountVectorizer.fit (X, y=None)

Fit the vectorizer to the tokens in the feature store.

source

LexiconCountVectorizer.transform

 LexiconCountVectorizer.transform (X)

Transform the texts to a matrix of counts.

source

LexiconCountVectorizer.get_feature_names_out

 LexiconCountVectorizer.get_feature_names_out (input_features=None)

Get the feature names out from the vectorizer.

Note: currently lexicon counts are not stored in the feature store. It will be added in the future to avoid recomputing lexicon counts. A feature store is still required, as this is the source of the tokens to calculate counts.

Available lexicons

source

get_empath_lexicons

 get_empath_lexicons (save_to:str|None='lexicons_empath.txt')

Get the empath lexicons from the empath github repo.

	Type	Default	Details
save_to	str \| None	lexicons_empath.txt	where to save the file, None will not save
Returns	dict		a dictionary with the name of each empath category as the key and the lexicon (the corresponding list of tokens to count) as the value

The Empath library provides word lists corresponding to a number of lexical categories. These can be loaded using get_empath_lexicons to extract lexicon-based features from texts. There is an example below. Here is a preview of the data …

empath_lexicons = get_empath_lexicons()
print('Empath lexicon examples: ')
for key in ['love', 'hate', 'help', 'sleep']:
    print(f'{key}: {empath_lexicons[key][0:5]} (First 5 tokens of {len(empath_lexicons[key])})')

print()
print('Avaliable lexicons: ')    
for i, key in enumerate(empath_lexicons.keys()):
    if i % 8 == 0:
        print()
    print(f'{key}, ', end='')

Empath lexicon examples: 
love: ['love', 'indulge', 'closeness', 'yearn', 'love'] (First 5 tokens of 86)
hate: ['hate', 'despise', 'vindictive', 'infuriate', 'sexist'] (First 5 tokens of 102)
help: ['help', 'chore', 'responsible', 'help', 'grateful'] (First 5 tokens of 60)
sleep: ['sleep', 'sleepy', 'rest', 'sleep', 'bedroom'] (First 5 tokens of 53)

Avaliable lexicons: 

help, office, dance, money, wedding, domestic_work, sleep, medical_emergency, 
cold, hate, cheerfulness, aggression, occupation, envy, anticipation, family, 
vacation, crime, attractive, masculine, prison, health, pride, dispute, 
nervousness, government, weakness, horror, swearing_terms, leisure, suffering, royalty, 
wealthy, tourism, furniture, school, magic, beach, journalism, morning, 
banking, social_media, exercise, night, kill, blue_collar_job, art, ridicule, 
play, computer, college, optimism, stealing, real_estate, home, divine, 
sexual, fear, irritability, superhero, business, driving, pet, childish, 
cooking, exasperation, religion, hipster, internet, surprise, reading, worship, 
leader, independence, movement, body, noise, eating, medieval, zest, 
confusion, water, sports, death, healing, legend, heroic, celebration, 
restaurant, violence, programming, dominant_heirarchical, military, neglect, swimming, exotic, 
love, hiking, communication, hearing, order, sympathy, hygiene, weather, 
anonymity, trust, ancient, deception, fabric, air_travel, fight, dominant_personality, 
music, vehicle, politeness, toy, farming, meeting, war, speaking, 
listen, urban, shopping, disgust, fire, tool, phone, gain, 
sound, injury, sailing, rage, science, work, appearance, valuable, 
warmth, youth, sadness, fun, emotional, joy, affection, traveling, 
fashion, ugliness, lust, shame, torment, economics, anger, politics, 
ship, clothing, car, strength, technology, breaking, shape_and_size, power, 
white_collar_job, animal, party, terrorism, smell, disappointment, poor, plant, 
pain, beauty, timidity, philosophy, negotiate, negative_emotion, cleaning, messaging, 
competing, law, friends, payment, achievement, alcohol, liquid, feminine, 
weapon, children, monster, ocean, giving, contentment, writing, rural, 
positive_emotion, musical,

source

get_sentiment_lexicons

 get_sentiment_lexicons (save_to:str|None='lexicons_sentiment.txt')

	Type	Default	Details
save_to	str \| None	lexicons_sentiment.txt	where to save the file, None will not save

The get_sentiment_lexicons function can be used to download positive and negative lexicons based on words that are typically verbs or adjectives in the Vader Sentiment Lexicons. Vader’s valence scores are based on human raters. The Vader lexicon already excludes words with strong disagreement between raters. The get_sentiment_lexicons function also filters out words where raters gave both positive and negative valence scores. See the function code for details.

Note: when these lexicons are used with the LexiconCountVectorizer, the features returned are simple counts. It does not use Vader’s rules or Vader’s valence scores. However, the Textplumber does implement VADER as a feature extractor and estimator. See the VaderSentimentExtractor and VaderSentimentEstimator classes for more details.

Here is a preview of the positive and negative lexicons …

sentiment_lexicons = get_sentiment_lexicons()
print('Sentiment lexicon examples: ')
for key in ['positive', 'negative']:
    print(f'{key}:')
    print(f'{sentiment_lexicons[key][0:5]}')
    print(f'(First 5 tokens of {len(sentiment_lexicons[key])})')

Sentiment lexicon examples: 
positive:
['absolved', 'absolving', 'accept', 'acceptable', 'accepted']
(First 5 tokens of 858)
negative:
['abandon', 'abandoned', 'abandoning', 'abducted', 'abhor']
(First 5 tokens of 1136)

Example

Here is an example …

from textplumber.preprocess import SpacyPreprocessor
from textplumber.lexicons import LexiconCountVectorizer, get_empath_lexicons, get_sentiment_lexicons
from textplumber.store import TextFeatureStore
from textplumber.core import get_example_data
from textplumber.report import plot_confusion_matrix, plot_logistic_regression_features_from_pipeline

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

Here we load text samples from Barack Obama and Donald Trump available in the AuthorMix dataset.

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['obama', 'trump'])

Create a feature store …

feature_store = TextFeatureStore('feature_store_example_lexicon.sqlite')

How accurately can we predict Obama or Trump using positive and negative sentiment lexicons (from get_sentiment_lexicons)?

sentiment_lexicons = get_sentiment_lexicons()

A SpacyPreprocessor component must be included in the pipeline prior to the LexiconCountVectorizer.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('lexicons', LexiconCountVectorizer(feature_store=feature_store, lexicons=sentiment_lexicons)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'negative': ['abandon',
                                                               'abandoned',
                                                               'abandoning',
                                                               'abducted',
                                                               'abhor',
                                                               'abhorred',
                                                               'abhorrent',
                                                               'abuse',
                                                               'abused',
                                                               '...
                                                               'accomplish',
                                                               'accomplished',
                                                               'achievable',
                                                               'acquitting',
                                                               'active',
                                                               'adequate',
                                                               'admirable',
                                                               'admire',
                                                               'admired',
                                                               'admiring',
                                                               'admit',
                                                               'admitted',
                                                               'adopt',
                                                               'adorable',
                                                               'adore',
                                                               'adored',
                                                               'adoring',
                                                               'adorn',
                                                               'adorning',
                                                               'advanced',
                                                               'advantage',
                                                               'advantaged',
                                                               'advantageous',
                                                               'advantaging', ...]})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiNot fitted

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'negative': ['abandon',
                                                               'abandoned',
                                                               'abandoning',
                                                               'abducted',
                                                               'abhor',
                                                               'abhorred',
                                                               'abhorrent',
                                                               'abuse',
                                                               'abused',
                                                               '...
                                                               'accomplish',
                                                               'accomplished',
                                                               'achievable',
                                                               'acquitting',
                                                               'active',
                                                               'adequate',
                                                               'admirable',
                                                               'admire',
                                                               'admired',
                                                               'admiring',
                                                               'admit',
                                                               'admitted',
                                                               'adopt',
                                                               'adorable',
                                                               'adore',
                                                               'adored',
                                                               'adoring',
                                                               'adorn',
                                                               'adorning',
                                                               'advanced',
                                                               'advantage',
                                                               'advantaged',
                                                               'advantageous',
                                                               'advantaging', ...]})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

SpacyPreprocessor

SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)

LexiconCountVectorizer

LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                       lexicons={'negative': ['abandon', 'abandoned',
                                              'abandoning', 'abducted', 'abhor',
                                              'abhorred', 'abhorrent', 'abuse',
                                              'abused', 'abusive', 'accidental',
                                              'ached', 'aching', 'admonished',
                                              'adverse', 'aggravated', 'aghast',
                                              'agitated', 'agitating',
                                              'agonise', 'agonised', 'agonize'...
                                 'positive': ['absolved', 'absolving', 'accept',
                                              'acceptable', 'accepted',
                                              'accepting', 'accomplish',
                                              'accomplished', 'achievable',
                                              'acquitting', 'active',
                                              'adequate', 'admirable', 'admire',
                                              'admired', 'admiring', 'admit',
                                              'admitted', 'adopt', 'adorable',
                                              'adore', 'adored', 'adoring',
                                              'adorn', 'adorning', 'advanced',
                                              'advantage', 'advantaged',
                                              'advantageous', 'advantaging', ...]})

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=5000, random_state=55)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  11.1s
[Pipeline] .......... (step 2 of 3) Processing lexicons, total=   0.7s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.0s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

       obama      0.571     0.487     0.526       273
       trump      0.620     0.695     0.655       328

    accuracy                          0.601       601
   macro avg      0.595     0.591     0.590       601
weighted avg      0.597     0.601     0.596       601

How does this compare to a model using features based on Empath’s categories (using get_empath_lexicons)?

empath_lexicons = get_empath_lexicons()

Train a model using features based on counts for each Empath category.

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store)),
    ('lexicons', LexiconCountVectorizer(feature_store=feature_store, lexicons=empath_lexicons)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'achievement': ['achievement',
                                                                  'milestone',
                                                                  'scoreboard',
                                                                  'competitive',
                                                                  'renown',
                                                                  'surpass',
                                                                  'scoring',
                                                                  'winner',
                                                                  '...
                                                                    'decode',
                                                                    'clarification',
                                                                    'explain',
                                                                    'confide',
                                                                    'confirmation',
                                                                    'read',
                                                                    'introduce',
                                                                    'mention',
                                                                    'report',
                                                                    'discuss',
                                                                    'texts',
                                                                    'list',
                                                                    'email',
                                                                    'handwritten',
                                                                    'wrote',
                                                                    'respond',
                                                                    'communication',
                                                                    'idea',
                                                                    'convey',
                                                                    'writing',
                                                                    'informed',
                                                                    'exchange',
                                                                    'communicate',
                                                                    'socialize',
                                                                    'mobile', ...], ...})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

Pipeline

?Documentation for PipelineiNot fitted

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)),
                ('lexicons',
                 LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                                        lexicons={'achievement': ['achievement',
                                                                  'milestone',
                                                                  'scoreboard',
                                                                  'competitive',
                                                                  'renown',
                                                                  'surpass',
                                                                  'scoring',
                                                                  'winner',
                                                                  '...
                                                                    'decode',
                                                                    'clarification',
                                                                    'explain',
                                                                    'confide',
                                                                    'confirmation',
                                                                    'read',
                                                                    'introduce',
                                                                    'mention',
                                                                    'report',
                                                                    'discuss',
                                                                    'texts',
                                                                    'list',
                                                                    'email',
                                                                    'handwritten',
                                                                    'wrote',
                                                                    'respond',
                                                                    'communication',
                                                                    'idea',
                                                                    'convey',
                                                                    'writing',
                                                                    'informed',
                                                                    'exchange',
                                                                    'communicate',
                                                                    'socialize',
                                                                    'mobile', ...], ...})),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

SpacyPreprocessor

SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>)

LexiconCountVectorizer

LexiconCountVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7fd08bb1cb10>,
                       lexicons={'achievement': ['achievement', 'milestone',
                                                 'scoreboard', 'competitive',
                                                 'renown', 'surpass', 'scoring',
                                                 'winner', 'winning', 'score',
                                                 'bravery', 'pride', 'achieve',
                                                 'ranking', 'exceptional',
                                                 'unbeatable', 'conquer',
                                                 'ambition', 'celebrate',
                                                 'strive', 'earn', 'prize...
                                             'alumnus', 'library', 'college', ...],
                                 'communication': ['communication', 'text',
                                                   'consult', 'detail',
                                                   'thesis', 'decode',
                                                   'clarification', 'explain',
                                                   'confide', 'confirmation',
                                                   'read', 'introduce',
                                                   'mention', 'report',
                                                   'discuss', 'texts', 'list',
                                                   'email', 'handwritten',
                                                   'wrote', 'respond',
                                                   'communication', 'idea',
                                                   'convey', 'writing',
                                                   'informed', 'exchange',
                                                   'communicate', 'socialize',
                                                   'mobile', ...], ...})

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=5000, random_state=55)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=   0.1s
[Pipeline] .......... (step 2 of 3) Processing lexicons, total=   4.0s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.1s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

       obama      0.699     0.722     0.710       273
       trump      0.762     0.741     0.751       328

    accuracy                          0.732       601
   macro avg      0.730     0.731     0.731       601
weighted avg      0.733     0.732     0.732       601

Taking a closer look at discriminative features …

plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name = 'classifier', features_step_name = 'features')

	Feature	Log Odds (Logit)	Odds Ratio
141	joy	1.297911	3.661641
58	irritability	-1.270415	0.280715
78	medieval	-1.177288	0.308113
49	computer	-1.085358	0.337781
47	ridicule	-1.020392	0.360453
130	sailing	0.982274	2.670523
81	water	-0.959476	0.383094
45	blue_collar_job	0.869111	2.384791
99	hearing	0.832905	2.299991
148	torment	0.823888	2.279345
77	eating	-0.807135	0.446134
6	sleep	-0.778338	0.459169
109	air_travel	-0.773281	0.461496
103	weather	-0.765938	0.464898
80	confusion	-0.763801	0.465892
56	sexual	-0.693234	0.499956
131	rage	0.678534	1.970985
48	play	0.676723	1.967420
101	sympathy	-0.665973	0.513773
190	writing	0.664452	1.943425