report

Functions to help understand text data and evaluate text classification models.

The reporting functionality is in active development. Documentation is also in development, but should reflect the latest version.


source

preview_dataset

 preview_dataset (dataset:datasets.arrow_dataset.Dataset|datasets.dataset_
                  dict.DatasetDict)

Output information about a Huggingface dataset.

Type Details
dataset datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict A Huggingface dataset or dataset dict, typically the result of load_dataset()

The preview_dataset function provides information on the dataset. This includes information on splits, a summary field types for each split, and counts of unique values for each field. Expand the split to get more information. The notices output after the dataset summary will identify probable text and label column candidates, as well as suggesting label columns that need to be cast as label columns.

dataset = load_dataset('hallisky/AuthorMix')
preview_dataset(dataset)
Split: train (14579 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    Value(dtype='string', id=None)
  • Field 'text' has 14489 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    Value(dtype='string', id=None)
Split: validation (3642 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    Value(dtype='string', id=None)
  • Field 'text' has 3636 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    Value(dtype='string', id=None)
Split: test (4747 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    Value(dtype='string', id=None)
  • Field 'text' has 4729 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    Value(dtype='string', id=None)
Warnings/Notices
  • Field 'style' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'style').
    • Unique counts are identical (14)
    • Unique counts are a low proportion of total rows.
    • Unique values are identical between splits: ('blog11518', 'blog25872', 'blog30102', 'blog30407', 'blog5546', 'bush', 'fitzgerald', 'h', 'hemingway', 'obama', 'pp', 'qq', 'trump', 'woolf')
  • Field 'text' appears to be a text column.
  • Field 'category' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'category').
    • Unique counts are identical (4)
    • Unique counts are a low proportion of total rows.
    • Unique values are identical between splits: ('amt', 'author', 'blog', 'speech')

Some functions for working with datasets are below. First, a function to cast a column as a label column. This is helpful to use the same code workflow regardless of how labels were defined by the dataset’s creator. This currently assumes a DatasetDict. It may be altered in the future to work with Dataset objects as well.


source

cast_column_to_label

 cast_column_to_label (dataset:datasets.dataset_dict.DatasetDict,
                       label_column:str)

Cast a column to a ClassLabel.

Type Details
dataset DatasetDict A Huggingface dataset dict, typically the result of load_dataset()
label_column str The name of the column to cast to ClassLabel
Returns DatasetDict

And, second - a function to get the text representation of label names …


source

get_label_names

 get_label_names (dataset:datasets.dataset_dict.DatasetDict|datasets.arrow
                  _dataset.Dataset, label_column:str)

Get label names from field in a Huggingface dataset.

Type Details
dataset datasets.dataset_dict.DatasetDict | datasets.arrow_dataset.Dataset A Huggingface dataset or dataset dict, typically the result of load_dataset()
label_column str The name of the column get the label names from
Returns list list of label names

An example of casting columns as labels and getting label names …

cast_column_to_label(dataset, 'category')
cast_column_to_label(dataset, 'style')
print(get_label_names(dataset, label_column = 'category'))
print(get_label_names(dataset, label_column = 'style'))
['speech', 'author', 'blog', 'amt']
['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq']

Output of preview_dataset after columns have been cast …

preview_dataset(dataset)
Split: train (14579 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
  • Field 'text' has 14489 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Split: validation (3642 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
  • Field 'text' has 3636 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Split: test (4747 samples)

Available fields: style, text, category

  • Field 'style' has 14 unique values
    ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
  • Field 'text' has 4729 unique values
    Value(dtype='string', id=None)
  • Field 'category' has 4 unique values
    ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Notices
  • Field 'style' is a label column (ClassLabel).
  • Field 'text' appears to be a text column.
  • Field 'category' is a label column (ClassLabel).

source

preview_label_counts

 preview_label_counts (df, label_column, label_names)

Preview label counts from a dataframe (this will be made an internal function in a future version - use preview_split_by_label_column instead).


source

preview_split_by_label_column

 preview_split_by_label_column (dataset:datasets.dataset_dict.DatasetDict,
                                label_column:str)

Output label counts per split for a Huggingface dataset.

Type Details
dataset DatasetDict A Huggingface dataset dataset dict, typically the result of load_dataset()
label_column str The name of the column to preview

Here is how to preview counts of a specific label column for each split in a dataset. This currently assumes input is a DatasetDict. It may be altered in the future to work with Dataset objects as well.

label_column = 'style'
preview_split_by_label_column(dataset, label_column)
label_name count
style
0 obama 1168
1 bush 619
2 trump 1361
3 woolf 1469
4 fitzgerald 2658
5 hemingway 1516
6 blog5546 904
7 blog11518 2889
8 blog25872 336
9 blog30102 505
10 blog30407 912
11 h 77
12 pp 93
13 qq 72
label_name count
style
0 obama 273
1 bush 139
2 trump 328
3 woolf 488
4 fitzgerald 885
5 hemingway 504
6 blog5546 160
7 blog11518 510
8 blog25872 60
9 blog30102 90
10 blog30407 161
11 h 14
12 pp 17
13 qq 13
label_name count
style
0 obama 475
1 bush 251
2 trump 558
3 woolf 488
4 fitzgerald 885
5 hemingway 504
6 blog5546 210
7 blog11518 677
8 blog25872 142
9 blog30102 217
10 blog30407 143
11 h 45
12 pp 85
13 qq 67

source

preview_text_field

 preview_text_field (text:str, width:int=80)

Display a text field, wrapping the text to 80 characters. This may be moved to an internal function in a future version.

Type Default Details
text str Text to preview
width int 80 Width to wrap the text to
preview_text_field(dataset['train'][0]['text'])
You see it in Melinda Lopez, who came to her family's old home. And as she was
walking the streets, an elderly woman recognized her as her mother's daughter,
and began to cry. She took her into her home and showed her a pile of photos
that included Melinda's baby picture, which her mother had sent 50 years ago.
Melinda later said, "So many of us are now getting so much back."

This function may be changed in future to work with Dataset/DatasetDict objects. It currently assumes you’ve done something like dataset.to_pandas() to get a pandas dataframe. No example is provided here, as it may change in the near future.


source

preview_row_text

 preview_row_text (df:pandas.core.frame.DataFrame, selected_index:int,
                   text_column:str='text', limit:int=-1)

Output the text fields of a row in the DataFrame

Type Default Details
df DataFrame DataFrame containing the data
selected_index int Index of the row to preview
text_column str text column name for text field
limit int -1 Limit the length of the text field

Loading example data for remaining code examples …

# for testing multi-class classification
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['fitzgerald', 'hemingway', 'woolf'])

TODO: document …


source

preview_splits

 preview_splits (X_train, y_train, X_test, y_test, label_names=None,
                 target_classes=None, target_names=None)

Display the number of samples in each class for train and test sets.

preview_splits(X_train, y_train, X_test, y_test, target_names = target_names, target_classes = target_classes)
Train: 4407 samples, 3 classes
label_name count
0
3 woolf 1469
4 fitzgerald 1469
5 hemingway 1469
Test: 1877 samples, 3 classes
label_name count
0
4 fitzgerald 885
5 hemingway 504
3 woolf 488
feature_store = TextFeatureStore('feature_store_example_report.sqlite')
pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('pos',
                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=  11.5s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.1s

source

plt_svg

 plt_svg (fig:matplotlib.figure.Figure=None)

Display an SVG in a notebook with save functionality (see note)

Type Default Details
fig Figure None Optional figure to display, if None uses current figure

The plt_svg is based on code from this gist, but is amended to specify rendering of text as text.

TODO: document …


source

plot_confusion_matrix

 plot_confusion_matrix (y_test, y_predicted, target_classes, target_names,
                        figsize=(10, 8), renderer:str='svg',
                        title:str|None=None)

Output a confusion matrix with counts and proportions and appropriate labels.

Type Default Details
y_test
y_predicted
target_classes
target_names
figsize tuple (10, 8)
renderer str svg ‘svg’ or ‘img’
title str | None None Title for the plot
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

  fitzgerald      0.748     0.538     0.626       885
   hemingway      0.567     0.819     0.670       504
       woolf      0.585     0.615     0.599       488

    accuracy                          0.633      1877
   macro avg      0.634     0.657     0.632      1877
weighted avg      0.657     0.633     0.631      1877

TODO: document …


source

save_results

 save_results (results_file, pipeline, experiment_descriptor,
               dataset_descriptor, y_test, y_pred, target_classes,
               target_names, classifier_step_name='classifier')

Save results from an experiment

experiment_descriptor = 'POS features, Logistic Regression'
dataset_descriptor = 'hallisky/AuthorMix dataset, authorship classification'
results_file = 'results.csv'

save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')
experiment dataset classifier parameters accuracy_f1 macro_precision macro_recall macro_f1 weighted_precision weighted_recall weighted_f1 fitzgerald_precision fitzgerald_recall fitzgerald_f1 hemingway_precision hemingway_recall hemingway_f1 woolf_precision woolf_recall woolf_f1
0 POS features, Logistic Regression hallisky/AuthorMix dataset, authorship classif... LogisticRegression {'memory': None, 'steps': [('preprocessor', Sp... 0.633458 0.63351 0.657351 0.63192 0.657252 0.633458 0.630976 0.748428 0.537853 0.625904 0.567308 0.819444 0.670455 0.584795 0.614754 0.599401

The following code is in development.

TODO: document …


source

plot_confusion_matrices

 plot_confusion_matrices (tests, predictions, target_classes,
                          target_names, model_names=None, n_col=2,
                          figsize=(16, 8), renderer:str='svg',
                          title:str|None=None)

Plot grid of confusion matrices for multiple models.

Type Default Details
tests
predictions
target_classes
target_names
model_names NoneType None Optional: list of names for each prediction
n_col int 2
figsize tuple (16, 8)
renderer str svg ‘svg’ or ‘img’
title str | None None
pipeline1 = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

experiment_descriptor = 'POS features, Unigrams, Logistic Regression'
pipeline1.fit(X_train, y_train)
y_pred1 = pipeline1.predict(X_test)
results = save_results(results_file, pipeline1, experiment_descriptor, dataset_descriptor, y_test, y_pred1, target_classes, target_names, classifier_step_name = 'classifier')

pipeline2 = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (2, 2))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter=5000, random_state=55))
], verbose=True)

experiment_descriptor = 'POS features, Bigrams, Logistic Regression'
pipeline2.fit(X_train, y_train)
y_pred2 = pipeline2.predict(X_test)
results = save_results(results_file, pipeline2, experiment_descriptor, dataset_descriptor, y_test, y_pred2, target_classes, target_names, classifier_step_name = 'classifier')
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.2s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.1s
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.1s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.3s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.2s
plot_confusion_matrices([y_test, y_test], [y_pred1, y_pred2], target_classes, target_names, model_names=['Logistic Regression (1-gram)', 'Logistic Regression (2-gram)'], n_col=2, figsize=(16, 8), renderer='svg', title='Confusion Matrices for Different Models')

source

get_classifier_feature_names_in

 get_classifier_feature_names_in (pipeline:sklearn.pipeline.Pipeline,
                                  classifier_step_name='classifier')

Get the feature names that were the input to the classifier step from a fitted pipeline.

Type Default Details
pipeline Pipeline fitted pipeline
classifier_step_name str classifier name of the classifier step in pipeline
print(get_classifier_feature_names_in(pipeline, classifier_step_name = 'classifier'))
["''" ',' '-LRB-' '-RRB-' '.' ':' 'ADD' 'CC' 'CD' 'DT' 'EX' 'FW' 'HYPH'
 'IN' 'JJ' 'JJR' 'JJS' 'LS' 'MD' 'NFP' 'NN' 'NNP' 'NNPS' 'NNS' 'PDT' 'POS'
 'PRP' 'PRP$' 'RB' 'RBR' 'RBS' 'RP' 'SYM' 'TO' 'UH' 'VB' 'VBD' 'VBG' 'VBN'
 'VBP' 'VBZ' 'WDT' 'WP' 'WP$' 'WRB' 'XX' '_SP' '``']

source

plot_logistic_regression_features_from_pipeline

 plot_logistic_regression_features_from_pipeline (pipeline,
                                                  target_classes,
                                                  target_names, top_n=20, 
                                                  classifier_step_name='cl
                                                  assifier', features_step
                                                  _name='features',
                                                  renderer='svg')

Plot the most discriminative features for a logistic regression classifier in a fitted pipeline.

plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=10, classifier_step_name='classifier', features_step_name='features')
Feature Log Odds (Logit) Odds Ratio
5 : 1.120742 3.067128
1 , 0.940030 2.560059
13 IN 0.787940 2.198862
9 DT -0.586936 0.556029
26 PRP -0.396887 0.672410
20 NN -0.396228 0.672853
33 TO 0.332592 1.394578
4 . 0.241355 1.272973
35 VB -0.236384 0.789477
41 WDT 0.229801 1.258350
Feature Log Odds (Logit) Odds Ratio
5 : 0.800501 2.226657
20 NN 0.732279 2.079814
9 DT -0.618647 0.538673
14 JJ 0.498787 1.646723
36 VBD -0.491005 0.612011
4 . -0.460041 0.631258
1 , -0.418957 0.657732
40 VBZ -0.312551 0.731578
26 PRP 0.302341 1.353022
13 IN 0.268820 1.308420
Feature Log Odds (Logit) Odds Ratio
5 : -1.921243 0.146425
9 DT 1.205582 3.338703
13 IN -1.056760 0.347580
1 , -0.521073 0.593883
33 TO -0.404040 0.667617
35 VB 0.373301 1.452521
14 JJ -0.346658 0.707047
20 NN -0.336050 0.714587
43 WP$ -0.326951 0.721119
36 VBD 0.318487 1.375046
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data()
pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('classifier', DecisionTreeClassifier(max_depth=4, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('pos',
                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                ('classifier',
                 DecisionTreeClassifier(max_depth=4, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
experiment_descriptor = 'POS features, Decision Tree Classifier'
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
results = save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  20.5s
[Pipeline] ............... (step 2 of 3) Processing pos, total=   0.7s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.0s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
              precision    recall  f1-score   support

        blog      0.634     0.583     0.608       981
      author      0.847     0.790     0.817      1877
      speech      0.614     0.785     0.689       740

    accuracy                          0.732      3598
   macro avg      0.698     0.719     0.705      3598
weighted avg      0.741     0.732     0.734      3598

source

plot_decision_tree_from_pipeline

 plot_decision_tree_from_pipeline (pipeline, X_train, y_train,
                                   target_classes, target_names,
                                   classifier_step_name='classifier',
                                   features_step_name='features')

Plot the decision tree of the classifier from a pipeline using SuperTree

Type Default Details
pipeline The pipeline containing the classifier
X_train The training data
y_train The training labels
target_classes The target classes
target_names The target names
classifier_step_name str classifier The name of the classifier step in the pipeline
features_step_name str features The name of the final preprocessing step = probably the name of the step prior to the classifier

SuperTree creates interactive decision tree visusalisations.

plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, classifier_step_name = 'classifier', features_step_name = 'pos')
×

source

get_selected_feature_names

 get_selected_feature_names (pipeline, features_step_name='features',
                             selector_step_name='selector')

Get the selected features from the pipeline (Depreciated).

Type Default Details
pipeline the pipeline to get the feature names from
features_step_name str features the name of the step in the pipeline that contains the features
selector_step_name str selector the name of the step in the pipeline that contains the selector
Returns list returns a list of the selected feature names

If a pipeline uses feature selection, here is how to get the selected features. First, here is a pipeline with a selector. A more complex pipeline is used for demonstation …

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('features', FeatureUnion([
        ('pos_unigrams', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
        ('textstats', Pipeline([
            ('textstats_vectors', TextstatsTransformer(feature_store=feature_store)),
            ('selector', SelectKBest(mutual_info_classif, k=5)),
        ], verbose=True)),
    ], verbose=True)),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The functions to preview features requires a fitted pipeline …

pipeline.fit(X_train, y_train)
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.3s
[FeatureUnion] .. (step 1 of 2) Processing pos_unigrams, total=   0.7s
[Pipeline] . (step 1 of 2) Processing textstats_vectors, total=   0.3s
[Pipeline] .......... (step 2 of 2) Processing selector, total=   0.2s
[FeatureUnion] ..... (step 2 of 2) Processing textstats, total=   0.5s
[Pipeline] .......... (step 2 of 4) Processing features, total=   1.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.6s
Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

source

preview_selected_features

 preview_selected_features (pipeline, features_step_name='features',
                            selector_step_name='selector')

Preview (i.e. prints) the selected features from the pipeline (Depreciated - this will be removed in 0.0.10).

Type Default Details
pipeline the pipeline to preview the selected features from
features_step_name str features the name of the step in the pipeline that contains the features
selector_step_name str selector the name of the step in the pipeline that contains the selector

This functionality is new in v0.0.9.


source

preview_pipeline_features

 preview_pipeline_features (pipeline:sklearn.pipeline.Pipeline)

Outputs the features at each step in a pipeline.

Type Details
pipeline Pipeline pipeline to preview

To see how features are extracted and/or selected at each stage of a pipeline use preview_pipeline_features. Expand each pipeline step to see the features output by that pipeline step.

preview_pipeline_features(pipeline)
preprocessor SpacyPreprocessor

This step receives and returns text.

features FeatureUnion

pos_unigrams POSVectorizer

Features Out (49)

$, '', ,, -LRB-, -RRB-, ., :, ADD, CC, CD, DT, EX, FW, HYPH, IN, JJ, JJR, JJS, LS, MD, NFP, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, XX, _SP, ``

textstats_vectors TextstatsTransformer

Features Out (12)

tokens_count, sentences_count, characters_count, monosyllabic_words_relfreq, polysyllabic_words_relfreq, unique_tokens_relfreq, average_characters_per_token, average_tokens_per_sentence, characters_proportion_letters, characters_proportion_uppercase, hapax_legomena_count, hapax_legomena_to_unique

selector SelectKBest

Features Out (5)

tokens_count, characters_count, polysyllabic_words_relfreq, unique_tokens_relfreq, average_tokens_per_sentence

scaler StandardScaler

Features Out (54)

pos_unigrams__$, pos_unigrams__'', pos_unigrams__,, pos_unigrams__-LRB-, pos_unigrams__-RRB-, pos_unigrams__., pos_unigrams__:, pos_unigrams__ADD, pos_unigrams__CC, pos_unigrams__CD, pos_unigrams__DT, pos_unigrams__EX, pos_unigrams__FW, pos_unigrams__HYPH, pos_unigrams__IN, pos_unigrams__JJ, pos_unigrams__JJR, pos_unigrams__JJS, pos_unigrams__LS, pos_unigrams__MD, pos_unigrams__NFP, pos_unigrams__NN, pos_unigrams__NNP, pos_unigrams__NNPS, pos_unigrams__NNS, pos_unigrams__PDT, pos_unigrams__POS, pos_unigrams__PRP, pos_unigrams__PRP$, pos_unigrams__RB, pos_unigrams__RBR, pos_unigrams__RBS, pos_unigrams__RP, pos_unigrams__SYM, pos_unigrams__TO, pos_unigrams__UH, pos_unigrams__VB, pos_unigrams__VBD, pos_unigrams__VBG, pos_unigrams__VBN, pos_unigrams__VBP, pos_unigrams__VBZ, pos_unigrams__WDT, pos_unigrams__WP, pos_unigrams__WP$, pos_unigrams__WRB, pos_unigrams__XX, pos_unigrams___SP, pos_unigrams__``, textstats__tokens_count, textstats__characters_count, textstats__polysyllabic_words_relfreq, textstats__unique_tokens_relfreq, textstats__average_tokens_per_sentence

classifier LogisticRegression

Features In (54)

pos_unigrams__$, pos_unigrams__'', pos_unigrams__,, pos_unigrams__-LRB-, pos_unigrams__-RRB-, pos_unigrams__., pos_unigrams__:, pos_unigrams__ADD, pos_unigrams__CC, pos_unigrams__CD, pos_unigrams__DT, pos_unigrams__EX, pos_unigrams__FW, pos_unigrams__HYPH, pos_unigrams__IN, pos_unigrams__JJ, pos_unigrams__JJR, pos_unigrams__JJS, pos_unigrams__LS, pos_unigrams__MD, pos_unigrams__NFP, pos_unigrams__NN, pos_unigrams__NNP, pos_unigrams__NNPS, pos_unigrams__NNS, pos_unigrams__PDT, pos_unigrams__POS, pos_unigrams__PRP, pos_unigrams__PRP$, pos_unigrams__RB, pos_unigrams__RBR, pos_unigrams__RBS, pos_unigrams__RP, pos_unigrams__SYM, pos_unigrams__TO, pos_unigrams__UH, pos_unigrams__VB, pos_unigrams__VBD, pos_unigrams__VBG, pos_unigrams__VBN, pos_unigrams__VBP, pos_unigrams__VBZ, pos_unigrams__WDT, pos_unigrams__WP, pos_unigrams__WP$, pos_unigrams__WRB, pos_unigrams__XX, pos_unigrams___SP, pos_unigrams__``, textstats__tokens_count, textstats__characters_count, textstats__polysyllabic_words_relfreq, textstats__unique_tokens_relfreq, textstats__average_tokens_per_sentence

pd.read_csv('results.csv')
experiment dataset classifier parameters accuracy_f1 macro_precision macro_recall macro_f1 weighted_precision weighted_recall ... woolf_f1 blog_precision blog_recall blog_f1 author_precision author_recall author_f1 speech_precision speech_recall speech_f1
0 POS features, Logistic Regression hallisky/AuthorMix dataset, authorship classif... LogisticRegression {'memory': None, 'steps': [('preprocessor', Sp... 0.633458 0.633510 0.657351 0.631920 0.657252 0.633458 ... 0.599401 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 POS features, Unigrams, Logistic Regression hallisky/AuthorMix dataset, authorship classif... LogisticRegression {'memory': None, 'steps': [('preprocessor', Sp... 0.633458 0.633510 0.657351 0.631920 0.657252 0.633458 ... 0.599401 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 POS features, Bigrams, Logistic Regression hallisky/AuthorMix dataset, authorship classif... LogisticRegression {'memory': None, 'steps': [('preprocessor', Sp... 0.636121 0.629850 0.649835 0.634696 0.649218 0.636121 ... 0.583411 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 POS features, Decision Tree Classifier hallisky/AuthorMix dataset, authorship classif... DecisionTreeClassifier {'memory': None, 'steps': [('preprocessor', Sp... 0.732351 0.698335 0.719257 0.704589 0.741123 0.732351 ... NaN 0.634146 0.583078 0.607541 0.847341 0.789558 0.81743 0.613516 0.785135 0.688797

4 rows × 29 columns