= load_dataset('hallisky/AuthorMix') dataset
report
The reporting functionality is in active development. Documentation is also in development, but should reflect the latest version.
preview_dataset
preview_dataset (dataset:datasets.arrow_dataset.Dataset|datasets.dataset_ dict.DatasetDict)
Output information about a Huggingface dataset.
Type | Details | |
---|---|---|
dataset | datasets.arrow_dataset.Dataset | datasets.dataset_dict.DatasetDict | A Huggingface dataset or dataset dict, typically the result of load_dataset() |
The preview_dataset
function provides information on the dataset. This includes information on splits, a summary field types for each split, and counts of unique values for each field. Expand the split to get more information. The notices output after the dataset summary will identify probable text and label column candidates, as well as suggesting label columns that need to be cast as label columns.
preview_dataset(dataset)
Split: train (14579 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
Value(dtype='string', id=None)
- Field 'text' has 14489 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
Value(dtype='string', id=None)
Split: validation (3642 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
Value(dtype='string', id=None)
- Field 'text' has 3636 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
Value(dtype='string', id=None)
Split: test (4747 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
Value(dtype='string', id=None)
- Field 'text' has 4729 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
Value(dtype='string', id=None)
Warnings/Notices
- Field 'style' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'style').
- Unique counts are identical (14)
- Unique counts are a low proportion of total rows.
- Unique values are identical between splits: ('blog11518', 'blog25872', 'blog30102', 'blog30407', 'blog5546', 'bush', 'fitzgerald', 'h', 'hemingway', 'obama', 'pp', 'qq', 'trump', 'woolf')
- Field 'text' appears to be a text column.
- Field 'category' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'category').
- Unique counts are identical (4)
- Unique counts are a low proportion of total rows.
- Unique values are identical between splits: ('amt', 'author', 'blog', 'speech')
Some functions for working with datasets are below. First, a function to cast a column as a label column. This is helpful to use the same code workflow regardless of how labels were defined by the dataset’s creator. This currently assumes a DatasetDict. It may be altered in the future to work with Dataset objects as well.
cast_column_to_label
cast_column_to_label (dataset:datasets.dataset_dict.DatasetDict, label_column:str)
Cast a column to a ClassLabel.
Type | Details | |
---|---|---|
dataset | DatasetDict | A Huggingface dataset dict, typically the result of load_dataset() |
label_column | str | The name of the column to cast to ClassLabel |
Returns | DatasetDict |
And, second - a function to get the text representation of label names …
get_label_names
get_label_names (dataset:datasets.dataset_dict.DatasetDict|datasets.arrow _dataset.Dataset, label_column:str)
Get label names from field in a Huggingface dataset.
Type | Details | |
---|---|---|
dataset | datasets.dataset_dict.DatasetDict | datasets.arrow_dataset.Dataset | A Huggingface dataset or dataset dict, typically the result of load_dataset() |
label_column | str | The name of the column get the label names from |
Returns | list | list of label names |
An example of casting columns as labels and getting label names …
'category')
cast_column_to_label(dataset, 'style')
cast_column_to_label(dataset, print(get_label_names(dataset, label_column = 'category'))
print(get_label_names(dataset, label_column = 'style'))
['speech', 'author', 'blog', 'amt']
['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq']
Output of preview_dataset
after columns have been cast …
preview_dataset(dataset)
Split: train (14579 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
- Field 'text' has 14489 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Split: validation (3642 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
- Field 'text' has 3636 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Split: test (4747 samples)
Available fields: style, text, category
- Field 'style' has 14 unique values
ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)
- Field 'text' has 4729 unique values
Value(dtype='string', id=None)
- Field 'category' has 4 unique values
ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)
Notices
- Field 'style' is a label column (ClassLabel).
- Field 'text' appears to be a text column.
- Field 'category' is a label column (ClassLabel).
preview_label_counts
preview_label_counts (df, label_column, label_names)
Preview label counts from a dataframe (this will be made an internal function in a future version - use preview_split_by_label_column instead).
preview_split_by_label_column
preview_split_by_label_column (dataset:datasets.dataset_dict.DatasetDict, label_column:str)
Output label counts per split for a Huggingface dataset.
Type | Details | |
---|---|---|
dataset | DatasetDict | A Huggingface dataset dataset dict, typically the result of load_dataset() |
label_column | str | The name of the column to preview |
Here is how to preview counts of a specific label column for each split in a dataset. This currently assumes input is a DatasetDict. It may be altered in the future to work with Dataset objects as well.
= 'style'
label_column preview_split_by_label_column(dataset, label_column)
label_name | count | |
---|---|---|
style | ||
0 | obama | 1168 |
1 | bush | 619 |
2 | trump | 1361 |
3 | woolf | 1469 |
4 | fitzgerald | 2658 |
5 | hemingway | 1516 |
6 | blog5546 | 904 |
7 | blog11518 | 2889 |
8 | blog25872 | 336 |
9 | blog30102 | 505 |
10 | blog30407 | 912 |
11 | h | 77 |
12 | pp | 93 |
13 | 72 |
label_name | count | |
---|---|---|
style | ||
0 | obama | 273 |
1 | bush | 139 |
2 | trump | 328 |
3 | woolf | 488 |
4 | fitzgerald | 885 |
5 | hemingway | 504 |
6 | blog5546 | 160 |
7 | blog11518 | 510 |
8 | blog25872 | 60 |
9 | blog30102 | 90 |
10 | blog30407 | 161 |
11 | h | 14 |
12 | pp | 17 |
13 | 13 |
label_name | count | |
---|---|---|
style | ||
0 | obama | 475 |
1 | bush | 251 |
2 | trump | 558 |
3 | woolf | 488 |
4 | fitzgerald | 885 |
5 | hemingway | 504 |
6 | blog5546 | 210 |
7 | blog11518 | 677 |
8 | blog25872 | 142 |
9 | blog30102 | 217 |
10 | blog30407 | 143 |
11 | h | 45 |
12 | pp | 85 |
13 | 67 |
preview_text_field
preview_text_field (text:str, width:int=80)
Display a text field, wrapping the text to 80 characters. This may be moved to an internal function in a future version.
Type | Default | Details | |
---|---|---|---|
text | str | Text to preview | |
width | int | 80 | Width to wrap the text to |
'train'][0]['text']) preview_text_field(dataset[
You see it in Melinda Lopez, who came to her family's old home. And as she was
walking the streets, an elderly woman recognized her as her mother's daughter,
and began to cry. She took her into her home and showed her a pile of photos
that included Melinda's baby picture, which her mother had sent 50 years ago.
Melinda later said, "So many of us are now getting so much back."
This function may be changed in future to work with Dataset/DatasetDict objects. It currently assumes you’ve done something like dataset.to_pandas()
to get a pandas dataframe. No example is provided here, as it may change in the near future.
preview_row_text
preview_row_text (df:pandas.core.frame.DataFrame, selected_index:int, text_column:str='text', limit:int=-1)
Output the text fields of a row in the DataFrame
Type | Default | Details | |
---|---|---|---|
df | DataFrame | DataFrame containing the data | |
selected_index | int | Index of the row to preview | |
text_column | str | text | column name for text field |
limit | int | -1 | Limit the length of the text field |
Loading example data for remaining code examples …
# for testing multi-class classification
= get_example_data(label_column = 'style', target_labels = ['fitzgerald', 'hemingway', 'woolf']) X_train, y_train, X_test, y_test, target_classes, target_names
TODO: document …
preview_splits
preview_splits (X_train, y_train, X_test, y_test, label_names=None, target_classes=None, target_names=None)
Display the number of samples in each class for train and test sets.
= target_names, target_classes = target_classes) preview_splits(X_train, y_train, X_test, y_test, target_names
Train: 4407 samples, 3 classes
label_name | count | |
---|---|---|
0 | ||
3 | woolf | 1469 |
4 | fitzgerald | 1469 |
5 | hemingway | 1469 |
Test: 1877 samples, 3 classes
label_name | count | |
---|---|---|
0 | ||
4 | fitzgerald | 885 |
5 | hemingway | 504 |
3 | woolf | 488 |
= TextFeatureStore('feature_store_example_report.sqlite') feature_store
= Pipeline([
pipeline 'preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
('scaler', StandardScaler(with_mean=False)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('pos', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('pos', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')
POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
StandardScaler(with_mean=False)
LogisticRegression(max_iter=5000, random_state=55)
pipeline.fit(X_train, y_train)= pipeline.predict(X_test) y_pred
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total= 11.5s
[Pipeline] ............... (step 2 of 4) Processing pos, total= 0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total= 0.1s
plt_svg
plt_svg (fig:matplotlib.figure.Figure=None)
Display an SVG in a notebook with save functionality (see note)
Type | Default | Details | |
---|---|---|---|
fig | Figure | None | Optional figure to display, if None uses current figure |
The plt_svg is based on code from this gist, but is amended to specify rendering of text as text.
TODO: document …
plot_confusion_matrix
plot_confusion_matrix (y_test, y_predicted, target_classes, target_names, figsize=(10, 8), renderer:str='svg', title:str|None=None)
Output a confusion matrix with counts and proportions and appropriate labels.
Type | Default | Details | |
---|---|---|---|
y_test | |||
y_predicted | |||
target_classes | |||
target_names | |||
figsize | tuple | (10, 8) | |
renderer | str | svg | ‘svg’ or ‘img’ |
title | str | None | None | Title for the plot |
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
fitzgerald 0.748 0.538 0.626 885
hemingway 0.567 0.819 0.670 504
woolf 0.585 0.615 0.599 488
accuracy 0.633 1877
macro avg 0.634 0.657 0.632 1877
weighted avg 0.657 0.633 0.631 1877
TODO: document …
save_results
save_results (results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name='classifier')
Save results from an experiment
= 'POS features, Logistic Regression'
experiment_descriptor = 'hallisky/AuthorMix dataset, authorship classification'
dataset_descriptor = 'results.csv'
results_file
= 'classifier') save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name
experiment | dataset | classifier | parameters | accuracy_f1 | macro_precision | macro_recall | macro_f1 | weighted_precision | weighted_recall | weighted_f1 | fitzgerald_precision | fitzgerald_recall | fitzgerald_f1 | hemingway_precision | hemingway_recall | hemingway_f1 | woolf_precision | woolf_recall | woolf_f1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POS features, Logistic Regression | hallisky/AuthorMix dataset, authorship classif... | LogisticRegression | {'memory': None, 'steps': [('preprocessor', Sp... | 0.633458 | 0.63351 | 0.657351 | 0.63192 | 0.657252 | 0.633458 | 0.630976 | 0.748428 | 0.537853 | 0.625904 | 0.567308 | 0.819444 | 0.670455 | 0.584795 | 0.614754 | 0.599401 |
The following code is in development.
TODO: document …
plot_confusion_matrices
plot_confusion_matrices (tests, predictions, target_classes, target_names, model_names=None, n_col=2, figsize=(16, 8), renderer:str='svg', title:str|None=None)
Plot grid of confusion matrices for multiple models.
Type | Default | Details | |
---|---|---|---|
tests | |||
predictions | |||
target_classes | |||
target_names | |||
model_names | NoneType | None | Optional: list of names for each prediction |
n_col | int | 2 | |
figsize | tuple | (16, 8) | |
renderer | str | svg | ‘svg’ or ‘img’ |
title | str | None | None |
= Pipeline([
pipeline1 'preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
('scaler', StandardScaler(with_mean=False)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
= 'POS features, Unigrams, Logistic Regression'
experiment_descriptor
pipeline1.fit(X_train, y_train)= pipeline1.predict(X_test)
y_pred1 = save_results(results_file, pipeline1, experiment_descriptor, dataset_descriptor, y_test, y_pred1, target_classes, target_names, classifier_step_name = 'classifier')
results
= Pipeline([
pipeline2 'preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
('pos', POSVectorizer(feature_store=feature_store, ngram_range = (2, 2))),
('scaler', StandardScaler(with_mean=False)),
('classifier', LogisticRegression(max_iter=5000, random_state=55))
(=True)
], verbose
= 'POS features, Bigrams, Logistic Regression'
experiment_descriptor
pipeline2.fit(X_train, y_train)= pipeline2.predict(X_test)
y_pred2 = save_results(results_file, pipeline2, experiment_descriptor, dataset_descriptor, y_test, y_pred2, target_classes, target_names, classifier_step_name = 'classifier') results
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total= 0.2s
[Pipeline] ............... (step 2 of 4) Processing pos, total= 0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total= 0.1s
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total= 0.1s
[Pipeline] ............... (step 2 of 4) Processing pos, total= 0.3s
[Pipeline] ............ (step 3 of 4) Processing scaler, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total= 0.2s
=['Logistic Regression (1-gram)', 'Logistic Regression (2-gram)'], n_col=2, figsize=(16, 8), renderer='svg', title='Confusion Matrices for Different Models') plot_confusion_matrices([y_test, y_test], [y_pred1, y_pred2], target_classes, target_names, model_names
get_classifier_feature_names_in
get_classifier_feature_names_in (pipeline:sklearn.pipeline.Pipeline, classifier_step_name='classifier')
Get the feature names that were the input to the classifier step from a fitted pipeline.
Type | Default | Details | |
---|---|---|---|
pipeline | Pipeline | fitted pipeline | |
classifier_step_name | str | classifier | name of the classifier step in pipeline |
print(get_classifier_feature_names_in(pipeline, classifier_step_name = 'classifier'))
["''" ',' '-LRB-' '-RRB-' '.' ':' 'ADD' 'CC' 'CD' 'DT' 'EX' 'FW' 'HYPH'
'IN' 'JJ' 'JJR' 'JJS' 'LS' 'MD' 'NFP' 'NN' 'NNP' 'NNPS' 'NNS' 'PDT' 'POS'
'PRP' 'PRP$' 'RB' 'RBR' 'RBS' 'RP' 'SYM' 'TO' 'UH' 'VB' 'VBD' 'VBG' 'VBN'
'VBP' 'VBZ' 'WDT' 'WP' 'WP$' 'WRB' 'XX' '_SP' '``']
plot_logistic_regression_features_from_pipeline
plot_logistic_regression_features_from_pipeline (pipeline, target_classes, target_names, top_n=20, classifier_step_name='cl assifier', features_step _name='features', renderer='svg')
Plot the most discriminative features for a logistic regression classifier in a fitted pipeline.
=10, classifier_step_name='classifier', features_step_name='features') plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n
Feature | Log Odds (Logit) | Odds Ratio | |
---|---|---|---|
5 | : | 1.120742 | 3.067128 |
1 | , | 0.940030 | 2.560059 |
13 | IN | 0.787940 | 2.198862 |
9 | DT | -0.586936 | 0.556029 |
26 | PRP | -0.396887 | 0.672410 |
20 | NN | -0.396228 | 0.672853 |
33 | TO | 0.332592 | 1.394578 |
4 | . | 0.241355 | 1.272973 |
35 | VB | -0.236384 | 0.789477 |
41 | WDT | 0.229801 | 1.258350 |
Feature | Log Odds (Logit) | Odds Ratio | |
---|---|---|---|
5 | : | 0.800501 | 2.226657 |
20 | NN | 0.732279 | 2.079814 |
9 | DT | -0.618647 | 0.538673 |
14 | JJ | 0.498787 | 1.646723 |
36 | VBD | -0.491005 | 0.612011 |
4 | . | -0.460041 | 0.631258 |
1 | , | -0.418957 | 0.657732 |
40 | VBZ | -0.312551 | 0.731578 |
26 | PRP | 0.302341 | 1.353022 |
13 | IN | 0.268820 | 1.308420 |
Feature | Log Odds (Logit) | Odds Ratio | |
---|---|---|---|
5 | : | -1.921243 | 0.146425 |
9 | DT | 1.205582 | 3.338703 |
13 | IN | -1.056760 | 0.347580 |
1 | , | -0.521073 | 0.593883 |
33 | TO | -0.404040 | 0.667617 |
35 | VB | 0.373301 | 1.452521 |
14 | JJ | -0.346658 | 0.707047 |
20 | NN | -0.336050 | 0.714587 |
43 | WP$ | -0.326951 | 0.721119 |
36 | VBD | 0.318487 | 1.375046 |
= get_example_data() X_train, y_train, X_test, y_test, target_classes, target_names
= Pipeline([
pipeline 'preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
('classifier', DecisionTreeClassifier(max_depth=4, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('pos', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('classifier', DecisionTreeClassifier(max_depth=4, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('pos', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('classifier', DecisionTreeClassifier(max_depth=4, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')
POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
DecisionTreeClassifier(max_depth=4, random_state=55)
= 'POS features, Decision Tree Classifier'
experiment_descriptor
pipeline.fit(X_train, y_train)= pipeline.predict(X_test)
y_pred = save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier') results
[Pipeline] ...... (step 1 of 3) Processing preprocessor, total= 20.5s
[Pipeline] ............... (step 2 of 3) Processing pos, total= 0.7s
[Pipeline] ........ (step 3 of 3) Processing classifier, total= 0.0s
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
precision recall f1-score support
blog 0.634 0.583 0.608 981
author 0.847 0.790 0.817 1877
speech 0.614 0.785 0.689 740
accuracy 0.732 3598
macro avg 0.698 0.719 0.705 3598
weighted avg 0.741 0.732 0.734 3598
plot_decision_tree_from_pipeline
plot_decision_tree_from_pipeline (pipeline, X_train, y_train, target_classes, target_names, classifier_step_name='classifier', features_step_name='features')
Plot the decision tree of the classifier from a pipeline using SuperTree
Type | Default | Details | |
---|---|---|---|
pipeline | The pipeline containing the classifier | ||
X_train | The training data | ||
y_train | The training labels | ||
target_classes | The target classes | ||
target_names | The target names | ||
classifier_step_name | str | classifier | The name of the classifier step in the pipeline |
features_step_name | str | features | The name of the final preprocessing step = probably the name of the step prior to the classifier |
SuperTree creates interactive decision tree visusalisations.
= 'classifier', features_step_name = 'pos') plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, classifier_step_name
get_selected_feature_names
get_selected_feature_names (pipeline, features_step_name='features', selector_step_name='selector')
Get the selected features from the pipeline (Depreciated).
Type | Default | Details | |
---|---|---|---|
pipeline | the pipeline to get the feature names from | ||
features_step_name | str | features | the name of the step in the pipeline that contains the features |
selector_step_name | str | selector | the name of the step in the pipeline that contains the selector |
Returns | list | returns a list of the selected feature names |
If a pipeline uses feature selection, here is how to get the selected features. First, here is a pipeline with a selector. A more complex pipeline is used for demonstation …
= Pipeline([
pipeline 'preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
('features', FeatureUnion([
('pos_unigrams', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
('textstats', Pipeline([
('textstats_vectors', TextstatsTransformer(feature_store=feature_store)),
('selector', SelectKBest(mutual_info_classif, k=5)),
(=True)),
], verbose=True)),
], verbose'scaler', StandardScaler(with_mean=False)),
('classifier', LogisticRegression(max_iter = 5000, random_state=55))
(=True)
], verbose
display(pipeline)
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('features', FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('features', FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')
FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)
POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>)
StandardScaler(with_mean=False)
LogisticRegression(max_iter=5000, random_state=55)
The functions to preview features requires a fitted pipeline …
pipeline.fit(X_train, y_train)
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total= 0.3s
[FeatureUnion] .. (step 1 of 2) Processing pos_unigrams, total= 0.7s
[Pipeline] . (step 1 of 2) Processing textstats_vectors, total= 0.3s
[Pipeline] .......... (step 2 of 2) Processing selector, total= 0.2s
[FeatureUnion] ..... (step 2 of 2) Processing textstats, total= 0.5s
[Pipeline] .......... (step 2 of 4) Processing features, total= 1.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total= 0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total= 0.6s
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('features', FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')), ('features', FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)), ('scaler', StandardScaler(with_mean=False)), ('classifier', LogisticRegression(max_iter=5000, random_state=55))], verbose=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>, pos_tagset='detailed')
FeatureUnion(transformer_list=[('pos_unigrams', POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('textstats', Pipeline(steps=[('textstats_vectors', TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)), ('selector', SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>))], verbose=True))], verbose=True)
POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)
SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>)
StandardScaler(with_mean=False)
LogisticRegression(max_iter=5000, random_state=55)
preview_selected_features
preview_selected_features (pipeline, features_step_name='features', selector_step_name='selector')
Preview (i.e. prints) the selected features from the pipeline (Depreciated - this will be removed in 0.0.10).
Type | Default | Details | |
---|---|---|---|
pipeline | the pipeline to preview the selected features from | ||
features_step_name | str | features | the name of the step in the pipeline that contains the features |
selector_step_name | str | selector | the name of the step in the pipeline that contains the selector |
This functionality is new in v0.0.9.
preview_pipeline_features
preview_pipeline_features (pipeline:sklearn.pipeline.Pipeline)
Outputs the features at each step in a pipeline.
Type | Details | |
---|---|---|
pipeline | Pipeline | pipeline to preview |
To see how features are extracted and/or selected at each stage of a pipeline use preview_pipeline_features
. Expand each pipeline step to see the features output by that pipeline step.
preview_pipeline_features(pipeline)
preprocessor SpacyPreprocessor
This step receives and returns text.
features FeatureUnion
pos_unigrams POSVectorizer
Features Out (49)
$, '', ,, -LRB-, -RRB-, ., :, ADD, CC, CD, DT, EX, FW, HYPH, IN, JJ, JJR, JJS, LS, MD, NFP, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, XX, _SP, ``
textstats_vectors TextstatsTransformer
Features Out (12)
tokens_count, sentences_count, characters_count, monosyllabic_words_relfreq, polysyllabic_words_relfreq, unique_tokens_relfreq, average_characters_per_token, average_tokens_per_sentence, characters_proportion_letters, characters_proportion_uppercase, hapax_legomena_count, hapax_legomena_to_unique
selector SelectKBest
Features Out (5)
tokens_count, characters_count, polysyllabic_words_relfreq, unique_tokens_relfreq, average_tokens_per_sentence
scaler StandardScaler
Features Out (54)
pos_unigrams__$, pos_unigrams__'', pos_unigrams__,, pos_unigrams__-LRB-, pos_unigrams__-RRB-, pos_unigrams__., pos_unigrams__:, pos_unigrams__ADD, pos_unigrams__CC, pos_unigrams__CD, pos_unigrams__DT, pos_unigrams__EX, pos_unigrams__FW, pos_unigrams__HYPH, pos_unigrams__IN, pos_unigrams__JJ, pos_unigrams__JJR, pos_unigrams__JJS, pos_unigrams__LS, pos_unigrams__MD, pos_unigrams__NFP, pos_unigrams__NN, pos_unigrams__NNP, pos_unigrams__NNPS, pos_unigrams__NNS, pos_unigrams__PDT, pos_unigrams__POS, pos_unigrams__PRP, pos_unigrams__PRP$, pos_unigrams__RB, pos_unigrams__RBR, pos_unigrams__RBS, pos_unigrams__RP, pos_unigrams__SYM, pos_unigrams__TO, pos_unigrams__UH, pos_unigrams__VB, pos_unigrams__VBD, pos_unigrams__VBG, pos_unigrams__VBN, pos_unigrams__VBP, pos_unigrams__VBZ, pos_unigrams__WDT, pos_unigrams__WP, pos_unigrams__WP$, pos_unigrams__WRB, pos_unigrams__XX, pos_unigrams___SP, pos_unigrams__``, textstats__tokens_count, textstats__characters_count, textstats__polysyllabic_words_relfreq, textstats__unique_tokens_relfreq, textstats__average_tokens_per_sentence
classifier LogisticRegression
Features In (54)
pos_unigrams__$, pos_unigrams__'', pos_unigrams__,, pos_unigrams__-LRB-, pos_unigrams__-RRB-, pos_unigrams__., pos_unigrams__:, pos_unigrams__ADD, pos_unigrams__CC, pos_unigrams__CD, pos_unigrams__DT, pos_unigrams__EX, pos_unigrams__FW, pos_unigrams__HYPH, pos_unigrams__IN, pos_unigrams__JJ, pos_unigrams__JJR, pos_unigrams__JJS, pos_unigrams__LS, pos_unigrams__MD, pos_unigrams__NFP, pos_unigrams__NN, pos_unigrams__NNP, pos_unigrams__NNPS, pos_unigrams__NNS, pos_unigrams__PDT, pos_unigrams__POS, pos_unigrams__PRP, pos_unigrams__PRP$, pos_unigrams__RB, pos_unigrams__RBR, pos_unigrams__RBS, pos_unigrams__RP, pos_unigrams__SYM, pos_unigrams__TO, pos_unigrams__UH, pos_unigrams__VB, pos_unigrams__VBD, pos_unigrams__VBG, pos_unigrams__VBN, pos_unigrams__VBP, pos_unigrams__VBZ, pos_unigrams__WDT, pos_unigrams__WP, pos_unigrams__WP$, pos_unigrams__WRB, pos_unigrams__XX, pos_unigrams___SP, pos_unigrams__``, textstats__tokens_count, textstats__characters_count, textstats__polysyllabic_words_relfreq, textstats__unique_tokens_relfreq, textstats__average_tokens_per_sentence
'results.csv') pd.read_csv(
experiment | dataset | classifier | parameters | accuracy_f1 | macro_precision | macro_recall | macro_f1 | weighted_precision | weighted_recall | ... | woolf_f1 | blog_precision | blog_recall | blog_f1 | author_precision | author_recall | author_f1 | speech_precision | speech_recall | speech_f1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POS features, Logistic Regression | hallisky/AuthorMix dataset, authorship classif... | LogisticRegression | {'memory': None, 'steps': [('preprocessor', Sp... | 0.633458 | 0.633510 | 0.657351 | 0.631920 | 0.657252 | 0.633458 | ... | 0.599401 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | POS features, Unigrams, Logistic Regression | hallisky/AuthorMix dataset, authorship classif... | LogisticRegression | {'memory': None, 'steps': [('preprocessor', Sp... | 0.633458 | 0.633510 | 0.657351 | 0.631920 | 0.657252 | 0.633458 | ... | 0.599401 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | POS features, Bigrams, Logistic Regression | hallisky/AuthorMix dataset, authorship classif... | LogisticRegression | {'memory': None, 'steps': [('preprocessor', Sp... | 0.636121 | 0.629850 | 0.649835 | 0.634696 | 0.649218 | 0.636121 | ... | 0.583411 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | POS features, Decision Tree Classifier | hallisky/AuthorMix dataset, authorship classif... | DecisionTreeClassifier | {'memory': None, 'steps': [('preprocessor', Sp... | 0.732351 | 0.698335 | 0.719257 | 0.704589 | 0.741123 | 0.732351 | ... | NaN | 0.634146 | 0.583078 | 0.607541 | 0.847341 | 0.789558 | 0.81743 | 0.613516 | 0.785135 | 0.688797 |
4 rows × 29 columns