report

Functions to help understand text data and evaluate text classification models.

The reporting functionality is in active development. Documentation is also in development, but should reflect the latest version.

source

preview_dataset

 preview_dataset (dataset:datasets.arrow_dataset.Dataset|datasets.dataset_
                  dict.DatasetDict)

Output information about a Huggingface dataset.

	Type	Details
dataset	datasets.arrow_dataset.Dataset \| datasets.dataset_dict.DatasetDict	A Huggingface dataset or dataset dict, typically the result of load_dataset()

The preview_dataset function provides information on the dataset. This includes information on splits, a summary field types for each split, and counts of unique values for each field. Expand the split to get more information. The notices output after the dataset summary will identify probable text and label column candidates, as well as suggesting label columns that need to be cast as label columns.

dataset = load_dataset('hallisky/AuthorMix')

preview_dataset(dataset)

Split: train (14579 samples)

Available fields: style, text, category

Field 'style' has 14 unique values
```
Value(dtype='string', id=None)
```
Field 'text' has 14489 unique values
```
Value(dtype='string', id=None)
```
Field 'category' has 4 unique values
```
Value(dtype='string', id=None)
```

Split: validation (3642 samples)

Available fields: style, text, category

Field 'style' has 14 unique values
```
Value(dtype='string', id=None)
```
Field 'text' has 3636 unique values
```
Value(dtype='string', id=None)
```
Field 'category' has 4 unique values
```
Value(dtype='string', id=None)
```

Split: test (4747 samples)

Available fields: style, text, category

Field 'style' has 14 unique values
```
Value(dtype='string', id=None)
```
Field 'text' has 4729 unique values
```
Value(dtype='string', id=None)
```
Field 'category' has 4 unique values
```
Value(dtype='string', id=None)
```

Warnings/Notices

Field 'style' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'style').
- Unique counts are identical (14)
- Unique counts are a low proportion of total rows.
- Unique values are identical between splits: ('blog11518', 'blog25872', 'blog30102', 'blog30407', 'blog5546', 'bush', 'fitzgerald', 'h', 'hemingway', 'obama', 'pp', 'qq', 'trump', 'woolf')
Field 'text' appears to be a text column.
Field 'category' appears to be a label column and should probably be cast as ClassLabel with cast_column_to_label(dataset, 'category').
- Unique counts are identical (4)
- Unique counts are a low proportion of total rows.
- Unique values are identical between splits: ('amt', 'author', 'blog', 'speech')

Some functions for working with datasets are below. First, a function to cast a column as a label column. This is helpful to use the same code workflow regardless of how labels were defined by the dataset’s creator. This currently assumes a DatasetDict. It may be altered in the future to work with Dataset objects as well.

source

cast_column_to_label

 cast_column_to_label (dataset:datasets.dataset_dict.DatasetDict,
                       label_column:str)

Cast a column to a ClassLabel.

	Type	Details
dataset	DatasetDict	A Huggingface dataset dict, typically the result of load_dataset()
label_column	str	The name of the column to cast to ClassLabel
Returns	DatasetDict

And, second - a function to get the text representation of label names …

source

get_label_names

 get_label_names (dataset:datasets.dataset_dict.DatasetDict|datasets.arrow
                  _dataset.Dataset, label_column:str)

Get label names from field in a Huggingface dataset.

	Type	Details
dataset	datasets.dataset_dict.DatasetDict \| datasets.arrow_dataset.Dataset	A Huggingface dataset or dataset dict, typically the result of load_dataset()
label_column	str	The name of the column get the label names from
Returns	list	list of label names

An example of casting columns as labels and getting label names …

cast_column_to_label(dataset, 'category')
cast_column_to_label(dataset, 'style')
print(get_label_names(dataset, label_column = 'category'))
print(get_label_names(dataset, label_column = 'style'))

['speech', 'author', 'blog', 'amt']
['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq']

Output of preview_dataset after columns have been cast …

preview_dataset(dataset)

Split: train (14579 samples)

Available fields: style, text, category

Field 'style' has 14 unique values

ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)

Field 'text' has 14489 unique values
```
Value(dtype='string', id=None)
```

Field 'category' has 4 unique values

ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)

Split: validation (3642 samples)

Available fields: style, text, category

Field 'style' has 14 unique values

ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)

Field 'text' has 3636 unique values
```
Value(dtype='string', id=None)
```

Field 'category' has 4 unique values

ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)

Split: test (4747 samples)

Available fields: style, text, category

Field 'style' has 14 unique values

ClassLabel(names=['obama', 'bush', 'trump', 'woolf', 'fitzgerald', 'hemingway', 'blog5546', 'blog11518', 'blog25872', 'blog30102', 'blog30407', 'h', 'pp', 'qq'], id=None)

Field 'text' has 4729 unique values
```
Value(dtype='string', id=None)
```

Field 'category' has 4 unique values

ClassLabel(names=['speech', 'author', 'blog', 'amt'], id=None)

Notices

Field 'style' is a label column (ClassLabel).
Field 'text' appears to be a text column.
Field 'category' is a label column (ClassLabel).

source

preview_label_counts

 preview_label_counts (df, label_column, label_names)

Preview label counts from a dataframe (this will be made an internal function in a future version - use preview_split_by_label_column instead).

source

preview_split_by_label_column

 preview_split_by_label_column (dataset:datasets.dataset_dict.DatasetDict,
                                label_column:str)

Output label counts per split for a Huggingface dataset.

	Type	Details
dataset	DatasetDict	A Huggingface dataset dataset dict, typically the result of load_dataset()
label_column	str	The name of the column to preview

Here is how to preview counts of a specific label column for each split in a dataset. This currently assumes input is a DatasetDict. It may be altered in the future to work with Dataset objects as well.

label_column = 'style'
preview_split_by_label_column(dataset, label_column)

	label_name	count
style
0	obama	1168
1	bush	619
2	trump	1361
3	woolf	1469
4	fitzgerald	2658
5	hemingway	1516
6	blog5546	904
7	blog11518	2889
8	blog25872	336
9	blog30102	505
10	blog30407	912
11	h	77
12	pp	93
13	qq	72

	label_name	count
style
0	obama	273
1	bush	139
2	trump	328
3	woolf	488
4	fitzgerald	885
5	hemingway	504
6	blog5546	160
7	blog11518	510
8	blog25872	60
9	blog30102	90
10	blog30407	161
11	h	14
12	pp	17
13	qq	13

	label_name	count
style
0	obama	475
1	bush	251
2	trump	558
3	woolf	488
4	fitzgerald	885
5	hemingway	504
6	blog5546	210
7	blog11518	677
8	blog25872	142
9	blog30102	217
10	blog30407	143
11	h	45
12	pp	85
13	qq	67

source

preview_text_field

 preview_text_field (text:str, width:int=80)

Display a text field, wrapping the text to 80 characters. This may be moved to an internal function in a future version.

	Type	Default	Details
text	str		Text to preview
width	int	80	Width to wrap the text to

preview_text_field(dataset['train'][0]['text'])

You see it in Melinda Lopez, who came to her family's old home. And as she was
walking the streets, an elderly woman recognized her as her mother's daughter,
and began to cry. She took her into her home and showed her a pile of photos
that included Melinda's baby picture, which her mother had sent 50 years ago.
Melinda later said, "So many of us are now getting so much back."

This function may be changed in future to work with Dataset/DatasetDict objects. It currently assumes you’ve done something like dataset.to_pandas() to get a pandas dataframe. No example is provided here, as it may change in the near future.

source

preview_row_text

 preview_row_text (df:pandas.core.frame.DataFrame, selected_index:int,
                   text_column:str='text', limit:int=-1)

Output the text fields of a row in the DataFrame

	Type	Default	Details
df	DataFrame		DataFrame containing the data
selected_index	int		Index of the row to preview
text_column	str	text	column name for text field
limit	int	-1	Limit the length of the text field

Loading example data for remaining code examples …

# for testing multi-class classification
X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data(label_column = 'style', target_labels = ['fitzgerald', 'hemingway', 'woolf'])

TODO: document …

source

preview_splits

 preview_splits (X_train, y_train, X_test, y_test, label_names=None,
                 target_classes=None, target_names=None)

Display the number of samples in each class for train and test sets.

preview_splits(X_train, y_train, X_test, y_test, target_names = target_names, target_classes = target_classes)

Train: 4407 samples, 3 classes

	label_name	count
0
3	woolf	1469
4	fitzgerald	1469
5	hemingway	1469

Test: 1877 samples, 3 classes

	label_name	count
0
4	fitzgerald	885
5	hemingway	504
3	woolf	488

feature_store = TextFeatureStore('feature_store_example_report.sqlite')

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('pos',
                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=  11.5s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.1s

source

plt_svg

 plt_svg (fig:matplotlib.figure.Figure=None)

Display an SVG in a notebook with save functionality (see note)

	Type	Default	Details
fig	Figure	None	Optional figure to display, if None uses current figure

The plt_svg is based on code from this gist, but is amended to specify rendering of text as text.

TODO: document …

source

plot_confusion_matrix

 plot_confusion_matrix (y_test, y_predicted, target_classes, target_names,
                        figsize=(10, 8), renderer:str='svg',
                        title:str|None=None)

Output a confusion matrix with counts and proportions and appropriate labels.

	Type	Default	Details
y_test
y_predicted
target_classes
target_names
figsize	tuple	(10, 8)
renderer	str	svg	‘svg’ or ‘img’
title	str \| None	None	Title for the plot

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

  fitzgerald      0.748     0.538     0.626       885
   hemingway      0.567     0.819     0.670       504
       woolf      0.585     0.615     0.599       488

    accuracy                          0.633      1877
   macro avg      0.634     0.657     0.632      1877
weighted avg      0.657     0.633     0.631      1877

TODO: document …

source

save_results

 save_results (results_file, pipeline, experiment_descriptor,
               dataset_descriptor, y_test, y_pred, target_classes,
               target_names, classifier_step_name='classifier')

Save results from an experiment

experiment_descriptor = 'POS features, Logistic Regression'
dataset_descriptor = 'hallisky/AuthorMix dataset, authorship classification'
results_file = 'results.csv'

save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

	experiment	dataset	classifier	parameters	accuracy_f1	macro_precision	macro_recall	macro_f1	weighted_precision	weighted_recall	weighted_f1	fitzgerald_precision	fitzgerald_recall	fitzgerald_f1	hemingway_precision	hemingway_recall	hemingway_f1	woolf_precision	woolf_recall	woolf_f1
0	POS features, Logistic Regression	hallisky/AuthorMix dataset, authorship classif...	LogisticRegression	{'memory': None, 'steps': [('preprocessor', Sp...	0.633458	0.63351	0.657351	0.63192	0.657252	0.633458	0.630976	0.748428	0.537853	0.625904	0.567308	0.819444	0.670455	0.584795	0.614754	0.599401

The following code is in development.

TODO: document …

source

plot_confusion_matrices

 plot_confusion_matrices (tests, predictions, target_classes,
                          target_names, model_names=None, n_col=2,
                          figsize=(16, 8), renderer:str='svg',
                          title:str|None=None)

Plot grid of confusion matrices for multiple models.

	Type	Default	Details
tests
predictions
target_classes
target_names
model_names	NoneType	None	Optional: list of names for each prediction
n_col	int	2
figsize	tuple	(16, 8)
renderer	str	svg	‘svg’ or ‘img’
title	str \| None	None

pipeline1 = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

experiment_descriptor = 'POS features, Unigrams, Logistic Regression'
pipeline1.fit(X_train, y_train)
y_pred1 = pipeline1.predict(X_test)
results = save_results(results_file, pipeline1, experiment_descriptor, dataset_descriptor, y_test, y_pred1, target_classes, target_names, classifier_step_name = 'classifier')

pipeline2 = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (2, 2))),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter=5000, random_state=55))
], verbose=True)

experiment_descriptor = 'POS features, Bigrams, Logistic Regression'
pipeline2.fit(X_train, y_train)
y_pred2 = pipeline2.predict(X_test)
results = save_results(results_file, pipeline2, experiment_descriptor, dataset_descriptor, y_test, y_pred2, target_classes, target_names, classifier_step_name = 'classifier')

[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.2s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.1s
[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.1s
[Pipeline] ............... (step 2 of 4) Processing pos, total=   0.3s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.2s

plot_confusion_matrices([y_test, y_test], [y_pred1, y_pred2], target_classes, target_names, model_names=['Logistic Regression (1-gram)', 'Logistic Regression (2-gram)'], n_col=2, figsize=(16, 8), renderer='svg', title='Confusion Matrices for Different Models')

source

get_classifier_feature_names_in

 get_classifier_feature_names_in (pipeline:sklearn.pipeline.Pipeline,
                                  classifier_step_name='classifier')

Get the feature names that were the input to the classifier step from a fitted pipeline.

	Type	Default	Details
pipeline	Pipeline		fitted pipeline
classifier_step_name	str	classifier	name of the classifier step in pipeline

print(get_classifier_feature_names_in(pipeline, classifier_step_name = 'classifier'))

["''" ',' '-LRB-' '-RRB-' '.' ':' 'ADD' 'CC' 'CD' 'DT' 'EX' 'FW' 'HYPH'
 'IN' 'JJ' 'JJR' 'JJS' 'LS' 'MD' 'NFP' 'NN' 'NNP' 'NNPS' 'NNS' 'PDT' 'POS'
 'PRP' 'PRP$' 'RB' 'RBR' 'RBS' 'RP' 'SYM' 'TO' 'UH' 'VB' 'VBD' 'VBG' 'VBN'
 'VBP' 'VBZ' 'WDT' 'WP' 'WP$' 'WRB' 'XX' '_SP' '``']

source

plot_logistic_regression_features_from_pipeline

 plot_logistic_regression_features_from_pipeline (pipeline,
                                                  target_classes,
                                                  target_names, top_n=20, 
                                                  classifier_step_name='cl
                                                  assifier', features_step
                                                  _name='features',
                                                  renderer='svg')

Plot the most discriminative features for a logistic regression classifier in a fitted pipeline.

plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=10, classifier_step_name='classifier', features_step_name='features')

	Feature	Log Odds (Logit)	Odds Ratio
5	:	1.120742	3.067128
1	,	0.940030	2.560059
13	IN	0.787940	2.198862
9	DT	-0.586936	0.556029
26	PRP	-0.396887	0.672410
20	NN	-0.396228	0.672853
33	TO	0.332592	1.394578
4	.	0.241355	1.272973
35	VB	-0.236384	0.789477
41	WDT	0.229801	1.258350

	Feature	Log Odds (Logit)	Odds Ratio
5	:	0.800501	2.226657
20	NN	0.732279	2.079814
9	DT	-0.618647	0.538673
14	JJ	0.498787	1.646723
36	VBD	-0.491005	0.612011
4	.	-0.460041	0.631258
1	,	-0.418957	0.657732
40	VBZ	-0.312551	0.731578
26	PRP	0.302341	1.353022
13	IN	0.268820	1.308420

	Feature	Log Odds (Logit)	Odds Ratio
5	:	-1.921243	0.146425
9	DT	1.205582	3.338703
13	IN	-1.056760	0.347580
1	,	-0.521073	0.593883
33	TO	-0.404040	0.667617
35	VB	0.373301	1.452521
14	JJ	-0.346658	0.707047
20	NN	-0.336050	0.714587
43	WP$	-0.326951	0.721119
36	VBD	0.318487	1.375046

X_train, y_train, X_test, y_test, target_classes, target_names = get_example_data()

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('pos', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
    ('classifier', DecisionTreeClassifier(max_depth=4, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('pos',
                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                ('classifier',
                 DecisionTreeClassifier(max_depth=4, random_state=55))],
         verbose=True)

experiment_descriptor = 'POS features, Decision Tree Classifier'
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
results = save_results(results_file, pipeline, experiment_descriptor, dataset_descriptor, y_test, y_pred, target_classes, target_names, classifier_step_name = 'classifier')

[Pipeline] ...... (step 1 of 3) Processing preprocessor, total=  20.5s
[Pipeline] ............... (step 2 of 3) Processing pos, total=   0.7s
[Pipeline] ........ (step 3 of 3) Processing classifier, total=   0.0s

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

              precision    recall  f1-score   support

        blog      0.634     0.583     0.608       981
      author      0.847     0.790     0.817      1877
      speech      0.614     0.785     0.689       740

    accuracy                          0.732      3598
   macro avg      0.698     0.719     0.705      3598
weighted avg      0.741     0.732     0.734      3598

source

plot_decision_tree_from_pipeline

 plot_decision_tree_from_pipeline (pipeline, X_train, y_train,
                                   target_classes, target_names,
                                   classifier_step_name='classifier',
                                   features_step_name='features')

Plot the decision tree of the classifier from a pipeline using SuperTree

	Type	Default	Details
pipeline			The pipeline containing the classifier
X_train			The training data
y_train			The training labels
target_classes			The target classes
target_names			The target names
classifier_step_name	str	classifier	The name of the classifier step in the pipeline
features_step_name	str	features	The name of the final preprocessing step = probably the name of the step prior to the classifier

SuperTree creates interactive decision tree visusalisations.

plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, classifier_step_name = 'classifier', features_step_name = 'pos')

source

get_selected_feature_names

 get_selected_feature_names (pipeline, features_step_name='features',
                             selector_step_name='selector')

Get the selected features from the pipeline (Depreciated).

	Type	Default	Details
pipeline			the pipeline to get the feature names from
features_step_name	str	features	the name of the step in the pipeline that contains the features
selector_step_name	str	selector	the name of the step in the pipeline that contains the selector
Returns	list		returns a list of the selected feature names

If a pipeline uses feature selection, here is how to get the selected features. First, here is a pipeline with a selector. A more complex pipeline is used for demonstation …

pipeline = Pipeline([
    ('preprocessor', SpacyPreprocessor(feature_store=feature_store, pos_tagset = 'detailed')),
    ('features', FeatureUnion([
        ('pos_unigrams', POSVectorizer(feature_store=feature_store, ngram_range = (1, 1))),
        ('textstats', Pipeline([
            ('textstats_vectors', TextstatsTransformer(feature_store=feature_store)),
            ('selector', SelectKBest(mutual_info_classif, k=5)),
        ], verbose=True)),
    ], verbose=True)),
    ('scaler', StandardScaler(with_mean=False)),
    ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)

display(pipeline)

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

Pipeline

?Documentation for PipelineiNot fitted

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

SpacyPreprocessor

SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                  pos_tagset='detailed')

features: FeatureUnion

?Documentation for features: FeatureUnion

FeatureUnion(transformer_list=[('pos_unigrams',
                                POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                               ('textstats',
                                Pipeline(steps=[('textstats_vectors',
                                                 TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('selector',
                                                 SelectKBest(k=5,
                                                             score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                         verbose=True))],
             verbose=True)

pos_unigrams

POSVectorizer

POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)

textstats

TextstatsTransformer

TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)

SelectKBest

?Documentation for SelectKBest

SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>)

StandardScaler

?Documentation for StandardScaler

StandardScaler(with_mean=False)

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=5000, random_state=55)

The functions to preview features requires a fitted pipeline …

pipeline.fit(X_train, y_train)

[Pipeline] ...... (step 1 of 4) Processing preprocessor, total=   0.3s
[FeatureUnion] .. (step 1 of 2) Processing pos_unigrams, total=   0.7s
[Pipeline] . (step 1 of 2) Processing textstats_vectors, total=   0.3s
[Pipeline] .......... (step 2 of 2) Processing selector, total=   0.2s
[FeatureUnion] ..... (step 2 of 2) Processing textstats, total=   0.5s
[Pipeline] .......... (step 2 of 4) Processing features, total=   1.2s
[Pipeline] ............ (step 3 of 4) Processing scaler, total=   0.0s
[Pipeline] ........ (step 4 of 4) Processing classifier, total=   0.6s

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('preprocessor',
                 SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                                   pos_tagset='detailed')),
                ('features',
                 FeatureUnion(transformer_list=[('pos_unigrams',
                                                 POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('textstats',
                                                 Pipeline(steps=[('textstats_vectors',
                                                                  TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                                 ('selector',
                                                                  SelectKBest(k=5,
                                                                              score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                                          verbose=True))],
                              verbose=True)),
                ('scaler', StandardScaler(with_mean=False)),
                ('classifier',
                 LogisticRegression(max_iter=5000, random_state=55))],
         verbose=True)

SpacyPreprocessor

SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>,
                  pos_tagset='detailed')

features: FeatureUnion

?Documentation for features: FeatureUnion

FeatureUnion(transformer_list=[('pos_unigrams',
                                POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                               ('textstats',
                                Pipeline(steps=[('textstats_vectors',
                                                 TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)),
                                                ('selector',
                                                 SelectKBest(k=5,
                                                             score_func=<function mutual_info_classif at 0x7efb5a59e020>))],
                                         verbose=True))],
             verbose=True)

pos_unigrams

POSVectorizer

POSVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)

textstats

TextstatsTransformer

TextstatsTransformer(feature_store=<textplumber.store.TextFeatureStore object at 0x7efb5562dc90>)

SelectKBest

?Documentation for SelectKBest

SelectKBest(k=5, score_func=<function mutual_info_classif at 0x7efb5a59e020>)

StandardScaler

?Documentation for StandardScaler

StandardScaler(with_mean=False)

LogisticRegression

?Documentation for LogisticRegression

LogisticRegression(max_iter=5000, random_state=55)

source

preview_selected_features

 preview_selected_features (pipeline, features_step_name='features',
                            selector_step_name='selector')

Preview (i.e. prints) the selected features from the pipeline (Depreciated - this will be removed in 0.0.10).

	Type	Default	Details
pipeline			the pipeline to preview the selected features from
features_step_name	str	features	the name of the step in the pipeline that contains the features
selector_step_name	str	selector	the name of the step in the pipeline that contains the selector

This functionality is new in v0.0.9.

source

preview_pipeline_features

 preview_pipeline_features (pipeline:sklearn.pipeline.Pipeline)

Outputs the features at each step in a pipeline.

	Type	Details
pipeline	Pipeline	pipeline to preview

To see how features are extracted and/or selected at each stage of a pipeline use preview_pipeline_features. Expand each pipeline step to see the features output by that pipeline step.

preview_pipeline_features(pipeline)

preprocessor SpacyPreprocessor

This step receives and returns text.

features FeatureUnion

pos_unigrams POSVectorizer

Features Out (49)

$, '', ,, -LRB-, -RRB-, ., :, ADD, CC, CD, DT, EX, FW, HYPH, IN, JJ, JJR, JJS, LS, MD, NFP, NN, NNP, NNPS, NNS, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB, XX, _SP, ``

textstats_vectors TextstatsTransformer

Features Out (12)

tokens_count, sentences_count, characters_count, monosyllabic_words_relfreq, polysyllabic_words_relfreq, unique_tokens_relfreq, average_characters_per_token, average_tokens_per_sentence, characters_proportion_letters, characters_proportion_uppercase, hapax_legomena_count, hapax_legomena_to_unique

selector SelectKBest

Features Out (5)

tokens_count, characters_count, polysyllabic_words_relfreq, unique_tokens_relfreq, average_tokens_per_sentence

scaler StandardScaler

Features Out (54)

pos_unigrams__$, pos_unigrams__'', pos_unigrams__,, pos_unigrams__-LRB-, pos_unigrams__-RRB-, pos_unigrams__., pos_unigrams__:, pos_unigrams__ADD, pos_unigrams__CC, pos_unigrams__CD, pos_unigrams__DT, pos_unigrams__EX, pos_unigrams__FW, pos_unigrams__HYPH, pos_unigrams__IN, pos_unigrams__JJ, pos_unigrams__JJR, pos_unigrams__JJS, pos_unigrams__LS, pos_unigrams__MD, pos_unigrams__NFP, pos_unigrams__NN, pos_unigrams__NNP, pos_unigrams__NNPS, pos_unigrams__NNS, pos_unigrams__PDT, pos_unigrams__POS, pos_unigrams__PRP, pos_unigrams__PRP$, pos_unigrams__RB, pos_unigrams__RBR, pos_unigrams__RBS, pos_unigrams__RP, pos_unigrams__SYM, pos_unigrams__TO, pos_unigrams__UH, pos_unigrams__VB, pos_unigrams__VBD, pos_unigrams__VBG, pos_unigrams__VBN, pos_unigrams__VBP, pos_unigrams__VBZ, pos_unigrams__WDT, pos_unigrams__WP, pos_unigrams__WP$, pos_unigrams__WRB, pos_unigrams__XX, pos_unigrams___SP, pos_unigrams__``, textstats__tokens_count, textstats__characters_count, textstats__polysyllabic_words_relfreq, textstats__unique_tokens_relfreq, textstats__average_tokens_per_sentence

classifier LogisticRegression

Features In (54)

pd.read_csv('results.csv')

	experiment	dataset	classifier	parameters	accuracy_f1	macro_precision	macro_recall	macro_f1	weighted_precision	weighted_recall	...	woolf_f1	blog_precision	blog_recall	blog_f1	author_precision	author_recall	author_f1	speech_precision	speech_recall	speech_f1
0	POS features, Logistic Regression	hallisky/AuthorMix dataset, authorship classif...	LogisticRegression	{'memory': None, 'steps': [('preprocessor', Sp...	0.633458	0.633510	0.657351	0.631920	0.657252	0.633458	...	0.599401	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	POS features, Unigrams, Logistic Regression	hallisky/AuthorMix dataset, authorship classif...	LogisticRegression	{'memory': None, 'steps': [('preprocessor', Sp...	0.633458	0.633510	0.657351	0.631920	0.657252	0.633458	...	0.599401	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	POS features, Bigrams, Logistic Regression	hallisky/AuthorMix dataset, authorship classif...	LogisticRegression	{'memory': None, 'steps': [('preprocessor', Sp...	0.636121	0.629850	0.649835	0.634696	0.649218	0.636121	...	0.583411	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	POS features, Decision Tree Classifier	hallisky/AuthorMix dataset, authorship classif...	DecisionTreeClassifier	{'memory': None, 'steps': [('preprocessor', Sp...	0.732351	0.698335	0.719257	0.704589	0.741123	0.732351	...	NaN	0.634146	0.583078	0.607541	0.847341	0.789558	0.81743	0.613516	0.785135	0.688797

4 rows × 29 columns