preprocess

Preprocess text data.

Note: v0.0.8 and prior attempted to automatically install Spacy’s en_core_web_sm model. This is against SpaCy’s recommendation. The automatic installation of the model has been removed from version 0.0.9. When installing textplumber please install the model manually as shown below:

pip install textplumber
python -m spacy download en_core_web_sm

If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.

source

SpacyPreprocessor

 SpacyPreprocessor (feature_store:textplumber.store.TextFeatureStore,
                    pos_tagset:str='simple',
                    model_name:str='en_core_web_sm',
                    disable:list[str]=['parser', 'ner'],
                    enable:list[str]=['sentencizer'], batch_size:int=200,
                    n_process:int=1)

A Sci-kit Learn pipeline component to preprocess text using spaCy, the pipeline component receives and returns texts, but prepares tokens, pos, and text statistics as input to compatible Textplumber classes in a pipeline.

	Type	Default	Details
feature_store	TextFeatureStore		the feature store to use
pos_tagset	str	simple	‘simple’ or ‘detailed’ (see note in documentation about the tag sets used)
model_name	str	en_core_web_sm	the spaCy model to use
disable	list	[‘parser’, ‘ner’]	the spaCy components to disable
enable	list	[‘sentencizer’]	the spaCy components to enable
batch_size	int	200	the batch size for the Spacy processing
n_process	int	1	the number of processes for Spacy to use

The pos_tagset argument controls whether POS tag preprocessing uses the ‘simple’ UPOS part-of-speech tag set or the ‘detailed’ part-of-speech tag set. For the English-language SpaCy models this is based on the Penn Treebank tag set.

source

SpacyPreprocessor.fit

 SpacyPreprocessor.fit (X, y=None)

Fit is implemented, but does nothing.

source

SpacyPreprocessor.transform

 SpacyPreprocessor.transform (X)

Preprocess the texts using spaCy and populate the feature store ready for use by Textplumber components later in a pipeline.

source

SpacyPreprocessor.is_text_handler

 SpacyPreprocessor.is_text_handler ()

This is used by preview_pipeline_features to detect if receives and returns text.

Examples

Check out TokensVectorizer, POSVectorizer, TextstatsTransformer and LexiconCountVectorizer for examples.

source

NLTKPreprocessor

 NLTKPreprocessor (feature_store:textplumber.store.TextFeatureStore)

A Sci-kit Learn pipeline component to preprocess English-language text using NLTK, the pipeline component receives and returns texts, but prepares tokens, pos, and text statistics as input to compatible Textplumber classes in a pipeline.

	Type	Details
feature_store	TextFeatureStore	the feature store to use

source

NLTKPreprocessor.fit

 NLTKPreprocessor.fit (X, y=None)

Fit is implemented, but does nothing.

source

NLTKPreprocessor.transform

 NLTKPreprocessor.transform (X)

Preprocess the texts using NLTK and populate the feature store ready for use later in a pipeline.

source

NLTKPreprocessor.is_text_handler

 NLTKPreprocessor.is_text_handler ()

This is used by preview_pipeline_features to detect if receives and returns text.

Examples

Check out TokensVectorizer, POSVectorizer, TextstatsTransformer and LexiconCountVectorizer for examples.