preprocess

Preprocess text data.

Note: v0.0.8 and prior attempted to automatically install Spacy’s en_core_web_sm model. This is against SpaCy’s recommendation. The automatic installation of the model has been removed from version 0.0.9. When installing textplumber please install the model manually as shown below:

pip install textplumber
python -m spacy download en_core_web_sm

If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.


source

SpacyPreprocessor

 SpacyPreprocessor (feature_store:textplumber.store.TextFeatureStore,
                    pos_tagset:str='simple',
                    model_name:str='en_core_web_sm',
                    disable:list[str]=['parser', 'ner'],
                    enable:list[str]=['sentencizer'], batch_size:int=500,
                    n_process:int=1)

A Sci-kit Learn pipeline component to preprocess text using spaCy, the pipeline component receives and returns texts, but prepares tokens, pos, and text statistics as input to other compatible classes in a pipeline.

Type Default Details
feature_store TextFeatureStore the feature store to use
pos_tagset str simple ‘simple’ or ‘detailed’ (see note in documentation about the tag sets used)
model_name str en_core_web_sm the spaCy model to use
disable list [‘parser’, ‘ner’] the spaCy components to disable
enable list [‘sentencizer’] the spaCy components to enable
batch_size int 500 the batch size for the Spacy processing
n_process int 1 the number of processes for Spacy to use

The pos_tagset argument controls whether POS tag preprocessing uses the ‘simple’ UPOS part-of-speech tag set or the ‘detailed’ part-of-speech tag set. For the English-language SpaCy models this is based on the Penn Treebank tag set.


source

SpacyPreprocessor.fit

 SpacyPreprocessor.fit (X, y=None)

Fit is implemented, but does nothing.


source

SpacyPreprocessor.transform

 SpacyPreprocessor.transform (X)

Preprocess the texts using spaCy and populate the feature store ready for use later in a pipeline.


source

SpacyPreprocessor.is_text_handler

 SpacyPreprocessor.is_text_handler ()

This is used by preview_pipeline_features to detect if receives and returns text.

Examples

Check out TokensVectorizer, POSVectorizer, TextstatsTransformer and LexiconCountVectorizer for examples.