preprocess
Note: v0.0.8 and prior attempted to automatically install Spacy’s en_core_web_sm model. This is against SpaCy’s recommendation. The automatic installation of the model has been removed from version 0.0.9. When installing textplumber please install the model manually as shown below:
pip install textplumber
python -m spacy download en_core_web_sm
If you are working with a different language or want to use a different ‘en’ model, check the SpaCy models documentation for the relevant model name.
SpacyPreprocessor
SpacyPreprocessor (feature_store:textplumber.store.TextFeatureStore, pos_tagset:str='simple', model_name:str='en_core_web_sm', disable:list[str]=['parser', 'ner'], enable:list[str]=['sentencizer'], batch_size:int=500, n_process:int=1)
A Sci-kit Learn pipeline component to preprocess text using spaCy, the pipeline component receives and returns texts, but prepares tokens, pos, and text statistics as input to other compatible classes in a pipeline.
Type | Default | Details | |
---|---|---|---|
feature_store | TextFeatureStore | the feature store to use | |
pos_tagset | str | simple | ‘simple’ or ‘detailed’ (see note in documentation about the tag sets used) |
model_name | str | en_core_web_sm | the spaCy model to use |
disable | list | [‘parser’, ‘ner’] | the spaCy components to disable |
enable | list | [‘sentencizer’] | the spaCy components to enable |
batch_size | int | 500 | the batch size for the Spacy processing |
n_process | int | 1 | the number of processes for Spacy to use |
The pos_tagset
argument controls whether POS tag preprocessing uses the ‘simple’ UPOS part-of-speech tag set or the ‘detailed’ part-of-speech tag set. For the English-language SpaCy models this is based on the Penn Treebank tag set.
SpacyPreprocessor.fit
SpacyPreprocessor.fit (X, y=None)
Fit is implemented, but does nothing.
SpacyPreprocessor.transform
SpacyPreprocessor.transform (X)
Preprocess the texts using spaCy and populate the feature store ready for use later in a pipeline.
SpacyPreprocessor.is_text_handler
SpacyPreprocessor.is_text_handler ()
This is used by preview_pipeline_features to detect if receives and returns text.
Examples
Check out TokensVectorizer
, POSVectorizer
, TextstatsTransformer
and LexiconCountVectorizer
for examples.