cleaner

Clean text data before feature extraction.

source

TextCleaner

 TextCleaner (character_replacements:dict=None, remove_html:bool=False,
              strip_whitespace:bool=False,
              normalize_whitespace:bool=False)

A component for a Sci-kit learn pipeline to clean clean text data, including normalizing characters and whitespace, stripping whitespace before and after text, and removing html tags, .

Type Default Details
character_replacements dict None character replacements for character normalization - a dict with key as the character to replace and value as the replacement character
remove_html bool False whether to remove html tags
strip_whitespace bool False whether to remove whitespace from the start and end of the text
normalize_whitespace bool False whether to replace one or more whitespace characters with a single space

source

TextCleaner.fit

 TextCleaner.fit (X, y=None)

Fit is implemented, but does nothing.


source

TextCleaner.transform

 TextCleaner.transform (X:list)

Apply transformations to the text data.

Type Details
X list the text to transform
Returns list the transformed text

source

TextCleaner.apply_transformations

 TextCleaner.apply_transformations (text:str, transformations:list)

Apply a series of transformations to a text.

Type Details
text str the text to transform
transformations list a list of transformation methods to apply
Returns str the transformed text

source

TextCleaner.is_text_handler

 TextCleaner.is_text_handler ()

This is used by preview_pipeline_features to detect if receives and returns text.

Example

Here’s an example that removes html, strips and normalizes whitespace and normalizes single quotes.

Only apply what makes sense for your use case.

Note: The character_replacement dictionary is used to specify single-character replacements. This will raise a ValueError if the input strings are longer than 1 character.

documents = [
    "<p>Some text with <b>html</b> tags</p>",
    "Some text with      extra whitespace",
    "Some text with ‘single quotes’",
    "Some text with  \t a tab character",
    "   Some text with whitespace before and after the text  \n ",
]

character_replacements ={
    "‘": "'",
    "’": "'",
}

text_cleaner = TextCleaner(
    character_replacements=character_replacements,
    remove_html=True,
    strip_whitespace=True,
    normalize_whitespace=True,
)
cleaned_documents = text_cleaner.fit_transform(documents)

for i, doc in enumerate(cleaned_documents):
    print(f"Original: {documents[i]}")
    print(f"Cleaned: {doc}")
    print()
Original: <p>Some text with <b>html</b> tags</p>
Cleaned: Some text with html tags

Original: Some text with      extra whitespace
Cleaned: Some text with extra whitespace

Original: Some text with ‘single quotes’
Cleaned: Some text with 'single quotes'

Original: Some text with     a tab character
Cleaned: Some text with a tab character

Original:    Some text with whitespace before and after the text  
 
Cleaned: Some text with whitespace before and after the text