cleaner

Clean text data before feature extraction.

TextCleaner

 TextCleaner (character_replacements:dict=None, remove_html:bool=False,
              strip_whitespace:bool=False,
              normalize_whitespace:bool=False)

A component for a Sci-kit learn pipeline to clean clean text data, including normalizing characters and whitespace, stripping whitespace before and after text, and removing html tags, .

	Type	Default	Details
character_replacements	dict	None	character replacements for character normalization - a dict with key as the character to replace and value as the replacement character
remove_html	bool	False	whether to remove html tags
strip_whitespace	bool	False	whether to remove whitespace from the start and end of the text
normalize_whitespace	bool	False	whether to replace one or more whitespace characters with a single space

source

TextCleaner.fit

 TextCleaner.fit (X, y=None)

Fit is implemented, but does nothing.

source

TextCleaner.transform

 TextCleaner.transform (X:list)

Apply transformations to the text data.

	Type	Details
X	list	the text to transform
Returns	list	the transformed text

source

TextCleaner.apply_transformations

 TextCleaner.apply_transformations (text:str, transformations:list)

Apply a series of transformations to a text.

	Type	Details
text	str	the text to transform
transformations	list	a list of transformation methods to apply
Returns	str	the transformed text

source

TextCleaner.is_text_handler

 TextCleaner.is_text_handler ()

This is used by preview_pipeline_features to detect if receives and returns text.

Example

Here’s an example that removes html, strips and normalizes whitespace and normalizes single quotes.

Only apply what makes sense for your use case.

Note: The character_replacement dictionary is used to specify single-character replacements. This will raise a ValueError if the input strings are longer than 1 character.

documents = [
    "<p>Some text with <b>html</b> tags</p>",
    "Some text with      extra whitespace",
    "Some text with ‘single quotes’",
    "Some text with  \t a tab character",
    "   Some text with whitespace before and after the text  \n ",
]

character_replacements ={
    "‘": "'",
    "’": "'",
}

text_cleaner = TextCleaner(
    character_replacements=character_replacements,
    remove_html=True,
    strip_whitespace=True,
    normalize_whitespace=True,
)
cleaned_documents = text_cleaner.fit_transform(documents)

for i, doc in enumerate(cleaned_documents):
    print(f"Original: {documents[i]}")
    print(f"Cleaned: {doc}")
    print()

Original: <p>Some text with <b>html</b> tags</p>
Cleaned: Some text with html tags

Original: Some text with      extra whitespace
Cleaned: Some text with extra whitespace

Original: Some text with ‘single quotes’
Cleaned: Some text with 'single quotes'

Original: Some text with     a tab character
Cleaned: Some text with a tab character

Original:    Some text with whitespace before and after the text  
 
Cleaned: Some text with whitespace before and after the text