A component for a Sci-kit learn pipeline to clean clean text data, including normalizing characters and whitespace, stripping whitespace before and after text, and removing html tags, .
Type
Default
Details
character_replacements
dict
None
character replacements for character normalization - a dict with key as the character to replace and value as the replacement character
remove_html
bool
False
whether to remove html tags
strip_whitespace
bool
False
whether to remove whitespace from the start and end of the text
normalize_whitespace
bool
False
whether to replace one or more whitespace characters with a single space
This is used by preview_pipeline_features to detect if receives and returns text.
Example
Here’s an example that removes html, strips and normalizes whitespace and normalizes single quotes.
Only apply what makes sense for your use case.
Note: The character_replacement dictionary is used to specify single-character replacements. This will raise a ValueError if the input strings are longer than 1 character.
documents = ["<p>Some text with <b>html</b> tags</p>","Some text with extra whitespace","Some text with ‘single quotes’","Some text with \t a tab character"," Some text with whitespace before and after the text \n ",]character_replacements ={"‘": "'","’": "'",}text_cleaner = TextCleaner( character_replacements=character_replacements, remove_html=True, strip_whitespace=True, normalize_whitespace=True,)cleaned_documents = text_cleaner.fit_transform(documents)for i, doc inenumerate(cleaned_documents):print(f"Original: {documents[i]}")print(f"Cleaned: {doc}")print()
Original: <p>Some text with <b>html</b> tags</p>
Cleaned: Some text with html tags
Original: Some text with extra whitespace
Cleaned: Some text with extra whitespace
Original: Some text with ‘single quotes’
Cleaned: Some text with 'single quotes'
Original: Some text with a tab character
Cleaned: Some text with a tab character
Original: Some text with whitespace before and after the text
Cleaned: Some text with whitespace before and after the text