store

Store text features to avoid recomputing them.

TextFeatureStore

 TextFeatureStore (path:str)

A class to store features extracted for a text classification pipeline and cache them to disk to avoid recomputing them.

	Type	Details
path	str	where to save SQLite db to persist the feature store between runs

Check out TokensVectorizer, POSVectorizer, TextstatsTransformer and LexiconCountVectorizer for examples using a feature store with a Sci-kit learn pipeline.

Initiating a TextFeatureStore creates an SQLite with a texts tables for core features (tokens, parts of speech, textstats, i.e. document-level statistics), an embeddings table for embeddings features, and a lexicons table for lexicon features.

The store is used in a number of Textplumber components. The methods are documented below with examples, but these are primarily as a reference or for implementing new components.

feature_store = TextFeatureStore(path='feature_store_example.sqlite')

source

TextFeatureStore.dump

 TextFeatureStore.dump (structure_only=False)

Outputs the structure or contents of the feature store (intended for debugging/development)

	Type	Default	Details
structure_only	bool	False	if True, only show the schema of the feature store

Features are retrieved with the hash column. Texts are hashed using MD5. Here is the structure of a store …

feature_store.dump(structure_only=True)

texts
('hash', 'TEXT', 1, None, 1)
('tokens', 'BLOB', 1, None, 0)
('pos', 'BLOB', 1, None, 0)
('textstats', 'BLOB', 1, None, 0)

embeddings
('hash', 'TEXT', 1, None, 1)
('embeddings', 'BLOB', 1, None, 0)

lexicons
('hash', 'TEXT', 1, None, 1)
('lexicons', 'BLOB', 1, None, 0)

source

TextFeatureStore.update

 TextFeatureStore.update (text:str, tokens:list, pos:list, textstats:list)

Update (insert or replce) the feature store with the token, parts of speech tag and textstat (document-level statistics) features for a specific text.

	Type	Details
text	str	the text to update
tokens	list	the tokens to update
pos	list	the part of speech tags to update
textstats	list	the text statistics to update

feature_store.update(
        text="This is an example.",
        tokens=["This", "is", "an", "example", "."],
        pos=["DET", "VERB", "DET", "NOUN", "PUNCT"],
        textstats=[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
)

feature_store.dump()

Table: texts (1)

	hash	tokens	pos	textstats
0	263fb1aa85489991a2ef832ef10308a0	[This, is, an, example, .]	[DET, VERB, DET, NOUN, PUNCT]	[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]

Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.update_embeddings

 TextFeatureStore.update_embeddings (texts:str, embeddings:list)

Update the feature store with embeddings for a list of texts.

	Type	Details
texts	str	the texts to update
embeddings	list	the embeddings to update

feature_store.update_embeddings(["This is an example."], [[0.1, 0.2, 0.3, 0.4, 0.5]])
feature_store.dump()

Table: texts (1)

	hash	tokens	pos	textstats
0	263fb1aa85489991a2ef832ef10308a0	[This, is, an, example, .]	[DET, VERB, DET, NOUN, PUNCT]	[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]

Table: embeddings (1)

	hash	embeddings
0	263fb1aa85489991a2ef832ef10308a0	[0.1, 0.2, 0.3, 0.4, 0.5]

Table: lexicons (0)

source

TextFeatureStore.update_lexicons

 TextFeatureStore.update_lexicons (texts:str, lexicons:list)

Update the feature store with lexicon features for a list of texts.

	Type	Details
texts	str	the texts to update
lexicons	list	the lexicon scores to update

feature_store.update_lexicons(["This is an example."], [[1, 7]])
feature_store.dump()

Table: texts (1)

	hash	tokens	pos	textstats
0	263fb1aa85489991a2ef832ef10308a0	[This, is, an, example, .]	[DET, VERB, DET, NOUN, PUNCT]	[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]

Table: embeddings (1)

	hash	embeddings
0	263fb1aa85489991a2ef832ef10308a0	[0.1, 0.2, 0.3, 0.4, 0.5]

Table: lexicons (1)

	hash	lexicons
0	263fb1aa85489991a2ef832ef10308a0	[1, 7]

source

TextFeatureStore.empty

 TextFeatureStore.empty ()

Clear the contents of the feature store.

feature_store.empty()
feature_store.dump()

Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.buffered_update

 TextFeatureStore.buffered_update (text:str, tokens:list, pos:list,
                                   textstats:list)

Update the feature store tokens, parts of speech tags and text statistics for multiple texts.

	Type	Details
text	str	the text to update
tokens	list	the tokens to update
pos	list	the part of speech tags to update
textstats	list	the text statistics to update

source

TextFeatureStore.flush

 TextFeatureStore.flush ()

Flush the buffer to the database.

feature_store.buffered_update(
        text="This is an example.",
        tokens=["This", "is", "an", "example", "."],
        pos=["DET", "VERB", "DET", "NOUN", "PUNCT"],
        textstats=[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]
)
print('Updates pending, not flushed ...')
feature_store.dump() # nothing added yet
feature_store.flush() # flush add buffered updates to the store
print()
print('Updates flushed ...')
feature_store.dump() # now the store is updated

Updates pending, not flushed ...
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)

Updates flushed ...
Table: texts (1)

	hash	tokens	pos	textstats
0	263fb1aa85489991a2ef832ef10308a0	[This, is, an, example, .]	[DET, VERB, DET, NOUN, PUNCT]	[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]

Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.get

 TextFeatureStore.get (text:str, type:str=None)

Get features for a text.

	Type	Default	Details
text	str		the text to get features for
type	str	None	the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns	dict \| list		the features for the text

The get method can return a specific feature type for a text …

feature_store.get('This is an example.', type='tokens')

['This', 'is', 'an', 'example', '.']

or all features as a dict …

feature_store.get('This is an example.')

{'tokens': ['This', 'is', 'an', 'example', '.'],
 'pos': ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'],
 'textstats': [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12],
 'embeddings': None,
 'lexicons': None}

source

TextFeatureStore.get_features_from_texts_by_type

 TextFeatureStore.get_features_from_texts_by_type (texts:list, type:str)

Get features for a list of texts by type, if no match returns None for the text.

	Type	Details
texts	list	the texts to get features for
type	str	the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns	list	the features for the texts

Rather than calling get_features_from_texts_by_type use the specific method for the feature type …

source

TextFeatureStore.get_tokens_from_texts

 TextFeatureStore.get_tokens_from_texts (texts:list, lowercase:bool=False,
                                         min_token_length:int=0,
                                         remove_punctuation:bool=False,
                                         remove_numbers:bool=False)

Get (and optionally filter or transform) tokens for a list of texts.

	Type	Default	Details
texts	list		the texts to get tokens for
lowercase	bool	False	whether to return tokens as lowercase
min_token_length	int	0	the minimum token length to include
remove_punctuation	bool	False	whether to remove punctuation
remove_numbers	bool	False	whether to remove numbers
Returns	list		the tokens for the texts

feature_store.update(
        text="This is example 2.",
        tokens=["This", "is", "example", "2", "."],
        pos=["DET", "VERB", "NOUN", "NUM", "PUNCT"],
        textstats=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
)

# defaults - if features for a text are not in the store, None is returned
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
# lowercase
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    lowercase=True))
# min token length
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    min_token_length=3))
# remove punctuation
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    remove_punctuation=True))
# remove numbers
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    remove_numbers=True))

[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '2', '.'], None]
[['this', 'is', 'an', 'example', '.'], ['this', 'is', 'example', '2', '.'], None]
[['This', 'example'], ['This', 'example'], None]
[['This', 'is', 'an', 'example'], ['This', 'is', 'example', '2'], None]
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '.'], None]

source

TextFeatureStore.get_textstats_from_texts

 TextFeatureStore.get_textstats_from_texts (texts:list,
                                            columns_out=['tokens_count',
                                            'sentences_count',
                                            'characters_count',
                                            'monosyllabic_words_relfreq',
                                            'polysyllabic_words_relfreq',
                                            'unique_tokens_relfreq', 'aver
                                            age_characters_per_token',
                                            'average_tokens_per_sentence',
                                            'characters_proportion_letters
                                            ', 'characters_proportion_uppe
                                            rcase',
                                            'hapax_legomena_count',
                                            'hapax_legomena_to_unique'],
                                            columns_in=['tokens_count',
                                            'sentences_count',
                                            'characters_count',
                                            'monosyllabic_words_relfreq',
                                            'polysyllabic_words_relfreq',
                                            'unique_tokens_relfreq', 'aver
                                            age_characters_per_token',
                                            'average_tokens_per_sentence',
                                            'characters_proportion_letters
                                            ', 'characters_proportion_uppe
                                            rcase',
                                            'hapax_legomena_count',
                                            'hapax_legomena_to_unique'])

Get document-level text statistics for a list of texts.

	Type	Default	Details
texts	list		the texts to get text statistics for
columns_out	list	[‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’]	the columns to return
columns_in	list	[‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’]	the possible columns
Returns	list		the text statistics for the texts

To restrict to specific textstats use columns_out. The columns_in argument is provided to allow a different definition of available statistics.

# defaults - if there are no features for a text returns None
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))

# restrict to specific columns - tokens_count, sentence_count ...
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                             columns_out = ['tokens_count', 'sentences_count']))

[[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], None]
[[2, 4], [1, 2], None]

source

TextFeatureStore.get_pos_from_texts

 TextFeatureStore.get_pos_from_texts (texts:list)

Get parts of speech for a list of texts.

	Type	Details
texts	list	the texts to get part of speech tags for
Returns	list	the part of speech tags for the texts

# if there are no features for a text returns None
print(feature_store.get_pos_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))

[['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['DET', 'VERB', 'NOUN', 'NUM', 'PUNCT'], None]

source

TextFeatureStore.get_embeddings_from_texts

 TextFeatureStore.get_embeddings_from_texts (texts:str)

Get embeddings for multiple texts.

	Type	Details
texts	str	the texts to get embeddings for
Returns	list	the embeddings for the texts

feature_store.update_embeddings(["This is an example.", "This is example 2."],[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9]])

# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))

[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9], None]

source

TextFeatureStore.get_lexicons_from_texts

 TextFeatureStore.get_lexicons_from_texts (texts:str)

Get lexicon features for multiple texts.

	Type	Details
texts	str	the texts to get lexicons for
Returns	list

feature_store.update_embeddings(["This is an example.", "This is example 2."],[[1, 5], [7, 1]])

# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))

[[1, 5], [7, 1], None]