= TextFeatureStore(path='feature_store_example.sqlite') feature_store
store
TextFeatureStore
TextFeatureStore (path:str)
A class to store features extracted for a text classification pipeline and cache them to disk to avoid recomputing them.
Type | Details | |
---|---|---|
path | str | where to save SQLite db to persist the feature store between runs |
Check out TokensVectorizer
, POSVectorizer
, TextstatsTransformer
and LexiconCountVectorizer
for examples using a feature store with a Sci-kit learn pipeline.
Initiating a TextFeatureStore creates an SQLite with a texts tables for core features (tokens, parts of speech, textstats, i.e. document-level statistics), an embeddings table for embeddings features, and a lexicons table for lexicon features.
The store is used in a number of Textplumber components. The methods are documented below with examples, but these are primarily as a reference or for implementing new components.
TextFeatureStore.dump
TextFeatureStore.dump (structure_only=False)
Outputs the structure or contents of the feature store (intended for debugging/development)
Type | Default | Details | |
---|---|---|---|
structure_only | bool | False | if True, only show the schema of the feature store |
Features are retrieved with the hash column. Texts are hashed using MD5. Here is the structure of a store …
=True) feature_store.dump(structure_only
texts
('hash', 'TEXT', 1, None, 1)
('tokens', 'BLOB', 1, None, 0)
('pos', 'BLOB', 1, None, 0)
('textstats', 'BLOB', 1, None, 0)
embeddings
('hash', 'TEXT', 1, None, 1)
('embeddings', 'BLOB', 1, None, 0)
lexicons
('hash', 'TEXT', 1, None, 1)
('lexicons', 'BLOB', 1, None, 0)
TextFeatureStore.update
TextFeatureStore.update (text:str, tokens:list, pos:list, textstats:list)
Update (insert or replce) the feature store with the token, parts of speech tag and textstat (document-level statistics) features for a specific text.
Type | Details | |
---|---|---|
text | str | the text to update |
tokens | list | the tokens to update |
pos | list | the part of speech tags to update |
textstats | list | the text statistics to update |
feature_store.update(="This is an example.",
text=["This", "is", "an", "example", "."],
tokens=["DET", "VERB", "DET", "NOUN", "PUNCT"],
pos=[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
textstats )
feature_store.dump()
Table: texts (1)
hash | tokens | pos | textstats | |
---|---|---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [This, is, an, example, .] | [DET, VERB, DET, NOUN, PUNCT] | [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11] |
Table: embeddings (0)
Table: lexicons (0)
TextFeatureStore.update_embeddings
TextFeatureStore.update_embeddings (texts:str, embeddings:list)
Update the feature store with embeddings for a list of texts.
Type | Details | |
---|---|---|
texts | str | the texts to update |
embeddings | list | the embeddings to update |
"This is an example."], [[0.1, 0.2, 0.3, 0.4, 0.5]])
feature_store.update_embeddings([ feature_store.dump()
Table: texts (1)
hash | tokens | pos | textstats | |
---|---|---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [This, is, an, example, .] | [DET, VERB, DET, NOUN, PUNCT] | [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11] |
Table: embeddings (1)
hash | embeddings | |
---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [0.1, 0.2, 0.3, 0.4, 0.5] |
Table: lexicons (0)
TextFeatureStore.update_lexicons
TextFeatureStore.update_lexicons (texts:str, lexicons:list)
Update the feature store with lexicon features for a list of texts.
Type | Details | |
---|---|---|
texts | str | the texts to update |
lexicons | list | the lexicon scores to update |
"This is an example."], [[1, 7]])
feature_store.update_lexicons([ feature_store.dump()
Table: texts (1)
hash | tokens | pos | textstats | |
---|---|---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [This, is, an, example, .] | [DET, VERB, DET, NOUN, PUNCT] | [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11] |
Table: embeddings (1)
hash | embeddings | |
---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [0.1, 0.2, 0.3, 0.4, 0.5] |
Table: lexicons (1)
hash | lexicons | |
---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [1, 7] |
TextFeatureStore.empty
TextFeatureStore.empty ()
Clear the contents of the feature store.
feature_store.empty() feature_store.dump()
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)
TextFeatureStore.buffered_update
TextFeatureStore.buffered_update (text:str, tokens:list, pos:list, textstats:list)
Update the feature store tokens, parts of speech tags and text statistics for multiple texts.
Type | Details | |
---|---|---|
text | str | the text to update |
tokens | list | the tokens to update |
pos | list | the part of speech tags to update |
textstats | list | the text statistics to update |
TextFeatureStore.flush
TextFeatureStore.flush ()
Flush the buffer to the database.
feature_store.buffered_update(="This is an example.",
text=["This", "is", "an", "example", "."],
tokens=["DET", "VERB", "DET", "NOUN", "PUNCT"],
pos=[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]
textstats
)print('Updates pending, not flushed ...')
# nothing added yet
feature_store.dump() # flush add buffered updates to the store
feature_store.flush() print()
print('Updates flushed ...')
# now the store is updated feature_store.dump()
Updates pending, not flushed ...
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)
Updates flushed ...
Table: texts (1)
hash | tokens | pos | textstats | |
---|---|---|---|---|
0 | 263fb1aa85489991a2ef832ef10308a0 | [This, is, an, example, .] | [DET, VERB, DET, NOUN, PUNCT] | [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12] |
Table: embeddings (0)
Table: lexicons (0)
TextFeatureStore.get
TextFeatureStore.get (text:str, type:str=None)
Get features for a text.
Type | Default | Details | |
---|---|---|---|
text | str | the text to get features for | |
type | str | None | the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’ |
Returns | dict | list | the features for the text |
The get
method can return a specific feature type for a text …
'This is an example.', type='tokens') feature_store.get(
['This', 'is', 'an', 'example', '.']
or all features as a dict …
'This is an example.') feature_store.get(
{'tokens': ['This', 'is', 'an', 'example', '.'],
'pos': ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'],
'textstats': [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12],
'embeddings': None,
'lexicons': None}
TextFeatureStore.get_features_from_texts_by_type
TextFeatureStore.get_features_from_texts_by_type (texts:list, type:str)
Get features for a list of texts by type, if no match returns None for the text.
Type | Details | |
---|---|---|
texts | list | the texts to get features for |
type | str | the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’ |
Returns | list | the features for the texts |
Rather than calling get_features_from_texts_by_type
use the specific method for the feature type …
TextFeatureStore.get_tokens_from_texts
TextFeatureStore.get_tokens_from_texts (texts:list, lowercase:bool=False, min_token_length:int=0, remove_punctuation:bool=False, remove_numbers:bool=False)
Get (and optionally filter or transform) tokens for a list of texts.
Type | Default | Details | |
---|---|---|---|
texts | list | the texts to get tokens for | |
lowercase | bool | False | whether to return tokens as lowercase |
min_token_length | int | 0 | the minimum token length to include |
remove_punctuation | bool | False | whether to remove punctuation |
remove_numbers | bool | False | whether to remove numbers |
Returns | list | the tokens for the texts |
feature_store.update(="This is example 2.",
text=["This", "is", "example", "2", "."],
tokens=["DET", "VERB", "NOUN", "NUM", "PUNCT"],
pos=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
textstats
)
# defaults - if features for a text are not in the store, None is returned
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
# lowercase
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
=True))
lowercase# min token length
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
=3))
min_token_length# remove punctuation
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
=True))
remove_punctuation# remove numbers
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
=True)) remove_numbers
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '2', '.'], None]
[['this', 'is', 'an', 'example', '.'], ['this', 'is', 'example', '2', '.'], None]
[['This', 'example'], ['This', 'example'], None]
[['This', 'is', 'an', 'example'], ['This', 'is', 'example', '2'], None]
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '.'], None]
TextFeatureStore.get_textstats_from_texts
TextFeatureStore.get_textstats_from_texts (texts:list, columns_out=['tokens_count', 'sentences_count', 'characters_count', 'monosyllabic_words_relfreq', 'polysyllabic_words_relfreq', 'unique_tokens_relfreq', 'aver age_characters_per_token', 'average_tokens_per_sentence', 'characters_proportion_letters ', 'characters_proportion_uppe rcase', 'hapax_legomena_count', 'hapax_legomena_to_unique'], columns_in=['tokens_count', 'sentences_count', 'characters_count', 'monosyllabic_words_relfreq', 'polysyllabic_words_relfreq', 'unique_tokens_relfreq', 'aver age_characters_per_token', 'average_tokens_per_sentence', 'characters_proportion_letters ', 'characters_proportion_uppe rcase', 'hapax_legomena_count', 'hapax_legomena_to_unique'])
Get document-level text statistics for a list of texts.
Type | Default | Details | |
---|---|---|---|
texts | list | the texts to get text statistics for | |
columns_out | list | [‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’] | the columns to return |
columns_in | list | [‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’] | the possible columns |
Returns | list | the text statistics for the texts |
To restrict to specific textstats use columns_out
. The columns_in
argument is provided to allow a different definition of available statistics.
# defaults - if there are no features for a text returns None
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
# restrict to specific columns - tokens_count, sentence_count ...
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
= ['tokens_count', 'sentences_count'])) columns_out
[[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], None]
[[2, 4], [1, 2], None]
TextFeatureStore.get_pos_from_texts
TextFeatureStore.get_pos_from_texts (texts:list)
Get parts of speech for a list of texts.
Type | Details | |
---|---|---|
texts | list | the texts to get part of speech tags for |
Returns | list | the part of speech tags for the texts |
# if there are no features for a text returns None
print(feature_store.get_pos_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['DET', 'VERB', 'NOUN', 'NUM', 'PUNCT'], None]
TextFeatureStore.get_embeddings_from_texts
TextFeatureStore.get_embeddings_from_texts (texts:str)
Get embeddings for multiple texts.
Type | Details | |
---|---|---|
texts | str | the texts to get embeddings for |
Returns | list | the embeddings for the texts |
"This is an example.", "This is example 2."],[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9]])
feature_store.update_embeddings([
# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9], None]
TextFeatureStore.get_lexicons_from_texts
TextFeatureStore.get_lexicons_from_texts (texts:str)
Get lexicon features for multiple texts.
Type | Details | |
---|---|---|
texts | str | the texts to get lexicons for |
Returns | list |
"This is an example.", "This is example 2."],[[1, 5], [7, 1]])
feature_store.update_embeddings([
# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[[1, 5], [7, 1], None]