source
TextFeatureStore
TextFeatureStore (path:str)
A class to store features extracted for a text classification pipeline and cache them to disk to avoid recomputing them.
path
str
where to save SQLite db to persist the feature store between runs
Check out TokensVectorizer
, POSVectorizer
, TextstatsTransformer
and LexiconCountVectorizer
for examples using a feature store with a Sci-kit learn pipeline.
Initiating a TextFeatureStore creates an SQLite with a texts tables for core features (tokens, parts of speech, textstats, i.e. document-level statistics), an embeddings table for embeddings features, and a lexicons table for lexicon features. From Textplumber 0.0.9 a config table has been added to allow auto-refresh of the store in cases where the preprocessor, embedder or important settings of these components change (e.g. the model used).
The store is used in a number of Textplumber components. The methods are documented below with examples, but these are primarily as a reference or for implementing new components.
feature_store = TextFeatureStore(path= 'feature_store_example.sqlite' )
source
TextFeatureStore.dump
TextFeatureStore.dump (structure_only=False)
Outputs the structure or contents of the feature store (intended for debugging/development)
structure_only
bool
False
if True, only show the schema of the feature store
Features are retrieved with the hash column. Texts are hashed using MD5. Here is the structure of a store …
feature_store.dump(structure_only= True )
config
('key', 'TEXT', 1, None, 1)
('value', 'TEXT', 1, None, 0)
texts
('hash', 'TEXT', 1, None, 1)
('tokens', 'BLOB', 1, None, 0)
('pos', 'BLOB', 1, None, 0)
('textstats', 'BLOB', 1, None, 0)
embeddings
('hash', 'TEXT', 1, None, 1)
('embeddings', 'BLOB', 1, None, 0)
lexicons
('hash', 'TEXT', 1, None, 1)
('lexicons', 'BLOB', 1, None, 0)
source
TextFeatureStore.set_config
TextFeatureStore.set_config (key:str, value:str)
Updates a configuration key in the feature store.
key
str
the key to update
value
str
the value to set for the key
source
TextFeatureStore.get_config
TextFeatureStore.get_config (key:str)
Retrieves a configuration key from the feature store.
key
str
the key to retrieve
Returns
str
source
TextFeatureStore.update
TextFeatureStore.update (text:str, tokens:list, pos:list, textstats:list)
Update (insert or replce) the feature store with the token, parts of speech tag and textstat (document-level statistics) features for a specific text.
text
str
the text to update
tokens
list
the tokens to update
pos
list
the part of speech tags to update
textstats
list
the text statistics to update
feature_store.update(
text= "This is an example." ,
tokens= ["This" , "is" , "an" , "example" , "." ],
pos= ["DET" , "VERB" , "DET" , "NOUN" , "PUNCT" ],
textstats= [1 , 3 , 5 , 7 , 9 , 11 , 1 , 3 , 5 , 7 , 9 , 11 ]
)
0
263fb1aa85489991a2ef832ef10308a0
[This, is, an, example, .]
[DET, VERB, DET, NOUN, PUNCT]
[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
Table: embeddings (0)
Table: lexicons (0)
source
TextFeatureStore.update_embeddings
TextFeatureStore.update_embeddings (texts:str, embeddings:list)
Update the feature store with embeddings for a list of texts.
texts
str
the texts to update
embeddings
list
the embeddings to update
feature_store.update_embeddings(["This is an example." ], [[0.1 , 0.2 , 0.3 , 0.4 , 0.5 ]])
feature_store.dump()
0
263fb1aa85489991a2ef832ef10308a0
[This, is, an, example, .]
[DET, VERB, DET, NOUN, PUNCT]
[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
0
263fb1aa85489991a2ef832ef10308a0
[0.1, 0.2, 0.3, 0.4, 0.5]
source
TextFeatureStore.update_lexicons
TextFeatureStore.update_lexicons (texts:str, lexicons:list)
Update the feature store with lexicon features for a list of texts.
texts
str
the texts to update
lexicons
list
the lexicon scores to update
feature_store.update_lexicons(["This is an example." ], [[1 , 7 ]])
feature_store.dump()
0
263fb1aa85489991a2ef832ef10308a0
[This, is, an, example, .]
[DET, VERB, DET, NOUN, PUNCT]
[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
0
263fb1aa85489991a2ef832ef10308a0
[0.1, 0.2, 0.3, 0.4, 0.5]
0
263fb1aa85489991a2ef832ef10308a0
[1, 7]
source
TextFeatureStore.empty
TextFeatureStore.empty ()
Clear the contents of the feature store.
feature_store.empty()
feature_store.dump()
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)
source
TextFeatureStore.buffered_update
TextFeatureStore.buffered_update (text:str, tokens:list, pos:list,
textstats:list)
Update the feature store tokens, parts of speech tags and text statistics for multiple texts.
text
str
the text to update
tokens
list
the tokens to update
pos
list
the part of speech tags to update
textstats
list
the text statistics to update
source
TextFeatureStore.flush
TextFeatureStore.flush ()
Flush the buffer to the database.
feature_store.buffered_update(
text= "This is an example." ,
tokens= ["This" , "is" , "an" , "example" , "." ],
pos= ["DET" , "VERB" , "DET" , "NOUN" , "PUNCT" ],
textstats= [2 , 4 , 6 , 8 , 10 , 12 , 2 , 4 , 6 , 8 , 10 , 12 ]
)
print ('Updates pending, not flushed ...' )
feature_store.dump() # nothing added yet
feature_store.flush() # flush add buffered updates to the store
print ()
print ('Updates flushed ...' )
feature_store.dump() # now the store is updated
Updates pending, not flushed ...
Table: config (1)
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)
Updates flushed ...
Table: config (1)
0
263fb1aa85489991a2ef832ef10308a0
[This, is, an, example, .]
[DET, VERB, DET, NOUN, PUNCT]
[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]
Table: embeddings (0)
Table: lexicons (0)
source
TextFeatureStore.get
TextFeatureStore.get (text:str, type:str=None)
Get features for a text.
text
str
the text to get features for
type
str
None
the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns
dict | list
the features for the text
The get
method can return a specific feature type for a text …
feature_store.get('This is an example.' , type = 'tokens' )
['This', 'is', 'an', 'example', '.']
or all features as a dict …
feature_store.get('This is an example.' )
{'tokens': ['This', 'is', 'an', 'example', '.'],
'pos': ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'],
'textstats': [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12],
'embeddings': None,
'lexicons': None}
source
TextFeatureStore.get_features_from_texts_by_type
TextFeatureStore.get_features_from_texts_by_type (texts:list, type:str)
Get features for a list of texts by type, if no match returns None for the text.
texts
list
the texts to get features for
type
str
the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns
list
the features for the texts
Rather than calling get_features_from_texts_by_type
use the specific method for the feature type …
source
TextFeatureStore.get_tokens_from_texts
TextFeatureStore.get_tokens_from_texts (texts:list)
Get (and optionally filter or transform) tokens for a list of texts.
texts
list
the texts to get tokens for
Returns
list
the tokens for the texts
feature_store.update(
text= "This is example 2." ,
tokens= ["This" , "is" , "example" , "2" , "." ],
pos= ["DET" , "VERB" , "NOUN" , "NUM" , "PUNCT" ],
textstats= [1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 ]
)
# defaults - if features for a text are not in the store, None is returned
print (feature_store.get_tokens_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ]))
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '2', '.'], None]
source
TextFeatureStore.get_textstats_from_texts
TextFeatureStore.get_textstats_from_texts (texts:list,
columns_out=['tokens_count',
'sentences_count',
'characters_count',
'monosyllabic_words_relfreq',
'polysyllabic_words_relfreq',
'unique_tokens_relfreq', 'aver
age_characters_per_token',
'average_tokens_per_sentence',
'characters_proportion_letters
', 'characters_proportion_uppe
rcase',
'hapax_legomena_count',
'hapax_legomena_to_unique'],
columns_in=['tokens_count',
'sentences_count',
'characters_count',
'monosyllabic_words_relfreq',
'polysyllabic_words_relfreq',
'unique_tokens_relfreq', 'aver
age_characters_per_token',
'average_tokens_per_sentence',
'characters_proportion_letters
', 'characters_proportion_uppe
rcase',
'hapax_legomena_count',
'hapax_legomena_to_unique'])
Get document-level text statistics for a list of texts.
texts
list
the texts to get text statistics for
columns_out
list
[‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’]
the columns to return
columns_in
list
[‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’]
the possible columns
Returns
list
the text statistics for the texts
To restrict to specific textstats use columns_out
. The columns_in
argument is provided to allow a different definition of available statistics.
# defaults - if there are no features for a text returns None
print (feature_store.get_textstats_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ]))
# restrict to specific columns - tokens_count, sentence_count ...
print (feature_store.get_textstats_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ],
columns_out = ['tokens_count' , 'sentences_count' ]))
[[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], None]
[[2, 4], [1, 2], None]
source
TextFeatureStore.get_pos_from_texts
TextFeatureStore.get_pos_from_texts (texts:list)
Get parts of speech for a list of texts.
texts
list
the texts to get part of speech tags for
Returns
list
the part of speech tags for the texts
# if there are no features for a text returns None
print (feature_store.get_pos_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ]))
[['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['DET', 'VERB', 'NOUN', 'NUM', 'PUNCT'], None]
source
TextFeatureStore.get_embeddings_from_texts
TextFeatureStore.get_embeddings_from_texts (texts:str)
Get embeddings for multiple texts.
texts
str
the texts to get embeddings for
Returns
list
the embeddings for the texts
feature_store.update_embeddings(["This is an example." , "This is example 2." ],[[0.9 , - 0.5 , 3.0 , 1.7 ], [0.5 , 0.6 , 0.8 , - 0.9 ]])
# if there are no features for a text returns None
print (feature_store.get_embeddings_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ]))
[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9], None]
source
TextFeatureStore.get_lexicons_from_texts
TextFeatureStore.get_lexicons_from_texts (texts:str)
Get lexicon features for multiple texts.
texts
str
the texts to get lexicons for
Returns
list
feature_store.update_embeddings(["This is an example." , "This is example 2." ],[[1 , 5 ], [7 , 1 ]])
# if there are no features for a text returns None
print (feature_store.get_embeddings_from_texts(["This is an example." , "This is example 2." , "This is an example that is not in the store." ]))