store

Store text features to avoid recomputing them.

source

TextFeatureStore

 TextFeatureStore (path:str)

A class to store features extracted for a text classification pipeline and cache them to disk to avoid recomputing them.

Type Details
path str where to save SQLite db to persist the feature store between runs

Check out TokensVectorizer, POSVectorizer, TextstatsTransformer and LexiconCountVectorizer for examples using a feature store with a Sci-kit learn pipeline.

Initiating a TextFeatureStore creates an SQLite with a texts tables for core features (tokens, parts of speech, textstats, i.e. document-level statistics), an embeddings table for embeddings features, and a lexicons table for lexicon features.

The store is used in a number of Textplumber components. The methods are documented below with examples, but these are primarily as a reference or for implementing new components.

feature_store = TextFeatureStore(path='feature_store_example.sqlite')

source

TextFeatureStore.dump

 TextFeatureStore.dump (structure_only=False)

Outputs the structure or contents of the feature store (intended for debugging/development)

Type Default Details
structure_only bool False if True, only show the schema of the feature store

Features are retrieved with the hash column. Texts are hashed using MD5. Here is the structure of a store …

feature_store.dump(structure_only=True)
texts
('hash', 'TEXT', 1, None, 1)
('tokens', 'BLOB', 1, None, 0)
('pos', 'BLOB', 1, None, 0)
('textstats', 'BLOB', 1, None, 0)

embeddings
('hash', 'TEXT', 1, None, 1)
('embeddings', 'BLOB', 1, None, 0)

lexicons
('hash', 'TEXT', 1, None, 1)
('lexicons', 'BLOB', 1, None, 0)

source

TextFeatureStore.update

 TextFeatureStore.update (text:str, tokens:list, pos:list, textstats:list)

Update (insert or replce) the feature store with the token, parts of speech tag and textstat (document-level statistics) features for a specific text.

Type Details
text str the text to update
tokens list the tokens to update
pos list the part of speech tags to update
textstats list the text statistics to update
feature_store.update(
        text="This is an example.",
        tokens=["This", "is", "an", "example", "."],
        pos=["DET", "VERB", "DET", "NOUN", "PUNCT"],
        textstats=[1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
)
feature_store.dump()
Table: texts (1)
hash tokens pos textstats
0 263fb1aa85489991a2ef832ef10308a0 [This, is, an, example, .] [DET, VERB, DET, NOUN, PUNCT] [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.update_embeddings

 TextFeatureStore.update_embeddings (texts:str, embeddings:list)

Update the feature store with embeddings for a list of texts.

Type Details
texts str the texts to update
embeddings list the embeddings to update
feature_store.update_embeddings(["This is an example."], [[0.1, 0.2, 0.3, 0.4, 0.5]])
feature_store.dump()
Table: texts (1)
hash tokens pos textstats
0 263fb1aa85489991a2ef832ef10308a0 [This, is, an, example, .] [DET, VERB, DET, NOUN, PUNCT] [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
Table: embeddings (1)
hash embeddings
0 263fb1aa85489991a2ef832ef10308a0 [0.1, 0.2, 0.3, 0.4, 0.5]
Table: lexicons (0)

source

TextFeatureStore.update_lexicons

 TextFeatureStore.update_lexicons (texts:str, lexicons:list)

Update the feature store with lexicon features for a list of texts.

Type Details
texts str the texts to update
lexicons list the lexicon scores to update
feature_store.update_lexicons(["This is an example."], [[1, 7]])
feature_store.dump()
Table: texts (1)
hash tokens pos textstats
0 263fb1aa85489991a2ef832ef10308a0 [This, is, an, example, .] [DET, VERB, DET, NOUN, PUNCT] [1, 3, 5, 7, 9, 11, 1, 3, 5, 7, 9, 11]
Table: embeddings (1)
hash embeddings
0 263fb1aa85489991a2ef832ef10308a0 [0.1, 0.2, 0.3, 0.4, 0.5]
Table: lexicons (1)
hash lexicons
0 263fb1aa85489991a2ef832ef10308a0 [1, 7]

source

TextFeatureStore.empty

 TextFeatureStore.empty ()

Clear the contents of the feature store.

feature_store.empty()
feature_store.dump()
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.buffered_update

 TextFeatureStore.buffered_update (text:str, tokens:list, pos:list,
                                   textstats:list)

Update the feature store tokens, parts of speech tags and text statistics for multiple texts.

Type Details
text str the text to update
tokens list the tokens to update
pos list the part of speech tags to update
textstats list the text statistics to update

source

TextFeatureStore.flush

 TextFeatureStore.flush ()

Flush the buffer to the database.

feature_store.buffered_update(
        text="This is an example.",
        tokens=["This", "is", "an", "example", "."],
        pos=["DET", "VERB", "DET", "NOUN", "PUNCT"],
        textstats=[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]
)
print('Updates pending, not flushed ...')
feature_store.dump() # nothing added yet
feature_store.flush() # flush add buffered updates to the store
print()
print('Updates flushed ...')
feature_store.dump() # now the store is updated
Updates pending, not flushed ...
Table: texts (0)
Table: embeddings (0)
Table: lexicons (0)

Updates flushed ...
Table: texts (1)
hash tokens pos textstats
0 263fb1aa85489991a2ef832ef10308a0 [This, is, an, example, .] [DET, VERB, DET, NOUN, PUNCT] [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12]
Table: embeddings (0)
Table: lexicons (0)

source

TextFeatureStore.get

 TextFeatureStore.get (text:str, type:str=None)

Get features for a text.

Type Default Details
text str the text to get features for
type str None the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns dict | list the features for the text

The get method can return a specific feature type for a text …

feature_store.get('This is an example.', type='tokens')
['This', 'is', 'an', 'example', '.']

or all features as a dict …

feature_store.get('This is an example.')
{'tokens': ['This', 'is', 'an', 'example', '.'],
 'pos': ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'],
 'textstats': [2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12],
 'embeddings': None,
 'lexicons': None}

source

TextFeatureStore.get_features_from_texts_by_type

 TextFeatureStore.get_features_from_texts_by_type (texts:list, type:str)

Get features for a list of texts by type, if no match returns None for the text.

Type Details
texts list the texts to get features for
type str the type of features to get - ‘tokens’, ‘pos’, ‘textstats’, ‘embeddings’, ‘lexicons’
Returns list the features for the texts

Rather than calling get_features_from_texts_by_type use the specific method for the feature type …


source

TextFeatureStore.get_tokens_from_texts

 TextFeatureStore.get_tokens_from_texts (texts:list, lowercase:bool=False,
                                         min_token_length:int=0,
                                         remove_punctuation:bool=False,
                                         remove_numbers:bool=False)

Get (and optionally filter or transform) tokens for a list of texts.

Type Default Details
texts list the texts to get tokens for
lowercase bool False whether to return tokens as lowercase
min_token_length int 0 the minimum token length to include
remove_punctuation bool False whether to remove punctuation
remove_numbers bool False whether to remove numbers
Returns list the tokens for the texts
feature_store.update(
        text="This is example 2.",
        tokens=["This", "is", "example", "2", "."],
        pos=["DET", "VERB", "NOUN", "NUM", "PUNCT"],
        textstats=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
)

# defaults - if features for a text are not in the store, None is returned
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
# lowercase
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    lowercase=True))
# min token length
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    min_token_length=3))
# remove punctuation
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    remove_punctuation=True))
# remove numbers
print(feature_store.get_tokens_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                    remove_numbers=True))
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '2', '.'], None]
[['this', 'is', 'an', 'example', '.'], ['this', 'is', 'example', '2', '.'], None]
[['This', 'example'], ['This', 'example'], None]
[['This', 'is', 'an', 'example'], ['This', 'is', 'example', '2'], None]
[['This', 'is', 'an', 'example', '.'], ['This', 'is', 'example', '.'], None]

source

TextFeatureStore.get_textstats_from_texts

 TextFeatureStore.get_textstats_from_texts (texts:list,
                                            columns_out=['tokens_count',
                                            'sentences_count',
                                            'characters_count',
                                            'monosyllabic_words_relfreq',
                                            'polysyllabic_words_relfreq',
                                            'unique_tokens_relfreq', 'aver
                                            age_characters_per_token',
                                            'average_tokens_per_sentence',
                                            'characters_proportion_letters
                                            ', 'characters_proportion_uppe
                                            rcase',
                                            'hapax_legomena_count',
                                            'hapax_legomena_to_unique'],
                                            columns_in=['tokens_count',
                                            'sentences_count',
                                            'characters_count',
                                            'monosyllabic_words_relfreq',
                                            'polysyllabic_words_relfreq',
                                            'unique_tokens_relfreq', 'aver
                                            age_characters_per_token',
                                            'average_tokens_per_sentence',
                                            'characters_proportion_letters
                                            ', 'characters_proportion_uppe
                                            rcase',
                                            'hapax_legomena_count',
                                            'hapax_legomena_to_unique'])

Get document-level text statistics for a list of texts.

Type Default Details
texts list the texts to get text statistics for
columns_out list [‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’] the columns to return
columns_in list [‘tokens_count’, ‘sentences_count’, ‘characters_count’, ‘monosyllabic_words_relfreq’, ‘polysyllabic_words_relfreq’, ‘unique_tokens_relfreq’, ‘average_characters_per_token’, ‘average_tokens_per_sentence’, ‘characters_proportion_letters’, ‘characters_proportion_uppercase’, ‘hapax_legomena_count’, ‘hapax_legomena_to_unique’] the possible columns
Returns list the text statistics for the texts

To restrict to specific textstats use columns_out. The columns_in argument is provided to allow a different definition of available statistics.

# defaults - if there are no features for a text returns None
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))

# restrict to specific columns - tokens_count, sentence_count ...
print(feature_store.get_textstats_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."],
                                             columns_out = ['tokens_count', 'sentences_count']))
[[2, 4, 6, 8, 10, 12, 2, 4, 6, 8, 10, 12], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], None]
[[2, 4], [1, 2], None]

source

TextFeatureStore.get_pos_from_texts

 TextFeatureStore.get_pos_from_texts (texts:list)

Get parts of speech for a list of texts.

Type Details
texts list the texts to get part of speech tags for
Returns list the part of speech tags for the texts
# if there are no features for a text returns None
print(feature_store.get_pos_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[['DET', 'VERB', 'DET', 'NOUN', 'PUNCT'], ['DET', 'VERB', 'NOUN', 'NUM', 'PUNCT'], None]

source

TextFeatureStore.get_embeddings_from_texts

 TextFeatureStore.get_embeddings_from_texts (texts:str)

Get embeddings for multiple texts.

Type Details
texts str the texts to get embeddings for
Returns list the embeddings for the texts
feature_store.update_embeddings(["This is an example.", "This is example 2."],[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9]])

# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[[0.9, -0.5, 3.0, 1.7], [0.5, 0.6, 0.8, -0.9], None]

source

TextFeatureStore.get_lexicons_from_texts

 TextFeatureStore.get_lexicons_from_texts (texts:str)

Get lexicon features for multiple texts.

Type Details
texts str the texts to get lexicons for
Returns list
feature_store.update_embeddings(["This is an example.", "This is example 2."],[[1, 5], [7, 1]])

# if there are no features for a text returns None
print(feature_store.get_embeddings_from_texts(["This is an example.", "This is example 2.", "This is an example that is not in the store."]))
[[1, 5], [7, 1], None]