- word_index
Grand index to each words.
print(tokenizer.word_index)
>>>
{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'driving': 10, 'crazy': 11, 'went': 12, 'mountain': 13}
- word_counts
Count the each words.
print(tokenizer.word_counts)
>>>
OrderedDict([('barber', 8), ('person', 3), ('good', 1), ('huge', 5), ('knew', 1), ('secret', 6), ('kept', 4), ('word', 2), ('keeping', 2), ('driving', 1), ('crazy', 1), ('went', 1), ('mountain', 1)])
- texts_to_sequences
Convert the words to index.
encoded=tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)
>>>
[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]
- pad_sequences
Padding to 2D numpy array.
pad_sequences(encoded)
>>>
array([[ 0, 0, 0, 0, 0, 1, 5],
[ 0, 0, 0, 0, 1, 8, 5], dtype=int32)
- padding : pad either before or after
- truncating : remove values either at the beginning or at the end
pad_sequences(encoded, padding='post', truncating='post', value=len(tokenizer.word_index)+1)
>>>
array([[ 1, 5, 14, 14, 14, 14, 14],
[ 1, 8, 5, 14, 14, 14, 14],dtype=int32)
- num_words
The maximum number of words to keep, based on word frequency.
Tokenizer(num_words=10)
- fit_on_texts
Assign lower integer to word frequency.
tokenizer.fit_on_texts(sentences)
>>>
{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'driving': 10, 'crazy': 11, 'went': 12, 'mountain': 13}
- texts_to_matrix
Convert a list of texts to a Numpy matrix.
- mode : one of "binary", "count", "tfidf", "freq"