Deep Learning/Tensorflow

keras-Tokenizer

Naranjito 2021. 3. 8. 17:25
  • word_index

Grand index to each words.

print(tokenizer.word_index)

>>>
{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'driving': 10, 'crazy': 11, 'went': 12, 'mountain': 13}

 

  • word_counts

Count the each words.

print(tokenizer.word_counts)

>>>
OrderedDict([('barber', 8), ('person', 3), ('good', 1), ('huge', 5), ('knew', 1), ('secret', 6), ('kept', 4), ('word', 2), ('keeping', 2), ('driving', 1), ('crazy', 1), ('went', 1), ('mountain', 1)])

 

  • texts_to_sequences

Convert the words to index.

encoded=tokenizer.texts_to_sequences(preprocessed_sentences)
print(encoded)

>>>
[[1, 5], [1, 8, 5], [1, 3, 5], [9, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [7, 7, 3, 2, 10, 1, 11], [1, 12, 3, 13]]

 

  • pad_sequences

Padding to 2D numpy array.

pad_sequences(encoded)

>>>
array([[ 0,  0,  0,  0,  0,  1,  5],
       [ 0,  0,  0,  0,  1,  8,  5], dtype=int32)

- padding : pad either before or after

- truncating : remove values either at the beginning or at the end 

pad_sequences(encoded, padding='post', truncating='post', value=len(tokenizer.word_index)+1)

>>>
array([[ 1,  5, 14, 14, 14, 14, 14],
       [ 1,  8,  5, 14, 14, 14, 14],dtype=int32)

 

  • num_words

The maximum number of words to keep, based on word frequency.

Tokenizer(num_words=10)

 

  • fit_on_texts

Assign lower integer to word frequency.

tokenizer.fit_on_texts(sentences)

>>>
{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7, 'good': 8, 'knew': 9, 'driving': 10, 'crazy': 11, 'went': 12, 'mountain': 13}

 

  • texts_to_matrix

Convert a list of texts to a Numpy matrix.

- mode : one of "binary", "count", "tfidf", "freq"