Deep Learning

normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Naranjito 2021. 3. 5. 19:31
  • normalization

Integrate different words to make them the same word-such as US is same as USA integrate them as US.

 

1. WordNetLemmatizer

If words have different forms, find the root word-such as the root of 'am, are, is' is 'be'.

from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
words=[ 'have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
print('before lemmatize : ', words)
print('after lemmatize : ',[lemmatizer.lemmatize(word) for word in words])

>>>
before lemmatize :  ['have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
after lemmatize :  ['have', 'going', 'love', 'life', 'fly', 'dy', 'watched', 'ha', 'starting']

 

2. PorterStemmer

It remove morphological affixes from words, leaving only the word stem.

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
sentence="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
tokenized_sentence=word_tokenize(sentence)
print('before porterstemmer : ', tokenized_sentence)
print('after porterstemmer : ', [stemmer.stem(word) for word in tokenized_sentence])

>>>
before porterstemmer :  ['Complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']
after porterstemmer :  ['complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']

 

3. LancasterStemmer

It reduces the word to the shortest stem possible. More aggressive than PorterStemmer.

from nltk.stem import LancasterStemmer
stemmer=LancasterStemmer()
print('after LancasterStemmer : ',[stemmer.stem(word) for word in tokenized_sentence])

>>>
after LancasterStemmer :  ['complet', 'in', 'al', 'thing', '--', 'nam', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'exceiv', 'of', 'the', 'red', 'cross', 'and', 'the', 'writ', 'not', '.']

 

  • Storword

It appears often but meaningless-such as [I, my, me].

from nltk.corpus import stopwords
example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
word_tokens=word_tokenize(example)
stop_words=set(stopwords.words('english'))
print([word for word in word_tokens if word not in stop_words])

>>>
['Complete', 'things', '--', 'names', 'heights', 'soundings', '--', 'single', 'exception', 'red', 'crosses', 'written', 'notes', '.']