- normalization
Integrate different words to make them the same word-such as US is same as USA integrate them as US.
1. WordNetLemmatizer
If words have different forms, find the root word-such as the root of 'am, are, is' is 'be'.
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
words=[ 'have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
print('before lemmatize : ', words)
print('after lemmatize : ',[lemmatizer.lemmatize(word) for word in words])
>>>
before lemmatize : ['have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
after lemmatize : ['have', 'going', 'love', 'life', 'fly', 'dy', 'watched', 'ha', 'starting']
2. PorterStemmer
It remove morphological affixes from words, leaving only the word stem.
from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
sentence="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
tokenized_sentence=word_tokenize(sentence)
print('before porterstemmer : ', tokenized_sentence)
print('after porterstemmer : ', [stemmer.stem(word) for word in tokenized_sentence])
>>>
before porterstemmer : ['Complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']
after porterstemmer : ['complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']
3. LancasterStemmer
It reduces the word to the shortest stem possible. More aggressive than PorterStemmer.
from nltk.stem import LancasterStemmer
stemmer=LancasterStemmer()
print('after LancasterStemmer : ',[stemmer.stem(word) for word in tokenized_sentence])
>>>
after LancasterStemmer : ['complet', 'in', 'al', 'thing', '--', 'nam', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'exceiv', 'of', 'the', 'red', 'cross', 'and', 'the', 'writ', 'not', '.']
- Storword
It appears often but meaningless-such as [I, my, me].
from nltk.corpus import stopwords
example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
word_tokens=word_tokenize(example)
stop_words=set(stopwords.words('english'))
print([word for word in word_tokens if word not in stop_words])
>>>
['Complete', 'things', '--', 'names', 'heights', 'soundings', '--', 'single', 'exception', 'red', 'crosses', 'written', 'notes', '.']