normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Deep Learning

normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Naranjito 2021. 3. 5. 19:31

normalization

Integrate different words to make them the same word-such as US is same as USA integrate them as US.

1. WordNetLemmatizer

If words have different forms, find the root word-such as the root of 'am, are, is' is 'be'.

from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
words=[ 'have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
print('before lemmatize : ', words)
print('after lemmatize : ',[lemmatizer.lemmatize(word) for word in words])

>>>
before lemmatize :  ['have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
after lemmatize :  ['have', 'going', 'love', 'life', 'fly', 'dy', 'watched', 'ha', 'starting']

2. PorterStemmer

It remove morphological affixes from words, leaving only the word stem.

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
sentence="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
tokenized_sentence=word_tokenize(sentence)
print('before porterstemmer : ', tokenized_sentence)
print('after porterstemmer : ', [stemmer.stem(word) for word in tokenized_sentence])

>>>
before porterstemmer :  ['Complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']
after porterstemmer :  ['complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']

3. LancasterStemmer

It reduces the word to the shortest stem possible. More aggressive than PorterStemmer.

from nltk.stem import LancasterStemmer
stemmer=LancasterStemmer()
print('after LancasterStemmer : ',[stemmer.stem(word) for word in tokenized_sentence])

>>>
after LancasterStemmer :  ['complet', 'in', 'al', 'thing', '--', 'nam', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'exceiv', 'of', 'the', 'red', 'cross', 'and', 'the', 'writ', 'not', '.']

Storword

It appears often but meaningless-such as [I, my, me].

from nltk.corpus import stopwords
example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
word_tokens=word_tokenize(example)
stop_words=set(stopwords.words('english'))
print([word for word in word_tokens if word not in stop_words])

>>>
['Complete', 'things', '--', 'names', 'heights', 'soundings', '--', 'single', 'exception', 'red', 'crosses', 'written', 'notes', '.']

저작자표시

'Deep Learning' 카테고리의 다른 글

FFNN, RNN, FCNNs (0)	2021.04.05
Perceptron, Step function, Single-Layer Perceptron, Multi-Layer Perceptron, DNN (0)	2021.03.31
LSA, SVD, Orthogonal matrix, Transposed matrix, Identity matrix, Inverse matrix, Diagonal matrix, Truncated SVD (0)	2021.03.11
Bag of words(BoW), DTM, TDM, TF-IDF (0)	2021.03.10
LM, Language Model, Language Modeling, Conditional Probability, Statistical Language Model, n-gram (0)	2021.03.09

현재글normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

zeros, docker-compose, cross-entropy, Regular Expression, Sigmoid function, yield from, randn, nvidia-smi, batch size, global variable, d3js, abstractmethod, Step Function, kafka, selectall, Filter, axis, textdistance, forward propagation, classmethod,

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

¡Hola, Mundo!