normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Deep Learning

normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Naranjito 2021. 3. 5. 19:31

normalization

Integrate different words to make them the same word-such as US is same as USA integrate them as US.

1. WordNetLemmatizer

If words have different forms, find the root word-such as the root of 'am, are, is' is 'be'.

from nltk.stem import WordNetLemmatizer

lemmatizer=WordNetLemmatizer()
words=[ 'have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
print('before lemmatize : ', words)
print('after lemmatize : ',[lemmatizer.lemmatize(word) for word in words])

>>>
before lemmatize :  ['have', 'going', 'loves', 'lives', 'flies', 'dies', 'watched', 'has', 'starting']
after lemmatize :  ['have', 'going', 'love', 'life', 'fly', 'dy', 'watched', 'ha', 'starting']

2. PorterStemmer

It remove morphological affixes from words, leaving only the word stem.

from nltk.stem import PorterStemmer
stemmer=PorterStemmer()
sentence="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
tokenized_sentence=word_tokenize(sentence)
print('before porterstemmer : ', tokenized_sentence)
print('after porterstemmer : ', [stemmer.stem(word) for word in tokenized_sentence])

>>>
before porterstemmer :  ['Complete', 'in', 'all', 'things', '--', 'names', 'and', 'heights', 'and', 'soundings', '--', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '.']
after porterstemmer :  ['complet', 'in', 'all', 'thing', '--', 'name', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'except', 'of', 'the', 'red', 'cross', 'and', 'the', 'written', 'note', '.']

3. LancasterStemmer

It reduces the word to the shortest stem possible. More aggressive than PorterStemmer.

from nltk.stem import LancasterStemmer
stemmer=LancasterStemmer()
print('after LancasterStemmer : ',[stemmer.stem(word) for word in tokenized_sentence])

>>>
after LancasterStemmer :  ['complet', 'in', 'al', 'thing', '--', 'nam', 'and', 'height', 'and', 'sound', '--', 'with', 'the', 'singl', 'exceiv', 'of', 'the', 'red', 'cross', 'and', 'the', 'writ', 'not', '.']

Storword

It appears often but meaningless-such as [I, my, me].

from nltk.corpus import stopwords
example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
word_tokens=word_tokenize(example)
stop_words=set(stopwords.words('english'))
print([word for word in word_tokens if word not in stop_words])

>>>
['Complete', 'things', '--', 'names', 'heights', 'soundings', '--', 'single', 'exception', 'red', 'crosses', 'written', 'notes', '.']

저작자표시

'Deep Learning' 카테고리의 다른 글

FFNN, RNN, FCNNs (0)	2021.04.05
Perceptron, Step function, Single-Layer Perceptron, Multi-Layer Perceptron, DNN (0)	2021.03.31
LSA, SVD, Orthogonal matrix, Transposed matrix, Identity matrix, Inverse matrix, Diagonal matrix, Truncated SVD (0)	2021.03.11
Bag of words(BoW), DTM, TDM, TF-IDF (0)	2021.03.10
LM, Language Model, Language Modeling, Conditional Probability, Statistical Language Model, n-gram (0)	2021.03.09

현재글normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword

Sigmoid function, classmethod, Step Function, Regular Expression, global variable, axis, zeros, forward propagation, textdistance, selectall, d3js, yield from, Filter, batch size, abstractmethod, nvidia-smi, randn, cross-entropy, kafka, docker-compose,

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

¡Hola, Mundo!