gensim, Scikit-learn, NLTK, TreebankWordTokenizer, WordPunctTokenizer, sent_tokenize, pos_tag, word_tokenize, NLP, text_to_word

Deep Learning/Tensorflow

gensim, Scikit-learn, NLTK, TreebankWordTokenizer, WordPunctTokenizer, sent_tokenize, pos_tag, word_tokenize, NLP, text_to_word_sequence, Corpus

Naranjito 2021. 3. 5. 17:56

Corpus

Natural Language Data

- Natural Language Processing

gensim

- It is an open source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

Scikit-learn

- SciPy Toolkit. It features various classification, regression and clustering algorithms including support vector machines.

NLTK

- The Natural Language ToolKit, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

Word Tokenization

1. word_tokenize

The token's criterion is a word.

from nltk.tokenize import word_tokenize
print('word tokenize : ', word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
word tokenize :  ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']

2. WordPunctTokenizer

The token's criterion is a punctuation(., ,, ?, ;, !).

from nltk.tokenize import WordPunctTokenizer
print('word punctuation tokenize : ',WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
word punctuation tokenize :  ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', '-', 'shop', '.']

3. text_to_word_sequence

Replace all alphabets with lowercase letters, remove punctuation marks but the apostrophe-such as don't or zone's-is preserved.

from tensorflow.keras.preprocessing.text import text_to_word_sequence
print('text to word sequence tokenize : ', text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
text to word sequence tokenize :  ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']

4. TreebankWordTokenizer

Keep the word with hyphen as one.

Seperate the clitic with apostrophe such as doesn't.

from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."
print('Tree bank Word Tokenizer : ',tokenizer.tokenize(text))

>>>
Tree bank Word Tokenizer :  ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']

Sentence Tokenization

1. sent_tokenize

Separate sentences from multiple sentences.

from nltk.tokenize import sent_tokenize
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
print('sentence tokenize :',sent_tokenize(text))

>>>
sentence tokenize : ["Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop.", 'I am actively looking for Ph.D. students.', 'and you are a Ph.D student.']

Part of speech tagging

Each word is classified as which part-of-speech-tagging.

from nltk.tag import pos_tag
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
tokenized_sentence=word_tokenize(text)
pos_tag(tokenized_sentence)

>>>
Part of speech tagging :  [('Do', 'VBP'), ("n't", 'RB'), ('be', 'VB'), ('fooled', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('dark', 'NN'), ('sounding', 'VBG'), ('name', 'NN'), (',', ','), ('Mr.', 'NNP'), ('Jone', 'NNP'), ("'s", 'POS'), ('Orphanage', 'NN'), ('is', 'VBZ'), ('as', 'RB'), ('cheery', 'JJ'), ('as', 'IN'), ('cheery', 'NN'), ('goes', 'VBZ'), ('for', 'IN'), ('a', 'DT'), ('pastry-shop', 'NN'), ('.', '.')]

저작자표시

'Deep Learning > Tensorflow' 카테고리의 다른 글

GradientTape (0)	2023.12.12
reduce_sum, cast, argmax, image_dataset_from_directory, one_hot, reduce_mean, assign_sub, boolean_mask, random.normal, zeros (0)	2023.12.12
LSTM (0)	2022.09.15
Keras-Preprocessing, One-hot encoding, Word Embedding , Modeling, Compile (0)	2021.04.09
keras-Tokenizer (0)	2021.03.08

현재글gensim, Scikit-learn, NLTK, TreebankWordTokenizer, WordPunctTokenizer, sent_tokenize, pos_tag, word_tokenize, NLP, text_to_word_sequence, Corpus

cross-entropy, Step Function, docker-compose, classmethod, Regular Expression, batch size, forward propagation, abstractmethod, zeros, Filter, d3js, selectall, nvidia-smi, yield from, axis, global variable, randn, Sigmoid function, kafka, textdistance,

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

¡Hola, Mundo!