- Corpus
Natural Language Data
- NLP
- Natural Language Processing
- gensim
- It is an open source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.
- Scikit-learn
- SciPy Toolkit. It features various classification, regression and clustering algorithms including support vector machines.
- NLTK
- The Natural Language ToolKit, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
- Word Tokenization
1. word_tokenize
The token's criterion is a word.
from nltk.tokenize import word_tokenize
print('word tokenize : ', word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))
>>>
word tokenize : ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']
2. WordPunctTokenizer
The token's criterion is a punctuation(., ,, ?, ;, !).
from nltk.tokenize import WordPunctTokenizer
print('word punctuation tokenize : ',WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))
>>>
word punctuation tokenize : ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', '-', 'shop', '.']
3. text_to_word_sequence
Replace all alphabets with lowercase letters, remove punctuation marks but the apostrophe-such as don't or zone's-is preserved.
from tensorflow.keras.preprocessing.text import text_to_word_sequence
print('text to word sequence tokenize : ', text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))
>>>
text to word sequence tokenize : ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']
4. TreebankWordTokenizer
Keep the word with hyphen as one.
Seperate the clitic with apostrophe such as doesn't.
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."
print('Tree bank Word Tokenizer : ',tokenizer.tokenize(text))
>>>
Tree bank Word Tokenizer : ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']
- Sentence Tokenization
1. sent_tokenize
Separate sentences from multiple sentences.
from nltk.tokenize import sent_tokenize
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
print('sentence tokenize :',sent_tokenize(text))
>>>
sentence tokenize : ["Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop.", 'I am actively looking for Ph.D. students.', 'and you are a Ph.D student.']
- Part of speech tagging
Each word is classified as which part-of-speech-tagging.
from nltk.tag import pos_tag
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
tokenized_sentence=word_tokenize(text)
pos_tag(tokenized_sentence)
>>>
Part of speech tagging : [('Do', 'VBP'), ("n't", 'RB'), ('be', 'VB'), ('fooled', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('dark', 'NN'), ('sounding', 'VBG'), ('name', 'NN'), (',', ','), ('Mr.', 'NNP'), ('Jone', 'NNP'), ("'s", 'POS'), ('Orphanage', 'NN'), ('is', 'VBZ'), ('as', 'RB'), ('cheery', 'JJ'), ('as', 'IN'), ('cheery', 'NN'), ('goes', 'VBZ'), ('for', 'IN'), ('a', 'DT'), ('pastry-shop', 'NN'), ('.', '.')]
'Deep Learning > Tensorflow' 카테고리의 다른 글
GradientTape (0) | 2023.12.12 |
---|---|
reduce_sum, cast, argmax, image_dataset_from_directory, one_hot, reduce_mean, assign_sub, boolean_mask, random.normal, zeros (0) | 2023.12.12 |
LSTM (0) | 2022.09.15 |
Keras-Preprocessing, One-hot encoding, Word Embedding , Modeling, Compile (0) | 2021.04.09 |
keras-Tokenizer (0) | 2021.03.08 |