Deep Learning/Tensorflow

gensim, Scikit-learn, NLTK, TreebankWordTokenizer, WordPunctTokenizer, sent_tokenize, pos_tag, word_tokenize, NLP, text_to_word_sequence, Corpus

Naranjito 2021. 3. 5. 17:56
  • Corpus

Natural Language Data

 

  • NLP

- Natural Language Processing

 

  • gensim

- It is an open source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

 

  • Scikit-learn

- SciPy Toolkit. It features various classification, regression and clustering algorithms including support vector machines.

 

  • NLTK

- The Natural Language ToolKit, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.

 

  • Word Tokenization

 

1. word_tokenize

The token's criterion is a word.

from nltk.tokenize import word_tokenize
print('word tokenize : ', word_tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
word tokenize :  ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']

 

2. WordPunctTokenizer

The token's criterion is a punctuation(., ,, ?, ;, !).

from nltk.tokenize import WordPunctTokenizer
print('word punctuation tokenize : ',WordPunctTokenizer().tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
word punctuation tokenize :  ['Don', "'", 't', 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr', '.', 'Jone', "'", 's', 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', '-', 'shop', '.']

 

3. text_to_word_sequence

Replace all alphabets with lowercase letters, remove punctuation marks but the apostrophe-such as don't or zone's-is preserved.

from tensorflow.keras.preprocessing.text import text_to_word_sequence
print('text to word sequence tokenize : ', text_to_word_sequence("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."))

>>>
text to word sequence tokenize :  ["don't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', 'mr', "jone's", 'orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry', 'shop']

 

4. TreebankWordTokenizer

Keep the word with hyphen as one.

Seperate the clitic with apostrophe such as doesn't.

from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop."
print('Tree bank Word Tokenizer : ',tokenizer.tokenize(text))

>>>
Tree bank Word Tokenizer :  ['Do', "n't", 'be', 'fooled', 'by', 'the', 'dark', 'sounding', 'name', ',', 'Mr.', 'Jone', "'s", 'Orphanage', 'is', 'as', 'cheery', 'as', 'cheery', 'goes', 'for', 'a', 'pastry-shop', '.']

 

  • Sentence Tokenization

1. sent_tokenize

Separate sentences from multiple sentences.

 

from nltk.tokenize import sent_tokenize
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
print('sentence tokenize :',sent_tokenize(text))

>>>
sentence tokenize : ["Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop.", 'I am actively looking for Ph.D. students.', 'and you are a Ph.D student.']

 

  • Part of speech tagging

Each word is classified as which part-of-speech-tagging.

from nltk.tag import pos_tag
text="Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry-shop. I am actively looking for Ph.D. students. and you are a Ph.D student."
tokenized_sentence=word_tokenize(text)
pos_tag(tokenized_sentence)

>>>
Part of speech tagging :  [('Do', 'VBP'), ("n't", 'RB'), ('be', 'VB'), ('fooled', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('dark', 'NN'), ('sounding', 'VBG'), ('name', 'NN'), (',', ','), ('Mr.', 'NNP'), ('Jone', 'NNP'), ("'s", 'POS'), ('Orphanage', 'NN'), ('is', 'VBZ'), ('as', 'RB'), ('cheery', 'JJ'), ('as', 'IN'), ('cheery', 'NN'), ('goes', 'VBZ'), ('for', 'IN'), ('a', 'DT'), ('pastry-shop', 'NN'), ('.', '.')]