Bag of words(BoW), DTM, TDM, TF-IDF

Deep Learning

Bag of words(BoW), DTM, TDM, TF-IDF

Naranjito 2021. 3. 10. 17:03

Bag of words(BoW)

It is a way of extracting features from text, a representation of text that describes the occurrence of words within a document, for example, pour the sentence into the bag and shuffling the bag, any information about the order or structure of words in the bag is discarded.

Review 1 : This movie is very scary and long

Review 2 : This movie is not scary and is slow

Review 3 : This movie is spooky and good

Reference : www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

DTM or TDM

Document-Term Matrix or Term-Document Matrix, is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. This is a matrix where

- each row represents one document

- each column represents one term (word)

- each value (typically) contains the number of appearances of that term in that document

It is often stored as a sparse matrix or sparse vector that most of value is 0.

TF-IDF

Term Frequency-Inberse Document Frequency, is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus.

1. TF

Term Frequency, measures the frequency of a word in a document.

The numerator N is the number of times the term 't' appears in the document 'd'.

Let's take the same vocabulary I had built above.

TF(‘movie’) = 1/8

TF(‘is’) = 2/8 = 1/4

TF(‘very’) = 0/8 = 0

TF(‘scary’) = 1/8

TF(‘and’) = 1/8

TF(‘long’) = 0/8 = 0

TF(‘not’) = 1/8

TF(‘slow’) = 1/8

TF( ‘spooky’) = 0/8 = 0

TF(‘good’) = 0/8 = 0

And now, let's calculate the term frequencies for all the terms and all the reviews.

2. IDF

Inverse Document Frequency, is how common or rare a word is in the entire document set. In other words, a measure of how important a term is. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of docments that contain a word. So, if the word is common and appears in many documents, this number will approach 0(log 1 is 0, in other words, numerator is same as denominator). Otherwise, it will approach 1.

d : Document

t : Term(word)

n : Total Number of documents

IDF(‘movie’, ) = log(3/1+3)

Hence, the words like “is”, “this”, “and”, etc., are reduced to 0 and have little importance, while words like “scary”, “long”, “good”, etc. are words with more importance and thus have a higher value.

Let's now calculate the TF-IDF score.

import pandas as pd
from math import log

docs=['This movie is very scary and long',
      'This movie is not scary and is slow',
      'This movie is spooky and good']

vocab=list(set(w for doc in docs for w in doc.split()))
vocab
>>>['is',
 'slow',
 'and',
 'This',
 'not',
 'spooky',
 'good',
 'very',
 'scary',
 'movie',
 'long']
 
 vocab.sort()
 N=len(docs)

def tf(t,d):
  return d.count(t)

def idf(t):
  df=0
  for doc in docs:
    df+=t in doc
  return log(N/(1+df))

def tfidf(t,d):
  return tf(t,d)*idf(t)

result=[]
for i in range(N):
  result.append([])
  d=docs[i]
  for j in range(len(vocab)):
    t=vocab[j]
    result[-1].append(tf(t,d))

print(pd.DataFrame(result, columns=vocab))
print(result)
>>>   This  and  good  is  long  movie  not  scary  slow  spooky  very
0     1    1     0   2     1      1    0      1     0       0     1
1     1    1     0   3     0      1    1      1     1       0     0
2     1    1     1   2     0      1    0      0     0       1     0
[[1, 1, 0, 2, 1, 1, 0, 1, 0, 0, 1], [1, 1, 0, 3, 0, 1, 1, 1, 1, 0, 0], [1, 1, 1, 2, 0, 1, 0, 0, 0, 1, 0]]

result=[]
for j in range(len(vocab)):
  t=vocab[j]
  result.append(idf(t))

pd.DataFrame(result, index=vocab, columns=['IDF'])
>>>
IDF
This	-0.287682
and	-0.287682
good	0.405465
is	-0.287682
long	0.405465
movie	-0.287682
not	0.405465
scary	0.000000
slow	0.405465
spooky	0.405465
very	0.405465

result=[]
for i in range(N):
  result.append([])
  d=docs[i]
  for j in range(len(vocab)):
    t=vocab[j]
    result[-1].append(tfidf)
pd.DataFrame(result, columns=vocab)

Python sklearn package

from sklearn.feature_extraction.text import TfidfVectorizer

corpus=['you know I want your love',
    'I like you',
    'what should I do ',]
tfidf=TfidfVectorizer().fit(corpus)
print(tfidf.transform(corpus).toarray())
print(tfidf.vocabulary_)
>>>[[0.         0.46735098 0.         0.46735098 0.         0.46735098
  0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.
  0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.        ]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}

저작자표시

'Deep Learning' 카테고리의 다른 글

FFNN, RNN, FCNNs (0)	2021.04.05
Perceptron, Step function, Single-Layer Perceptron, Multi-Layer Perceptron, DNN (0)	2021.03.31
LSA, SVD, Orthogonal matrix, Transposed matrix, Identity matrix, Inverse matrix, Diagonal matrix, Truncated SVD (0)	2021.03.11
LM, Language Model, Language Modeling, Conditional Probability, Statistical Language Model, n-gram (0)	2021.03.09
normalization, WordNetLemmatizer, PorterStemmer, LancasterStemmer, Storword (0)	2021.03.05

현재글Bag of words(BoW), DTM, TDM, TF-IDF

cross-entropy, kafka, axis, yield from, randn, selectall, Filter, docker-compose, Sigmoid function, textdistance, global variable, nvidia-smi, forward propagation, d3js, zeros, abstractmethod, Regular Expression, batch size, Step Function, classmethod,

일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

¡Hola, Mundo!