OOV : Out-Of-Vocabulary, words that machine does not know.
Subword segmenation : It is a preprocessing by separating one word into multiple subwords. Because words often consist of a combination of meaningful subwords (Ex : birthplace=birth+place).
</w> : A special character at the end of a word
- BPE(Byte Pair Encoding)
It finds the most frequently appeared character pairs and merges them into one character.
aaabdaaabac
>>>
Z=aa
ZabdZabac
>>>
Y=ab
Z=aa
ZYdZYac
>>>
X=ZY
Y=ab
Z=aa
XdXac # result
code : https://github.com/mellamonaranja/category_separation/blob/main/tokenizer%20train%20-%20BPE.ipynb
reference : https://www.youtube.com/watch?v=zjaRNfvNMTs
'Deep Learning' 카테고리의 다른 글
AutoEncoder (0) | 2022.12.12 |
---|---|
encoding vs embedding (0) | 2022.12.08 |
Terms-text encoding, text decoding, embedding (0) | 2022.09.23 |
Word2Vec (0) | 2022.03.29 |
How to learning of DL (0) | 2022.03.17 |