Deep Learning
BPE(Byte Pair Encoding)
Naranjito
2022. 11. 30. 14:42
OOV : Out-Of-Vocabulary, words that machine does not know.
Subword segmenation : It is a preprocessing by separating one word into multiple subwords. Because words often consist of a combination of meaningful subwords (Ex : birthplace=birth+place).
</w> : A special character at the end of a word
- BPE(Byte Pair Encoding)
It finds the most frequently appeared character pairs and merges them into one character.
aaabdaaabac
>>>
Z=aa
ZabdZabac
>>>
Y=ab
Z=aa
ZYdZYac
>>>
X=ZY
Y=ab
Z=aa
XdXac # result
code : https://github.com/mellamonaranja/category_separation/blob/main/tokenizer%20train%20-%20BPE.ipynb
reference : https://www.youtube.com/watch?v=zjaRNfvNMTs