Deep Learning

BPE(Byte Pair Encoding)

Naranjito 2022. 11. 30. 14:42

OOV : Out-Of-Vocabulary, words that machine does not know.

 

Subword segmenation : It is a preprocessing by separating one word into multiple subwords. Because words often consist of a combination of meaningful subwords (Ex : birthplace=birth+place).

 

</w> : A special character at the end of a word

 

  • BPE(Byte Pair Encoding)

It finds the most frequently appeared character pairs and merges them into one character.

 

aaabdaaabac

>>>
Z=aa
ZabdZabac

>>>
Y=ab
Z=aa
ZYdZYac

>>>
X=ZY
Y=ab
Z=aa
XdXac # result

code : https://github.com/mellamonaranja/category_separation/blob/main/tokenizer%20train%20-%20BPE.ipynb

reference : https://www.youtube.com/watch?v=zjaRNfvNMTs

'Deep Learning' 카테고리의 다른 글

AutoEncoder  (0) 2022.12.12
encoding vs embedding  (0) 2022.12.08
Terms-text encoding, text decoding, embedding  (0) 2022.09.23
Word2Vec  (0) 2022.03.29
How to learning of DL  (0) 2022.03.17