Deep Learning

How to learning of DL

Naranjito 2022. 3. 17. 16:34
  • Loss function

In deep learning, we typically use a gradient-based optimization strategy to train a model

 

f ( x )

 

using some loss function 

 

l ( f ( x i ) , y i ) where ( x i , y i )

 

are some input-output pair.  It is used to help the model determine how "wrong" it is and, based on that "wrongness," improve itself. It's a measure of error. Our goal throughout training is to minimize this error/loss.

https://wandb.ai/sauravmaheshkar/cross-entropy/reports/What-Is-Cross-Entropy-Loss-A-Tutorial-With-Code--VmlldzoxMDA5NTMx

 

  • Gradient Descent

Reducing the value of the loss function.

  Gradient Descent by batch size
Gradient Descent
Training : all data 
: by epochs
Stochastic Gradient Descent(SGD)
Training : random data
 : by batch
Mini-batch Gradient Descent
Training : designated data
→ : by batch

 

  • optimizer
Optimizer
Momentum
 
Adagrad   Parameters with many changes set a small learning rate,
few changes set a high learning rate.
RMSprop   Improve Adagrad
Adam   Combine RMSprop and momentum.

  • Epochs

How many time train all data.

 

  • Batch size

Data unit

Let's say one data size is 256. For instance, it consists of [3,1,2,5, ...] and length is 256.

In other words, one data size = vector dimension = 256

If number of data is 3,000, total data size is 3,000 * 256.

Computer processes the data in chunks rather than processing them one by one.

If you take out 64 pieces of 3,000, then the batch size is 64.

Therefore the computer processes at once is (batch size × dim) = 64 × 256

 

- One data

[3,1,2,5, ...]
length = 256

 

- Number of data

[3,1,2,5, ...]
length = 256

           ...              3,000

                            

[3,1,2,5, ...]
length = 256