Deep Learning

How to learning of DL

Naranjito 2022. 3. 17. 16:34
  • Loss function


A gradient-based optimization strategy to train a model


f ( x )


using some loss function 


l ( f ( x i ) , y i ) where ( x i , y i )


are some input-output pair.  It is used to help the model determine how "wrong" it is and, based on that "wrongness," improve itself. It's a measure of error. Our goal throughout training is to minimize this error/loss.

  • Gradient Descent


Reducing the value of the loss function.


  Gradient Descent by batch size
Gradient Descent
Training : all data 
: by epochs
Stochastic Gradient Descent(SGD)
Training : random data
 : by batch
Mini-batch Gradient Descent
Training : designated data
→ : by batch

  • optimizer


Linear regression is the task of finding one straight line that best fits the training data. At this time, the hypothesis of linear regression has the following format.


H ( x ) = W x + b

W : Weight

b : bias

Optimizer is the method to find w, b that minimizes the value of the Cost Function.



Parameters with many changes set a small learning rate,
few changes set a high learning rate.
RMSprop   Improve Adagrad
Combine RMSprop and momentum.

  • Epochs


How many time train all data.

  • Batch size


Data unit


Let's say one data size is 256. For instance, it consists of [3,1,2,5, ...] and length is 256.

In other words, one data size = vector dimension = 256

If number of data is 3,000, total data size is 3,000 * 256.

Computer processes the data in chunks rather than processing them one by one.

If you take out 64 pieces of 3,000, then the batch size is 64.

Therefore the computer processes at once is (batch size × dim) = 64 × 256

- One data

[3,1,2,5, ...]
length = 256

- Number of data

[3,1,2,5, ...]
length = 256

           ...              3,000


[3,1,2,5, ...]
length = 256