Reinforcement Learning

Q-Learning, Greedy action, Q-Value, Exploration, ϵ-greedy, epsilon-greedy, Exploitation, Discount factor, Q-update

Naranjito 2024. 9. 19. 18:13
  • Q-Learning

 

It is the process of finding famous restaurants by moving up, down, left, and right from the starting point on this map. In other words, it is a process that induces the fastest way to find this restaurant. The process of learning Q-Value.

 


  • Episode

 

A sequence of interactions between an agent and its environment, starting from an initial state and ending at a terminal state. In other words, the first mission is the first episode, the second mission is the second episode, and so forth.


  • Greedy action

 

Moving to the side with the highest score.

Write the largest nonzero value from the next state in the current state.

 

 


  • Action

 

Moving up, down, left, and right.

 


  • Reward

 

I moved up, down, left, and right from the starting point and somehow arrived right in front of the restaurant. When I arrive at the restaurant, the episode ends. 
In order for the next episode to find a good restaurant more easily, I give the score to the last action right before find the good restaurant.

 

 


  • Q-Value

 

4 values in a state. In this case, 0,0,0,1.


  • Exploration

 

Exploration for a Better Way, new path, new good restaurant.

 


  • ϵ-greedy(epsilon-greedy)

 

ϵ(epsilon) : the value between 0 and 1.

For example, if ϵ is 0.1 at the first state, 10% move randomly, and 90% follow greedy action.

If ϵ is 1 at the first state, it keeps moving randomly, ignoring the best q value. 


  • Exploitation

 

Following greedy action at the current environment. Then it will miss the chance to find a better way.


  • Decaying ϵ-greedy(epsilon-greedy)

 

The way how to trade off between Exploration and Exploitation.

 

Decreasing from ϵ 0.9(or other number less than 1) to 0.

In other words, explore a lot at first, but gradually reduce it as it goes through the episodes.


  • Discount factor

 

To find a better way. When updating the path finding process, it does not get the largest value from the previous state, but rather by multiplying it by gamma.

 

𝛾(gamma) : the value between 0 and 1.


- The smaller the gamma : the larger the difference from the final reward, so the agent does not consider the final reward. For example, 𝛾(gamma) is 0.001 and the final reward is 1. 

- The larger the gamma : the smaller the difference from the final reward, so the agent thinks that the final reward is almost there. For example, 𝛾(gamma) is 0.9 and the final reward is 1. 

 


  • Q-update

 

How to get the largest value from the previous state by multiplying it by gamma.

 


: Updating action(A) that is in state(S).

← : Updating

: The largest action value in the following state.

: An indicator of how much you accept new things.

 

https://youtu.be/cvctS4xWSaU?si=_hlXyNT8mdxa2mol

https://youtu.be/3Ch14GDY5Y8?si=95qIqn8Qwh0J0dG_