- Q-Learning
It is the process of finding famous restaurants by moving up, down, left, and right from the starting point on this map. In other words, it is a process that induces the fastest way to find this restaurant. The process of learning Q-Value.
- Episode
A sequence of interactions between an agent and its environment, starting from an initial state and ending at a terminal state. In other words, the first mission is the first episode, the second mission is the second episode, and so forth.
- Greedy action
Moving to the side with the highest score.
Write the largest nonzero value from the next state in the current state.
- Action
Moving up, down, left, and right.
- Reward
I moved up, down, left, and right from the starting point and somehow arrived right in front of the restaurant. When I arrive at the restaurant, the episode ends.
In order for the next episode to find a good restaurant more easily, I give the score to the last action right before find the good restaurant.
- Q-Value
4 values in a state. In this case, 0,0,0,1.
- Exploration
Exploration for a Better Way, new path, new good restaurant.
- ϵ-greedy(epsilon-greedy)
ϵ(epsilon) : the value between 0 and 1.
For example, if ϵ is 0.1 at the first state, 10% move randomly, and 90% follow greedy action.
If ϵ is 1 at the first state, it keeps moving randomly, ignoring the best q value.
- Exploitation
Following greedy action at the current environment. Then it will miss the chance to find a better way.
- Decaying ϵ-greedy(epsilon-greedy)
The way how to trade off between Exploration and Exploitation.
Decreasing from ϵ 0.9(or other number less than 1) to 0.
In other words, explore a lot at first, but gradually reduce it as it goes through the episodes.
- Discount factor
To find a better way. When updating the path finding process, it does not get the largest value from the previous state, but rather by multiplying it by gamma.
𝛾(gamma) : the value between 0 and 1.
- The smaller the gamma : the larger the difference from the final reward, so the agent does not consider the final reward. For example, 𝛾(gamma) is 0.001 and the final reward is 1.
- The larger the gamma : the smaller the difference from the final reward, so the agent thinks that the final reward is almost there. For example, 𝛾(gamma) is 0.9 and the final reward is 1.
- Q-update
How to get the largest value from the previous state by multiplying it by gamma.
: Updating action(A) that is in state(S). ← : Updating : The largest action value in the following state. : An indicator of how much you accept new things.
'Reinforcement Learning' 카테고리의 다른 글
Behavior Policy VS Target Policy (0) | 2024.10.06 |
---|---|
Monte-carlo VS Temporal Difference (0) | 2024.10.06 |
Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation (0) | 2024.09.22 |