Q-Learning, Greedy action, Q-Value, Exploration, ϵ-greedy, epsilon-greedy, Exploitation, Discount factor, Q-update

Reinforcement Learning

Q-Learning, Greedy action, Q-Value, Exploration, ϵ-greedy, epsilon-greedy, Exploitation, Discount factor, Q-update

Naranjito 2024. 9. 19. 18:13

Q-Learning

It is the process of finding famous restaurants by moving up, down, left, and right from the starting point on this map. In other words, it is a process that induces the fastest way to find this restaurant. The process of learning Q-Value.

Episode

A sequence of interactions between an agent and its environment, starting from an initial state and ending at a terminal state. In other words, the first mission is the first episode, the second mission is the second episode, and so forth.

Greedy action

Moving to the side with the highest score.

Write the largest nonzero value from the next state in the current state.

Action

Moving up, down, left, and right.

Reward

I moved up, down, left, and right from the starting point and somehow arrived right in front of the restaurant. When I arrive at the restaurant, the episode ends.
In order for the next episode to find a good restaurant more easily, I give the score to the last action right before find the good restaurant.

Q-Value

4 values in a state. In this case, 0,0,0,1.

Exploration

Exploration for a Better Way, new path, new good restaurant.

ϵ-greedy(epsilon-greedy)

ϵ(epsilon) : the value between 0 and 1.

For example, if ϵ is 0.1 at the first state, 10% move randomly, and 90% follow greedy action.

If ϵ is 1 at the first state, it keeps moving randomly, ignoring the best q value.

Exploitation

Following greedy action at the current environment. Then it will miss the chance to find a better way.

Decaying ϵ-greedy(epsilon-greedy)

The way how to trade off between Exploration and Exploitation.

Decreasing from ϵ 0.9(or other number less than 1) to 0.

In other words, explore a lot at first, but gradually reduce it as it goes through the episodes.

Discount factor

To find a better way. When updating the path finding process, it does not get the largest value from the previous state, but rather by multiplying it by gamma.

𝛾(gamma) : the value between 0 and 1.

- The smaller the gamma : the larger the difference from the final reward, so the agent does not consider the final reward. For example, 𝛾(gamma) is 0.001 and the final reward is 1.

- The larger the gamma : the smaller the difference from the final reward, so the agent thinks that the final reward is almost there. For example, 𝛾(gamma) is 0.9 and the final reward is 1.

Q-update

How to get the largest value from the previous state by multiplying it by gamma.

$Q^{new}(S_{t},A_{t})$ : Updating action(A) that is in state(S).

← : Updating

$maxQ(s_{t+1},a_{t+1})$ : The largest action value in the following state.

$\alpha$ : An indicator of how much you accept new things.

https://youtu.be/cvctS4xWSaU?si=_hlXyNT8mdxa2mol

https://youtu.be/3Ch14GDY5Y8?si=95qIqn8Qwh0J0dG_

저작자표시

'Reinforcement Learning' 카테고리의 다른 글

Behavior Policy VS Target Policy (0)	2024.10.06
Monte-carlo VS Temporal Difference (0)	2024.10.06
Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation (0)	2024.09.22

현재글Q-Learning, Greedy action, Q-Value, Exploration, ϵ-greedy, epsilon-greedy, Exploitation, Discount factor, Q-update

docker-compose, yield from, textdistance, Sigmoid function, kafka, global variable, abstractmethod, Regular Expression, selectall, axis, batch size, classmethod, d3js, nvidia-smi, forward propagation, Step Function, randn, zeros, Filter, cross-entropy,

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

¡Hola, Mundo!