Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation

Reinforcement Learning

Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation

Naranjito 2024. 9. 22. 10:36

Markov Decision Process

- Decision : Sequence of Actions.

- S1 : It absorbed S0, a0 to indicate a1.

- a1 : It is given by S1. If only S1 is given, a1 is determined regadless of S0, a0.

https://youtu.be/DbbcaspZATg?si=KgUq5CdJKzHj9QOJ

- Policy : $P(a_{t}|\not S_{0},\not a_{0},S_{t})$ : Probability of what action to do in time t, state t.
That is, distribution of what action to do in a particular state.
The policy determines the action.

State, action are a pair.

$P(S_{2}|\not S_{0},\not a_{0},S_{1},a_{1})$

Return

In here,

$\gamma R_{t+1}+\gamma R_{t+2}+...$ → This can be $\gamma G_{t+1}$

because $G_{t}$ is $R_{t}$

- Goal of Reinforcement Learning :

Maximize Expected(Average) Return(Sum of rewards).

In order to do that, find the policy that can maximize the return.

The way that express the Expected(Average) Return

1. State value function

The expected return from now.

What is the expected sum of rewards in the future due to being in this state at the moment?

Value of the current state.(Important is now, not the past)

Take all the actions for all states and add all the rewards.

$V(S_{t})=$ $\int_{a_{t}:a_{\infty}}^{}G_{t}P(a_{t},S_{t+1},a_{t+1},...|S_{t})$

$S_{t}$ → Action → $a_{t}$ → next state $S_{t+1}$ → Action → $a_{t+1}$

2. Action value function

The expected return from current action(Q-learning).

현재 상태에서 어떤 특정한 액션을 선택했을 때, 선택 후 받을 수 있는 (감마로 discounted가 된)리워드의 썸

$Q(S_{t},a_{t})=$ $\int_{S_{t+1}:a_{\infty}}^{}G_{t}P(S_{t+1},a_{t+1},S_{t+2},a_{t+2},...|S_{t},a_{t})$

$S_{t},a_{t}$ : Current state, Action in Current state

$S_{t+1},a_{t+1},S_{t+2},a_{t+2},...$ : The expected return

https://youtu.be/7MdQ-UAhsxA?si=7nMOzl00vo6658H_

Optimal policy

Maximize the State value function.

https://youtu.be/cn7IAfgPasE?si=Xp6C2bacSXXq9CnU

Bellman equation

https://youtu.be/gA-6J-nl4c4?si=7uaKcJcokkQEKhng

저작자표시

'Reinforcement Learning' 카테고리의 다른 글

Behavior Policy VS Target Policy (0)	2024.10.06
Monte-carlo VS Temporal Difference (0)	2024.10.06
Q-Learning, Greedy action, Q-Value, Exploration, ϵ-greedy, epsilon-greedy, Exploitation, Discount factor, Q-update (0)	2024.09.19

현재글Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation

abstractmethod, nvidia-smi, global variable, classmethod, batch size, kafka, forward propagation, Regular Expression, zeros, axis, docker-compose, yield from, Sigmoid function, d3js, randn, Filter, selectall, Step Function, textdistance, cross-entropy,

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

¡Hola, Mundo!