Reinforcement Learning

Markov Decision Process, State value function, Action value function, Optimal policy, Bellman equation

Naranjito 2024. 9. 22. 10:36
  • Markov Decision Process

 

- Decision : Sequence of Actions.

- S1 : It absorbed S0, a0 to indicate a1.

- a1 : It is given by S1. If only S1 is given, a1 is determined regadless of S0, a0.


 

https://youtu.be/DbbcaspZATg?si=KgUq5CdJKzHj9QOJ


- Policy : : Probability of what action to do in time t, state t.
That is, distribution of what action to do in a particular state.
The policy determines the action.

State, action are a pair.

 


  • Return

 

 

In here, 

 

→ This can be

because is

- Goal of Reinforcement Learning : 

 

Maximize Expected(Average) Return(Sum of rewards).

In order to do that, find the policy that can maximize the return.


  • The way that express the Expected(Average) Return

 

1. State value function

 

The expected return from now.

What is the expected sum of rewards in the future due to being in this state at the moment?

Value of the current state.(Important is now, not the past)

Take all the actions for all states and add all the rewards.

 


→ Action → → next state → Action →

2. Action value function

 

The expected return from current action(Q-learning).

현재 상태에서 어떤 특정한 액션을 선택했을 때, 선택 후 받을 수 있는 (감마로 discounted가 된)리워드의 썸

 

 


: Current state, Action in Current state
: The expected return

 

https://youtu.be/7MdQ-UAhsxA?si=7nMOzl00vo6658H_


  • Optimal policy

 

Maximize the State value function.

 

 


 

https://youtu.be/cn7IAfgPasE?si=Xp6C2bacSXXq9CnU


  • Bellman equation

 


 

https://youtu.be/gA-6J-nl4c4?si=7uaKcJcokkQEKhng