reinforcement learning in nutshell-1

RL concepts

assuming timestep t:

the environment state S(t)
agent’s action A(t)
discouted future reward R(t), which satisfy:

$R_t = r_{t+1} + \gamma r_{t+2} + ... + \gamma^k r_{t+k+1}$

with current state `s` and taking current action `a`, the env will give the reward `r_{t+1}`. `\gamma` is the discount ratio, which usually in [0,1], when $\gamma = 0 $, agent's value only consider current reward.  check [discount future reward]() in following chapter.

agent’s policy P(a|s), in current state s, the propability of action a
agent’s state value V(s), in state s and took policy \pi, which usually described as an expectation:

$V^\pi(s) = E (R_t )$

agent’s action value Q(s, a), which consider both state s and action a effects on value calculation.

$Q^\pi(s, a) = E_{\pi} ( R_t | s=s_t, a=a_t) = E_{\pi} (R_{t+1}) = E(r_{t+1} + \gamma V^\pi(s_{t+1}))$

$V^\pi (s) = E ( Q^\pi(s, a)|_{a \sim \pi(a|s)} )$

namely, agent’s value is the expectation of its action value with probability distribution p(a|s).

state transfer propability P
explore rate \eta, it’s the chance to choose non-max value during iteration, similar as mutation.

Markov decision process

assuming state transfer propability, agent's policy, and agent's value follow Markov assumption. namely, the current state transfer propability, agent’s policy and value only relates to current state.

agent’s value function V(s) meets Belman’s equation:

$V(S) = E( R_{t+1} + \gamma V(S_{t+1})|S_t =s )$

and agent’s action value function q(s) also has Belman’s equation:

$Q(s, a) = E( R_{t+1} + \gamma Q(S_{t+1}, A_{t+1} )|S_t =s, A_t =a )$

in general, env’s state is random; the policy to take action is policy P(a|s). Markov Decision process is: state, action, policy, each [state, action, policy] is an episode, which gives the state-action-reward series:

s0, a0, r1, s1, a1, r2, ... s(n-1),  a(n-1), rn, sn

Markov assumption is state(n+1) is only depends on state(n).

Belman’s equation

Belman’s equation gives the recurrence relation in two timesteps. the current state value is calcualted from next future status with given current env status.

$v(s) = E( R_(t+1) + \gamma v(S_(t+1))|S_t = s )$

discounted future reward

to achieve better reward in long term, always conside the reward as sum of current and future rewards:

R(t) = r(t) + r(t+1) + ... r(n)

on the other side, as the env state is random in time, the same action doesn’t give the same reward usually, and as time goes, the difference is even larger. think about the B-tree, as it goes deeper, the number of nodes is larger. so the reward in future doesn’t count the same weight as earlier reward, here define the discount ratio \gamma <- [0, 1]

R(t) = r(t) + \gamma r(t+1) + ... + \gamma^(n-t) r(n)

R(t) = r(t) = \gamma R(t+1)

policy iteration

there are two steps:

policy evaluation, with current policy \pi to evaluate state’s value V'
policy improvment, with the state value V', with a special policy update strategy(e.g. greedy) to update policy.

the ideal result is to find the fixed point in value space, which corresponds to the optimized policy at current state with current policy update strategy.

in policy iteration, we only consider policy-value mapping.

value iteration

where policy iteration based valued is implicit, only value iteration explicit by Belman’s equation. this is also a fixed point application, as we can use the explicit mapping (Belman’s equation) in state value space. so there should guarantee existing an optimized policy.

an optimization problem

reiforcement learning solution is to find the optimal value function to achieve the max reward in each timestep, which leads to optimal policy \pi*, or max value func, or max action value func.

$$ v(s) = max(v_i(s)) foreach i in value funcs $$ 

$$ q(s, a) = max(q_i(s,a)) foreach i in action value funcs $$

mostly the solution space is not known at hand, if else, the optimal solution can get directly. on the other hand, current state, action set tells a little information about the solution space, as they are part of the solution space, so the optimization algorithms used in convex space can be applied here.

a few common optimization algorithms includes:

dynamic programming(DP)

DP can used in both policy evaluation and policy iteration, but in each calculation step(not even one physical step, and in one physical step, there can be hundreds or thousands of calculation steps) iteration, DP has to recalculate all state(t=current) value to the very first state(t=0) value, which is disaster for high-dimensional problem, and DP by default is kind of integer programming, not fitted to continuous domain either.

Monte Carlo (MC)

MC gets a few sample solutions in the solution space, and use these samples solution to approxiamte the solution space. in a geometry explaination, the base vectors of the solution space is unknown, but MC finds a few points in this space, then approximate the (nearly) base vectors, then any point in this space can be approximated by the linear combination of the nearly base vectors.

MC has no ideas of the propability of state transfer, MC doesn’t care the inner relations of state variables, or model-free.

MC is robost as it’s model free, on the other hand, if the solution space is high-dimensional, there is a high chance the sampling points are limited in a lower dimension, which weak the presentation of MC approximation; also MC needs to reach the final status, which may be not avaiable in some applications.

time-serial Time Difference(TD)

MC use the whole list of status for each policy evaluation; in 1st order TD, policy evaluation only use next reward and next state value:

$ v(s) = R(t+1)  + \gamma S(t+1)|S $

for n-th order TD, the update will use n-th reward :

$ v(s) = R(t+1) + \gamma R(t+2) + ... + \gamma^(n-1) R(t+n) + \gamma^n S(t+n)|S $

it’s clear here, as n->infinitly, n-th order TD is closer to MC. so how close is enough usually, namely which n is enough ?

SARSA

in TD, there are two ways: on-line policy, where the same one policy is using to both update value func and upate action; while off-line policy, where one policy is used to update value func, the other policy is used to update action.

SARSA is kind of on-line policy, and the policy is e-greedy, to choose the max value corresponding action in every iteration with a high probablitiy (1-e), as e is very small.

$ \delta(t) v(S, A) = R + \gamma (v(S', A') - v(S, A)) $

in the iteration equation above, \delta(t) will give the iteration timestep size, S', A' are next state and next action, compare to pure 1-st order TD, where the next state value is modified as: $(Q(S’, A’) - Q(S, A)$, in this way value func keep updated in every func, which can be considered as local correction, compared to the unchanged pure 1-st order TD, which may be highly unstable/diverge in long time series.

as time-serial TD can’t guarantee converge, so this local correction makes SARSA numerical robost.

SARSA(\lambda) in mutli-step based is same as n-th order TD.

$ v(s) = R(t+1) + \gamma R(t+2) + ... + \gamma^(n-1) R(t+n) + \gamma^n ( v(S`) - v(S) ) $

the problem of SARSA is v(S, A) may go huge, as the algorithm is on-line policy, which requires huge memory.

Q-learning

Q-learning is off-line policy TD. the policy iteration is use e-greedy policy, same as SARSA; while the policy evaluation to update value func use greedy policy.

in a word: from status S, using e-greedy policy to choose action A, and get reward R, and in status S', and using greed policy to get action A'.

$ \delta(t) Q(S, A) = R + \gamma (Q(S', A') - Q(S, A)) $   (a)

where $ A= max v(a|S) $. while in SARSA, both S' and A' update using e-greed.

usually Q(S,A) is called Q-valued.

the benefit of choosing a different policy to update action, is kind of decouping status and action, so in this way, they can reach more area in real state-action space, which also lead the solution space a little more robost.

but still equation (a) is not guaranted to converge as time goes. the converged Q(S,A) should be convex, which means its second-order derivative must be less than 0, then the max Q values (max extremum) achieves when first-order derivate is 0.

while Q-learning has the same problem as SARSA, the huge memory to store Q(S,A) table.

to make intuitive example, think about an robot walk with 2 choice(e.g. turn left, turn right), and the grid world has 20 box in line, which has 2^20 ~= 10e6 elements in Q(S, A). it can’t scale to any real problem.

refer

Sutton & Barto, Reinforcement learning: an introduction

fixed point mechanism