reinforcement learning in nutshell-2

function approximation

write the mapping from \ pair to value as $ S X A -> Q $.

usually S, Q can be continuous high-dimensional space, and even in discrete state problem. e.g. in GO game, the state space is about 10^170, to store each \ pair in a table is almost impossible. a good and approximated way is using a hash map, where input a state, and through this hash map to get its value, which is one step, rather than many steps even with a tree strucutre of the whole state-action/value space.

in math, this is called function approximation, it’s very intuitive idea, to fit a function from the sample points; then any point in the space, it can be represented by the function’s paramters, rather than search in all the sample points, which may be huge; and further as only the function’s parameters is required to store, rather than the whole sample points information, which is really a pleasure.

Deep Q-Learning

previously in Q-Learning, to update Q(S,A):

$ Q(S,A) <- \delta ( Q(S, A) + R + \gamma ( Q(S', A') - Q(S, A))

to approximate q(s,a) can use a neural network(NN), the benefit of NN is to approximate any function to any precise, similar to any complete orthonormal sequence, e.g. Fourier Series.

CNN approximator

the input to Q neural network is the eigenvector of state S, assuming the action set is finite. the output is action value in state-action-hyperparameter: q(s, a, \theta),

Q(S, A, \theta) ~= Q(S, A)

as the right most section, here assuming the action set is finite, so for each action, there is an output.

experience replay

instead of discarding experiences after one stochastic gradient descent, the agent remembers past expereicnes and learns from them repeatdely, as if the experience has happened again. this allows for greater data efficiency. another reason, as DNN is easily overfitting current episodes, once DNN is overfitted, it’s difficult to produce various experinces. so exprience replay can store experiences including state transitions, rewards, and actions, which are necessary to perform Q learning.

at time t, the agent’s experience e_t is defined as:

$e_t = (s_t, a_t, r_{t+1}, s_{t+1})$

all of agent’s experiences at each time step over all episodes played by the agent are stored in the replay memory. actually, usually the replay memory/buffer set to some finite size limit, namely only store the recent N experiences. and then choose random samples from the replay buffer to train the network.

the key reason here is to break the correction between consecutive samples. if the CNN learned from consecutive samples of experience as they occurred sequentially in the environment, the samples are highly correlated, taking random samples from replay buffer.

target network

use a separate network to estimate the target, this target network has the same architecture as the CNN approximator but with frozen parameters. every T steps(a hyperparameter) the parameters from the Q network are copied to this target network, which leads to more stable training because it keeps the target function fixed(for a while)

go futher in DQN

Double DQN

here use two networks, the DQN network is for action selection and the target network is for target Q value generation.

the problem comes: how to be sure the best action for the next state is the action with the highest Q-value ?

Prioritized Replay DQN

Dueling DQN

yet a summary

at the begining of this series, I was thinking to make a simple summary about DQN, which looks promising in self-driving. After almost one month, there are more and more topics and great blogs jumping out, the simple idea to make a summary is not easy any more. I will stop here, and once back.

refer

math equation editor

Petsc Krylov

zhihu: function approximate

dqnbook: experience replay

Toward Data Science

explained replay memory

going deeper into RL: understanding DQN

improvements in DQN