reinforcement learning in simulator

reinforcement learning from David Silver. In autonomous vehicle training, more and more practices dependent on simulator training e.g. carla RF
an open source architecture used often is OpenAI GYM

each agent has state, observation, action, which can be defined in state_space, observation_space, action_space.

background

1) infinite or finite :

in real case, state, observation space are easily to be infinite, e.g. in every frame, the agent has a different environment,
which means a different state. the elements in action_space may be limited, e.g. turn left, turn right, accelerate, brake. etc.

2) continuous or discrete :

when driving a car, the control is continuous, namely, a continuous function a time t. on the other hand, like play GO game,
the action is not time related, each action is discrete to another.

3) locally or globally perception

for a grid world game, the agent knows the whole world; but when a car running in a city, it can only percept the environment locally.
this has a big difference when make decision

the following talks about on-line policy RL, which means the agent will follow some policy(high-level). e.g the velocity limit, no collision with another car etc to training agent know how to drive, often need define environment reward, used to update the policy(detail-level), e.g. when the agent in stateA, policyI will be used. but keep in mind, neither the human nor the agent know exactly at first, which is the best(detail-level) value for the chosen policy, and the input (state, policy/action) is not even accurate so the learning process is actually to iterate in each episode to update the table (state, action), in which the policy is updated hidden, and which is usually called Q-value. after training, the (state, action) is better to handle the simulation environement.

so the high-level computing flow:

 
 s0 = env.step(agent)
 a0 = agent.follow_policy(s0)
 while not is_done: 
s1, r1, is_done, info = agent.act(a0) 
old_q_value = env.get_Q(s0, a0)
cur_q_value = env.get_Q(s1, a1) 
measure = r1 + gamma * cur_q_value 
new_q_value = old_q_value + alpha * (measure - old_q_value) 
env.set_Q(s0, a0, new_q_value) 
s0, a0 = s1, a1

real case

while loop is basically update policy, so next time in a certain state, the agent will performance better. but there is a problem, most real problem don’t have a Q-table, since the state space is infinite.

1) linear space approximator

the real state space is high-dimension even infinite-dimension, mathmatically it’s OK to use a linear space to be as close as possible to the real space (Hilbert space). e.g. in elastic mechanics, the finite element space can approximate any accuracy to the continuous elastic space.

so the idea is to construct the linear space base vector based on existing sample state, then all state can be descriped as a linear combination of the base vector.

2） neuro-network aproximator

for nonlinear fitted application, to construct a neuro-network to describe the state space. how to train the network and how to make sure it is robost and convergence is another topic