Q-learning:- Playing OpenAi's Taxi- V2
Originally Written on:- 5th January, 2019.
|
Introduction
Reinforcement
Learning
Reinforcement
Learning is the science of making optimal decisions using experiences. Breaking
it down, the process of Reinforcement Learning involves these simple steps:
1. Observation of the environment
2. Deciding how to act using some
strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the past
experiences and refining the strategy
6. Iterate until an optimal strategy
is found
Let's now
understand Reinforcement Learning by actually developing an agent to learn to
play a game automatically on its own.
0 1
2 3 4
+---------+
0 |R: | :
:G|
1 | : : : : |
2 | : : : : |
3 | | : | : |
4 |Y| : |B: |
+---------+
#code
import numpy as np
import random
import gym
import gym
env = gym.make("Taxi-v2").env
env.render()
o/p
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
We have
imported gym. If you do not have gym installed, you can do pip install gym in
your command prompt and gym will be installed.
1) State space
Let's assume
The Taxi is the only vehicle in this parking lot. We can break up the parking
lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25
locations are one part of our state space. Notice the current location state of
our taxi from the above output is in the coordinate (1,3).
You'll also
notice there are four (4) locations that we can pick up and drop off a passenger:
R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our
illustrated passenger is in location Y and they wish to go to location R.
When we also
account for one (1) additional passenger state of being inside the taxi, we can
take all combinations of passenger locations and destination locations to come
to a total number of states for our taxi environment; there's four (4)
destinations and five (4 + 1) passenger locations.
So, our taxi
environment has 5×5×5×4=500 total possible states.
The agent
encounters one of the 500 states and it takes an action. The action in our case
can be to move in a direction or decide to pickup/dropoff a passenger.
In other
words, we have six possible actions:
1. south
2. north
3. east
4. west
5. pickup
6. dropoff
This is the
action space: the set of all the actions that our agent can take in a given
state.
we will simply
to consider going around the wall.
2.Action Space
The agent
encounters one of the 500 states and it takes an action. The action in our case
can be to move in a direction or decide to pickup/dropoff a passenger.
In other
words, we have six possible actions:
1. south
2. north
3. east
4. west
5. pickup
6. dropoff
This is the
action space: the set of all the actions that our agent can take in a given
state.
You'll notice
in the illustration above, that the taxi cannot perform certain actions in
certain states due to walls. In environment's code, we will simply provide a -1
penalty for every wall hit and the taxi won't move anywhere. This will just
rack up penalties causing the taxi to consider going around the wall.
env.reset() # reset environment to a
new, random state
env.render()
action_size = env.action_space.n
print("Action size ",
action_size)
state_size = env.observation_space.n
print("State size ",
state_size)
o/p
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
Action
size 6
State
size 500
We have
created our Q-table, to know how much rows (states) and columns (actions) we
need, we need to calculate the action_size and the state_size OpenAI Gym
provides us a way to do that: env.action_space.n and env.observation_space.n
After that we
are displaying the matrices.
q_table = np.zeros([env.observation_space.n, env.action_space.n])
print (q_table)
o/p
[[0.
0. 0. 0. 0. 0.]
[0.
0. 0. 0. 0. 0.]
[0.
0. 0. 0. 0. 0.]
...
[0.
0. 0. 0. 0. 0.]
[0.
0. 0. 0. 0. 0.]
[0.
0. 0. 0. 0. 0.]]
Intro to
Q-learning
Essentially,
Q-learning lets the agent use the environment's rewards to learn, over time,
the best action to take in a given state.
In our Taxi
environment, we will have a reward table from which the agent will
learn from. It does thing by looking receiving a reward for taking an action in
the current state, then updating a Q-value to remember if that action was
beneficial.
The values
store in the Q-table are called a Q-values, and they map to a (state, action)
combination.
A Q-value for
a particular state-action combination is representative of the
"quality" of an action taken from that state. Better Q-values imply
better chances of getting greater rewards.
For example,
if the taxi is faced with a state that includes a passenger at its current
location, it is highly likely that the Q-value for pickup is higher when
compared to other actions, like dropoff or north.
Q-values are
initialized to an arbitrary value, and as the agent exposes itself to the
environment and receives different rewards by executing different actions, the
Q-values are updated using the equation:
Where:
· α (alpha) is the learning rate (0<α≤1) - Just like in supervised
learning settings, α is the extent to which our Q-values are being updated in every
iteration.
· γ (gamma) is the discount factor (0≤γ≤1) - determines how much
importance we want to give to future rewards. A high value for the discount
factor (close to 1) captures the long-term effective award, whereas, a discount
factor of 0 makes our agent consider only immediate reward, hence making it
greedy.
%%time
"""Training the
agent"""
import random
from IPython.display import clear_output
# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
max_steps = 99
total_test_episodes = 100
# For plotting metrics
all_epochs = []
all_penalties = []
for i in range(1,
100001):
state = env.reset()
epochs, penalties, reward, = 0, 0, 0
done = False
while not done:
if random.uniform(0,
1) < epsilon:
action = env.action_space.sample() # Explore action space
else:
action = np.argmax(q_table[state]) # Exploit learned values
next_state, reward, done,
info = env.step(action)
old_value = q_table[state,
action]
next_max =
np.max(q_table[next_state])
new_value = (1 - alpha) *
old_value + alpha * (reward + gamma * next_max)
q_table[state, action] =
new_value
if reward ==
-10: #computing rewards and penalties
penalties += 1
state = next_state
epochs += 1
if i % 100 == 0:
clear_output(wait=True)
print("Episode:",{i})
print("Training finished.\n")
o/p
Episode:
{100000}
Training
finished.
Wall time:
40.5 s
next, we
evaluate the performance of the agent.
#evaluate the
agent's performance
from IPython.display import clear_output
from time import sleep
env.reset()
rewards = []
for episode in range(total_test_episodes):
state = env.reset()
step = 0
done = False
total_rewards = 0
for step in range(max_steps):
# The AGENT is
PLAYING
env.render()
# Take the action
(index) that have the maximum expected future reward given that state
action =
np.argmax(q_table[state])
new_state, reward, done, info
= env.step(action)
total_rewards += reward
if done:
rewards.append(total_rewards)
print
("Score", total_rewards)
break
state = new_state
env.close()
print ("Score over time: " +
str(sum(rewards)/total_test_episodes))
Full code is available on my Github
Output after evaluating the
agent's performance
(North)
+---------+
|R: | :_:G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
(East)
Score 11
Score over
time: 8.41
This is the
output after a 100
episodes.
Comments
Post a Comment