Q-learning:- Playing OpenAi's Taxi- V2

Originally Written on:-  5th January, 2019.

OpenAI taxi v2. img source: Google Images




Introduction

Reinforcement Learning


Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps:

1.     Observation of the environment
2.     Deciding how to act using some strategy
3.     Acting accordingly
4.     Receiving a reward or penalty
5.     Learning from the past experiences and refining the strategy
6.     Iterate until an optimal strategy is found


Let's now understand Reinforcement Learning by actually developing an agent to learn to play a game automatically on its own.

               0     1     2      3     4
   +---------+
 0 |R: | : :G|
 1 | : : : : |
 2 | : : : : |
 3 | | : | : |
 4 |Y| : |B: |
   +---------+

 #code






import numpy as np

import random
import gym

import gym
env = gym.make("Taxi-v2").env

env.render()

o/p

+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

We have imported gym. If you do not have gym installed, you can do pip install gym in your command prompt and gym will be installed.

1) State space

Let's assume The Taxi is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. Notice the current location state of our taxi from the above output is in the coordinate (1,3).
You'll also notice there are four (4) locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our illustrated passenger is in location Y and they wish to go to location R.
When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations and five (4 + 1) passenger locations.
So, our taxi environment has 5×5×5×4=500 total possible states.
The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.
In other words, we have six possible actions:
1.     south
2.     north
3.     east
4.     west
5.     pickup
6.     dropoff
This is the action space: the set of all the actions that our agent can take in a given state.
we will simply to consider going around the wall.
2.Action Space
The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.
In other words, we have six possible actions:
1.     south
2.     north
3.     east
4.     west
5.     pickup
6.     dropoff
This is the action space: the set of all the actions that our agent can take in a given state.
You'll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to walls. In environment's code, we will simply provide a -1 penalty for every wall hit and the taxi won't move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

env.reset() # reset environment to a new, random state
env.render()

action_size = env.action_space.n
print("Action size ", action_size)

state_size = env.observation_space.n
print("State size ", state_size)


o/p
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

Action size  6
State size  500

We have created our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size OpenAI Gym provides us a way to do that: env.action_space.n and env.observation_space.n
After that we are displaying the matrices.


q_table = np.zeros([env.observation_space.n, env.action_space.n])
print (q_table)

 o/p
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]



Intro to Q-learning

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.
In our Taxi environment, we will  have a reward table from which  the agent will learn from. It does thing by looking receiving a reward for taking an action in the current state, then updating a Q-value to remember if that action was beneficial.
The values store in the Q-table are called a Q-values, and they map to a (state, action) combination.
A Q-value for a particular state-action combination is representative of the "quality" of an action taken from that state. Better Q-values imply better chances of getting greater rewards.
For example, if the taxi is faced with a state that includes a passenger at its current location, it is highly likely that the Q-value for pickup is higher when compared to other actions, like dropoff or north.
Q-values are initialized to an arbitrary value, and as the agent exposes itself to the environment and receives different rewards by executing different actions, the Q-values are updated using the equation:

Where:
·        α (alpha) is the learning rate (0<α≤1) - Just like in supervised learning settings, α is the extent to which our Q-values are being updated in every iteration.
·        γ (gamma) is the discount factor (0≤γ≤1) - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1
max_steps = 99 
total_test_episodes = 100  

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
   
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action)
       
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
       
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:  #computing rewards and penalties
            penalties += 1

        state = next_state
        epochs += 1
       
    if i % 100 == 0:
        clear_output(wait=True)
        print("Episode:",{i})

print("Training finished.\n")
o/p
Episode: {100000}
Training finished.

Wall time: 40.5 s
next, we evaluate the performance of the agent.
#evaluate the agent's performance

from IPython.display import clear_output
from time import sleep


env.reset()
rewards = []

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0

    for step in range(max_steps):
        #  The AGENT is PLAYING
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(q_table[state])
       
        new_state, reward, done, info = env.step(action)
        total_rewards += reward
        if done:
            rewards.append(total_rewards)
            print ("Score", total_rewards)
            break
        state = new_state
env.close()
print ("Score over time: " +  str(sum(rewards)/total_test_episodes))



Full code is available on my Github


 Output after evaluating the agent's performance
(North)
+---------+
|R: | :_:G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
Score 11
Score over time: 8.41

This is the output after a 100
 episodes.

Comments