Q-learning:- Playing OpenAi's Taxi- V2

Originally Written on:- 5th January, 2019.

OpenAI taxi v2. img source: Google Images

Introduction

Reinforcement Learning

Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps:

1. Observation of the environment

2. Deciding how to act using some strategy

3. Acting accordingly

4. Receiving a reward or penalty

5. Learning from the past experiences and refining the strategy

6. Iterate until an optimal strategy is found

Let's now understand Reinforcement Learning by actually developing an agent to learn to play a game automatically on its own.

0 1 2 3 4

+---------+

0 |R: | : :G|

1 | : : : : |

2 | : : : : |

3 | | : | : |

4 |Y| : |B: |

+---------+

#code

import numpy as np

import random

import gym

env = gym.make("Taxi-v2").env

env.render()

o/p

+---------+

|R: | : :G|

| : : : : |

| | : | : |

|Y| : |B: |

+---------+

We have imported gym. If you do not have gym installed, you can do pip install gym in your command prompt and gym will be installed.

1) State space

Let's assume The Taxi is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. Notice the current location state of our taxi from the above output is in the coordinate (1,3).

You'll also notice there are four (4) locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our illustrated passenger is in location Y and they wish to go to location R.

When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations and five (4 + 1) passenger locations.

So, our taxi environment has 5×5×5×4=500 total possible states.

The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:

1. south

2. north

3. east

4. west

5. pickup

6. dropoff

This is the action space: the set of all the actions that our agent can take in a given state.

we will simply to consider going around the wall.

2.Action Space

The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:

1. south

2. north

3. east

4. west

5. pickup

6. dropoff

This is the action space: the set of all the actions that our agent can take in a given state.

You'll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to walls. In environment's code, we will simply provide a -1 penalty for every wall hit and the taxi won't move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

env.reset() # reset environment to a new, random state

env.render()

action_size = env.action_space.n

print("Action size ", action_size)

state_size = env.observation_space.n

print("State size ", state_size)

o/p

+---------+

|R: | : :G|

| : : : : |

| | : | : |

|Y| : |B: |

+---------+

Action size 6

State size 500

We have created our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size OpenAI Gym provides us a way to do that: env.action_space.n and env.observation_space.n

After that we are displaying the matrices.

q_table = np.zeros([env.observation_space.n, env.action_space.n])

print (q_table)

o/p

[[0. 0. 0. 0. 0. 0.]

[0. 0. 0. 0. 0. 0.]

...

[0. 0. 0. 0. 0. 0.]

[0. 0. 0. 0. 0. 0.]]

Intro to Q-learning

Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

In our Taxi environment, we will have a reward table from which the agent will learn from. It does thing by looking receiving a reward for taking an action in the current state, then updating a Q-value to remember if that action was beneficial.

The values store in the Q-table are called a Q-values, and they map to a (state, action) combination.

A Q-value for a particular state-action combination is representative of the "quality" of an action taken from that state. Better Q-values imply better chances of getting greater rewards.

For example, if the taxi is faced with a state that includes a passenger at its current location, it is highly likely that the Q-value for pickup is higher when compared to other actions, like dropoff or north.

Q-values are initialized to an arbitrary value, and as the agent exposes itself to the environment and receives different rewards by executing different actions, the Q-values are updated using the equation:

Where:

· α (alpha) is the learning rate (0<α≤1) - Just like in supervised learning settings, α is the extent to which our Q-values are being updated in every iteration.

· γ (gamma) is the discount factor (0≤γ≤1) - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

%%time

"""Training the agent"""

import random

from IPython.display import clear_output

# Hyperparameters

alpha = 0.1

gamma = 0.6

epsilon = 0.1

max_steps = 99

total_test_episodes = 100

# For plotting metrics

all_epochs = []

all_penalties = []

for i in range(1, 100001):

state = env.reset()

epochs, penalties, reward, = 0, 0, 0

done = False

while not done:

if random.uniform(0, 1) < epsilon:

action = env.action_space.sample() # Explore action space

else:

action = np.argmax(q_table[state]) # Exploit learned values

next_state, reward, done, info = env.step(action)

old_value = q_table[state, action]

next_max = np.max(q_table[next_state])

new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)

q_table[state, action] = new_value

if reward == -10: #computing rewards and penalties

penalties += 1

state = next_state

epochs += 1

if i % 100 == 0:

clear_output(wait=True)

print("Episode:",{i})

print("Training finished.\n")

o/p

Episode: {100000}

Training finished.

Wall time: 40.5 s

next, we evaluate the performance of the agent.

#evaluate the agent's performance

from IPython.display import clear_output

from time import sleep

env.reset()

rewards = []

for episode in range(total_test_episodes):

state = env.reset()

step = 0

done = False

total_rewards = 0

for step in range(max_steps):

# The AGENT is PLAYING

env.render()

# Take the action (index) that have the maximum expected future reward given that state

action = np.argmax(q_table[state])

new_state, reward, done, info = env.step(action)

total_rewards += reward

if done:

rewards.append(total_rewards)

print ("Score", total_rewards)

break

state = new_state

env.close()

print ("Score over time: " + str(sum(rewards)/total_test_episodes))

Full code is available on my Github

Output after evaluating the agent's performance

(North)

+---------+

|R: | :_:G|

| : : : : |

| | : | : |

|Y| : |B: |

+---------+

(East)

+---------+

|R: | : :G|

| : : : : |

| | : | : |

|Y| : |B: |

+---------+

(East)

Score 11

Score over time: 8.41

This is the output after a 100

episodes.

Full Credits:- https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/

Search This Blog

AI Activated

Q-learning:- Playing OpenAi's Taxi- V2

Comments

Post a Comment

Popular posts from this blog

Tensor Processing Units:- Architecture

Discrete Mathematics and Artificial Intelligence