Q-Learning Grid World

A simple example of reinforcement learning using Q-learning to navigate a grid.

What is Reinforcement Learning?

Imagine teaching a dog to fetch a ball. You don't explain the rules - instead, the dog tries different things and learns from the results:

Fetch the ball → gets a treat (positive reward)
Run away → no treat (no reward)

Over time, the dog figures out: "If I fetch the ball, I get a treat!"

Reinforcement Learning (RL) works the same way. An agent (like a robot or game character) learns by:

Trying actions in an environment
Getting rewards (positive or negative) based on what happens
Learning which actions lead to the best outcomes

What is Q-Learning?

Q-learning is a specific RL method where the agent builds a "cheat sheet" of how good each action is in each situation.

The Q-Table: Think of it as a lookup table:

Rows = all possible situations (states)
Columns = all possible actions
Values = how good that action is in that situation

Example:

State	Go Up	Go Down	Go Left	Go Right
(0,0)	-2.5	3.1	-1.0	2.8
(0,1)	1.2	4.5	-0.5	3.2

Looking at state (0,1), "Go Down" has the highest value (4.5), so that's the best action.

The Grid World

+---+---+---+---+---+
| S |   |   |   |   |    S = Start (where the agent begins)
+---+---+---+---+---+
|   | X |   | X |   |    X = Wall (can't walk through)
+---+---+---+---+---+
|   |   |   |   |   |    G = Goal (where we want to go)
+---+---+---+---+---+
|   | X |   | X |   |
+---+---+---+---+---+
|   |   |   |   | G |
+---+---+---+---+---+

Goal: The agent starts at S (top-left) and must learn to reach G (bottom-right) while avoiding walls.

How the Agent Learns

The Learning Cycle

┌─────────────────────────────────────────────────────┐
│  1. Look at current position (state)                │
│                    ↓                                │
│  2. Pick an action (up/down/left/right)             │
│                    ↓                                │
│  3. Move and get reward (+10 for goal, -1 per step) │
│                    ↓                                │
│  4. Update the Q-table based on what happened       │
│                    ↓                                │
│  5. Repeat until reaching the goal                  │
└─────────────────────────────────────────────────────┘

Exploration vs Exploitation

The agent faces a dilemma:

Exploit: Do what it knows works best
Explore: Try new things that might be better

We use epsilon-greedy strategy:

With probability ε (epsilon): pick a random action (explore)
Otherwise: pick the best known action (exploit)

At first, ε = 1.0 (100% random). Over time, it decreases so the agent explores less and uses its knowledge more.

The Q-Learning Formula

Q(state, action) = Q(state, action) + α × [reward + γ × max(Q(next_state)) - Q(state, action)]

In plain English:

α (alpha = 0.1): Learning rate. How much to trust new information vs old knowledge.
γ (gamma = 0.99): Discount factor. How much to care about future rewards vs immediate rewards.
reward: What we got for this action (+10 for goal, -1 for each step).
max(Q(next_state)): The best possible value from the next state.

Code Structure

`GridWorld` class (The Environment)

Creates the 5x5 grid with walls
reset(): Put the agent back at the start
step(action): Move the agent and return the reward
render(): Display the grid

`QLearningAgent` class (The Learner)

Holds the Q-table
choose_action(): Pick an action (explore or exploit)
learn(): Update Q-values after each action

`train()` function

Runs 500 episodes (attempts) where the agent tries to reach the goal, learning a little bit each time.

`test_policy()` function

After training, test the learned policy without any random exploration.

Running the Code

Requirements

Python 3.x
NumPy

Install and Run

# Install numpy if you don't have it
pip install numpy

# Run the program
python q_learning_gridworld.py

What You'll See

Training progress - Shows how the agent improves over 500 episodes
Learned policy - A grid with arrows showing the best direction to move from each cell
Test run - Watch the trained agent navigate from start to goal

Example output:

Episode 100/500 | Avg Reward: -12.45 | Epsilon: 0.606 | Steps: 23
Episode 200/500 | Avg Reward: 1.23 | Epsilon: 0.367 | Steps: 9
...

LEARNED POLICY (arrows show best action per cell)
=========================
 S  →  →  ↓  ↓
 ↓  █  ↓  █  ↓
 ↓  →  →  →  ↓
 ↓  █  ↑  █  ↓
 →  →  ↑  →  G
=========================

The arrows show the shortest path the agent learned!

Key Hyperparameters

Parameter	Value	What it does
`learning_rate`	0.1	How fast the agent learns (higher = faster but less stable)
`discount_factor`	0.99	How much future rewards matter (1 = future is as important as now)
`epsilon`	1.0 → 0.01	Exploration rate (starts high, decreases over time)
`episodes`	500	Number of training attempts

Try Changing Things

Make learning faster: Increase learning_rate to 0.5
More exploration: Slow down epsilon_decay (e.g., 0.999 instead of 0.995)
Different rewards: Change the step penalty from -1 to -0.1 (agent will wander more)
Bigger grid: Modify grid_size and add more walls

Summary

The agent starts knowing nothing (Q-table is all zeros)
It tries random actions and learns from rewards
Gradually, it figures out the best path to the goal
After training, it can consistently navigate from start to goal

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
q_learning_gridworld.py		q_learning_gridworld.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Learning Grid World

What is Reinforcement Learning?

What is Q-Learning?

The Grid World

How the Agent Learns

The Learning Cycle

Exploration vs Exploitation

The Q-Learning Formula

Code Structure

`GridWorld` class (The Environment)

`QLearningAgent` class (The Learner)

`train()` function

`test_policy()` function

Running the Code

Requirements

Install and Run

What You'll See

Key Hyperparameters

Try Changing Things

Summary

About

Uh oh!

Releases

Packages

Languages

yash2002vardhan/rl_tutorial

Folders and files

Latest commit

History

Repository files navigation

Q-Learning Grid World

What is Reinforcement Learning?

What is Q-Learning?

The Grid World

How the Agent Learns

The Learning Cycle

Exploration vs Exploitation

The Q-Learning Formula

Code Structure

GridWorld class (The Environment)

QLearningAgent class (The Learner)

train() function

test_policy() function

Running the Code

Requirements

Install and Run

What You'll See

Key Hyperparameters

Try Changing Things

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GridWorld` class (The Environment)

`QLearningAgent` class (The Learner)

`train()` function

`test_policy()` function

Packages