{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ๐ŸŽฏ RL Bootcamp 2026 โ€” S1 Coding Lab\n", "## Build Your First Agent: GridWorld + Random Policy\n", "\n", "**Section 1** ยท *Welcome to Reinforcement Learning* \n", "**Duration:** 60 minutes ยท **Pre-requisites:** Python basics, NumPy\n", "\n", "---\n", "\n", "## What you will do in this lab\n", "\n", "1. **Build** a 4ร—4 GridWorld environment from scratch (no library needed).\n", "2. **Run** a random-policy agent through it and measure performance.\n", "3. **Observe** what \"no learning\" looks like โ€” episode lengths, reward, goal-reach rate.\n", "4. **Visualise** the agent's behaviour as a printed grid.\n", "\n", "By the end you will have a working RL environment and the baseline metrics every learning algorithm in this course will be compared against.\n", "\n", "## How this connects to your assignment\n", "\n", "The Week 1 capstone asks you to extend exactly this GridWorld โ€” adding walls, lava, and statistics over 100 episodes. **The classes and functions you write here are the foundation your assignment builds on.** Save this notebook; you will copy code from it.\n", "\n", "## How to run this notebook in Google Colab\n", "\n", "1. File โ†’ Open notebook โ†’ Upload โ†’ select this `.ipynb`.\n", "2. Runtime โ†’ Run all (or `Shift+Enter` on each cell).\n", "3. No GPU needed. No installs needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 0 ยท Setup\n", "\n", "We need only NumPy (for arrays) and Python's `random` module (for the random policy). Both come pre-installed in Colab โ€” no `pip install` needed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import random\n", "from collections import Counter\n", "\n", "# Fix the random seed so your results match the lecture exactly.\n", "# Change the seed (or remove it) and re-run to see how randomness affects outcomes.\n", "np.random.seed(42)\n", "random.seed(42)\n", "\n", "print(\"โœ… Setup complete. NumPy version:\", np.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 1 ยท Recap โ€” The 8 Words That Unlock Everything\n", "\n", "Before code, the vocabulary from Lecture 2:\n", "\n", "| Term | Symbol | Meaning in our GridWorld |\n", "|---|---|---|\n", "| **Agent** | ฯ€ | The decision-maker we'll build |\n", "| **Environment** | ฮต | The 4ร—4 grid |\n", "| **State** | Sโ‚œ | The agent's `(row, col)` position |\n", "| **Action** | Aโ‚œ | One of {up, down, left, right} |\n", "| **Reward** | Rโ‚œ | +10 goal, โˆ’1 step, โˆ’5 wall bump |\n", "| **Policy** | ฯ€(s) | A function: state โ†’ action |\n", "| **Value** | V(s) | How good is state s? (we'll learn this in S2) |\n", "| **Episode** | G | One run, from start to goal (or step limit) |\n", "\n", "Every line of code below maps directly to one of these terms. Watch for them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 2 ยท Build the GridWorld Environment\n", "\n", "### Specification\n", "\n", "* **Grid size:** 4 ร— 4 = **16 states**.\n", "* **Start:** top-left `(0, 0)`.\n", "* **Goal:** bottom-right `(3, 3)` โ€” reward **+10**.\n", "* **Step cost:** **โˆ’1** for every move (encourages efficiency).\n", "* **Wall bump:** **โˆ’5** for trying to leave the grid (action wastes a turn).\n", "* **Episode ends** when the agent reaches the goal **or** uses 50 steps.\n", "\n", "### Why a class?\n", "\n", "Wrapping the environment in a class gives us a clean `reset() / step()` interface โ€” the same interface OpenAI Gym uses. Future code will plug straight into this contract." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class GridWorld:\n", " \"\"\"A minimal 4x4 GridWorld matching the Gymnasium interface.\n", " \n", " Methods:\n", " reset() -> returns starting state\n", " step(action) -> returns (next_state, reward, done, info)\n", " render() -> prints the grid\n", " \n", " State representation: (row, col) tuple. Rows count down, cols count right.\n", " Action representation: integer 0..3 (UP / DOWN / LEFT / RIGHT).\n", " \"\"\"\n", " \n", " # Action constants โ€” readable code beats magic numbers\n", " UP, DOWN, LEFT, RIGHT = 0, 1, 2, 3\n", " ACTION_NAMES = ['โ†‘ UP', 'โ†“ DOWN', 'โ† LEFT', 'โ†’ RIGHT']\n", " \n", " def __init__(self, size=4, max_steps=50):\n", " self.size = size\n", " self.max_steps = max_steps\n", " self.start = (0, 0)\n", " self.goal = (size - 1, size - 1)\n", " self.reset()\n", " \n", " def reset(self):\n", " \"\"\"Reset the agent to the start; return the starting state.\"\"\"\n", " self.agent_pos = self.start\n", " self.step_count = 0\n", " self.done = False\n", " return self.agent_pos\n", " \n", " def step(self, action):\n", " \"\"\"Apply one action; return (next_state, reward, done, info).\"\"\"\n", " if self.done:\n", " raise RuntimeError(\"Episode has ended. Call reset().\")\n", " \n", " r, c = self.agent_pos\n", " \n", " # Map action to a position delta\n", " if action == self.UP: new_r, new_c = r - 1, c\n", " elif action == self.DOWN: new_r, new_c = r + 1, c\n", " elif action == self.LEFT: new_r, new_c = r, c - 1\n", " elif action == self.RIGHT: new_r, new_c = r, c + 1\n", " else: raise ValueError(f\"Unknown action: {action}\")\n", " \n", " # Check wall: if we try to leave the grid, stay put and penalise\n", " if not (0 <= new_r < self.size and 0 <= new_c < self.size):\n", " reward = -5\n", " # Agent stays at (r, c)\n", " else:\n", " self.agent_pos = (new_r, new_c)\n", " reward = -1 # step cost\n", " \n", " self.step_count += 1\n", " \n", " # Goal check\n", " if self.agent_pos == self.goal:\n", " reward = 10\n", " self.done = True\n", " elif self.step_count >= self.max_steps:\n", " self.done = True # timeout\n", " \n", " info = {\"step_count\": self.step_count}\n", " return self.agent_pos, reward, self.done, info\n", " \n", " def render(self):\n", " \"\"\"Print the grid with A = agent, G = goal, . = empty.\"\"\"\n", " for r in range(self.size):\n", " row = \"\"\n", " for c in range(self.size):\n", " if (r, c) == self.agent_pos and (r, c) == self.goal:\n", " row += \" โ˜… \" # agent at goal\n", " elif (r, c) == self.agent_pos:\n", " row += \" A \"\n", " elif (r, c) == self.goal:\n", " row += \" G \"\n", " else:\n", " row += \" . \"\n", " print(row)\n", " print(f\"Step: {self.step_count}/{self.max_steps} Done: {self.done}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### โœ‹ Quick check: did it build?\n", "\n", "Let's instantiate the environment and look at the starting state." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "env = GridWorld(size=4, max_steps=50)\n", "state = env.reset()\n", "print(f\"Starting state: {state}\")\n", "print(f\"Goal location: {env.goal}\")\n", "print()\n", "env.render()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see a 4ร—4 grid with **A** in the top-left and **G** in the bottom-right.\n", "\n", "Step counter shows **0** โ€” no moves yet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 3 ยท Take One Action\n", "\n", "Let's manually step through three actions and watch the state, reward, and done flag change. This is what every RL algorithm does โ€” just much faster, and choosing actions intelligently." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "state = env.reset()\n", "print(\"Initial:\")\n", "env.render()\n", "print()\n", "\n", "# Action 1: move RIGHT\n", "state, reward, done, info = env.step(env.RIGHT)\n", "print(f\"After RIGHT โ†’ state={state} reward={reward} done={done}\")\n", "env.render()\n", "print()\n", "\n", "# Action 2: move DOWN\n", "state, reward, done, info = env.step(env.DOWN)\n", "print(f\"After DOWN โ†’ state={state} reward={reward} done={done}\")\n", "env.render()\n", "print()\n", "\n", "# Action 3: try to go UP off the grid (wall bump)\n", "state, reward, done, info = env.step(env.UP)\n", "print(f\"After UP โ†’ state={state} reward={reward} done={done}\")\n", "print(\"(notice the -5 reward and unchanged position โ€” that's the wall penalty)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ๐Ÿง  What you just saw\n", "\n", "* Each `step()` returns **four things**: next state, reward, done flag, info dict.\n", "* Walking into the wall **costs โˆ’5 reward** AND **wastes a turn** โ€” the step counter still advanced.\n", "* The agent's policy choice (RIGHT, DOWN, UP) was hard-coded by us. In Part 4 we'll let a random policy choose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 4 ยท The Random-Policy Agent\n", "\n", "A **policy** is a function: state โ†’ action. The simplest possible policy is **random** โ€” pick any action uniformly. This is our baseline, the dumbest possible agent.\n", "\n", "Why bother? Because if a learning algorithm doesn't beat random, it isn't learning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def random_policy(state, num_actions=4):\n", " \"\"\"Return a uniformly random action regardless of state.\"\"\"\n", " return random.randint(0, num_actions - 1)\n", "\n", "# Quick demo: 5 random action choices\n", "print(\"5 random actions from the policy:\")\n", "for i in range(5):\n", " a = random_policy(state)\n", " print(f\" Action {i+1}: {a} ({env.ACTION_NAMES[a]})\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice: the policy **ignores the state**. A learning agent would use the state to pick smarter actions. By the end of Section 5, yours will." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 5 ยท Run One Full Episode\n", "\n", "Now we glue it all together: agent โ†” environment loop until `done=True`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_episode(env, policy, render=False):\n", " \"\"\"Run one episode. Return (total_reward, length, reached_goal).\"\"\"\n", " state = env.reset()\n", " total_reward = 0\n", " length = 0\n", " done = False\n", " \n", " while not done:\n", " action = policy(state)\n", " state, reward, done, info = env.step(action)\n", " total_reward += reward\n", " length += 1\n", " if render:\n", " print(f\"Step {length}: action={env.ACTION_NAMES[action]:<8} \"\n", " f\"state={state} reward={reward:+d} total={total_reward:+d}\")\n", " \n", " reached_goal = (state == env.goal)\n", " return total_reward, length, reached_goal\n", "\n", "# Run one episode and watch every step\n", "total_reward, length, reached = run_episode(env, random_policy, render=True)\n", "print()\n", "print(f\"Episode summary: reward={total_reward} length={length} reached_goal={reached}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Expect most random episodes to **time out at 50 steps** without reaching the goal โ€” and to accumulate a deeply negative reward. That is the entire pedagogical point.\n", "\n", "**Try this:** re-run the cell above several times. Each run uses different random actions. See how chaotic the results are." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 6 ยท Run 100 Episodes โ€” Measure What \"No Learning\" Costs You\n", "\n", "One episode is noise. **100 episodes** is data. We'll compute:\n", "\n", "* Average **episode length**\n", "* Average **total reward**\n", "* **Goal-reach rate** (fraction of episodes that found G)\n", "\n", "These three numbers are the baseline your assignment will compare against." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def evaluate(env, policy, n_episodes=100):\n", " \"\"\"Run n episodes and return aggregate statistics.\"\"\"\n", " rewards, lengths, reached = [], [], []\n", " for _ in range(n_episodes):\n", " r, l, g = run_episode(env, policy, render=False)\n", " rewards.append(r)\n", " lengths.append(l)\n", " reached.append(g)\n", " return {\n", " \"avg_reward\": np.mean(rewards),\n", " \"avg_length\": np.mean(lengths),\n", " \"goal_reach_pct\": 100 * np.mean(reached),\n", " \"rewards\": rewards,\n", " \"lengths\": lengths,\n", " }\n", "\n", "# Reset seeds for reproducibility\n", "np.random.seed(42)\n", "random.seed(42)\n", "\n", "stats = evaluate(env, random_policy, n_episodes=100)\n", "\n", "print(\"=\" * 50)\n", "print(\"RANDOM POLICY โ€” 100 EPISODES\")\n", "print(\"=\" * 50)\n", "print(f\"Average reward: {stats['avg_reward']:>8.2f}\")\n", "print(f\"Average length: {stats['avg_length']:>8.2f} (of 50 max)\")\n", "print(f\"Goal-reach rate: {stats['goal_reach_pct']:>7.1f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ๐Ÿ“Š Expected results\n", "\n", "| Metric | Typical value |\n", "|---|---|\n", "| Average reward | **roughly โˆ’60 to โˆ’90** |\n", "| Average length | **roughly 45โ€“50** steps (timeouts dominate) |\n", "| Goal-reach rate | **5โ€“25 %** |\n", "\n", "Your exact numbers will vary slightly. The key observation: a random agent is **bad on every metric**. Even when it does find G, it's by accident." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 7 ยท Visualise the Random Agent's Path\n", "\n", "Let's overlay one episode's trajectory onto the grid to see *how* the agent failed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def run_and_trace(env, policy):\n", " \"\"\"Run one episode and return the list of visited positions.\"\"\"\n", " state = env.reset()\n", " visited = [state]\n", " done = False\n", " while not done:\n", " action = policy(state)\n", " state, _, done, _ = env.step(action)\n", " visited.append(state)\n", " return visited\n", "\n", "def render_path(env, path):\n", " \"\"\"Print the grid with #N showing the step number at each visited cell.\"\"\"\n", " visit_counts = Counter(path)\n", " for r in range(env.size):\n", " line = \"\"\n", " for c in range(env.size):\n", " cell = (r, c)\n", " if cell == env.goal:\n", " line += \" G \"\n", " elif cell in visit_counts:\n", " line += f\" x{visit_counts[cell]:<2}\"\n", " else:\n", " line += \" . \"\n", " print(line)\n", " print(f\"Path length: {len(path) - 1} steps. Cells visited: {len(visit_counts)}.\")\n", "\n", "np.random.seed(7); random.seed(7) # different seed for variety\n", "path = run_and_trace(env, random_policy)\n", "render_path(env, path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each cell shows how many times the random agent visited it.\n", "\n", "Look for: **revisits, wasted moves, cells near the goal that never got explored**. This visual is what makes \"random is bad\" feel concrete." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Part 8 ยท Stretch Exercises (Optional, but Recommended)\n", "\n", "These are warm-ups for the assignment. Try at least one before moving on.\n", "\n", "### ๐Ÿงช Exercise 8.1 โ€” A Smarter-Than-Random Baseline\n", "\n", "Write a policy that **always picks RIGHT** (or alternates RIGHT and DOWN). Run 100 episodes. Compare to random.\n", "\n", "*Hint: a policy is just a function `state โ†’ action`. It doesn't have to be smart, it just has to be a function.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def always_right(state):\n", " \"\"\"Always go right. The dumbest deterministic policy.\"\"\"\n", " return GridWorld.RIGHT\n", "\n", "# Reset seeds for fair comparison\n", "np.random.seed(42); random.seed(42)\n", "stats_right = evaluate(env, always_right, n_episodes=100)\n", "print(f\"ALWAYS-RIGHT policy: reward={stats_right['avg_reward']:.2f} \"\n", " f\"length={stats_right['avg_length']:.2f} \"\n", " f\"goal-rate={stats_right['goal_reach_pct']:.1f}%\")\n", "\n", "# Now write your own โ€” alternate RIGHT/DOWN\n", "step_counter = {\"n\": 0}\n", "def alternate(state):\n", " step_counter[\"n\"] += 1\n", " return GridWorld.RIGHT if step_counter[\"n\"] % 2 == 0 else GridWorld.DOWN\n", "\n", "step_counter[\"n\"] = 0\n", "np.random.seed(42); random.seed(42)\n", "stats_alt = evaluate(env, alternate, n_episodes=100)\n", "print(f\"ALTERNATE policy: reward={stats_alt['avg_reward']:.2f} \"\n", " f\"length={stats_alt['avg_length']:.2f} \"\n", " f\"goal-rate={stats_alt['goal_reach_pct']:.1f}%\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ๐Ÿงช Exercise 8.2 โ€” Add a Wall\n", "\n", "Modify `GridWorld.__init__` to accept a list of **wall cells** that block movement. A wall at `(1, 1)` should: (a) not be enterable, (b) cost the โˆ’5 wall-bump reward if the agent tries.\n", "\n", "This is exactly what your **assignment Part 2** asks for. Get it working here first.\n", "\n", "### ๐Ÿงช Exercise 8.3 โ€” Add Lava\n", "\n", "Add a `lava` cell list. Entering lava gives **โˆ’10** reward AND ends the episode. This adds a real strategic dimension โ€” the agent must learn to avoid lava on its way to G.\n", "\n", "*Hint: handle lava in `step()`, right where you handle the goal.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## โœ… Recap โ€” What You Built\n", "\n", "| You wrote | What it represents (in RL terms) |\n", "|---|---|\n", "| `GridWorld` class | The **environment** ฮต |\n", "| `agent_pos` tuple | The **state** Sโ‚œ |\n", "| `step(action)` | The **transition function** P (deterministic here) |\n", "| Reward in `step()` | The **reward function** R |\n", "| `random_policy()` | A **policy** ฯ€ (the trivial one) |\n", "| `run_episode()` | One **episode** G |\n", "| `evaluate()` | Statistics across **many episodes** |\n", "\n", "**Every single algorithm in this course replaces just one thing: `random_policy`.**\n", "\n", "Section 2 will swap it for a policy computed by value iteration. Section 3 with Monte Carlo. Section 4 with TD/SARSA. The environment stays. The policy gets smarter.\n", "\n", "---\n", "\n", "## ๐Ÿ“Œ What this gives you for Week 1's assignment\n", "\n", "* **Part 1** (Environment Setup): you already have it โ€” done.\n", "* **Part 2** (5ร—5 GridWorld with walls + lava + goal): start by copying this notebook, change `size=4` to `size=5`, and complete Exercise 8.2 and 8.3.\n", "* **Part 3** (random-walk agent, 100 episodes, learning curve): `evaluate()` already returns `rewards` per episode. Plot it with `matplotlib`.\n", "* **Part 4** (analysis & reflection): your numbers from Part 6 are exactly the stats the assignment asks for. Use the vocabulary table in Part 1 in your reflection.\n", "\n", "---\n", "\n", "## โญ๏ธ Next coding lab\n", "\n", "**S2 โ€” Value Iteration & Policy Iteration.** Same GridWorld, but the policy stops guessing. We compute the *optimal* policy from the Bellman equation.\n", "\n", "Save this notebook before you close it โ€” the S2 lab will import from it." ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }