{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🎯 RL Bootcamp 2026 — S1 Coding Lab\n",
    "## Build Your First Agent: GridWorld + Random Policy\n",
    "\n",
    "**Section 1** · *Welcome to Reinforcement Learning*  \n",
    "**Duration:** 60 minutes · **Pre-requisites:** Python basics, NumPy\n",
    "\n",
    "---\n",
    "\n",
    "## What you will do in this lab\n",
    "\n",
    "1. **Build** a 4×4 GridWorld environment from scratch (no library needed).\n",
    "2. **Run** a random-policy agent through it and measure performance.\n",
    "3. **Observe** what \"no learning\" looks like — episode lengths, reward, goal-reach rate.\n",
    "4. **Visualise** the agent's behaviour as a printed grid.\n",
    "\n",
    "By the end you will have a working RL environment and the baseline metrics every learning algorithm in this course will be compared against.\n",
    "\n",
    "## How this connects to your assignment\n",
    "\n",
    "The Week 1 capstone asks you to extend exactly this GridWorld — adding walls, lava, and statistics over 100 episodes. **The classes and functions you write here are the foundation your assignment builds on.** Save this notebook; you will copy code from it.\n",
    "\n",
    "## How to run this notebook in Google Colab\n",
    "\n",
    "1. File → Open notebook → Upload → select this `.ipynb`.\n",
    "2. Runtime → Run all (or `Shift+Enter` on each cell).\n",
    "3. No GPU needed. No installs needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 0 · Setup\n",
    "\n",
    "We need only NumPy (for arrays) and Python's `random` module (for the random policy). Both come pre-installed in Colab — no `pip install` needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import random\n",
    "from collections import Counter\n",
    "\n",
    "# Fix the random seed so your results match the lecture exactly.\n",
    "# Change the seed (or remove it) and re-run to see how randomness affects outcomes.\n",
    "np.random.seed(42)\n",
    "random.seed(42)\n",
    "\n",
    "print(\"✅ Setup complete. NumPy version:\", np.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 1 · Recap — The 8 Words That Unlock Everything\n",
    "\n",
    "Before code, the vocabulary from Lecture 2:\n",
    "\n",
    "| Term | Symbol | Meaning in our GridWorld |\n",
    "|---|---|---|\n",
    "| **Agent** | π | The decision-maker we'll build |\n",
    "| **Environment** | ε | The 4×4 grid |\n",
    "| **State** | Sₜ | The agent's `(row, col)` position |\n",
    "| **Action** | Aₜ | One of {up, down, left, right} |\n",
    "| **Reward** | Rₜ | +10 goal, −1 step, −5 wall bump |\n",
    "| **Policy** | π(s) | A function: state → action |\n",
    "| **Value** | V(s) | How good is state s? (we'll learn this in S2) |\n",
    "| **Episode** | G | One run, from start to goal (or step limit) |\n",
    "\n",
    "Every line of code below maps directly to one of these terms. Watch for them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 2 · Build the GridWorld Environment\n",
    "\n",
    "### Specification\n",
    "\n",
    "* **Grid size:** 4 × 4 = **16 states**.\n",
    "* **Start:** top-left `(0, 0)`.\n",
    "* **Goal:** bottom-right `(3, 3)` — reward **+10**.\n",
    "* **Step cost:** **−1** for every move (encourages efficiency).\n",
    "* **Wall bump:** **−5** for trying to leave the grid (action wastes a turn).\n",
    "* **Episode ends** when the agent reaches the goal **or** uses 50 steps.\n",
    "\n",
    "### Why a class?\n",
    "\n",
    "Wrapping the environment in a class gives us a clean `reset() / step()` interface — the same interface OpenAI Gym uses. Future code will plug straight into this contract."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GridWorld:\n",
    "    \"\"\"A minimal 4x4 GridWorld matching the Gymnasium interface.\n",
    "    \n",
    "    Methods:\n",
    "        reset()                 -> returns starting state\n",
    "        step(action)            -> returns (next_state, reward, done, info)\n",
    "        render()                -> prints the grid\n",
    "        \n",
    "    State representation:  (row, col) tuple. Rows count down, cols count right.\n",
    "    Action representation: integer 0..3  (UP / DOWN / LEFT / RIGHT).\n",
    "    \"\"\"\n",
    "    \n",
    "    # Action constants — readable code beats magic numbers\n",
    "    UP, DOWN, LEFT, RIGHT = 0, 1, 2, 3\n",
    "    ACTION_NAMES = ['↑ UP', '↓ DOWN', '← LEFT', '→ RIGHT']\n",
    "    \n",
    "    def __init__(self, size=4, max_steps=50):\n",
    "        self.size = size\n",
    "        self.max_steps = max_steps\n",
    "        self.start = (0, 0)\n",
    "        self.goal = (size - 1, size - 1)\n",
    "        self.reset()\n",
    "    \n",
    "    def reset(self):\n",
    "        \"\"\"Reset the agent to the start; return the starting state.\"\"\"\n",
    "        self.agent_pos = self.start\n",
    "        self.step_count = 0\n",
    "        self.done = False\n",
    "        return self.agent_pos\n",
    "    \n",
    "    def step(self, action):\n",
    "        \"\"\"Apply one action; return (next_state, reward, done, info).\"\"\"\n",
    "        if self.done:\n",
    "            raise RuntimeError(\"Episode has ended. Call reset().\")\n",
    "        \n",
    "        r, c = self.agent_pos\n",
    "        \n",
    "        # Map action to a position delta\n",
    "        if action == self.UP:    new_r, new_c = r - 1, c\n",
    "        elif action == self.DOWN:  new_r, new_c = r + 1, c\n",
    "        elif action == self.LEFT:  new_r, new_c = r, c - 1\n",
    "        elif action == self.RIGHT: new_r, new_c = r, c + 1\n",
    "        else: raise ValueError(f\"Unknown action: {action}\")\n",
    "        \n",
    "        # Check wall: if we try to leave the grid, stay put and penalise\n",
    "        if not (0 <= new_r < self.size and 0 <= new_c < self.size):\n",
    "            reward = -5\n",
    "            # Agent stays at (r, c)\n",
    "        else:\n",
    "            self.agent_pos = (new_r, new_c)\n",
    "            reward = -1  # step cost\n",
    "        \n",
    "        self.step_count += 1\n",
    "        \n",
    "        # Goal check\n",
    "        if self.agent_pos == self.goal:\n",
    "            reward = 10\n",
    "            self.done = True\n",
    "        elif self.step_count >= self.max_steps:\n",
    "            self.done = True  # timeout\n",
    "        \n",
    "        info = {\"step_count\": self.step_count}\n",
    "        return self.agent_pos, reward, self.done, info\n",
    "    \n",
    "    def render(self):\n",
    "        \"\"\"Print the grid with A = agent, G = goal, . = empty.\"\"\"\n",
    "        for r in range(self.size):\n",
    "            row = \"\"\n",
    "            for c in range(self.size):\n",
    "                if (r, c) == self.agent_pos and (r, c) == self.goal:\n",
    "                    row += \" ★ \"  # agent at goal\n",
    "                elif (r, c) == self.agent_pos:\n",
    "                    row += \" A \"\n",
    "                elif (r, c) == self.goal:\n",
    "                    row += \" G \"\n",
    "                else:\n",
    "                    row += \" . \"\n",
    "            print(row)\n",
    "        print(f\"Step: {self.step_count}/{self.max_steps}   Done: {self.done}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ✋ Quick check: did it build?\n",
    "\n",
    "Let's instantiate the environment and look at the starting state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = GridWorld(size=4, max_steps=50)\n",
    "state = env.reset()\n",
    "print(f\"Starting state: {state}\")\n",
    "print(f\"Goal location:  {env.goal}\")\n",
    "print()\n",
    "env.render()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see a 4×4 grid with **A** in the top-left and **G** in the bottom-right.\n",
    "\n",
    "Step counter shows **0** — no moves yet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 3 · Take One Action\n",
    "\n",
    "Let's manually step through three actions and watch the state, reward, and done flag change. This is what every RL algorithm does — just much faster, and choosing actions intelligently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "state = env.reset()\n",
    "print(\"Initial:\")\n",
    "env.render()\n",
    "print()\n",
    "\n",
    "# Action 1: move RIGHT\n",
    "state, reward, done, info = env.step(env.RIGHT)\n",
    "print(f\"After RIGHT  →  state={state}  reward={reward}  done={done}\")\n",
    "env.render()\n",
    "print()\n",
    "\n",
    "# Action 2: move DOWN\n",
    "state, reward, done, info = env.step(env.DOWN)\n",
    "print(f\"After DOWN   →  state={state}  reward={reward}  done={done}\")\n",
    "env.render()\n",
    "print()\n",
    "\n",
    "# Action 3: try to go UP off the grid (wall bump)\n",
    "state, reward, done, info = env.step(env.UP)\n",
    "print(f\"After UP     →  state={state}  reward={reward}  done={done}\")\n",
    "print(\"(notice the -5 reward and unchanged position — that's the wall penalty)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🧠 What you just saw\n",
    "\n",
    "* Each `step()` returns **four things**: next state, reward, done flag, info dict.\n",
    "* Walking into the wall **costs −5 reward** AND **wastes a turn** — the step counter still advanced.\n",
    "* The agent's policy choice (RIGHT, DOWN, UP) was hard-coded by us. In Part 4 we'll let a random policy choose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 4 · The Random-Policy Agent\n",
    "\n",
    "A **policy** is a function: state → action. The simplest possible policy is **random** — pick any action uniformly. This is our baseline, the dumbest possible agent.\n",
    "\n",
    "Why bother? Because if a learning algorithm doesn't beat random, it isn't learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def random_policy(state, num_actions=4):\n",
    "    \"\"\"Return a uniformly random action regardless of state.\"\"\"\n",
    "    return random.randint(0, num_actions - 1)\n",
    "\n",
    "# Quick demo: 5 random action choices\n",
    "print(\"5 random actions from the policy:\")\n",
    "for i in range(5):\n",
    "    a = random_policy(state)\n",
    "    print(f\"  Action {i+1}: {a}  ({env.ACTION_NAMES[a]})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice: the policy **ignores the state**. A learning agent would use the state to pick smarter actions. By the end of Section 5, yours will."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 5 · Run One Full Episode\n",
    "\n",
    "Now we glue it all together: agent ↔ environment loop until `done=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_episode(env, policy, render=False):\n",
    "    \"\"\"Run one episode. Return (total_reward, length, reached_goal).\"\"\"\n",
    "    state = env.reset()\n",
    "    total_reward = 0\n",
    "    length = 0\n",
    "    done = False\n",
    "    \n",
    "    while not done:\n",
    "        action = policy(state)\n",
    "        state, reward, done, info = env.step(action)\n",
    "        total_reward += reward\n",
    "        length += 1\n",
    "        if render:\n",
    "            print(f\"Step {length}: action={env.ACTION_NAMES[action]:<8} \"\n",
    "                  f\"state={state}  reward={reward:+d}  total={total_reward:+d}\")\n",
    "    \n",
    "    reached_goal = (state == env.goal)\n",
    "    return total_reward, length, reached_goal\n",
    "\n",
    "# Run one episode and watch every step\n",
    "total_reward, length, reached = run_episode(env, random_policy, render=True)\n",
    "print()\n",
    "print(f\"Episode summary:  reward={total_reward}  length={length}  reached_goal={reached}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Expect most random episodes to **time out at 50 steps** without reaching the goal — and to accumulate a deeply negative reward. That is the entire pedagogical point.\n",
    "\n",
    "**Try this:** re-run the cell above several times. Each run uses different random actions. See how chaotic the results are."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 6 · Run 100 Episodes — Measure What \"No Learning\" Costs You\n",
    "\n",
    "One episode is noise. **100 episodes** is data. We'll compute:\n",
    "\n",
    "* Average **episode length**\n",
    "* Average **total reward**\n",
    "* **Goal-reach rate** (fraction of episodes that found G)\n",
    "\n",
    "These three numbers are the baseline your assignment will compare against."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate(env, policy, n_episodes=100):\n",
    "    \"\"\"Run n episodes and return aggregate statistics.\"\"\"\n",
    "    rewards, lengths, reached = [], [], []\n",
    "    for _ in range(n_episodes):\n",
    "        r, l, g = run_episode(env, policy, render=False)\n",
    "        rewards.append(r)\n",
    "        lengths.append(l)\n",
    "        reached.append(g)\n",
    "    return {\n",
    "        \"avg_reward\":     np.mean(rewards),\n",
    "        \"avg_length\":     np.mean(lengths),\n",
    "        \"goal_reach_pct\": 100 * np.mean(reached),\n",
    "        \"rewards\":        rewards,\n",
    "        \"lengths\":        lengths,\n",
    "    }\n",
    "\n",
    "# Reset seeds for reproducibility\n",
    "np.random.seed(42)\n",
    "random.seed(42)\n",
    "\n",
    "stats = evaluate(env, random_policy, n_episodes=100)\n",
    "\n",
    "print(\"=\" * 50)\n",
    "print(\"RANDOM POLICY — 100 EPISODES\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Average reward:       {stats['avg_reward']:>8.2f}\")\n",
    "print(f\"Average length:       {stats['avg_length']:>8.2f}  (of 50 max)\")\n",
    "print(f\"Goal-reach rate:      {stats['goal_reach_pct']:>7.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 📊 Expected results\n",
    "\n",
    "| Metric | Typical value |\n",
    "|---|---|\n",
    "| Average reward | **roughly −60 to −90** |\n",
    "| Average length | **roughly 45–50** steps (timeouts dominate) |\n",
    "| Goal-reach rate | **5–25 %** |\n",
    "\n",
    "Your exact numbers will vary slightly. The key observation: a random agent is **bad on every metric**. Even when it does find G, it's by accident."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 7 · Visualise the Random Agent's Path\n",
    "\n",
    "Let's overlay one episode's trajectory onto the grid to see *how* the agent failed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_and_trace(env, policy):\n",
    "    \"\"\"Run one episode and return the list of visited positions.\"\"\"\n",
    "    state = env.reset()\n",
    "    visited = [state]\n",
    "    done = False\n",
    "    while not done:\n",
    "        action = policy(state)\n",
    "        state, _, done, _ = env.step(action)\n",
    "        visited.append(state)\n",
    "    return visited\n",
    "\n",
    "def render_path(env, path):\n",
    "    \"\"\"Print the grid with #N showing the step number at each visited cell.\"\"\"\n",
    "    visit_counts = Counter(path)\n",
    "    for r in range(env.size):\n",
    "        line = \"\"\n",
    "        for c in range(env.size):\n",
    "            cell = (r, c)\n",
    "            if cell == env.goal:\n",
    "                line += \"  G  \"\n",
    "            elif cell in visit_counts:\n",
    "                line += f\" x{visit_counts[cell]:<2}\"\n",
    "            else:\n",
    "                line += \"  .  \"\n",
    "        print(line)\n",
    "    print(f\"Path length: {len(path) - 1} steps. Cells visited: {len(visit_counts)}.\")\n",
    "\n",
    "np.random.seed(7); random.seed(7)  # different seed for variety\n",
    "path = run_and_trace(env, random_policy)\n",
    "render_path(env, path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each cell shows how many times the random agent visited it.\n",
    "\n",
    "Look for: **revisits, wasted moves, cells near the goal that never got explored**. This visual is what makes \"random is bad\" feel concrete."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 8 · Stretch Exercises (Optional, but Recommended)\n",
    "\n",
    "These are warm-ups for the assignment. Try at least one before moving on.\n",
    "\n",
    "### 🧪 Exercise 8.1 — A Smarter-Than-Random Baseline\n",
    "\n",
    "Write a policy that **always picks RIGHT** (or alternates RIGHT and DOWN). Run 100 episodes. Compare to random.\n",
    "\n",
    "*Hint: a policy is just a function `state → action`. It doesn't have to be smart, it just has to be a function.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def always_right(state):\n",
    "    \"\"\"Always go right. The dumbest deterministic policy.\"\"\"\n",
    "    return GridWorld.RIGHT\n",
    "\n",
    "# Reset seeds for fair comparison\n",
    "np.random.seed(42); random.seed(42)\n",
    "stats_right = evaluate(env, always_right, n_episodes=100)\n",
    "print(f\"ALWAYS-RIGHT policy: reward={stats_right['avg_reward']:.2f}  \"\n",
    "      f\"length={stats_right['avg_length']:.2f}  \"\n",
    "      f\"goal-rate={stats_right['goal_reach_pct']:.1f}%\")\n",
    "\n",
    "# Now write your own — alternate RIGHT/DOWN\n",
    "step_counter = {\"n\": 0}\n",
    "def alternate(state):\n",
    "    step_counter[\"n\"] += 1\n",
    "    return GridWorld.RIGHT if step_counter[\"n\"] % 2 == 0 else GridWorld.DOWN\n",
    "\n",
    "step_counter[\"n\"] = 0\n",
    "np.random.seed(42); random.seed(42)\n",
    "stats_alt = evaluate(env, alternate, n_episodes=100)\n",
    "print(f\"ALTERNATE policy:    reward={stats_alt['avg_reward']:.2f}  \"\n",
    "      f\"length={stats_alt['avg_length']:.2f}  \"\n",
    "      f\"goal-rate={stats_alt['goal_reach_pct']:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 🧪 Exercise 8.2 — Add a Wall\n",
    "\n",
    "Modify `GridWorld.__init__` to accept a list of **wall cells** that block movement. A wall at `(1, 1)` should: (a) not be enterable, (b) cost the −5 wall-bump reward if the agent tries.\n",
    "\n",
    "This is exactly what your **assignment Part 2** asks for. Get it working here first.\n",
    "\n",
    "### 🧪 Exercise 8.3 — Add Lava\n",
    "\n",
    "Add a `lava` cell list. Entering lava gives **−10** reward AND ends the episode. This adds a real strategic dimension — the agent must learn to avoid lava on its way to G.\n",
    "\n",
    "*Hint: handle lava in `step()`, right where you handle the goal.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## ✅ Recap — What You Built\n",
    "\n",
    "| You wrote | What it represents (in RL terms) |\n",
    "|---|---|\n",
    "| `GridWorld` class | The **environment** ε |\n",
    "| `agent_pos` tuple | The **state** Sₜ |\n",
    "| `step(action)` | The **transition function** P (deterministic here) |\n",
    "| Reward in `step()` | The **reward function** R |\n",
    "| `random_policy()` | A **policy** π (the trivial one) |\n",
    "| `run_episode()` | One **episode** G |\n",
    "| `evaluate()` | Statistics across **many episodes** |\n",
    "\n",
    "**Every single algorithm in this course replaces just one thing: `random_policy`.**\n",
    "\n",
    "Section 2 will swap it for a policy computed by value iteration. Section 3 with Monte Carlo. Section 4 with TD/SARSA. The environment stays. The policy gets smarter.\n",
    "\n",
    "---\n",
    "\n",
    "## 📌 What this gives you for Week 1's assignment\n",
    "\n",
    "* **Part 1** (Environment Setup): you already have it — done.\n",
    "* **Part 2** (5×5 GridWorld with walls + lava + goal): start by copying this notebook, change `size=4` to `size=5`, and complete Exercise 8.2 and 8.3.\n",
    "* **Part 3** (random-walk agent, 100 episodes, learning curve): `evaluate()` already returns `rewards` per episode. Plot it with `matplotlib`.\n",
    "* **Part 4** (analysis & reflection): your numbers from Part 6 are exactly the stats the assignment asks for. Use the vocabulary table in Part 1 in your reflection.\n",
    "\n",
    "---\n",
    "\n",
    "## ⏭️ Next coding lab\n",
    "\n",
    "**S2 — Value Iteration & Policy Iteration.** Same GridWorld, but the policy stops guessing. We compute the *optimal* policy from the Bellman equation.\n",
    "\n",
    "Save this notebook before you close it — the S2 lab will import from it."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}