{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 5 - Bayes theorem, MLE, MAP, naive Bayes and k-NN\n",
    "\n",
    "In this tutorial, we'll look at the Bayesian view of probability and use it to motivate two simple, but very useful classifiers: **naive Bayes** and **k-nearest neighbors**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from matplotlib import pyplot as plt\n",
    "from scipy import stats\n",
    "from sklearn.datasets import load_iris, make_moons\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import CategoricalNB, GaussianNB\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As in the previous tutorials, we'll use the Iris dataset for the classification examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = load_iris()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "print(f\"Features: {iris.feature_names}\")\n",
    "print(f\"Target Classes: {iris.target_names}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayes' theorem\n",
    "\n",
    "Most of probabilistic machine learning rests on a single identity. For two events $A$ and $B$, the conditional probability is:\n",
    "\n",
    "$$\n",
    "P(A, B) = P(A \\mid B) P(B) = P(B \\mid A) P(A)\n",
    "$$\n",
    "\n",
    "Assuming $P(B) > 0$, we can rewrite this to obtain the Bayes' theorem:\n",
    "\n",
    "$$\n",
    "P(A \\mid B) = \\frac{P(B \\mid A) \\, P(A)}{P(B)}.\n",
    "$$\n",
    "\n",
    "It is just a rearrangement of the definition of conditional probability, but the way we read it matters. In machine learning we usually want to reason about an unknown quantity $\\theta$ (a parameter, a class label, ...) given some observed data $D$:\n",
    "\n",
    "$$\n",
    "\\underbrace{P(\\theta \\mid D)}_{\\text{posterior}}\n",
    "= \\frac{\\overbrace{P(D \\mid \\theta)}^{\\text{likelihood}} \\; \\overbrace{P(\\theta)}^{\\text{prior}}}{\\underbrace{P(D)}_{\\text{evidence}}}.\n",
    "$$\n",
    "\n",
    "- The **prior** $P(\\theta)$ encodes what we believed about $\\theta$ before seeing the data.\n",
    "- The **likelihood** $P(D \\mid \\theta)$ tells us how plausible the data is for a given value of $\\theta$.\n",
    "- The **posterior** $P(\\theta \\mid D)$ is the updated belief after seeing the data.\n",
    "- The **evidence** $P(D) = \\sum_\\theta P(D \\mid \\theta) P(\\theta)$ is just a normalising constant; it does not depend on $\\theta$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: medical testing\n",
    "\n",
    "A standard introductory example: a disease has prevalence $P(\\text{sick}) = 0.01$ in the population. A test is $99\\%$ sensitive ($P(+ \\mid \\text{sick}) = 0.99$) and $95\\%$ specific ($P(- \\mid \\text{healthy}) = 0.95$). What is the probability that you are actually sick if you tested positive?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_sick = 0.01\n",
    "p_pos_given_sick = 0.99\n",
    "p_pos_given_healthy = 1 - 0.95  # 1 - specificity\n",
    "\n",
    "p_pos = p_pos_given_sick * p_sick + p_pos_given_healthy * (1 - p_sick)\n",
    "p_sick_given_pos = p_pos_given_sick * p_sick / p_pos\n",
    "\n",
    "print(f\"P(sick | +) = {p_sick_given_pos:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we see that even with a positive test, the probability of actually being sick is only about 17%. The low prior dominates the update from the likelihood. This is exactly the kind of intuition Bayes' theorem formalises."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Maximum likelihood vs. maximum a-posteriori estimation\n",
    "\n",
    "Suppose we have data $D = \\{x_1, \\dots, x_n\\}$ assumed to come from a parametric distribution $p(x \\mid \\theta)$ and we want to estimate $\\theta$. There are two natural point estimates:\n",
    "\n",
    "**Maximum likelihood estimation (MLE)**: pick the parameter that makes the observed data most likely:\n",
    "\n",
    "$$\n",
    "\\hat\\theta_\\text{MLE} = \\arg\\max_{\\theta} \\; p(D \\mid \\theta) = \\arg\\max_{\\theta} \\; \\prod_{i=1}^n p(x_i \\mid \\theta).\n",
    "$$\n",
    "\n",
    "**Maximum a-posteriori estimation (MAP)**: combine the likelihood with a prior $p(\\theta)$ and pick the mode of the posterior:\n",
    "\n",
    "$$\n",
    "\\hat\\theta_\\text{MAP} = \\arg\\max_{\\theta} \\; p(\\theta \\mid D) = \\arg\\max_{\\theta} \\; p(D \\mid \\theta) \\, p(\\theta).\n",
    "$$\n",
    "\n",
    "Notice that MLE is the special case of MAP with a uniform prior.\n",
    "\n",
    "In practice we maximise the *log* likelihood (or log posterior), turning products into sums and avoiding numerical underflow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: estimating the bias of a coin\n",
    "\n",
    "Let $\\theta \\in [0, 1]$ be the probability of heads of a biased coin. We flip it $n$ times and observe $k$ heads. The likelihood is binomial:\n",
    "\n",
    "$$\n",
    "p(D \\mid \\theta) = \\binom{n}{k} \\theta^k (1 - \\theta)^{n-k}.\n",
    "$$\n",
    "\n",
    "Setting the derivative of the log-likelihood to zero gives\n",
    "\n",
    "$$\n",
    "\\hat\\theta_\\text{MLE} = \\frac{k}{n}.\n",
    "$$\n",
    "\n",
    "For MAP we need a prior on $\\theta$. We have two requirements:\n",
    "\n",
    "1. The prior has to be supported on $[0, 1]$ (a probability can't be negative or larger than one). This rules out a Gaussian.\n",
    "2. The prior should be flexible enough to express different beliefs: maybe we suspect the coin is fair, maybe we suspect it's biased toward heads, maybe we have no idea at all.\n",
    "\n",
    "The Beta distribution $\\theta \\sim \\mathrm{Beta}(\\alpha, \\beta)$ with $\\alpha, \\beta > 0$ is the natural choice — it is supported on $[0, 1]$ and its two parameters have a very direct interpretation:\n",
    "\n",
    "$$\n",
    "\\mathbb{E}[\\theta] = \\frac{\\alpha}{\\alpha + \\beta}, \\qquad\n",
    "\\mathrm{Var}[\\theta] = \\frac{\\alpha\\beta}{(\\alpha+\\beta)^2 (\\alpha+\\beta+1)}.\n",
    "$$\n",
    "\n",
    "So $\\alpha / (\\alpha+\\beta)$ controls where we think $\\theta$ lies, and $\\alpha + \\beta$ controls how strongly we believe it: large $\\alpha + \\beta$ gives a sharply peaked prior, small values give a vague one. Special cases:\n",
    "\n",
    "- $\\alpha = \\beta = 1$: uniform on $[0,1]$ — \"I have no idea.\" MAP then coincides with MLE.\n",
    "- $\\alpha = \\beta \\gg 1$: peaked at $0.5$ — \"I strongly believe the coin is fair.\"\n",
    "- $\\alpha \\gg \\beta$: skewed toward $1$ — \"I expect mostly heads.\"\n",
    "\n",
    "A useful way to set the parameters is a pseudo-count interpretation: $\\mathrm{Beta}(\\alpha, \\beta)$ behaves as if we had already seen $\\alpha - 1$ heads and $\\beta - 1$ tails before any real data. So if we would only feel confident about a coin's bias after $\\sim 20$ flips, we should pick $\\alpha + \\beta \\approx 20$; if we would need $\\sim 200$, we shoukd pick $\\alpha + \\beta \\approx 200$. This gives us a rule for translating \"how much do I trust my prior?\" into $\\alpha$ and $\\beta$. This is generally a sound approach, picking the prior from something that we know to be realistic.\n",
    "\n",
    "With the Beta in hand the MAP derivation proceeds as before, and we get a happy bonus: the posterior turns out to also be Beta,\n",
    "\n",
    "$$\n",
    "p(\\theta \\mid D) \\propto \\theta^{k + \\alpha - 1} (1 - \\theta)^{n - k + \\beta - 1} = \\mathrm{Beta}(k + \\alpha, ; n - k + \\beta).\n",
    "$$\n",
    "\n",
    "A prior whose family is preserved under Bayesian updating is called conjugate to the likelihood; Beta is conjugate to the binomial. Conjugacy gives our choice of the prior mathematical cleanliness, but isn't required.\n",
    "\n",
    "The mode of $\\mathrm{Beta}(a, b)$ for $a, b > 1$ is $(a - 1) / (a + b - 2)$, so\n",
    "\n",
    "$$\n",
    "\\hat\\theta_\\text{MAP} = \\frac{k + \\alpha - 1}{n + \\alpha + \\beta - 2}.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Suppose we observe 7 heads out of 10 flips of a coin we suspect is fair.\n",
    "n, k = 10, 7\n",
    "\n",
    "# Prior: Beta(11, 11) -- centred on 0.5, fairly informative (20 pseudo-flips).\n",
    "prior_heads = 10\n",
    "prior_tails = 10\n",
    "alpha, beta = prior_heads + 1, prior_tails + 1\n",
    "\n",
    "theta_mle = k / n\n",
    "theta_map = (k + alpha - 1) / (n + alpha + beta - 2)\n",
    "\n",
    "print(f\"MLE estimate of P(heads) = {theta_mle:.3f}\")\n",
    "print(f\"MAP estimate of P(heads) = {theta_map:.3f}  (with Beta({alpha},{beta}) prior)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualise prior, likelihood and posterior across a range of theta values.\n",
    "theta_grid = np.linspace(0, 1, 500)\n",
    "\n",
    "prior = stats.beta.pdf(theta_grid, alpha, beta)\n",
    "likelihood = stats.binom.pmf(k, n, theta_grid)\n",
    "likelihood = likelihood / np.trapezoid(likelihood, theta_grid)  # normalise for plotting\n",
    "posterior = stats.beta.pdf(theta_grid, k + alpha, n - k + beta)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "ax.plot(theta_grid, prior, label=f\"prior  Beta({alpha},{beta})\", linestyle=\"--\")\n",
    "ax.plot(theta_grid, likelihood, label=\"likelihood (normalised)\", linestyle=\":\")\n",
    "ax.plot(theta_grid, posterior, label=f\"posterior  Beta({k+alpha},{n-k+beta})\")\n",
    "ax.axvline(theta_mle, color=\"C1\", linestyle=\":\", alpha=0.6, label=f\"MLE = {theta_mle:.2f}\")\n",
    "ax.axvline(theta_map, color=\"C2\", linestyle=\"-.\", alpha=0.6, label=f\"MAP = {theta_map:.2f}\")\n",
    "ax.set_xlabel(r\"$\\theta$ = P(heads)\")\n",
    "ax.set_ylabel(\"density\")\n",
    "ax.set_title(\"Prior, likelihood and posterior for the coin example\")\n",
    "ax.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the MAP estimate sits between the MLE (driven only by the data) and the prior mean ($0.5$). As we collect more data the likelihood becomes much sharper than the prior, and the two estimates converge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1\n",
    "\n",
    "A common task is estimating the mean of a Gaussian when we already have prior knowledge about it. Assume $x_1, \\dots, x_n \\sim \\mathcal{N}(\\mu, \\sigma^2)$ with **known** variance $\\sigma^2$ and a Gaussian prior $\\mu \\sim \\mathcal{N}(\\mu_0, \\tau^2)$.\n",
    "\n",
    "Implement the two functions below:\n",
    "\n",
    "1. `gaussian_mle(samples)` — return $\\hat\\mu_\\text{MLE}$.\n",
    "2. `gaussian_map(samples, sigma_square, mu0, tau_square)` — return $\\hat\\mu_\\text{MAP}$.\n",
    "\n",
    "For the MAP estimate, derive the posterior. Since the Gaussian is its own conjugate prior, the posterior is again Gaussian and its mean (which is also the mode) is\n",
    "\n",
    "$$\n",
    "\\hat\\mu_\\text{MAP}\n",
    "= \\frac{\\frac{n}{\\sigma^2} \\bar x + \\frac{1}{\\tau^2} \\mu_0}{\\frac{n}{\\sigma^2} + \\frac{1}{\\tau^2}}\n",
    "\\; ,\n",
    "$$\n",
    "\n",
    "i.e. a precision-weighted average of the sample mean $\\bar x$ and the prior mean $\\mu_0$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gaussian_mle(samples):\n",
    "    # TODO\n",
    "\n",
    "\n",
    "def gaussian_map(samples, sigma_square, mu0, tau_square):\n",
    "    # TODO\n",
    "\n",
    "\n",
    "# Truth: mu = 5, sigma^2 = 4. Prior: weakly centred on 0 with tau^2 = 1.\n",
    "true_mu, sigma_square = 5.0, 4.0\n",
    "mu0, tau_square = 0.0, 1.0\n",
    "\n",
    "for n in [3, 30, 300]:\n",
    "    samples = np.random.normal(true_mu, np.sqrt(sigma_square), size=n)\n",
    "    print(f\"n={n:>3d}  MLE={gaussian_mle(samples):.3f}  \"\n",
    "          f\"MAP={gaussian_map(samples, sigma_square, mu0, tau_square):.3f}  (true mu={true_mu})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With only a handful of samples the MAP estimate is heavily pulled toward the prior mean ($0$); as $n$ grows it concentrates on the true mean and matches the MLE."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Naive Bayes classifier\n",
    "\n",
    "We can use Bayes' theorem directly to build a classifier. Given a feature vector $\\mathbf{x} = (x_1, \\dots, x_d)$ and a class label $y \\in \\{1, \\dots, K\\}$:\n",
    "\n",
    "$$\n",
    "P(y \\mid \\mathbf{x}) = \\frac{P(\\mathbf{x} \\mid y) \\, P(y)}{P(\\mathbf{x})}.\n",
    "$$\n",
    "\n",
    "To classify, we pick the class with the highest posterior:\n",
    "\n",
    "$$\n",
    "\\hat y = \\arg\\max_{y} \\; P(y \\mid \\mathbf{x}) = \\arg\\max_{y} \\; P(\\mathbf{x} \\mid y) \\, P(y).\n",
    "$$\n",
    "\n",
    "The hard part is modelling the joint $P(\\mathbf{x} \\mid y)$. Naive Bayes makes the **conditional independence assumption**: features are independent given the class. This is rarely true in practice (that's why it's called \"naive\"), but it makes the model simpler to calculate:\n",
    "\n",
    "$$\n",
    "P(\\mathbf{x} \\mid y) \\approx \\prod_{j=1}^{d} P(x_j \\mid y).\n",
    "$$\n",
    "\n",
    "For continuous features (like the Iris measurements) we model each $P(x_j \\mid y)$ as a univariate Gaussian fitted by MLE on the training data — this is **Gaussian naive Bayes**:\n",
    "\n",
    "$$\n",
    "P(x_j \\mid y) = \\frac{1}{\\sqrt{2\\pi \\sigma_{j,y}^2}} \\exp\\!\\left(-\\frac{(x_j - \\mu_{j,y})^2}{2 \\sigma_{j,y}^2}\\right).\n",
    "$$\n",
    "\n",
    "We compare log-posteriors (again, sums are nicer than products):\n",
    "\n",
    "$$\n",
    "\\log P(y \\mid \\mathbf{x}) \\;\\propto\\; \\log P(y) + \\sum_{j=1}^{d} \\log P(x_j \\mid y).\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_nb_sklearn = GaussianNB()\n",
    "model_nb_sklearn.fit(X_train, y_train)\n",
    "\n",
    "print(f\"Train accuracy: {accuracy_score(y_train, model_nb_sklearn.predict(X_train)):.2f}\")\n",
    "print(f\"Test accuracy:  {accuracy_score(y_test, model_nb_sklearn.predict(X_test)):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2\n",
    "\n",
    "Implement Gaussian naive Bayes from scratch. Fill in `fit` and `predict` below.\n",
    "\n",
    "- In `fit`, estimate the per-class prior (probability a random sample is of this class), mean and variance for every feature using MLE.\n",
    "- In `predict`, return the class with the highest log-posterior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyGaussianNB:\n",
    "    def __init__(self):\n",
    "        self.classes_ = None\n",
    "        self.priors_ = None\n",
    "        self.means_ = None\n",
    "        self.variances_ = None\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        self.classes_ = np.unique(y)\n",
    "        n_classes = len(self.classes_)\n",
    "        n_features = X.shape[1]\n",
    "\n",
    "        self.priors_ = np.zeros(n_classes)\n",
    "        self.means_ = np.zeros((n_classes, n_features))\n",
    "        self.variances_ = np.zeros((n_classes, n_features))\n",
    "\n",
    "        # TODO\n",
    "\n",
    "    def predict(self, X):\n",
    "        # TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_nb_manual = MyGaussianNB()\n",
    "model_nb_manual.fit(X_train, y_train)\n",
    "\n",
    "print(f\"Train accuracy: {accuracy_score(y_train, model_nb_manual.predict(X_train)):.2f}\")\n",
    "print(f\"Test accuracy:  {accuracy_score(y_test, model_nb_manual.predict(X_test)):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discrete features and Laplace smoothing\n",
    "\n",
    "The Iris features are continuous and Gaussian NB worked well. But naive Bayes is also commonly used with **discrete** features, for example, in document classification where each $x_j \\in \\{0, 1\\}$ indicates whether word $j$ appears in the text.\n",
    "\n",
    "For categorical features we estimate $P(x_j = v \\mid y)$ directly from training counts:\n",
    "\n",
    "$$\n",
    "\\hat P(x_j = v \\mid y = c) = \\frac{N_{c, j, v}}{N_c},\n",
    "$$\n",
    "\n",
    "where $N_{c,j,v}$ is the number of training examples with class $c$ and feature value $v$ at position $j$, and $N_c$ is the number of training examples in class $c$. This MLE has a failure mode: if some value $v$ is **never observed** for class $c$ in training, $\\hat P(x_j = v \\mid y = c) = 0$, which makes the *whole* product $\\prod_j P(x_j \\mid y = c)$ collapse to zero. Class $c$ then gets ruled out for any test point with that value, no matter how strong the evidence from the other features.\n",
    "\n",
    "The standard fix is **Laplace smoothing** (also called add-$\\alpha$ / additive smoothing): pretend you saw $\\alpha$ extra examples of every value in every class.\n",
    "\n",
    "$$\n",
    "\\hat P(x_j = v \\mid y = c) = \\frac{N_{c, j, v} + \\alpha}{N_c + \\alpha \\cdot K_j},\n",
    "$$\n",
    "\n",
    "where $K_j$ is the number of possible values of feature $j$. With $\\alpha = 1$ (\"add-one\"), no probability is ever exactly zero. As an aside, this is exactly the **MAP estimate** of $P(x_j \\mid y = c)$ under a symmetric $\\mathrm{Dir}(\\alpha, \\dots, \\alpha)$ prior. As usual, the prior matters most with low data, with enough data the smoothing term is negligible.\n",
    "\n",
    "The cleanest way to see why this matters is a small toy \"spam classifier\" with three binary features (think: presence/absence of three keywords). In the training set below, the third keyword *never* appears in spam — so the MLE assigns it probability zero. A test email containing that keyword then has spam posterior collapse to zero, no matter what the other two features say."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Toy 'spam' dataset: 3 binary features, 2 classes. Crucially, feature 2 is\n",
    "# never 1 for any spam example -- so MLE estimates P(x_2=1 | spam) = 0.\n",
    "X_toy = np.array([\n",
    "    [1, 0, 1],  # ham\n",
    "    [0, 1, 1],  # ham\n",
    "    [1, 1, 1],  # ham\n",
    "    [1, 0, 0],  # spam\n",
    "    [0, 1, 0],  # spam\n",
    "    [1, 1, 0],  # spam\n",
    "])\n",
    "y_toy = np.array([0, 0, 0, 1, 1, 1])  # 0 = ham, 1 = spam\n",
    "class_names = [\"ham\", \"spam\"]\n",
    "\n",
    "# A new email that has feature 2 = 1. Under unsmoothed MLE the spam posterior\n",
    "# becomes zero regardless of the rest of the features.\n",
    "x_new = np.array([[1, 0, 1]])\n",
    "\n",
    "for alpha, label in [(1e-10, \"alpha~=0 (no smoothing)\"), (1.0, \"alpha=1 (Laplace)   \")]:\n",
    "    model = CategoricalNB(alpha=alpha, force_alpha=True, min_categories=2)\n",
    "    model.fit(X_toy, y_toy)\n",
    "    log_probs = model.predict_log_proba(x_new)[0]\n",
    "    pred = class_names[int(model.predict(x_new)[0])]\n",
    "    print(f\"{label}  log P(ham|x)={log_probs[0]:7.2f}  log P(spam|x)={log_probs[1]:7.2f}  -> {pred}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## k-nearest neighbors classifier\n",
    "\n",
    "Naive Bayes assumes a parametric family for $P(\\mathbf{x} \\mid y)$. **k-nearest neighbors** (k-NN) takes the opposite, fully *non-parametric* route: it does not assume any shape for the data distribution and stores the entire training set instead.\n",
    "\n",
    "To classify a new point $\\mathbf{x}$:\n",
    "\n",
    "1. Compute the distance from $\\mathbf{x}$ to every training sample (typically Euclidean).\n",
    "2. Pick the $k$ closest training points.\n",
    "3. Predict the most common label among them (majority vote).\n",
    "\n",
    "This can be motivated by Bayes' rule too: the local fraction of class $y$ among the $k$ neighbors is an estimate of $P(y \\mid \\mathbf{x})$. Picking the most common label is then approximately the MAP class.\n",
    "\n",
    "The hyperparameter $k$ controls the bias–variance trade-off:\n",
    "- small $k$ — flexible, low bias, high variance (sensitive to noise);\n",
    "- large $k$ — smoother decision boundary, higher bias, lower variance.\n",
    "\n",
    "Distances are sensitive to feature scaling, so in practice you usually standardise the features before running k-NN. We skip that here because Iris features are already on similar scales."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_knn_sklearn = KNeighborsClassifier(n_neighbors=5)\n",
    "model_knn_sklearn.fit(X_train, y_train)\n",
    "\n",
    "print(f\"Train accuracy: {accuracy_score(y_train, model_knn_sklearn.predict(X_train)):.2f}\")\n",
    "print(f\"Test accuracy:  {accuracy_score(y_test, model_knn_sklearn.predict(X_test)):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3\n",
    "\n",
    "Implement k-NN from scratch. Use Euclidean distance and break ties by picking the smallest class label (this is what `np.bincount(...).argmax()` does)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyKNN:\n",
    "    def __init__(self, n_neighbors=5):\n",
    "        self.n_neighbors = n_neighbors\n",
    "        self.X_train = None\n",
    "        self.y_train = None\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        # TODO\n",
    "\n",
    "    def predict(self, X):\n",
    "        # TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_knn_manual = MyKNN(n_neighbors=5)\n",
    "model_knn_manual.fit(X_train, y_train)\n",
    "\n",
    "print(f\"Train accuracy: {accuracy_score(y_train, model_knn_manual.predict(X_train)):.2f}\")\n",
    "print(f\"Test accuracy:  {accuracy_score(y_test, model_knn_manual.predict(X_test)):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Effect of $k$\n",
    "\n",
    "Iris is too easy to expose the bias–variance trade-off — every $k$ from 1 to 20 gives test accuracy near $1$. To see the trade-off in action we'll switch to the synthetic *two moons* dataset: two interleaved half-circles with Gaussian noise, where the boundary is genuinely fuzzy and $k$ matters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_moons, y_moons = make_moons(n_samples=350, noise=0.3, random_state=1)\n",
    "X_moons_train, X_moons_test = X_moons[:150], X_moons[150:]\n",
    "y_moons_train, y_moons_test = y_moons[:150], y_moons[150:]\n",
    "\n",
    "ks = list(range(1, 51, 2))\n",
    "train_acc, test_acc = [], []\n",
    "for k in ks:\n",
    "    model = MyKNN(n_neighbors=k)\n",
    "    model.fit(X_moons_train, y_moons_train)\n",
    "    train_acc.append(accuracy_score(y_moons_train, model.predict(X_moons_train)))\n",
    "    test_acc.append(accuracy_score(y_moons_test, model.predict(X_moons_test)))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(8, 5))\n",
    "ax.plot(ks, train_acc, marker=\"o\", label=\"train accuracy\")\n",
    "ax.plot(ks, test_acc, marker=\"s\", label=\"test accuracy\")\n",
    "ax.set_xlabel(\"k\")\n",
    "ax.set_ylabel(\"accuracy\")\n",
    "ax.set_title(\"k-NN accuracy on the two-moons dataset as a function of k\")\n",
    "ax.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For $k = 1$ the train accuracy is exactly $1$ (each training point is its own nearest neighbor) but the test accuracy is the worst on the curve — the model has memorised the noise. As $k$ grows the decision boundary smooths out: the training accuracy drops (we are no longer memorising) and the test accuracy improves, peaking somewhere in the middle. Past that peak, $k$ becomes so large that we're averaging over points from the wrong class and accuracy starts to drop again."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generative vs. discriminative classifiers\n",
    "\n",
    "Naive Bayes and k-NN are useful representatives of two opposing philosophies of classification.\n",
    "\n",
    "**Generative classifiers** model the *joint* distribution $p(\\mathbf{x}, y) = p(\\mathbf{x} \\mid y) p(y)$ and then use Bayes' theorem to recover the posterior $p(y \\mid \\mathbf{x})$. Once you have the joint, you can do anything with it: sample new $\\mathbf{x}$, detect outliers, fill in missing features by marginalisation. The cost is that you have to commit to a model of $p(\\mathbf{x} \\mid y)$, and a wrong model can hurt classification even when a much simpler discriminative model would have nailed it. Naive Bayes (in any of its variants) is generative - that's exactly why we need to specify a Gaussian / Bernoulli / categorical likelihood per feature.\n",
    "\n",
    "**Discriminative classifiers** model the conditional $p(y \\mid \\mathbf{x})$ (or just the decision boundary) directly, without bothering to model how $\\mathbf{x}$ is distributed. They tend to give better classification accuracy when data is plentiful, but can't generate new data and don't natively handle missing features. Logistic regression is the canonical discriminative classifier. k-NN is a bit of astrange example of this family as well: it doesn't model anything explicitly, but its output is a local non-parametric estimate of $p(y \\mid \\mathbf{x})$, which puts it on the discriminative side in spirit.\n",
    "\n",
    "A useful rule of thumb (Ng & Jordan, 2001): generative models tend to converge faster,but discriminative models eventually win as $n$ grows, because they don't waste capacity modelling $\\mathbf{x}$ and they're robust when the generative assumption (e.g. conditional independence in naive Bayes) is wrong."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualising decision boundaries\n",
    "\n",
    "A 2D projection (petal length × petal width) makes the parametric vs. non-parametric distinction very visual. Gaussian NB carves the plane with smooth quadratic curves — that's the shape of the boundary between two Gaussian densities with different means and variances. k-NN produces a piecewise-linear, Voronoi-like boundary that follows the data points instead of any global density model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feat_idx = [2, 3]  # petal length, petal width\n",
    "X_train_2d = X_train[:, feat_idx]\n",
    "X_test_2d = X_test[:, feat_idx]\n",
    "\n",
    "models_2d = {\n",
    "    \"Gaussian naive Bayes\": GaussianNB().fit(X_train_2d, y_train),\n",
    "    \"k-NN (k=5)\": KNeighborsClassifier(n_neighbors=5).fit(X_train_2d, y_train),\n",
    "}\n",
    "\n",
    "x_min, x_max = X_train_2d[:, 0].min() - 0.5, X_train_2d[:, 0].max() + 0.5\n",
    "y_min, y_max = X_train_2d[:, 1].min() - 0.5, X_train_2d[:, 1].max() + 0.5\n",
    "xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))\n",
    "grid = np.column_stack([xx.ravel(), yy.ravel()])\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharex=True, sharey=True)\n",
    "for ax, (name, model) in zip(axes, models_2d.items()):\n",
    "    Z = model.predict(grid).reshape(xx.shape)\n",
    "    ax.contourf(xx, yy, Z, alpha=0.3, levels=np.arange(-0.5, 3.5), cmap=\"viridis\")\n",
    "    ax.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, edgecolor=\"k\", cmap=\"viridis\", s=40)\n",
    "    ax.set_xlabel(iris.feature_names[feat_idx[0]])\n",
    "    ax.set_ylabel(iris.feature_names[feat_idx[1]])\n",
    "    ax.set_title(name)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "nsn",
   "language": "python",
   "name": "nsn"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
