{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d65b6d0f",
   "metadata": {},
   "source": [
    "# 🗺️ raster2poly — Classification & Vectorisation Cookbook\n",
    "\n",
    "This notebook demonstrates every classification method in **`raster2poly`**\n",
    "using Landsat 9 geological ratio imagery.\n",
    "\n",
    "**Input raster:** `LC09_193036_20230713_geo_ratios.tif` — a 15-band geological\n",
    "ratio stack produced by the `landsat9geo` pipeline.\n",
    "\n",
    "| Band | Ratio | Geological target |\n",
    "|------|-------|-------------------|\n",
    "| 1 | Iron Oxide (Red/Blue) | Fe³⁺ gossans, laterite |\n",
    "| 2 | Ferrous Iron (SWIR1/Red) | Fe²⁺ mafics, chlorite |\n",
    "| 3 | Clay/Hydroxyl (SWIR1/SWIR2) | Al-OH kaolinite, illite |\n",
    "| 4 | Carbonate (SWIR2/NIR) | CO₃²⁻ calcite, dolomite |\n",
    "| 5 | Ferric Oxide (Red/Green) | Hematite / goethite |\n",
    "| 6 | NDVI | Vegetation |\n",
    "| … | … | … |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b91b7e49",
   "metadata": {},
   "source": [
    "## 0 — Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31e24cf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install raster2poly\n",
    "from raster2poly import RasterClassifier\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa225609",
   "metadata": {},
   "source": [
    "## 1 — Load raster & inspect\n",
    "\n",
    "`RasterClassifier` reads all bands on init and reports dimensions,\n",
    "band count, and the percentage of valid (non-nodata) pixels.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ca61577",
   "metadata": {},
   "outputs": [],
   "source": [
    "RASTER_PATH = \"LC09_193036_20230713_geo_ratios.tif\"\n",
    "clf = RasterClassifier(RASTER_PATH)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82022117",
   "metadata": {},
   "source": [
    "### 1a — List available algorithms\n",
    "\n",
    "A quick reference for all supported methods — callable as a\n",
    "static method or on an instance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "de5b245b",
   "metadata": {},
   "outputs": [],
   "source": [
    "RasterClassifier.available_algorithms()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b96f64da",
   "metadata": {},
   "source": [
    "### 1b — Band statistics\n",
    "\n",
    "Before designing DN-range rules you need to know the value range\n",
    "of each band.  `band_stats()` prints min / max / mean / std for\n",
    "every band in one call — no manual rasterio loop needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d26c3460",
   "metadata": {},
   "outputs": [],
   "source": [
    "clf.band_stats()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "089a0493",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2 — Unsupervised classification (clustering)\n",
    "\n",
    "No training data required.  The algorithm groups pixels into *k*\n",
    "spectral clusters based purely on their multi-band signature.\n",
    "\n",
    "### 2a — Standard KMeans\n",
    "\n",
    "Best for small-to-medium rasters (< 10 M pixels).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "391a11b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_kmeans = clf.unsupervised(\n",
    "    n_clusters=5,\n",
    "    algorithm=\"kmeans\",\n",
    "    dissolve=True,      # merge adjacent same-class polygons\n",
    "    min_area=50.0,      # drop tiny speckle polygons (m²)\n",
    ")\n",
    "\n",
    "clf.save(gdf_kmeans, \"output_kmeans.gpkg\")\n",
    "print(f\"\\nClasses found: {sorted(gdf_kmeans['class_id'].unique())}\")\n",
    "print(f\"Total polygons: {len(gdf_kmeans)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bff5b62",
   "metadata": {},
   "source": [
    "### 2b — MiniBatchKMeans (recommended for large rasters)\n",
    "\n",
    "Processes the data in small batches → drastically lower RAM usage\n",
    "and faster convergence on rasters with tens of millions of pixels.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6da1f960",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_mini = clf.unsupervised(\n",
    "    n_clusters=8,\n",
    "    algorithm=\"mini_batch_kmeans\",\n",
    "    dissolve=True,\n",
    "    min_area=100.0,\n",
    ")\n",
    "\n",
    "clf.save(gdf_mini, \"output_minibatch.shp\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ee5981b",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 3 — Supervised classification (Random Forest from ROI)\n",
    "\n",
    "Requires a shapefile / GeoPackage with labelled training geometries\n",
    "(Points **or** Polygons — or both).  The column holding the labels\n",
    "must contain **integer** class IDs.\n",
    "\n",
    "### 3a — Encode string labels → integer IDs\n",
    "\n",
    "If your ROI has a text column (e.g. `\"Age\"` with values like\n",
    "*Holocene*, *Jurassic*, …) you can convert it with `encode_roi()`\n",
    "— no external pandas step needed.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e6aa652",
   "metadata": {},
   "outputs": [],
   "source": [
    "roi_encoded_path, label_mapping = clf.encode_roi(\n",
    "    \"ages2.shp\",\n",
    "    label_col=\"Age\",\n",
    "    # output_path=\"ages2_encoded.shp\",  # optional, auto-generated if omitted\n",
    ")\n",
    "\n",
    "# label_mapping is a dict: {'Holocene': 1, 'Jurassic': 2, ...}\n",
    "print(\"Label mapping:\")\n",
    "for name, idx in label_mapping.items():\n",
    "    print(f\"  {idx}: {name}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89857ffd",
   "metadata": {},
   "source": [
    "### 3b — Train & classify\n",
    "\n",
    "Every pixel inside each polygon ROI is used as a training sample.\n",
    "This is far more robust than the old zonal-mean approach.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edbb3b57",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_rf = clf.supervised(\n",
    "    roi_path=roi_encoded_path,\n",
    "    class_col=\"class_id\",       # the column created by encode_roi()\n",
    "    n_estimators=100,            # number of Random Forest trees\n",
    "    dissolve=True,\n",
    "    min_area=25.0,\n",
    ")\n",
    "\n",
    "clf.save(gdf_rf, \"output_random_forest.geojson\")\n",
    "print(f\"\\n{len(gdf_rf)} polygons across {gdf_rf['class_id'].nunique()} classes\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bf27eb7",
   "metadata": {},
   "source": [
    "### 3c — Inspect feature importances\n",
    "\n",
    "After `.supervised()` the fitted model is stored on\n",
    "`clf._last_model` — use it to check which bands the\n",
    "classifier found most discriminating.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60bb8e25",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "importances = clf._last_model.feature_importances_\n",
    "band_labels = [f\"Band {i+1}\" for i in range(len(importances))]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 4))\n",
    "ax.barh(band_labels, importances)\n",
    "ax.set_xlabel(\"Feature importance\")\n",
    "ax.set_title(\"Random Forest — band importance\", fontweight=\"bold\")\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80597880",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 4 — Rule-based classification (DN ranges)\n",
    "\n",
    "Define per-band thresholds manually.  Useful when you have domain\n",
    "knowledge about the spectral signature of each class.\n",
    "\n",
    "> **Tip:** run `clf.band_stats()` first (section 1b) to see the\n",
    "> actual value ranges before writing rules.\n",
    "\n",
    "Rules format: `{class_id: [(band, min, max), …]}`\n",
    "\n",
    "- Band numbers are **1-based**\n",
    "- A pixel must satisfy **all** conditions in the list\n",
    "- Unclassified pixels are dropped (class 0 = nodata)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46f2cb8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "dn_rules = {\n",
    "    # Class 1: Vegetation-like — high clay ratio, moderate carbonate\n",
    "    1: [\n",
    "        (3, 1.0, 3.5),     # Band 3 (Clay/Hydroxyl) high\n",
    "        (4, 0.9, 1.8),     # Band 4 (Carbonate) moderate\n",
    "        (2, 0.6, 1.2),     # Band 2 (Ferrous) moderate\n",
    "        (1, 1.0, 6.0),     # Band 1 (Iron Oxide) not too low\n",
    "    ],\n",
    "\n",
    "    # Class 2: Bare soil / dry surfaces — balanced mid-range values\n",
    "    2: [\n",
    "        (1, 1.0, 4.0),\n",
    "        (2, 0.7, 1.3),\n",
    "        (3, 0.5, 2.0),\n",
    "        (5, 0.4, 1.5),\n",
    "    ],\n",
    "\n",
    "    # Class 3: Water / very low reflectance\n",
    "    3: [\n",
    "        (2, 0.49, 0.8),\n",
    "        (3, 0.27, 0.8),\n",
    "        (4, 0.72, 1.0),\n",
    "        (5, 0.18, 0.6),\n",
    "    ],\n",
    "\n",
    "    # Class 4: Bright surfaces — urban / light-coloured rocks\n",
    "    4: [\n",
    "        (1, 4.0, 10.4),    # very high iron-oxide ratio\n",
    "        (3, 2.0, 3.5),\n",
    "        (4, 1.2, 2.0),\n",
    "        (5, 1.0, 2.3),\n",
    "    ],\n",
    "}\n",
    "\n",
    "gdf_dn = clf.from_dn_ranges(\n",
    "    rules=dn_rules,\n",
    "    dissolve=True,\n",
    "    min_area=0.0,   # keep all polygons for strict extraction\n",
    ")\n",
    "\n",
    "clf.save(gdf_dn, \"output_dn_rules.gpkg\")\n",
    "print(f\"\\nClasses: {sorted(gdf_dn['class_id'].unique())}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "267bdbfb",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 5 — Quick visual comparison\n",
    "\n",
    "Plot the polygon counts per class for each method side by side.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fed3614d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(16, 4))\n",
    "\n",
    "for ax, (gdf, title) in zip(axes, [\n",
    "    (gdf_kmeans, \"KMeans (5 clusters)\"),\n",
    "    (gdf_rf, \"Random Forest\"),\n",
    "    (gdf_dn, \"DN Rules\"),\n",
    "]):\n",
    "    counts = gdf[\"class_id\"].value_counts().sort_index()\n",
    "    counts.plot.bar(ax=ax, color=\"steelblue\")\n",
    "    ax.set_title(title, fontweight=\"bold\")\n",
    "    ax.set_xlabel(\"Class ID\")\n",
    "    ax.set_ylabel(\"Polygon count\")\n",
    "\n",
    "plt.suptitle(\"Polygon counts per class — three classification methods\",\n",
    "             fontsize=13, fontweight=\"bold\", y=1.03)\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75362460",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 📋 Output files\n",
    "\n",
    "| File | Method | Format |\n",
    "|------|--------|--------|\n",
    "| `output_kmeans.gpkg` | Unsupervised KMeans | GeoPackage |\n",
    "| `output_minibatch.shp` | Unsupervised MiniBatchKMeans | Shapefile |\n",
    "| `output_random_forest.geojson` | Supervised Random Forest | GeoJSON |\n",
    "| `output_dn_rules.gpkg` | Rule-based DN ranges | GeoPackage |\n",
    "\n",
    "All outputs have a `class_id` column and dissolved, cleaned polygon geometries.\n",
    "Load directly into QGIS / ArcGIS for further analysis.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}