🗺️ raster2poly — Classification & Vectorisation Cookbook

This notebook demonstrates every classification method in ``raster2poly`` using Landsat 9 geological ratio imagery.

Input raster: LC09_193036_20230713_geo_ratios.tif — a 15-band geological ratio stack produced by the landsat9geo pipeline.

Band	Ratio	Geological target
1	Iron Oxide (Red/Blue)	Fe³⁺ gossans, laterite
2	Ferrous Iron (SWIR1/Red)	Fe²⁺ mafics, chlorite
3	Clay/Hydroxyl (SWIR1/SWIR2)	Al-OH kaolinite, illite
4	Carbonate (SWIR2/NIR)	CO₃²⁻ calcite, dolomite
5	Ferric Oxide (Red/Green)	Hematite / goethite
6	NDVI	Vegetation
…	…	…

0 — Setup

[ ]:

# !pip install raster2poly
from raster2poly import RasterClassifier

1 — Load raster & inspect

RasterClassifier reads all bands on init and reports dimensions, band count, and the percentage of valid (non-nodata) pixels.

[ ]:

RASTER_PATH = "LC09_193036_20230713_geo_ratios.tif"
clf = RasterClassifier(RASTER_PATH)

1a — List available algorithms

A quick reference for all supported methods — callable as a static method or on an instance.

[ ]:

RasterClassifier.available_algorithms()

1b — Band statistics

Before designing DN-range rules you need to know the value range of each band. band_stats() prints min / max / mean / std for every band in one call — no manual rasterio loop needed.

[ ]:

clf.band_stats()

2 — Unsupervised classification (clustering)

No training data required. The algorithm groups pixels into k spectral clusters based purely on their multi-band signature.

2a — Standard KMeans

Best for small-to-medium rasters (< 10 M pixels).

[ ]:

gdf_kmeans = clf.unsupervised(
    n_clusters=5,
    algorithm="kmeans",
    dissolve=True,      # merge adjacent same-class polygons
    min_area=50.0,      # drop tiny speckle polygons (m²)
)

clf.save(gdf_kmeans, "output_kmeans.gpkg")
print(f"\nClasses found: {sorted(gdf_kmeans['class_id'].unique())}")
print(f"Total polygons: {len(gdf_kmeans)}")

2b — MiniBatchKMeans (recommended for large rasters)

Processes the data in small batches → drastically lower RAM usage and faster convergence on rasters with tens of millions of pixels.

[ ]:

gdf_mini = clf.unsupervised(
    n_clusters=8,
    algorithm="mini_batch_kmeans",
    dissolve=True,
    min_area=100.0,
)

clf.save(gdf_mini, "output_minibatch.shp")

3 — Supervised classification (Random Forest from ROI)

Requires a shapefile / GeoPackage with labelled training geometries (Points or Polygons — or both). The column holding the labels must contain integer class IDs.

3a — Encode string labels → integer IDs

If your ROI has a text column (e.g. "Age" with values like Holocene, Jurassic, …) you can convert it with encode_roi() — no external pandas step needed.

[ ]:

roi_encoded_path, label_mapping = clf.encode_roi(
    "ages2.shp",
    label_col="Age",
    # output_path="ages2_encoded.shp",  # optional, auto-generated if omitted
)

# label_mapping is a dict: {'Holocene': 1, 'Jurassic': 2, ...}
print("Label mapping:")
for name, idx in label_mapping.items():
    print(f"  {idx}: {name}")

3b — Train & classify

Every pixel inside each polygon ROI is used as a training sample. This is far more robust than the old zonal-mean approach.

[ ]:

gdf_rf = clf.supervised(
    roi_path=roi_encoded_path,
    class_col="class_id",       # the column created by encode_roi()
    n_estimators=100,            # number of Random Forest trees
    dissolve=True,
    min_area=25.0,
)

clf.save(gdf_rf, "output_random_forest.geojson")
print(f"\n{len(gdf_rf)} polygons across {gdf_rf['class_id'].nunique()} classes")

3c — Inspect feature importances

After .supervised() the fitted model is stored on clf._last_model — use it to check which bands the classifier found most discriminating.

[ ]:

import matplotlib.pyplot as plt

importances = clf._last_model.feature_importances_
band_labels = [f"Band {i+1}" for i in range(len(importances))]

fig, ax = plt.subplots(figsize=(10, 4))
ax.barh(band_labels, importances)
ax.set_xlabel("Feature importance")
ax.set_title("Random Forest — band importance", fontweight="bold")
plt.tight_layout()
plt.show()

4 — Rule-based classification (DN ranges)

Define per-band thresholds manually. Useful when you have domain knowledge about the spectral signature of each class.

Tip: run clf.band_stats() first (section 1b) to see the actual value ranges before writing rules.

Rules format: {class_id: [(band, min, max), …]}

Band numbers are 1-based
A pixel must satisfy all conditions in the list
Unclassified pixels are dropped (class 0 = nodata)

[ ]:

dn_rules = {
    # Class 1: Vegetation-like — high clay ratio, moderate carbonate
    1: [
        (3, 1.0, 3.5),     # Band 3 (Clay/Hydroxyl) high
        (4, 0.9, 1.8),     # Band 4 (Carbonate) moderate
        (2, 0.6, 1.2),     # Band 2 (Ferrous) moderate
        (1, 1.0, 6.0),     # Band 1 (Iron Oxide) not too low
    ],

    # Class 2: Bare soil / dry surfaces — balanced mid-range values
    2: [
        (1, 1.0, 4.0),
        (2, 0.7, 1.3),
        (3, 0.5, 2.0),
        (5, 0.4, 1.5),
    ],

    # Class 3: Water / very low reflectance
    3: [
        (2, 0.49, 0.8),
        (3, 0.27, 0.8),
        (4, 0.72, 1.0),
        (5, 0.18, 0.6),
    ],

    # Class 4: Bright surfaces — urban / light-coloured rocks
    4: [
        (1, 4.0, 10.4),    # very high iron-oxide ratio
        (3, 2.0, 3.5),
        (4, 1.2, 2.0),
        (5, 1.0, 2.3),
    ],
}

gdf_dn = clf.from_dn_ranges(
    rules=dn_rules,
    dissolve=True,
    min_area=0.0,   # keep all polygons for strict extraction
)

clf.save(gdf_dn, "output_dn_rules.gpkg")
print(f"\nClasses: {sorted(gdf_dn['class_id'].unique())}")

5 — Quick visual comparison

Plot the polygon counts per class for each method side by side.

[ ]:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for ax, (gdf, title) in zip(axes, [
    (gdf_kmeans, "KMeans (5 clusters)"),
    (gdf_rf, "Random Forest"),
    (gdf_dn, "DN Rules"),
]):
    counts = gdf["class_id"].value_counts().sort_index()
    counts.plot.bar(ax=ax, color="steelblue")
    ax.set_title(title, fontweight="bold")
    ax.set_xlabel("Class ID")
    ax.set_ylabel("Polygon count")

plt.suptitle("Polygon counts per class — three classification methods",
             fontsize=13, fontweight="bold", y=1.03)
plt.tight_layout()
plt.show()

📋 Output files

File	Method	Format
`output_kmeans.gpkg`	Unsupervised KMeans	GeoPackage
`output_minibatch.shp`	Unsupervised MiniBatchKMeans	Shapefile
`output_random_forest.geojson`	Supervised Random Forest	GeoJSON
`output_dn_rules.gpkg`	Rule-based DN ranges	GeoPackage

All outputs have a class_id column and dissolved, cleaned polygon geometries. Load directly into QGIS / ArcGIS for further analysis.