πŸ—ΊοΈ raster2poly β€” Classification & Vectorisation Cookbook

This notebook demonstrates every classification method in ``raster2poly`` using Landsat 9 geological ratio imagery.

Input raster: LC09_193036_20230713_geo_ratios.tif β€” a 15-band geological ratio stack produced by the landsat9geo pipeline.

Band

Ratio

Geological target

1

Iron Oxide (Red/Blue)

Fe³⁺ gossans, laterite

2

Ferrous Iron (SWIR1/Red)

Fe²⁺ mafics, chlorite

3

Clay/Hydroxyl (SWIR1/SWIR2)

Al-OH kaolinite, illite

4

Carbonate (SWIR2/NIR)

CO₃²⁻ calcite, dolomite

5

Ferric Oxide (Red/Green)

Hematite / goethite

6

NDVI

Vegetation

…

…

…

0 β€” Setup

[ ]:
# !pip install raster2poly
from raster2poly import RasterClassifier

1 β€” Load raster & inspect

RasterClassifier reads all bands on init and reports dimensions, band count, and the percentage of valid (non-nodata) pixels.

[ ]:
RASTER_PATH = "LC09_193036_20230713_geo_ratios.tif"
clf = RasterClassifier(RASTER_PATH)

1a β€” List available algorithms

A quick reference for all supported methods β€” callable as a static method or on an instance.

[ ]:
RasterClassifier.available_algorithms()

1b β€” Band statistics

Before designing DN-range rules you need to know the value range of each band. band_stats() prints min / max / mean / std for every band in one call β€” no manual rasterio loop needed.

[ ]:
clf.band_stats()


2 β€” Unsupervised classification (clustering)

No training data required. The algorithm groups pixels into k spectral clusters based purely on their multi-band signature.

2a β€” Standard KMeans

Best for small-to-medium rasters (< 10 M pixels).

[ ]:
gdf_kmeans = clf.unsupervised(
    n_clusters=5,
    algorithm="kmeans",
    dissolve=True,      # merge adjacent same-class polygons
    min_area=50.0,      # drop tiny speckle polygons (mΒ²)
)

clf.save(gdf_kmeans, "output_kmeans.gpkg")
print(f"\nClasses found: {sorted(gdf_kmeans['class_id'].unique())}")
print(f"Total polygons: {len(gdf_kmeans)}")

2b β€” MiniBatchKMeans (recommended for large rasters)

Processes the data in small batches β†’ drastically lower RAM usage and faster convergence on rasters with tens of millions of pixels.

[ ]:
gdf_mini = clf.unsupervised(
    n_clusters=8,
    algorithm="mini_batch_kmeans",
    dissolve=True,
    min_area=100.0,
)

clf.save(gdf_mini, "output_minibatch.shp")


3 β€” Supervised classification (Random Forest from ROI)

Requires a shapefile / GeoPackage with labelled training geometries (Points or Polygons β€” or both). The column holding the labels must contain integer class IDs.

3a β€” Encode string labels β†’ integer IDs

If your ROI has a text column (e.g. "Age" with values like Holocene, Jurassic, …) you can convert it with encode_roi() β€” no external pandas step needed.

[ ]:
roi_encoded_path, label_mapping = clf.encode_roi(
    "ages2.shp",
    label_col="Age",
    # output_path="ages2_encoded.shp",  # optional, auto-generated if omitted
)

# label_mapping is a dict: {'Holocene': 1, 'Jurassic': 2, ...}
print("Label mapping:")
for name, idx in label_mapping.items():
    print(f"  {idx}: {name}")

3b β€” Train & classify

Every pixel inside each polygon ROI is used as a training sample. This is far more robust than the old zonal-mean approach.

[ ]:
gdf_rf = clf.supervised(
    roi_path=roi_encoded_path,
    class_col="class_id",       # the column created by encode_roi()
    n_estimators=100,            # number of Random Forest trees
    dissolve=True,
    min_area=25.0,
)

clf.save(gdf_rf, "output_random_forest.geojson")
print(f"\n{len(gdf_rf)} polygons across {gdf_rf['class_id'].nunique()} classes")

3c β€” Inspect feature importances

After .supervised() the fitted model is stored on clf._last_model β€” use it to check which bands the classifier found most discriminating.

[ ]:
import matplotlib.pyplot as plt

importances = clf._last_model.feature_importances_
band_labels = [f"Band {i+1}" for i in range(len(importances))]

fig, ax = plt.subplots(figsize=(10, 4))
ax.barh(band_labels, importances)
ax.set_xlabel("Feature importance")
ax.set_title("Random Forest β€” band importance", fontweight="bold")
plt.tight_layout()
plt.show()


4 β€” Rule-based classification (DN ranges)

Define per-band thresholds manually. Useful when you have domain knowledge about the spectral signature of each class.

Tip: run clf.band_stats() first (section 1b) to see the actual value ranges before writing rules.

Rules format: {class_id: [(band, min, max), …]}

  • Band numbers are 1-based

  • A pixel must satisfy all conditions in the list

  • Unclassified pixels are dropped (class 0 = nodata)

[ ]:
dn_rules = {
    # Class 1: Vegetation-like β€” high clay ratio, moderate carbonate
    1: [
        (3, 1.0, 3.5),     # Band 3 (Clay/Hydroxyl) high
        (4, 0.9, 1.8),     # Band 4 (Carbonate) moderate
        (2, 0.6, 1.2),     # Band 2 (Ferrous) moderate
        (1, 1.0, 6.0),     # Band 1 (Iron Oxide) not too low
    ],

    # Class 2: Bare soil / dry surfaces β€” balanced mid-range values
    2: [
        (1, 1.0, 4.0),
        (2, 0.7, 1.3),
        (3, 0.5, 2.0),
        (5, 0.4, 1.5),
    ],

    # Class 3: Water / very low reflectance
    3: [
        (2, 0.49, 0.8),
        (3, 0.27, 0.8),
        (4, 0.72, 1.0),
        (5, 0.18, 0.6),
    ],

    # Class 4: Bright surfaces β€” urban / light-coloured rocks
    4: [
        (1, 4.0, 10.4),    # very high iron-oxide ratio
        (3, 2.0, 3.5),
        (4, 1.2, 2.0),
        (5, 1.0, 2.3),
    ],
}

gdf_dn = clf.from_dn_ranges(
    rules=dn_rules,
    dissolve=True,
    min_area=0.0,   # keep all polygons for strict extraction
)

clf.save(gdf_dn, "output_dn_rules.gpkg")
print(f"\nClasses: {sorted(gdf_dn['class_id'].unique())}")


5 β€” Quick visual comparison

Plot the polygon counts per class for each method side by side.

[ ]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(16, 4))

for ax, (gdf, title) in zip(axes, [
    (gdf_kmeans, "KMeans (5 clusters)"),
    (gdf_rf, "Random Forest"),
    (gdf_dn, "DN Rules"),
]):
    counts = gdf["class_id"].value_counts().sort_index()
    counts.plot.bar(ax=ax, color="steelblue")
    ax.set_title(title, fontweight="bold")
    ax.set_xlabel("Class ID")
    ax.set_ylabel("Polygon count")

plt.suptitle("Polygon counts per class β€” three classification methods",
             fontsize=13, fontweight="bold", y=1.03)
plt.tight_layout()
plt.show()


πŸ“‹ Output files

File

Method

Format

output_kmeans.gpkg

Unsupervised KMeans

GeoPackage

output_minibatch.shp

Unsupervised MiniBatchKMeans

Shapefile

output_random_forest.geojson

Supervised Random Forest

GeoJSON

output_dn_rules.gpkg

Rule-based DN ranges

GeoPackage

All outputs have a class_id column and dissolved, cleaned polygon geometries. Load directly into QGIS / ArcGIS for further analysis.