πΊοΈ raster2poly β Classification & Vectorisation Cookbookο
This notebook demonstrates every classification method in ``raster2poly`` using Landsat 9 geological ratio imagery.
Input raster: LC09_193036_20230713_geo_ratios.tif β a 15-band geological ratio stack produced by the landsat9geo pipeline.
Band |
Ratio |
Geological target |
|---|---|---|
1 |
Iron Oxide (Red/Blue) |
FeΒ³βΊ gossans, laterite |
2 |
Ferrous Iron (SWIR1/Red) |
FeΒ²βΊ mafics, chlorite |
3 |
Clay/Hydroxyl (SWIR1/SWIR2) |
Al-OH kaolinite, illite |
4 |
Carbonate (SWIR2/NIR) |
COβΒ²β» calcite, dolomite |
5 |
Ferric Oxide (Red/Green) |
Hematite / goethite |
6 |
NDVI |
Vegetation |
β¦ |
β¦ |
β¦ |
0 β Setupο
[ ]:
# !pip install raster2poly
from raster2poly import RasterClassifier
1 β Load raster & inspectο
RasterClassifier reads all bands on init and reports dimensions, band count, and the percentage of valid (non-nodata) pixels.
[ ]:
RASTER_PATH = "LC09_193036_20230713_geo_ratios.tif"
clf = RasterClassifier(RASTER_PATH)
1a β List available algorithmsο
A quick reference for all supported methods β callable as a static method or on an instance.
[ ]:
RasterClassifier.available_algorithms()
1b β Band statisticsο
Before designing DN-range rules you need to know the value range of each band. band_stats() prints min / max / mean / std for every band in one call β no manual rasterio loop needed.
[ ]:
clf.band_stats()
2 β Unsupervised classification (clustering)ο
No training data required. The algorithm groups pixels into k spectral clusters based purely on their multi-band signature.
2a β Standard KMeansο
Best for small-to-medium rasters (< 10 M pixels).
[ ]:
gdf_kmeans = clf.unsupervised(
n_clusters=5,
algorithm="kmeans",
dissolve=True, # merge adjacent same-class polygons
min_area=50.0, # drop tiny speckle polygons (mΒ²)
)
clf.save(gdf_kmeans, "output_kmeans.gpkg")
print(f"\nClasses found: {sorted(gdf_kmeans['class_id'].unique())}")
print(f"Total polygons: {len(gdf_kmeans)}")
2b β MiniBatchKMeans (recommended for large rasters)ο
Processes the data in small batches β drastically lower RAM usage and faster convergence on rasters with tens of millions of pixels.
[ ]:
gdf_mini = clf.unsupervised(
n_clusters=8,
algorithm="mini_batch_kmeans",
dissolve=True,
min_area=100.0,
)
clf.save(gdf_mini, "output_minibatch.shp")
3 β Supervised classification (Random Forest from ROI)ο
Requires a shapefile / GeoPackage with labelled training geometries (Points or Polygons β or both). The column holding the labels must contain integer class IDs.
3a β Encode string labels β integer IDsο
If your ROI has a text column (e.g. "Age" with values like Holocene, Jurassic, β¦) you can convert it with encode_roi() β no external pandas step needed.
[ ]:
roi_encoded_path, label_mapping = clf.encode_roi(
"ages2.shp",
label_col="Age",
# output_path="ages2_encoded.shp", # optional, auto-generated if omitted
)
# label_mapping is a dict: {'Holocene': 1, 'Jurassic': 2, ...}
print("Label mapping:")
for name, idx in label_mapping.items():
print(f" {idx}: {name}")
3b β Train & classifyο
Every pixel inside each polygon ROI is used as a training sample. This is far more robust than the old zonal-mean approach.
[ ]:
gdf_rf = clf.supervised(
roi_path=roi_encoded_path,
class_col="class_id", # the column created by encode_roi()
n_estimators=100, # number of Random Forest trees
dissolve=True,
min_area=25.0,
)
clf.save(gdf_rf, "output_random_forest.geojson")
print(f"\n{len(gdf_rf)} polygons across {gdf_rf['class_id'].nunique()} classes")
3c β Inspect feature importancesο
After .supervised() the fitted model is stored on clf._last_model β use it to check which bands the classifier found most discriminating.
[ ]:
import matplotlib.pyplot as plt
importances = clf._last_model.feature_importances_
band_labels = [f"Band {i+1}" for i in range(len(importances))]
fig, ax = plt.subplots(figsize=(10, 4))
ax.barh(band_labels, importances)
ax.set_xlabel("Feature importance")
ax.set_title("Random Forest β band importance", fontweight="bold")
plt.tight_layout()
plt.show()
4 β Rule-based classification (DN ranges)ο
Define per-band thresholds manually. Useful when you have domain knowledge about the spectral signature of each class.
Tip: run
clf.band_stats()first (section 1b) to see the actual value ranges before writing rules.
Rules format: {class_id: [(band, min, max), β¦]}
Band numbers are 1-based
A pixel must satisfy all conditions in the list
Unclassified pixels are dropped (class 0 = nodata)
[ ]:
dn_rules = {
# Class 1: Vegetation-like β high clay ratio, moderate carbonate
1: [
(3, 1.0, 3.5), # Band 3 (Clay/Hydroxyl) high
(4, 0.9, 1.8), # Band 4 (Carbonate) moderate
(2, 0.6, 1.2), # Band 2 (Ferrous) moderate
(1, 1.0, 6.0), # Band 1 (Iron Oxide) not too low
],
# Class 2: Bare soil / dry surfaces β balanced mid-range values
2: [
(1, 1.0, 4.0),
(2, 0.7, 1.3),
(3, 0.5, 2.0),
(5, 0.4, 1.5),
],
# Class 3: Water / very low reflectance
3: [
(2, 0.49, 0.8),
(3, 0.27, 0.8),
(4, 0.72, 1.0),
(5, 0.18, 0.6),
],
# Class 4: Bright surfaces β urban / light-coloured rocks
4: [
(1, 4.0, 10.4), # very high iron-oxide ratio
(3, 2.0, 3.5),
(4, 1.2, 2.0),
(5, 1.0, 2.3),
],
}
gdf_dn = clf.from_dn_ranges(
rules=dn_rules,
dissolve=True,
min_area=0.0, # keep all polygons for strict extraction
)
clf.save(gdf_dn, "output_dn_rules.gpkg")
print(f"\nClasses: {sorted(gdf_dn['class_id'].unique())}")
5 β Quick visual comparisonο
Plot the polygon counts per class for each method side by side.
[ ]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 3, figsize=(16, 4))
for ax, (gdf, title) in zip(axes, [
(gdf_kmeans, "KMeans (5 clusters)"),
(gdf_rf, "Random Forest"),
(gdf_dn, "DN Rules"),
]):
counts = gdf["class_id"].value_counts().sort_index()
counts.plot.bar(ax=ax, color="steelblue")
ax.set_title(title, fontweight="bold")
ax.set_xlabel("Class ID")
ax.set_ylabel("Polygon count")
plt.suptitle("Polygon counts per class β three classification methods",
fontsize=13, fontweight="bold", y=1.03)
plt.tight_layout()
plt.show()
π Output filesο
File |
Method |
Format |
|---|---|---|
|
Unsupervised KMeans |
GeoPackage |
|
Unsupervised MiniBatchKMeans |
Shapefile |
|
Supervised Random Forest |
GeoJSON |
|
Rule-based DN ranges |
GeoPackage |
All outputs have a class_id column and dissolved, cleaned polygon geometries. Load directly into QGIS / ArcGIS for further analysis.