Classification methods

This page explains each method in detail, when to use it, and what the parameters do.

Unsupervised — `clf.unsupervised()`

Groups pixels into k spectral clusters with no training data.

gdf = clf.unsupervised(
    n_clusters=5,
    algorithm="kmeans",          # or "mini_batch_kmeans"
    dissolve=True,               # merge adjacent same-class polygons
    min_area=50.0,               # drop polygons < 50 m²
)

Algorithm choice:

"kmeans": Standard Lloyd’s algorithm. Loads all valid pixels into RAM. Best for rasters under ~10 M pixels.
"mini_batch_kmeans": Processes pixels in batches of 10 000. Convergence is slightly noisier but uses far less memory and runs much faster on large scenes.

Tip

Call RasterClassifier.available_algorithms() to see all options with inline usage examples.

Supervised — `clf.supervised()`

Trains a Random Forest classifier from labelled training geometries.

gdf = clf.supervised(
    roi_path="training_rois.shp",
    class_col="class_id",        # column with integer labels
    n_estimators=100,             # RF trees
    dissolve=True,
    min_area=25.0,
)

ROI requirements:

The file may contain Points, Polygons, or both (mixed geometry).
For polygons, every pixel inside the geometry is used as a training sample — no lossy zonal-mean shortcut.
The label column must contain integers. If you have string labels, use encode_roi() first.
CRS is automatically reprojected to match the raster — never the reverse.

After classification the fitted model is available at clf._last_model. Use it to inspect feature importances:

importances = clf._last_model.feature_importances_

Rule-based — `clf.from_dn_ranges()`

Classify pixels by per-band value thresholds. No training data and no statistical model — you define the rules from domain knowledge.

rules = {
    1: [(4, 0.15, 1.0), (5, 0.0, 0.10)],   # class 1: high B4 AND low B5
    2: [(5, 0.25, 1.0)],                      # class 2: high B5
}
gdf = clf.from_dn_ranges(rules)

Rule format: {class_id: [(band, min, max), …]}

Band numbers are 1-based.
A pixel must satisfy all conditions in the list to be assigned that class.
If multiple rules overlap, the last matching class wins.
Class 0 is reserved for nodata / unclassified.

Tip

Run clf.band_stats() first to see min / max / mean / std for every band, then design rules accordingly.

Utility helpers

`available_algorithms()`

Print a formatted list of all supported algorithms with usage snippets. Works as a static method or on an instance:

RasterClassifier.available_algorithms()
# — or —
clf.available_algorithms()

`band_stats()`

Print per-band statistics directly from the raster:

clf.band_stats()
# Band  1: min=0.8706  max=10.4115  mean=2.1423  std=0.9812
# Band  2: min=0.4902  max=1.6621   mean=0.9134  std=0.1089
# ...

`encode_roi()`

Convert a string/categorical label column to consecutive integer IDs:

out_path, mapping = clf.encode_roi(
    "geology.shp",
    label_col="Formation",
)
# mapping = {'Alluvium': 1, 'Granite': 2, 'Schist': 3}

# Feed directly into supervised():
gdf = clf.supervised(roi_path=out_path, class_col="class_id")

Labels are sorted alphabetically and numbered from 1 (0 is reserved for nodata). The encoded file is saved next to the input by default.

Polygon options (all methods)

dissolve (default True): Merge adjacent polygons of the same class into single features, then explode MultiPolygons to simple Polygons.
min_area (default 0.0): Drop polygons smaller than this threshold (in map units²). Set to e.g. 50.0 to remove 1–2 pixel speckle.

Output formats

clf.save() auto-detects format from the file extension:

clf.save(gdf, "out.shp")       # Shapefile
clf.save(gdf, "out.gpkg")      # GeoPackage (recommended)
clf.save(gdf, "out.geojson")   # GeoJSON

Classification methods

Unsupervised — clf.unsupervised()

Supervised — clf.supervised()

Rule-based — clf.from_dn_ranges()