Statistics

Statistics#

GeoUtils supports statistical analysis tailored to geospatial objects.

For a Raster or a PointCloud, the statistics are naturally performed on the data attribute which is clearly defined.

Warning

The API for statistical features is preliminary and might change with the release of zonal and grouped statistics.

Estimators#

The get_stats() method allows to extract key statistical estimators from a raster or a point cloud, optionally subsetting to an inlier mask.

Supported statistics are :

Mean: arithmetic mean of the data, ignoring masked values.
Median: middle value when the valid data points are sorted in increasing order, ignoring masked values.
Max: maximum value among the data, ignoring masked values.
Min: minimum value among the data, ignoring masked values.
Sum: sum of all data, ignoring masked values.
Sum of squares: sum of the squares of all data, ignoring masked values.
90th percentile: point below which 90% of the data falls, ignoring masked values.
IQR (Interquartile Range): difference between the 75th and 25th percentile of a dataset, ignoring masked values.
LE90 (Linear Error with 90% confidence): difference between the 95th and 5th percentiles of a dataset, representing the range within which 90% of the data points lie. Ignore masked values.
NMAD (Normalized Median Absolute Deviation): robust measure of variability in the data, less sensitive to outliers compared to standard deviation. Ignore masked values.
RMSE (Root Mean Square Error): commonly used to express the magnitude of errors or variability and can give insight into the spread of the data. Only relevant when the raster represents a difference of two objects. Ignore masked values.
Std (Standard deviation): measures the spread or dispersion of the data around the mean, ignoring masked values.
Valid count: number of finite data points in the array. It counts the non-masked elements.
Total count: total size of the raster.
Percentage valid points: ratio between Valid count and Total count.

If an inlier mask is passed:

Total inlier count: number of data points in the inlier mask.
Valid inlier count: number of unmasked data points in the array after applying the inlier mask.
Percentage inlier points: ratio between Valid inlier count and Valid count. Useful for classification statistics.
Percentage valid inlier points: ratio between Valid inlier count and Total inlier count.

Callable functions are supported as well.

import geoutils as gu
import numpy as np

# Instantiate a raster from a filename on disk
filename_rast = gu.examples.get_path("exploradores_aster_dem")
rast = gu.Raster(filename_rast)
rast

Raster(
  data=not_loaded; shape on disk (1, 618, 539); will load (618, 539)
  transform=| 30.00, 0.00, 627175.00|
            | 0.00,-30.00, 4852085.00|
            | 0.00, 0.00, 1.00|
  crs=EPSG:32718
  nodata=-9999.0)

Get all default statistics:

rast.get_stats()

{'Mean': np.float64(1497.2627007285762),
 'Median': np.float32(1276.0637),
 'Max': np.float32(3959.903),
 'Min': np.float32(317.71017),
 'Sum': np.float32(4.854036e+08),
 'Sum of squares': np.float32(8.3845474e+11),
 '90th percentile': np.float64(2359.6575439453127),
 'LE90': np.float64(1939.557562255859),
 'IQR': np.float64(530.6628723144531),
 'NMAD': np.float32(245.85695),
 'RMSE': np.float64(1608.1898819665316),
 'Standard deviation': np.float64(586.9235763179371),
 'Valid count': np.int64(324194),
 'Total count': 333102,
 'Percentage valid points': np.float64(97.32574406638207)}

Get a single statistic (e.g., ‘mean’) as a float:

rast.get_stats("mean")

np.float64(1497.2627007285762)

Get multiple statistics:

rast.get_stats(["mean", "max", "std"])

{'mean': np.float64(1497.2627007285762),
 'max': np.float32(3959.903),
 'std': np.float64(586.9235763179371)}

Using a custom callable statistic:

def custom_stat(data):
    return np.nansum(data > 100)  # Count the number of pixels above 100
rast.get_stats(custom_stat)

np.int64(324194)

Passing an inlier mask:

inlier_mask = rast > 1500
rast.get_stats(inlier_mask=inlier_mask)

{'Mean': np.float64(2251.6460131830745),
 'Median': np.float32(2107.4634),
 'Max': np.float32(3959.903),
 'Min': np.float32(1500.0018),
 'Sum': np.float32(2.1178982e+08),
 'Sum of squares': np.float32(5.0793883e+11),
 '90th percentile': np.float64(3167.4931884765624),
 'LE90': np.float64(1787.9011047363283),
 'IQR': np.float64(816.0797119140625),
 'NMAD': np.float32(542.05273),
 'RMSE': np.float64(2323.8239371057207),
 'Standard deviation': np.float64(574.6724350254293),
 'Valid count': np.int64(324194),
 'Total count': 333102,
 'Percentage valid points': np.float64(97.32574406638207),
 'Valid inlier count': np.int64(94060),
 'Total inlier count': np.int64(94060),
 'Percentage inlier points': np.float64(29.013491921503793),
 'Percentage valid inlier points': np.float64(100.0)}

Subsampling#

The subsample() method allows to efficiently extract a valid random subsample from a raster or a point cloud. It can conveniently return the output as a point cloud, or as an array.

The subsample size can be defined either as a fraction of valid values (floating value strictly between 0 and 1), or as a number of samples (integer value above 1).

# Subsample 10% of the raster valid values
rast.subsample(subsample=0.1)

masked_array(data=[1234.7056884765625, 1139.477294921875,
                   1033.6265869140625, ..., 1050.069091796875,
                   2192.496826171875, 907.2034301757812],
             mask=[False, False, False, ..., False, False, False],
       fill_value=-9999.0,
            dtype=float32)

Statistics

Contents

Statistics#

Estimators#

Subsampling#