This article uses examples to explain in detail how to use Python to perform partition statistics on raster data, pay attention to the official account GeodataAnalysis
, and reply to 20230401 to obtain sample data and code, including the writing ideas of the code of this tool.
A Zonal Statistics operation is a statistical operation that calculates cell values for a raster (value raster) within an area defined by another dataset. First, divide the rasterization into multiple regions according to the vector data, extract the pixel values of each region and perform statistical calculations separately, and then output the results (generally directly output in a certain field of the input vector).
ArcGIS has raster zonal statistical tools, but it is not flexible enough and can only perform statistical calculations such as maximum and minimum values. This article introduces the raster data partition statistics tool based on the Python open source library rasterio
and geopandas
implementation, which can realize the partition statistics function more flexibly and meet diverse needs.
Let's see how to use it first. The whole tool is encapsulated into a Python class. To use it, you need to create a class instance first. The initialization parameter is the file path of the raster and vector, such as:
zs = ZonalStatistics(tif_path, shp_path)
zonal_statistics
To perform partition statistics, you only need to call the method of the class instance . This method has four parameters, and the specific meanings are as follows:
zone_field
: String, mandatory parameter, specify which field of the vector data to output to, if there is no such field, a new field will be created.statistics_type
: String, the algorithm of statistical calculation.func
It is a required parameter when it is empty. The optional algorithms aremin
,max
,mean
, ,std
,sum
,range
,median
which respectively represent the minimum value, maximum value, average value, standard deviation, sum, range, and median .func
: User-defined function,statistics_type
when it is empty, it is a mandatory parameter. It should be noted that the input of this function during operation isnumpy
a mask array.all_touched
: Boolean value, optional parameter, if it is True, all pixels touched by the vector's geometric features will participate in the calculation, if it is False, only the pixels whose center is inside the polygon will participate in the calculation.
An example code for calling this function is as follows:
zs.zonal_statistics('min', 'min')
zs.zonal_statistics('max', 'max')
zs.zonal_statistics('range', func=lambda x: np.ma.max(x)-np.ma.min(x))
Saving the result is also very simple, because this class uses geopandas
vector data to operate, and its attribute gdf
is one GeoDataFrame
, and the result can be directly saved through it, as shown in the following example:
zs.gdf.to_file('./result/zonal_statistics.shp', encoding='utf-8')
The calculation results are as follows, displayed by the maximum value field:
ZonalStatistics
The code of the class is as follows for reference:
class ZonalStatistics(object):
def __init__(self, raster_path, shp_path) -> None:
self.raster_path = raster_path
self.gdf = gpd.read_file(shp_path)
self._init_params(self.raster_path)
def _init_params(self, raster_path):
src = rio.open(raster_path)
self.transform = src.transform
self.crs = src.crs
self.shape = src.shape
def _geometry_mask(self, src, geometries, all_touched=False):
if isinstance(src, rio.DatasetReader):
pass
elif isinstance(src, str):
src = rio.open(src)
else:
raise ValueError
if not isinstance(geometries, (tuple, list)):
raise ValueError
geometry_mask = features.geometry_mask(
geometries=geometries,
out_shape=src.shape,
transform=src.transform,
all_touched=all_touched,
invert=True)
return geometry_mask
def _valid_range(self, mask):
mask_col = np.any(mask, axis=0)
mask_row = np.any(mask, axis=1)
col_index = np.array(np.where(mask_col)[0])
min_col, max_col = min(col_index), max(col_index)
row_index = np.array(np.where(mask_row)[0])
min_row, max_row = min(row_index), max(row_index)
return (min_row, max_row), (min_col, max_col)
def _read_from_geometry(self, geometries, all_touched=False):
src = rio.open(self.raster_path)
mask = self._geometry_mask(src, geometries, all_touched)
(min_row, max_row), (min_col, max_col) = self._valid_range(mask)
window = Window.from_slices(rows=(min_row, max_row+1),
cols=(min_col, max_col+1))
geom_array = src.read(1, window=window)
geom_mask = ~mask[min_row:max_row+1, min_col:max_col+1]
nodata_mask = (geom_array == src.nodata)
nan_mask = np.isnan(geom_array)
geom_array = np.ma.masked_array(geom_array,
geom_mask | nodata_mask | nan_mask)
return geom_array
def _statistics(self, geom_array, statistics_type):
if statistics_type == 'min':
return np.ma.min(geom_array)
elif statistics_type == 'max':
return np.ma.max(geom_array)
elif statistics_type == 'median':
return np.ma.median(geom_array)
elif statistics_type == 'sum':
return np.ma.sum(geom_array)
elif statistics_type == 'std':
return np.ma.std(geom_array)
elif statistics_type == 'range':
return np.ma.max(geom_array)-np.ma.min(geom_array)
elif statistics_type == 'mean':
return np.ma.mean(geom_array)
else:
raise ValueError
def zonal_statistics(self, zone_field, statistics_type=None,
func=None, all_touched=False):
for i, geom in enumerate(self.gdf.geometry.to_list()):
geom_array = self._read_from_geometry([geom], all_touched)
if (isinstance(func, type(None)) and
isinstance(statistics_type, type(None))):
raise ValueError
if isinstance(func, type(None)):
value = self._statistics(geom_array, statistics_type)
else:
value = func(geom_array)
if isinstance(value, type(np.ma.masked)):
continue
else:
self.gdf.loc[i, zone_field] = value