Geospatial Analysis 2 – A critical step in optimizing geospatial analysis: A closer look at data cleaning and preprocessing

write at the beginning

When performing geospatial analysis, data quality is a key factor in ensuring accuracy and reliability. Data cleaning and preprocessing are fundamental steps in ensuring that geospatial datasets are suitable for analysis. This article will delve into the importance of data cleaning in geospatial analysis and introduce the basic process of performing data cleaning in Python.

1. The importance and basic process of data cleaning in geospatial analysis

importance:

The collection of geospatial data involves data from multiple sources and in different formats, which may contain missing values, outliers, and different coordinate systems. If these issues are not properly handled, they will have a serious impact on subsequent analysis. Data cleaning ensures the consistency of the data set, eliminates potentially misleading factors, and improves the credibility of the analysis.

The basic process of data cleaning in Python:

In Python, libraries such as Pandas and NumPy provide rich tools for performing various data cleaning tasks. The basic process includes data loading, missing value detection and filling, outlier identification and processing, and projection and coordinate conversion.

2. Data loading

Data loading is the first step in data cleaning and is usually performed using the Pandas library. Pandas provides functions read_csvsuch read_excelas , and can easily load various data formats.

import pandas as pd

# 读取地理空间数据
geo_data = pd.read_csv('geo_data.csv')

3. Handling missing values

In geospatial data analysis, missing values ​​can result from a variety of reasons, such as sensor failure, incomplete data transfer, or human error during data collection. The following are some missing situations you may encounter and how to deal with them.

Simulate a copy of missing data:

import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
from scipy import stats

# 构建地理空间数据
np.random.seed(12)
num_samples = 100

geo_data = pd.DataFrame({
    
    
    'latitude': np.random.uniform(35, 40, num_samples),
    'longitude': np.random.uniform(-120, -80, num_samples),
    'address': [f'Location_{
      
      i}' for i in range(num_samples)],
    'population': np.random.randint(1000, 10000, num_samples),
    'timestamp': pd.date_range(start='2022-01-01', periods=num_samples, freq='D')
})

# 引入一些缺失情况
# 缺失坐标信息
geo_data.loc[10:20, 'latitude'] = np.nan
geo_data.loc[30:40, 'longitude'] = np.nan

# 缺失属性信息
geo_data.loc[50:60, 'population'] = np.nan

# 缺失时间信息
geo_data.loc[70:80, 'timestamp'] = np.nan

# 缺失区域数据
geo_data.loc[90:95, 'attribute'] = np.nan

# 缺失空间关系数据
geo_data.loc[5:15, 'target_lat'] = np.nan
geo_data.loc[25:35, 'target_lon'] = np.nan

# 查看缺失数据
print("缺失数据情况:\n", geo_data.isnull().sum())

3.1 Missing coordinate information:

Latitude and longitude information is missing from some geospatial data due to GPS signal outages. Interpolation is generally used to fill in missing coordinate values ​​to ensure the integrity of geospatial data in analysis.

# 使用插值法填充缺失的坐标值
geo_data['latitude'].fillna(geo_data['latitude'].interpolate(), inplace=True)
geo_data['longitude'].fillna(geo_data['longitude'].interpolate(), inplace=True)

3.2 Missing attribute information:

Some geospatial data are missing demographic information, possibly due to missing surveys or data not being updated in a timely manner. Missing attribute information is often filled in using average values ​​to maintain overall data consistency.

# 使用平均值填充缺失的人口统计数据
geo_data['population'].fillna(geo_data['population'].mean(), inplace=True)

3.3 Missing time information:

Timestamp information in geospatial data is missing, possibly due to data collection equipment failure or recording errors. Missing timestamps can be filled using the previous non-missing value to maintain time series continuity.

# 使用前一个非缺失值填充时间信息
geo_data['timestamp'].fillna(method='ffill', inplace=True)

3.4 Missing regional data:

Lack of relevant data in certain geographic areas may result in certain geospatial extents not being covered in the analysis. You can use spatial interpolation methods, such as Kriging interpolation, to fill in missing areas of data.

Kriging Interpolation is an interpolation method used to estimate unknown geographical spatial locations. Its basic principle is to predict the value of unknown points based on the spatial relationship between known points. This method is commonly used in fields such as geospatial data analysis, geology, earth science, and environmental science.

Specifically, kriging interpolation is based on the concept of spatial autocorrelation, assuming that there is a certain spatial correlation between values ​​in adjacent geographical locations. The main idea of ​​this method is to estimate the value of unknown locations by modeling the spatial covariance function and take the variability of geospatial data into account during the interpolation process.

The result of interpolation takes into account not only the distance between spatial points, but also the spatial correlation between them. This makes kriging interpolation more accurate than simple interpolation methods, especially when processing geospatial data.

Kriging interpolation methods are mainly divided into the following types:

  1. Simple Kriging: Assume that the spatial variability around the unknown value is constant.

  2. Ordinary Kriging: Interpolation is performed by modeling the spatial variability, assuming that the spatial variability around the unknown value is not constant.

  3. Universal Kriging: Based on the ordinary kriging method, it takes into account the influence of the trend term and is more suitable for trend data.

The application of kriging interpolation usually requires appropriate selection and adjustment of spatial covariance functions and model parameters. The advantage of this method is that it can provide an estimate of the estimation error while fully utilizing the characteristics of the spatial data during the interpolation process.

Below is an example:

from scipy.interpolate import griddata

# 克里金插值填充缺失区域数据
grid_x, grid_y = np.mgrid[geo_data['longitude'].min():geo_data['longitude'].max():100j,
                          geo_data['latitude'].min():geo_data['latitude'].max():100j]
points = geo_data[['longitude', 'latitude']].values
values = geo_data['attribute'].values
geo_data['attribute'] = griddata(points, values, (grid_x, grid_y), method='cubic')

3.5 Missing multi-source data merging:

When integrating multiple geospatial data sources, data for certain geographic points is missing from one of the data sources. You can use an appropriate merge method, such as left join ( left join) or inner join ( inner join), and choose how to handle the merged missing values ​​based on your specific business needs.

# 使用Pandas进行数据合并(以'common_column'为共同列)
merged_data = pd.merge(geo_data1, geo_data2, on='common_column', how='left')

3.6 Missing spatial relationship data:

Some geospatial data lack spatial relationship data to other locations, such as distance information. You can consider using geocoding to obtain distance information between two places and fill in the missing spatial relationship data.

from geopy.geocoders import Nominatim
from geopy.distance import geodesic

# 创建地理编码器
geolocator = Nominatim(user_agent="geo_coder")

# 假设target_lat和target_lon是你要计算距离的目标地点的纬度和经度
target_lat, target_lon = 40.7128, -74.0060

# 使用地理编码获取两地之间的距离
geo_data['distance'] = geo_data.apply(lambda row: geodesic((row['latitude'], row['longitude']), (target_lat, target_lon)).km, axis=1)

4. Outlier handling

In geospatial data analysis, outliers can adversely affect the results. The following are some outlier situations you may encounter and how to deal with them accordingly.

4.1 Abnormal coordinate values:

Geographical coordinate values ​​deviate from the geographical range, which may be caused by measurement errors or equipment failure. Statistical methods (such as Z-score) can be used to identify outliers and then decide whether to delete or correct them.

from scipy import stats

# 计算Z-score
z_scores = stats.zscore(geo_data[['latitude', 'longitude']])

# 定义阈值
threshold = 3

# 标记异常值
outliers = (abs(z_scores) > threshold).any(axis=1)

# 处理异常值(删除或修正)
geo_data_no_outliers = geo_data[~outliers]

4.2 Abnormal demographic data:

Demographic data abnormally deviate from normal ranges, possibly due to misrecording or measurement bias. You can use domain knowledge and business rules to define reasonable ranges and identify and handle outliers.

# 定义人口统计数据合理范围
population_lower_limit = 1000
population_upper_limit = 10000

# 标记异常值
population_outliers = (geo_data['population'] < population_lower_limit) | (geo_data['population'] > population_upper_limit)

# 处理异常值(删除或修正)
geo_data_no_population_outliers = geo_data[~population_outliers]

4.3 Abnormal distance data:

Anomalies in the distance between two points in geospatial data deviate from the actual geographic distance, possibly due to incorrect geographic coordinates or data transfer issues. You can use geoinformatics knowledge to define reasonable distance ranges and identify and handle outliers.

# 定义合理的距离范围(根据业务需求和地理信息学知识)
distance_lower_limit = 0
distance_upper_limit = 1000

# 标记异常值
distance_outliers = (geo_data['distance'] < distance_lower_limit) | (geo_data['distance'] > distance_upper_limit)

# 处理异常值(删除或修正)
geo_data_no_distance_outliers = geo_data[~distance_outliers]

4.4 Geospatial boundary anomalies:

Some points in the geospatial data fall outside of legal geographic boundaries, possibly due to data acquisition errors or coordinate conversion issues. Geographic information knowledge can be used to define reasonable geographical boundaries and exclude outliers beyond the boundaries.

# 定义合理的地理边界
valid_boundary = {
    
    'min_longitude': -180, 'max_longitude': 180, 'min_latitude': -90, 'max_latitude': 90}

# 标记超出边界的异常点
boundary_outliers = (
    (geo_data['longitude'] < valid_boundary['min_longitude']) | 
    (geo_data['longitude'] > valid_boundary['max_longitude']) | 
    (geo_data['latitude'] < valid_boundary['min_latitude']) | 
    (geo_data['latitude'] > valid_boundary['max_latitude'])
)

# 处理异常点(删除或修正)
geo_data_within_boundary = geo_data[~boundary_outliers]

4.5 Abnormal geographical area data:

In geospatial data, certain areas have significantly abnormal attribute values, possibly due to measurement errors or unusual events. Spatial statistical methods, such as Local Outlier Factor (LOF), can be used to identify abnormal geographic area data.

from sklearn.neighbors import LocalOutlierFactor

# 使用局部异常因子识别异常地理区域数据
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
geo_data['is_outlier'] = lof.fit_predict(geo_data[['latitude', 'longitude', 'attribute']])
geo_data_no_region_outliers = geo_data[geo_data['is_outlier'] != -1].drop('is_outlier', axis=1)

5. Projection and coordinate conversion

In geospatial data analysis, map projection and coordinate system are two key concepts. Different data sets may use different coordinate systems and projections, which may lead to biases in the analysis process. Below we will introduce these two concepts in detail and show how to use Python libraries for coordinate transformation to ensure the consistency and comparability of geospatial data.

5.1 The concept of map projection and coordinate system:

  • Map projection: The Earth is a three-dimensional sphere, while maps are usually two-dimensional representations of a flat surface. Map projection is a method of mapping points on the earth's surface onto a flat surface. Different map projections have different properties and are suitable for different application scenarios, such as equidistant projection, equiangular projection, etc.

  • Coordinate System: A coordinate system is a set of rules used to define a geospatial location. The latitude and longitude coordinate system is the most common, where longitude represents a position in the east-west direction and latitude represents a position in the north-south direction. In addition to latitude and longitude coordinate systems, there are various local coordinate systems and projected coordinate systems, such as UTM (Universal Transverse Mercator projection).

5.2 Importance of coordinate transformation:

When you perform analyzes using geospatial datasets with different coordinate systems or map projections, coordinate transformations are often required to ensure data consistency and comparability. Coordinate transformations map a coordinate point from one coordinate system or projection to another.

5.3 Use the Python library for coordinate conversion:

In Python, there are some powerful libraries for performing transformations of geospatial coordinates, the most commonly used of which is pyprojthe . Here is a simple example showing how to convert latitude and longitude coordinates to Web Mercator projected coordinates:

from pyproj import Proj, transform

# 定义源坐标系统和目标坐标系统
source_proj = Proj(init='epsg:4326')  # WGS 84坐标系统(经纬度)
target_proj = Proj(init='epsg:3857')  # Web Mercator投影坐标系统

# 转换坐标
longitude, latitude = -122.4194, 37.7749  # 旧金山的经纬度
x, y = transform(source_proj, target_proj, longitude, latitude)

print(f"原始坐标:经度 {
      
      longitude}, 纬度 {
      
      latitude}")
print(f"转换后坐标:x {
      
      x}, y {
      
      y}")

In this example, we define a source coordinate system (WGS 84 coordinate system, latitude and longitude) and a target coordinate system (Web Mercator projected coordinate system), and then use the pyprojlibrary to perform coordinate transformation. This is important to ensure consistency and comparability of geospatial data during analysis.

5.4 Geocoding and reverse geocoding:

In addition to coordinate transformation, geocoding and reverse geocoding are also commonly used techniques in geospatial data processing.

  • Geocoding: The process of converting an address or place name into geographic coordinates (latitude and longitude). This is useful for extracting location information from non-spatial data.
from geopy.geocoders import Nominatim

# 创建地理编码器
geolocator = Nominatim(user_agent="geo_coder")

# 地理编码
location = geolocator.geocode("Statue of Liberty")
print(f"Statue of Liberty 的地理坐标:{
      
      location.latitude}, {
      
      location.longitude}")
  • Reverse geocoding: The process of converting geographic coordinates into addresses or place names. This is useful for converting coordinate information into a human-readable description of a location.
# 反向地理编码
address = geolocator.reverse((40.689247, -74.044502))
print(f"坐标 (40.689247, -74.044502) 的位置:{
      
      address.address}")

These two processes are very important for processing location information in geospatial data, and they can help you better understand and utilize geospatial data.

5.5 Practical applications of coordinate transformation:

In practical applications, you may encounter situations where you need to convert a geospatial dataset from one coordinate system or map projection to another. For example, you might have data that contains the latitude and longitude coordinates of cities, and you want to display those cities on a Web map, so you need to convert the coordinates to Web Mercator projected coordinates. In this case, you can use to pyprojperform batch coordinate conversion.

from pyproj import Proj, transform
import pandas as pd

# 示例数据:城市经纬度坐标
cities_data = pd.DataFrame({
    
    
    'City': ['New York', 'Paris', 'Tokyo'],
    'Latitude': [40.7128, 48.8566, 35.6895],
    'Longitude': [-74.0060, 2.3522, 139.6917]
})

# 定义源坐标系统和目标坐标系统
source_proj = Proj(init='epsg:4326')  # WGS 84坐标系统(经纬度)
target_proj = Proj(init='epsg:3857')  # Web Mercator投影坐标系统

# 批量转换坐标
cities_data['x'], cities_data['y'] = transform(source_proj, target_proj, cities_data['Longitude'].values, cities_data['Latitude'].values)

print(cities_data[['City', 'x', 'y']])

Through this example, you can see how to batch convert the latitude and longitude coordinates of a set of cities into Web Mercator projected coordinates for visualization on a web map.

write at the end

Data cleaning and preprocessing are the cornerstones of geospatial analysis. While this article delves into some of the key steps, it's just the beginning. Further learning includes participating in actual projects, reading relevant literature, and participating in community discussions. Only by continuously improving your skills can you better cope with the challenges of various geospatial data analysis. It is hoped that readers will gain a deep understanding of data cleaning and preprocessing through this article and be able to flexibly apply the knowledge they have learned in practical applications.

Guess you like

Origin blog.csdn.net/qq_41780234/article/details/135364428