[python] data mining analysis and cleaning - summary of outlier (outlier) processing methods


Link to this article: https://blog.csdn.net/weixin_47058355/article/details/129949060?spm=1001.2014.3001.5501

foreword

The significance of outlier processing is to improve the accuracy and reliability of data analysis. Outliers often affect the statistical characteristics of the data, such as mean, variance, etc., leading to wrong conclusions or predictions. In addition, outliers may interfere with the fitting effect of the model, making the model's ability to explain the data weaker.
Therefore, for data analysis tasks, we usually need to deal with outliers to ensure the quality and accuracy of the data as much as possible. Commonly used outlier processing methods include removing outliers, replacing outliers, treating outliers as missing values, and so on. The specific method needs to be selected and implemented according to the data type and task requirements.
This article uses the Titanic data set, which can be found on kaggle: Portal

1. Identify outliers

1.1 Box plot processing outliers

Box plot (Box plot) is a graph used to display the distribution of data, which can effectively detect outliers in the data. The boxplot consists of five numerical points, which are the minimum value, lower quartile (Q1), median (Q2), upper quartile (Q3), and maximum value.
In a boxplot, the upper and lower edges of the box represent the upper quartile (Q3) and lower quartile (Q1) of the data, respectively, and the line segment inside the box represents the median (Q2) of the data. The top and bottom of the boxes are joined by two line segments, called "whiskers," which typically extend to the maximum and minimum values ​​that are not outliers in the dataset.
If there are outliers in the data, those outliers will be plotted as separate points, away from the other data points. By looking at the boxplot, we can easily identify these outliers as they do not follow the normal distribution of data points, i.e. outside the "whiskers". This is how boxplots detect outliers.
IQR (Interquartile Range) refers to the interquartile range, which is the distance between the upper quartile (Q3) and the lower quartile (Q1) of the data. When using boxplots for outlier detection, the threshold range for outliers is usually determined based on IQR.

Specifically, the threshold for outliers can be calculated using the following formula:

Upper bound: Q3 + 1.5 * IQR Lower bound: Q1 - 1.5 * IQR A data point is considered an outlier if it is less than the lower bound or greater than the upper bound.

For example, suppose there is a set of data as follows: [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22
]. Through calculation, Q1=5, Q2=11, Q3=17 can be obtained, so IQR=12 (ie 17-5). According to the above formula, the lower limit is -11 and the upper limit is 33. Therefore, values ​​less than -11 or greater than 33 in this dataset are considered outliers.

By using IQR to detect outliers, we can more accurately identify outliers in the data and avoid the risk of over-reliance on specific distribution shapes.

#封装好的函数 可以随意调用
def outliers_proc(data, col_name, scale=1.5):
    """
            data:接收pandas数据格式
            col_name: pandas列名
            scale: 尺度
    """
    data_col = data[col_name]
    Q1 = data_col.quantile(0.25) # 0.25分位数
    Q3 = data_col.quantile(0.75)  # 0,75分位数
    IQR = Q3 - Q1
    data_col[data_col < Q1 - (scale * IQR)] = Q1 - (scale * IQR)
    data_col[data_col > Q3 + (scale * IQR)] = Q3 + (scale * IQR)
    return data[col_name]
data['Fare'] = outliers_proc(data, 'Fare')
print(data['Fare'].max())
data['Fare']

insert image description here
insert image description here

In addition, the data can also be visualized as a box plot

import matplotlib.pyplot as plt
import numpy as np

# 绘制箱线图
fig, ax = plt.subplots()
ax.boxplot(data['Fare'])

# 添加标题和标签
ax.set_title('Box Plot')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')

# 显示图形
plt.show()

insert image description here

1.2 3α principle

The 3α principle of the normal distribution means that for a normally distributed random variable, about 68% of the observations fall between the mean plus or minus the standard deviation, and about 95% of the observations fall between the mean plus or minus two standard deviations. About 99.7% of the observed values ​​fall between the mean plus or minus three standard deviations. This rule can be used to judge whether a data point is abnormal or outlier. A data point is considered an outlier or outlier if its value is outside the range of plus or minus three standard deviations from the mean.

def find_anomalies(random_data):
    random_data_std = random_data.std()
    random_data_mean = random_data.mean()
    anomaly_cut_off = random_data_std * 3

    lower_limit  = random_data_mean - anomaly_cut_off 
    upper_limit = random_data_mean + anomaly_cut_off

    random_data[random_data <lower_limit] = lower_limit
    random_data[random_data >upper_limit] = upper_limit

    return random_data
find_anomalies(data['Fare'])
print(data['Fare'].max())
data['Fare']

insert image description here

1.3 boxcox

The transformation proposed by box and cox in 1964 can make the linear regression model satisfy linearity, independence, homoscedasticity and normality without losing information. Real data often do not perfectly conform to these four characteristics, and most data statistics require data to be normally distributed (such as pearson correlation coefficient). Therefore, the data format can be changed through boxcox.
Remember to require that the data in this column is all greater than 0 and cannot be less than or equal to 0, otherwise an error will be reported Data must be positive.
insert image description here

from scipy.stats import boxcox
data['Fare']=data['Fare']+1
boxcox_transformed_data = boxcox(data['Fare'])
boxcox_transformed_data

insert image description here

2. Outlier processing

2.1 Censored method

The censoring method is that the numbers greater than the outlier threshold become the maximum threshold and the numbers smaller than the outlier threshold become the minimum threshold. For example, the box plot written earlier is the censored method

2.2 Single variable substitution

Anything outside the outlier threshold is replaced by a single variable, such as maximum, minimum, average, mode, and so on.

2.3 Replace with missing values

Anything outside the outlier threshold is replaced by missing values, and then these missing values ​​are filled by the filling method of missing values.

Summarize

Identification methods and processing methods can be combined freely, such as using box plots to detect outliers and replace them with missing values, replace values ​​exceeding the threshold, and then fill in missing values. If this article is useful to you, you can like it and add a comment. Welcome to exchange in the comment area.

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/129949060