Handling Outliers: Detailed Tutorial and Case Studies

In our daily data processing and machine learning tasks, it is very common to encounter some values ​​that violate the routine or do not meet expectations in the data. These values ​​are called outliers (Outliers). Outliers may affect the analysis and modeling results of the data, so dealing with outliers is a very important step in data preprocessing. In this article, we will explain in detail how to deal with outliers, including the identification of outliers, and the method of dealing with outliers.

1. Identification of outliers

First, we need to be able to identify outliers in the data. Generally, we can identify outliers by:

  • **Statistical methods:** Such as using Box Plot (Box Plot), Z-score (Z score) or IQR method (interquartile range) to detect outliers.

  • **Business understanding:** Sometimes, based on business understanding and expertise, we can determine that certain values ​​are outliers.

Let us take the Z-score method as an example and give the example code of Python:

import numpy as np
import scipy.stats as stats

# 创建数据
data = np.array([2,8,7,1,6,8,8,7,9,5,100])

# 计算z-score
z_scores = stats.zscore(data)

# 假设我们选择z-score大于2的数据为异常值
outliers = data[np.abs(z_scores) > 2]
print(outliers)

2. Handling of outliers

After identifying the outliers in the data, we can choose the following ways to deal with them:

  • **Delete:** This is the easiest way to directly delete the row or column where the outlier is located.

  • **Replacement:** Replace outliers, you can use median, mean, etc.

  • **Hold:** In some cases, outliers may also contain important information, which we may choose to keep.

Let's take replacement as an example and give the example code of Python:

# 计算数据的中位数
median = np.median(data)

# 将异常值替换为中位数
data[np.abs(z_scores) > 2] = median

3. Impact on data analysis and models

Outliers can have a big impact on our data analysis results and models. For example, when computing the mean and variance, outliers can have a significant impact on the results. For some models, such as linear regression, outliers can significantly affect the model's predictions.

When dealing with outliers, we need to weigh the pros and cons, and we may need to deal with outliers if they have a large impact on our analysis and model. If outliers contain important information, we may want to keep them.

3.1 Handling outliers in continuous values

For continuous values, we can usually use the following methods to deal with outliers:

  • Cap and bottom processing: For some continuous variables, such as age, income, etc., we can set an upper and lower limit, and cap and bottom the values ​​that exceed this range.
  • Data transformation: Transforming the data, such as log transformation, can reduce the influence of outliers on the results.

Let's take the topping and bottoming processing as an example, and give the sample code of Python:

# 将data中大于95的值替换为95
data[data > 95] = 95
# 将data中小于5的值替换为5
data[data < 5] = 5

3.2 Handling outliers in category values

For categorical values, we can consider the following treatments:

  • **Merge categories:** For some categories with less occurrences, we can consider merging them into one "Other" category.
  • **Delete:** If a category has very few observations, then we may choose to delete those observations.

Let's take merging categories as an example and give Python's example code:

# 假设我们有一个pandas的Series对象s
s = pd.Series(['a', 'b', 'c', 'a', 'b', 'a', 'd', 'd', 'e', 'e', 'e'])

# 我们将出现次数少于3次的类别合并为'other'
s = s.where(s.map(s.value_counts()) >= 3, 'other')

epilogue

Dealing with outliers is an integral part of data preprocessing. Correctly handling outliers can help us build more robust models and obtain more accurate analysis results. I hope this article can help you make a difference when encountering outliers and deal with your data better.

Guess you like

Origin blog.csdn.net/a871923942/article/details/131418906