Three ways to quickly find outliers

This article will introduce 3 Python methods to find outliers in a data set

Outliers refer to data points that are significantly different or abnormal from other data points in the data set. These data points may be farther from the center of the data set than other data points, or have unusual values. Outliers can be caused by data collection errors, unusual events, measurement errors, or other unknown factors.

The presence of outliers can have important consequences for data analysis and statistical modeling, as they can cause models to be inaccurate or produce misleading results.

Let’s first create a demo data

 import pandas as pd
 import matplotlib.pyplot as plt
 
 name = ['John', 'Victor', 'Carlos', 'Leo', 'Kevin', 'Silva', 'Johnson', 'Lewis', 'George', 'Daniel', 'Harry', 'Jordan', 'James']
 salary = [4000, 1000, 2000, 100000, 3500, 6000, 1500, 3000, 2500, 3600, 2100, 1700, 1600]
 
 df = pd.DataFrame({'Name': name, 'Salary': salary})
 
 plt.boxplot(df['Salary'])
 plt.show()

You can see that the above point is an outlier. Below we will introduce a method to quickly find it.

Interquartile range method

First find the first and third quartile values, usually noted as Q1 and Q3. Then subtract Q1 from Q3 to calculate the interquartile range (IQR).

Calculate the lower and upper bounds by subtracting/adding 1.5 times the IQR. Values outside the upper and lower boundaries are outliers

 q1 = df['Salary'].quantile(0.25)
 q3 = df['Salary'].quantile(0.75)
 iqr = q3 - q1
 
 lower_bound = q1 - 1.5 * iqr
 upper_bound = q3 + 1.5 * iqr
 
 outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]

The threshold of 1.5 times IQR is usually a commonly used standard, and the selection of the threshold can be adjusted according to specific circumstances. Sometimes, you can choose a stricter or looser threshold to suit your specific data analysis needs.

standard deviation method

The Standard Deviation Method uses the standard deviation of the data to determine whether the data points deviate from the data. The upper and lower bounds are the mean plus or minus three times the standard deviation.

His method is as follows:

Calculate the mean and standard deviation: First, calculate the mean and standard deviation of the data. The mean represents the center of the data, and the standard deviation measures the spread of the data.

Determine the threshold: Define a threshold, usually a multiple of the standard deviation (usually 2 or 3 times the standard deviation). This threshold determines what data points are considered outliers.

Identify outliers: Calculate the difference between each data point and the mean, then compare this difference to a threshold. If the difference exceeds the threshold, the data point is considered an outlier.

 mean = df.Salary.mean()
 std = df.Salary.std()
 
 upper_bound = mean + 3 * std
 lower_bound = mean - 3 * std
 
 outliers = df[(df['Salary'] < lower_bound) | (df['Salary'] > upper_bound)]

The advantage of the standard deviation method is that it is simple to understand and does not require assumptions about the shape of the data distribution. But you need to pay attention to the following points:

Typically, a threshold value of 2 or 3 times the standard deviation is used as the threshold, but this value may need to be adjusted depending on the situation.
This method works well for normally distributed data sets, but may lead to misjudgments for skewed distributed data.
The standard deviation method may not be suitable for small samples because the standard deviation may not be stable enough in small samples.

Z-score method

The Z-Score method measures the deviation between a data point and the mean of the data set, expressing this deviation in a standardized way. For each data point, calculate the difference between it and the mean, then divide this difference by the standard deviation to get the Z-score. If the z-score is greater than 3.0 or less than -3.0, the value can be classified as an outlier.

We can directly use the functions provided by scipy to perform calculations

 from scipy import stats
 
 df['Salary_zscore'] = stats.zscore(df['Salary'])
 filtered_df = df[(df['Salary_zscore'] <= 3) & (df['Salary_zscore'] >= -3)]

The Z-score method works with various types of data distributions and does not require assumptions about the shape of the data distribution. It also provides a standardized metric, making it easier to compare outliers between different data sets.

Summarize

The above are statistical methods that can quickly find outliers. In addition, there are also some machine learning methods such as:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): DBSCAN is a density clustering algorithm that can also be used to detect outliers. It identifies outliers based on the density of data points, considering points with lower density as outliers.

LOF (Local Outlier Factor): LOF is a local outlier factor method used to detect outliers within a local area. It identifies outliers by considering the ratio between the local density around each data point and the density of neighboring points.

Isolation Forest: Isolation Forest is an outlier detection method based on random forest, which identifies outliers by building a tree structure. Due to the use of randomness, it is very effective for high-dimensional data and large data sets.

However, the execution speed of these methods will be very slow. If the speed requirements are strict, you still need to choose carefully.

https://avoid.overfit.cn/post/2f9d9254f3a146bcb116f680906ec66a