[Python] How to get outliers outside the box plot?

In boxplots, outliers are usually defined as values ​​outside the 1.5 interquartile range of the data. So, to get the outliers outside the boxplot, you can do as follows:

  • First, use the dataset to calculate the upper and lower bounds and interquartile ranges of the boxplots. The upper boundary is equal to the third quartile (Q3) plus 1.5 times the interquartile range (IQR), and the lower boundary is equal to the first quartile (Q1) minus 1.5 times the IQR.
  • Values ​​in the dataset that are greater than the upper bound or less than the lower bound are then considered outliers.

Python or other statistical analysis software can be used to automatically detect and obtain outliers outside the box plot.

In Python, you can use the NumPy or Pandas library to calculate the quartiles, IQR, and bounds of the boxplot, and use conditional statements to filter outliers. For example, the following code demonstrates how to use the Python Pandas library to get outliers outside the boxplots:

import pandas as pd

# 创建数据集
data = pd.DataFrame({
    
    'values': [1, 2, 3, 4, 5, 10, 20, 30, 40, 500]})

# 计算四分位数、IQR和边界
Q1 = data['values'].quantile(0.25)
Q3 = data['values'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR

# 获取异常值
outliers = data[(data['values'] < lower_bound) | (data['values'] > upper_bound)]
print(outliers)

First we look at the distribution of the data:

insert image description here

In the above code, we first created a dataset of 10 numbers and calculated the quartiles, IQR and bounds using the Pandas library. Then, we filter the outliers outside the boxplots using a conditional statement and print them out. In this example, 500 in the dataset are considered outliers.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/129815442