Python Outlier Handling Box and Plot, 3σ Principle (Machine Learning)


✌ View data outliers

1. ✌ Box plot

Insert picture description here

✌ Lower quartile Q1:

The 1/4 position of the data, where Qi is located = i(n+1)/4, where i=1, 2, 3. n represents the number of items contained in the sequence.
For example, for 100 data, the location of Q1 = (100+1)/4
Q1 = 0.25 × Q1 location + 0.75 × (Q1 location + 1)

✌ Median Q2:

Q2 location = 2 * (100 + 1) / 4
is located in the middle of the data
Q1 = 0.5 × Q2 location + 0.5 × (Q2 location + 1)

✌ Upper quartile Q3:

Q3 location = 3 * (100 + 1) / 4
Q3 = 0.75 × Q3 location + 0.25 × (Q3 location + 1)
is located at the 3/4 position of the data

✌ Interquartile range IQR:

IQR = Q3-Q1

✌ Lower limit:

The lower limit is the minimum value in the non-anomalous range.
Lower limit = Q1-1.5IQR

✌ Upper limit

The upper limit is the maximum value in the non-anomalous range.
Upper limit=Q3+1.5IQR

2. ✌ 3σ principle

Insert picture description here

In the normal distribution,
σ represents the standard deviation, μ represents the mean value, and x=μ is the symmetry axis of the image. The
3σ principle is
that the probability of the value distribution in (μ-σ, μ+σ) is 0.6826 and the
value is distributed in (μ-2σ). ,μ+2σ), the probability is 0.9544. The probability of
numerical distribution in (μ-3σ,μ+3σ) is 0.9974. It
can be considered that the value of Y is almost all concentrated in the interval (μ-3σ,μ+3σ)] , The possibility of exceeding this range is less than 0.3%.
So we can use this property to eliminate outliers

3. ✌ Code test

3.1 ✌ Guide library

import pandas as pd
import numpy as np

3.2 ✌ Create data

x=np.random.randint(10,100,(100000,10))
x=pd.DataFrame(x)
rows=np.random.randint(1,100000,100)
cols=np.random.randint(0,10,100)
x.iloc[rows,cols]=1000
(x==99999999).sum()

Insert picture description here

3.3 ✌ Box plot

x.boxplot()

Insert picture description here

3.4 ✌ 3σ principle

a=pd.DataFrame()
for i in x.columns:
    z=(x[i]-x[i].mean())/x[i].std()
    a[i]=abs(z)>3
a.sum()

Insert picture description here

Guess you like

Origin blog.csdn.net/m0_47256162/article/details/113790444