Article Directory
✌ View data outliers
1. ✌ Box plot
✌ Lower quartile Q1:
The 1/4 position of the data, where Qi is located = i(n+1)/4, where i=1, 2, 3. n represents the number of items contained in the sequence.
For example, for 100 data, the location of Q1 = (100+1)/4
Q1 = 0.25 × Q1 location + 0.75 × (Q1 location + 1)
✌ Median Q2:
Q2 location = 2 * (100 + 1) / 4
is located in the middle of the data
Q1 = 0.5 × Q2 location + 0.5 × (Q2 location + 1)
✌ Upper quartile Q3:
Q3 location = 3 * (100 + 1) / 4
Q3 = 0.75 × Q3 location + 0.25 × (Q3 location + 1)
is located at the 3/4 position of the data
✌ Interquartile range IQR:
IQR = Q3-Q1
✌ Lower limit:
The lower limit is the minimum value in the non-anomalous range.
Lower limit = Q1-1.5IQR
✌ Upper limit
The upper limit is the maximum value in the non-anomalous range.
Upper limit=Q3+1.5IQR
2. ✌ 3σ principle
In the normal distribution,
σ represents the standard deviation, μ represents the mean value, and x=μ is the symmetry axis of the image. The
3σ principle is
that the probability of the value distribution in (μ-σ, μ+σ) is 0.6826 and the
value is distributed in (μ-2σ). ,μ+2σ), the probability is 0.9544. The probability of
numerical distribution in (μ-3σ,μ+3σ) is 0.9974. It
can be considered that the value of Y is almost all concentrated in the interval (μ-3σ,μ+3σ)] , The possibility of exceeding this range is less than 0.3%.
So we can use this property to eliminate outliers
3. ✌ Code test
3.1 ✌ Guide library
import pandas as pd
import numpy as np
3.2 ✌ Create data
x=np.random.randint(10,100,(100000,10))
x=pd.DataFrame(x)
rows=np.random.randint(1,100000,100)
cols=np.random.randint(0,10,100)
x.iloc[rows,cols]=1000
(x==99999999).sum()
3.3 ✌ Box plot
x.boxplot()
3.4 ✌ 3σ principle
a=pd.DataFrame()
for i in x.columns:
z=(x[i]-x[i].mean())/x[i].std()
a[i]=abs(z)>3
a.sum()