Data Analysis - MISSING VALUES

We obtained data (especially the large amount of data), is likely to appear missing data, data anomalies and other issues, data processing, data analysis part in a very important and necessary, as far as possible in order to do data analysis reduce abnormal appearance, and give a more accurate analysis of the conclusions, so before doing data analysis, data processing is especially necessary

Reading conditions: pandas familiar with the basic operation of
tools: Python

Use platform: jupyter notebook

Missing values

Analyzing missing values

python is read main data csv or excel, excel in the cell is empty, pandas reads the display is NaN, the value that is missing

Analyzing Data missing values Method: ISNULL, NotNull

  • isnull: True missing representation, False represents a non-missing
  • notnull: True representation of non-missing, False representation missing

First import python Toolkit Data Analysis Required

import numpy as np
import pandas as pd
__author__='莫叹'复制代码

Generate a two-dimensional array of table type df

#生成一个表格型的二维数组
df=pd.DataFrame({'a':[34,6,20,np.nan,56],
               'b':['juejin','number','one','good',np.nan]})复制代码

Output is as follows:


Whether there is a missing data value and the non-missing values ​​in the data screening Analyzing:

#判断二维数组df是否缺失
print(df.notnull(),'\n')
#通过索引判断a列是否缺失
print(df['a'].notnull(),'\n')
#筛选a列不存在缺失值数组
print(df[df['a'].notnull()])复制代码

Output is as follows:

       a      b
0   True   True
1   True   True
2   True   True
3  False   True
4   True  False 

0     True
1     True
2     True
3    False
4     True
Name: a, dtype: bool 

      a       b
0  34.0  juejin
1   6.0  number
2  20.0     one
4  56.0     NaN复制代码

Delete missing values

A method for screening notnull Boolean value above sequence is deleted missing values

Delete missing data values ​​need to follow specific circumstances and business situation to deal with, and sometimes need to remove all of the missing data, and sometimes need to remove part of the missing data, sometimes just need to delete the specified missing data.

Delete missing values ​​method: dropna (axis)

  • Default parameters axis = 0, delete the row of data when the parameter data delete column axis = 1 (but would not choose axis = 1, if 1, then delete an entire variable data)
  • Incoming thresh = when n retain at least n rows of data non NaN

#生成一个表格型二维数组
df2=pd.DataFrame([[1,2,3],['juejin',np.nan,np.nan],['a','b',np.nan],[np.nan,np.nan,np.nan],['d','j','h']],
                 columns=list('ABC'))
print(df2,'\n')
#删除所有存在缺失值的行
print(df2.dropna(),'\n')
#删除部分存在缺失值的行,保留至少有n个非NaN数据的行(比如保留至少有一个非NaN数据的行)
print(df2.dropna(thresh=1),'\n')
#删除某一列存在缺失值的行(删除A列里存在缺失值的所有的行,和上面的布尔值序列筛选相同)
print(df2[df2['A'].notnull()])复制代码

Output is as follows:

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3     NaN  NaN  NaN
4       d    j    h 

   A  B  C
0  1  2  3
4  d  j  h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
4       d    j    h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
4       d    j    h复制代码

Filling missing values ​​/ replacement

填充方法:fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

  • Parameter value: Fill Value
  • Parameter method: filling data before pad / ffill → use, backfill / bfill → data after filling with

替换方法:replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)

  • to_replacec parameters: value to be replaced
  • Parameter value: overrides

Use example:

import copy
df3=pd.DataFrame([[1,2,3],['juejin',np.nan,np.nan],['a','b',np.nan],['k',np.nan,np.nan],['d','j','h']],
                 columns=list('ABC'))
df4=copy.deepcopy(df3)
print(df3,'\n')
#缺失值全部用 0 填充
print(df3.fillna(0),'\n')
#method=‘pad’,使B列每一个缺失值用缺失值的前一个值填充
df3['B'].fillna(method='pad',inplace=True)
print(df3,'\n')
#用replace替换
print(df4,'\n')
df4.replace(np.nan,'juejin',inplace = True)
print('将缺失值替换为juejin\n',df4)复制代码

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3       k  NaN  NaN
4       d    j    h 

        A  B  C
0       1  2  3
1  juejin  0  0
2       a  b  0
3       k  0  0
4       d  j  h 

        A  B    C
0       1  2    3
1  juejin  2  NaN
2       a  b  NaN
3       k  b  NaN
4       d  j    h 

        A    B    C
0       1    2    3
1  juejin  NaN  NaN
2       a    b  NaN
3       k  NaN  NaN
4       d    j    h
将缺失值替换为juejin
         A       B       C
0       1       2       3
1  juejin  juejin  juejin
2       a       b  juejin
3       k  juejin  juejin
4       d       j       h复制代码

Interpolation of missing values

Fill in missing values ​​mentioned above, but in the actual data processing, the value of the missing data of the process is not just to find a completely filled, but targeted for interpolating each partial filling of missing values.

Pick some common interpolation methods missing values ​​representative:

  • Median number / public / average interpolation
  • Near value interpolation
  • Lagrange interpolation value
Median number / public / average interpolation


#生成一个一维数组
s1=pd.Series([6,4,2,5,4,3,3,7,np.nan,3,9,np.nan,1])
print(s1,'\n')
med=s1.median()#中位数
mod=s1.mode()#众数
avg=s1.mean() #平均值
print('中位数,众数,平均数分别为:%.2f,%.2f,%.2f'%(med,mod,avg))
#以平均值为例
s1.fillna(avg)复制代码

0     6.0
1     4.0
2     2.0
3     5.0
4     4.0
5     3.0
6     3.0
7     7.0
8     NaN
9     3.0
10    9.0
11    NaN
12    1.0
dtype: float64 

中位数,众数,平均数分别为:4.00,3.00,4.270     6.000000
1     4.000000
2     2.000000
3     5.000000
4     4.000000
5     3.000000
6     3.000000
7     7.000000
8     4.272727
9     3.000000
10    9.000000
11    4.272727
12    1.000000
dtype: float64复制代码

Near the interpolation value

This, in fact in the above mentioned filling in missing values, the main parameters of this method, optionally preceded by a missing data of the same value of the position and a fill position or to data, reference df3

fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Parameter method: filling data before pad / ffill → use, backfill / bfill → data after filling with

Lagrange polynomial

Many practical issues are intrinsically linked to represent some kind of function or with the law, and many functions can only be learned through experimentation and observation. The practice of a physical quantity observed, the corresponding value observed in several different places, Lagrange polynomial to find a polynomial that each observation point just to take the observed values. Such a polynomial is called Lagrange (interpolation) polynomial . Mathematically, Lagrange polynomial can be given a polynomial function exactly passes through a plurality of known points on the two-dimensional plane.

Due to space limitations, here is a rough calculation of Lagrange polynomial explain the process

The mathematical knowledge, the known n points on a plane, to find a polynomial of degree n-1:


When we know the preceding n-1 th coordinate point (x1, y1), (x2, y2) value ...... (xn-1, yn-1) time, into the above equation can be obtained in a multi- equation


By this multivariate equation, we can calculate the parameters a0, a1, ..... an-1 values ​​of each parameter polyhydric know this equation, which is known as a function of an equation between y and x, Chuan the value of x, can be calculated corresponding to the missing values ​​of y (an approximation), shaped as computed above process is called Lagrange polynomial worth.

In python, there is a very convenient tool magazine Lagrange interpolation calculation, and implementation using specific directly implemented in the following code example

We arbitrarily selected a set of data (3,6), (7,9), (8,5), (9,8) is calculated through those points by a function equation Lagrange polynomial and we need to enter the interpolation the value of x, y values ​​can be obtained naturally.

#导入拉格朗日插值法计算和作图包
from scipy.interpolate import lagrange
import matplotlib.pyplot as plt
% matplotlib inline
#任意创建一个有缺失值的二维数组
s2=pd.DataFrame({'x':[3,7,12,8,9],'y':[6,9,np.nan,5,8]})
#x值
x=[3,7,8,9]
#y值
y=[6,9,5,8]
#生成这几个点的散点图
plt.scatter(x,y)
#求出函数方程
print(lagrange(x,y))
#选择一个x=12,计算插入值
print('插值12为%.2f' % lagrange(x,y)(12))复制代码

Generated function (above figures 2 and 3 represent x² and x³), corresponding to the missing-value interpolation, the scattergram as follows:

        3        2
0.7417 x - 14.3 x + 85.16 x - 140.8

插值12103.50

复制代码

Therefore, x = 12, the corresponding missing values ​​can be inserted in place of 103.50









Reproduced in: https: //juejin.im/post/5d060839e51d455a694f951e

Guess you like

Origin blog.csdn.net/weixin_33751566/article/details/93183904