We obtained data (especially the large amount of data), is likely to appear missing data, data anomalies and other issues, data processing, data analysis part in a very important and necessary, as far as possible in order to do data analysis reduce abnormal appearance, and give a more accurate analysis of the conclusions, so before doing data analysis, data processing is especially necessary
Reading conditions: pandas familiar with the basic operation oftools: Python
Use platform: jupyter notebook
Missing values
Analyzing missing values
python is read main data csv or excel, excel in the cell is empty, pandas reads the display is NaN, the value that is missing
Analyzing Data missing values Method: ISNULL, NotNull
- isnull: True missing representation, False represents a non-missing
- notnull: True representation of non-missing, False representation missing
First import python Toolkit Data Analysis Required
import numpy as np
import pandas as pd
__author__='莫叹'复制代码
Generate a two-dimensional array of table type df
#生成一个表格型的二维数组
df=pd.DataFrame({'a':[34,6,20,np.nan,56],
'b':['juejin','number','one','good',np.nan]})复制代码
Output is as follows:
Whether there is a missing data value and the non-missing values in the data screening Analyzing:
#判断二维数组df是否缺失
print(df.notnull(),'\n')
#通过索引判断a列是否缺失
print(df['a'].notnull(),'\n')
#筛选a列不存在缺失值数组
print(df[df['a'].notnull()])复制代码
Output is as follows:
a b
0 True True
1 True True
2 True True
3 False True
4 True False
0 True
1 True
2 True
3 False
4 True
Name: a, dtype: bool
a b
0 34.0 juejin
1 6.0 number
2 20.0 one
4 56.0 NaN复制代码
Delete missing values
A method for screening notnull Boolean value above sequence is deleted missing values
Delete missing data values need to follow specific circumstances and business situation to deal with, and sometimes need to remove all of the missing data, and sometimes need to remove part of the missing data, sometimes just need to delete the specified missing data.
Delete missing values method: dropna (axis)
- Default parameters axis = 0, delete the row of data when the parameter data delete column axis = 1 (but would not choose axis = 1, if 1, then delete an entire variable data)
- Incoming thresh = when n retain at least n rows of data non NaN
#生成一个表格型二维数组
df2=pd.DataFrame([[1,2,3],['juejin',np.nan,np.nan],['a','b',np.nan],[np.nan,np.nan,np.nan],['d','j','h']],
columns=list('ABC'))
print(df2,'\n')
#删除所有存在缺失值的行
print(df2.dropna(),'\n')
#删除部分存在缺失值的行,保留至少有n个非NaN数据的行(比如保留至少有一个非NaN数据的行)
print(df2.dropna(thresh=1),'\n')
#删除某一列存在缺失值的行(删除A列里存在缺失值的所有的行,和上面的布尔值序列筛选相同)
print(df2[df2['A'].notnull()])复制代码
Output is as follows:
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 NaN NaN NaN
4 d j h
A B C
0 1 2 3
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
4 d j h复制代码
Filling missing values / replacement
填充方法:fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
- Parameter value: Fill Value
- Parameter method: filling data before pad / ffill → use, backfill / bfill → data after filling with
替换方法:replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)
- to_replacec parameters: value to be replaced
- Parameter value: overrides
Use example:
import copy
df3=pd.DataFrame([[1,2,3],['juejin',np.nan,np.nan],['a','b',np.nan],['k',np.nan,np.nan],['d','j','h']],
columns=list('ABC'))
df4=copy.deepcopy(df3)
print(df3,'\n')
#缺失值全部用 0 填充
print(df3.fillna(0),'\n')
#method=‘pad’,使B列每一个缺失值用缺失值的前一个值填充
df3['B'].fillna(method='pad',inplace=True)
print(df3,'\n')
#用replace替换
print(df4,'\n')
df4.replace(np.nan,'juejin',inplace = True)
print('将缺失值替换为juejin\n',df4)复制代码
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 k NaN NaN
4 d j h
A B C
0 1 2 3
1 juejin 0 0
2 a b 0
3 k 0 0
4 d j h
A B C
0 1 2 3
1 juejin 2 NaN
2 a b NaN
3 k b NaN
4 d j h
A B C
0 1 2 3
1 juejin NaN NaN
2 a b NaN
3 k NaN NaN
4 d j h
将缺失值替换为juejin
A B C
0 1 2 3
1 juejin juejin juejin
2 a b juejin
3 k juejin juejin
4 d j h复制代码
Interpolation of missing values
Fill in missing values mentioned above, but in the actual data processing, the value of the missing data of the process is not just to find a completely filled, but targeted for interpolating each partial filling of missing values.
Pick some common interpolation methods missing values representative:
- Median number / public / average interpolation
- Near value interpolation
- Lagrange interpolation value
#生成一个一维数组
s1=pd.Series([6,4,2,5,4,3,3,7,np.nan,3,9,np.nan,1])
print(s1,'\n')
med=s1.median()#中位数
mod=s1.mode()#众数
avg=s1.mean() #平均值
print('中位数,众数,平均数分别为:%.2f,%.2f,%.2f'%(med,mod,avg))
#以平均值为例
s1.fillna(avg)复制代码
0 6.0
1 4.0
2 2.0
3 5.0
4 4.0
5 3.0
6 3.0
7 7.0
8 NaN
9 3.0
10 9.0
11 NaN
12 1.0
dtype: float64
中位数,众数,平均数分别为:4.00,3.00,4.270 6.000000
1 4.000000
2 2.000000
3 5.000000
4 4.000000
5 3.000000
6 3.000000
7 7.000000
8 4.272727
9 3.000000
10 9.000000
11 4.272727
12 1.000000
dtype: float64复制代码
Near the interpolation value
This, in fact in the above mentioned filling in missing values, the main parameters of this method, optionally preceded by a missing data of the same value of the position and a fill position or to data, reference df3
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
Parameter method: filling data before pad / ffill → use, backfill / bfill → data after filling with
Lagrange polynomialMany practical issues are intrinsically linked to represent some kind of function or with the law, and many functions can only be learned through experimentation and observation. The practice of a physical quantity observed, the corresponding value observed in several different places, Lagrange polynomial to find a polynomial that each observation point just to take the observed values. Such a polynomial is called Lagrange (interpolation) polynomial . Mathematically, Lagrange polynomial can be given a polynomial function exactly passes through a plurality of known points on the two-dimensional plane.
Due to space limitations, here is a rough calculation of Lagrange polynomial explain the process
The mathematical knowledge, the known n points on a plane, to find a polynomial of degree n-1:
When we know the preceding n-1 th coordinate point (x1, y1), (x2, y2) value ...... (xn-1, yn-1) time, into the above equation can be obtained in a multi- equation
By this multivariate equation, we can calculate the parameters a0, a1, ..... an-1 values of each parameter polyhydric know this equation, which is known as a function of an equation between y and x, Chuan the value of x, can be calculated corresponding to the missing values of y (an approximation), shaped as computed above process is called Lagrange polynomial worth.
In python, there is a very convenient tool magazine Lagrange interpolation calculation, and implementation using specific directly implemented in the following code example
We arbitrarily selected a set of data (3,6), (7,9), (8,5), (9,8) is calculated through those points by a function equation Lagrange polynomial and we need to enter the interpolation the value of x, y values can be obtained naturally.
#导入拉格朗日插值法计算和作图包
from scipy.interpolate import lagrange
import matplotlib.pyplot as plt
% matplotlib inline
#任意创建一个有缺失值的二维数组
s2=pd.DataFrame({'x':[3,7,12,8,9],'y':[6,9,np.nan,5,8]})
#x值
x=[3,7,8,9]
#y值
y=[6,9,5,8]
#生成这几个点的散点图
plt.scatter(x,y)
#求出函数方程
print(lagrange(x,y))
#选择一个x=12,计算插入值
print('插值12为%.2f' % lagrange(x,y)(12))复制代码
Generated function (above figures 2 and 3 represent x² and x³), corresponding to the missing-value interpolation, the scattergram as follows:
3 2
0.7417 x - 14.3 x + 85.16 x - 140.8
插值12为103.50
复制代码
Therefore, x = 12, the corresponding missing values can be inserted in place of 103.50
Reproduced in: https: //juejin.im/post/5d060839e51d455a694f951e