python learning road --- common preprocessing operation PANDAS

python commonly used in the data analysis, mainly because of the data analysis tool --pandas. Before the two pandas have been introduced data structure, the read and write operations, mainly introduce today access commonly used data analysis preprocessing, namely:

(1) missing values: dropna (), fillna ()

(2) Processing duplicates: drop_duplicates ()

(3) discrete: cut (), qcut ()

(4) Packet aggregation: groupby ()

(5) PivotTable: pivot_table ()

(6) Sort by: sort_values ()

(1) missing values

Missing values often because registration or when the entered time does not regulate abnormal value occurs or does not fill, no entry caused, in python we used isnull (), info (), describe (), etc. to determine location of missing values. As for handling missing values, our common methods are: delete (without affecting the original data, that is, the amount of missing values rarely), mean (influenced by outliers larger), median (by outliers less affected), number (missing value of all the characters, which is the category) filling, filling back forward, interpolation and so on. Of course, the most appropriate method is to find the closest value to fill.

dropna common parameters:

how: in what way deleted, any representation that is removed when there is a null value, all means to delete all for only null

axis: axis = 1 is a column operation, axis = 0 is operative

The bank retains a row when the number of non-null value: thresh

#info（）查看数据信息
import pandas as pd
data = pd.read_table(r"D:\迅雷下载\示例txt.txt",engine = "python",nrows= 10,index_col = 0)
print(data.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 1 to 9
Data columns (total 5 columns):
性别      7 non-null object
年龄      7 non-null float64
省内省外    8 non-null float64
消费金额    7 non-null float64
贷款与否    8 non-null float64
dtypes: float64(4), object(1)
memory usage: 432.0+ bytes
None

We can see the data set, a total of nine lines, while non-null value of each column there were 7 or 8, which means that each column has at least a null value.

print(data.dropna(how = "all",axis =0))  #how = "all",axis = 0表示当一行全是空值时，删除该行
性别    年龄  省内省外   消费金额  贷款与否
用户id                              
1       男  60.0   1.0  311.0   0.0
2     NaN  25.0   1.0  220.0   1.0
3       男  47.0   1.0  246.0   0.0
4       女  52.0   0.0    NaN   0.0
5       女  21.0   0.0  916.0   0.0
6       男  37.0   0.0  980.0   1.0
7       男  34.0   0.0  482.0   1.0
8       男   NaN   0.0  267.0   0.0

print(data.dropna(how = "any",axis = 0)) # how = "any",axis = 0表示当一行里有一个空值时，删除该行
性别    年龄  省内省外   消费金额  贷款与否
用户id                            
1     男  60.0   1.0  311.0   0.0
3     男  47.0   1.0  246.0   0.0
5     女  21.0   0.0  916.0   0.0
6     男  37.0   0.0  980.0   1.0
7     男  34.0   0.0  482.0   1.0

print(data.dropna(how = "any",axis = 1)) #how ="any",axis = 1表示当一列里有一个空值时，删除该列
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5, 6, 7, 8, 9]

data.dropna(how = "all",axis = 0,inplace = True)       #inplace = True将原数据替换成dropna后的数据
print(data)
性别    年龄  省内省外   消费金额  贷款与否
用户id                              
1       男  60.0   1.0  311.0   0.0
2     NaN  25.0   1.0  220.0   1.0
3       男  47.0   1.0  246.0   0.0
4       女  52.0   0.0    NaN   0.0
5       女  21.0   0.0  916.0   0.0
6       男  37.0   0.0  980.0   1.0
7       男  34.0   0.0  482.0   1.0
8       男   NaN   0.0  267.0   0.0

#thresh:当某行有多少个非空值时，保留该行
print(data.dropna(thresh = 3))
性别    年龄  省内省外   消费金额  贷款与否
用户id                              
1       男  60.0   1.0  311.0   0.0
2     NaN  25.0   1.0  220.0   1.0
3       男  47.0   1.0  246.0   0.0
4       女  52.0   0.0    NaN   0.0
5       女  21.0   0.0  916.0   0.0
6       男  37.0   0.0  980.0   1.0
7       男  34.0   0.0  482.0   1.0
8       男   NaN   0.0  267.0   0.0

#固定值填补
print(data.fillna(0))
性别    年龄  省内省外   消费金额  贷款与否
用户id                            
1     男  60.0   1.0  311.0   0.0
2     0  25.0   1.0  220.0   1.0
3     男  47.0   1.0  246.0   0.0
4     女  52.0   0.0    0.0   0.0
5     女  21.0   0.0  916.0   0.0
6     男  37.0   0.0  980.0   1.0
7     男  34.0   0.0  482.0   1.0
8     男   0.0   0.0  267.0   0.0

#利用字典同时多处填补
import numpy as np
data.fillna({"年龄":np.median(data["年龄"][data["年龄"].notnull()]),"消费金额":np.mean(data["消费金额"][data["消费金额"].notnull()])},inplace = True)
print(data)
性别    年龄  省内省外        消费金额  贷款与否
用户id                                   
1       男  60.0   1.0  311.000000   0.0
2     NaN  25.0   1.0  220.000000   1.0
3       男  47.0   1.0  246.000000   0.0
4       女  52.0   0.0  488.857143   0.0
5       女  21.0   0.0  916.000000   0.0
6       男  37.0   0.0  980.000000   1.0
7       男  34.0   0.0  482.000000   1.0
8       男  37.0   0.0  267.000000   0.0

#众数用value_counts()统计
print(data["性别"].value_counts())
男    5
女    2
Name: 性别, dtype: int64

#众数用describe（）统计
print(data.describe(include = np.object))     #include = np.object 只统计字符型
性别
count   7
unique  2
top     男
freq    5

#replacet填补空值
data["性别"].replace(np.NaN,"男",inplace = True)
print(data)
性别    年龄  省内省外        消费金额  贷款与否
用户id                                 
1     男  60.0   1.0  311.000000   0.0
2     男  25.0   1.0  220.000000   1.0
3     男  47.0   1.0  246.000000   0.0
4     女  52.0   0.0  488.857143   0.0
5     女  21.0   0.0  916.000000   0.0
6     男  37.0   0.0  980.000000   1.0
7     男  34.0   0.0  482.000000   1.0
8     男  37.0   0.0  267.000000   0.0

#拉格朗日填补法(行索引为默认值，从0开始)
from scipy.interpolate import lagrange
def p_col(a,b,k= 5):         #a是数据,b是空值位置,k是前后多少位，这里设为默认5
    x = a[list(range(b-k,b)) + list(range(b+1,b+k+1))]    #取数
    x =x[x.notnull()]       #非空值
    return lagrange(x.index,list(x))(b)
df = pd.read_table(r"D:\迅雷下载\示例txt.txt",engine = "python")
for j in range(len(df)):
    if(df["消费金额"].isnull())[j]:     #找到需要插值的位置
        df["消费金额"][j] = p_col(df["消费金额"],j)    
print(df["消费金额"])
0     311.000000
1     220.000000
2     246.000000
3     529.257143
4     916.000000
5     980.000000
6     482.000000
7     267.000000
8    2264.257143
Name: 消费金额, dtype: float64

# 自定义一个函数填充不同格式的值
def insert_data(x):
    for i in x.columns:
        if x[i].dtype == "object":
            x[i] = x[i].fillna(x[i].value_counts().idxmax())
        elif x[i].dtype == "float64":
            x[i] = x[i].fillna(x[i][x[i].notnull()].mean())
        elif x[i].dtype == "int64":
            x[i] = x[i].fillna(x[i][x[i].notnull()].median())
insert_data(data)
print(data)
性别    年龄  省内省外        消费金额  贷款与否
用户id                                 
1     男  60.0   1.0  311.000000   0.0
2     男  25.0   1.0  220.000000   1.0
3     男  47.0   1.0  246.000000   0.0
4     女  52.0   0.0  488.857143   0.0
5     女  21.0   0.0  916.000000   0.0
6     男  37.0   0.0  980.000000   1.0
7     男  34.0   0.0  482.000000   1.0
8     男  37.0   0.0  267.000000   0.0

(2) discretization

Discrete data processing is to divide the continuous attribute value to a different interval, and finally with a different character or numerical data representing each section, commonly used methods are: width method, and other methods based on clustering method and Frequency Analysis here to not introduce clustering method.

#使用cut()，进行自定义区间的划分
data["年龄分层_bins"] = pd.cut(data["年龄"],bins = [0,25,35,45,55,65])
print(data["年龄分层_bins"].value_counts())
(45, 55]    2
(35, 45]    2
(0, 25]     2
(55, 65]    1
(25, 35]    1
Name: 年龄分层_bins, dtype: int64

#使用cut()，划分等宽区间
data["年龄分层_num"] = pd.cut(data["年龄"],4)
print(data["年龄分层_num"].value_counts())
(30.75, 40.5]      3
(50.25, 60.0]      2
(20.961, 30.75]    2
(40.5, 50.25]      1
Name: 年龄分层_num, dtype: int64

#使用qcut()，划分等频区间，也就时每个区间的数据量相等
data["年龄分层_qcut1"] = pd.qcut(data["年龄"],4)
print(data["年龄分层_qcut1"].value_counts())
(31.75, 37.0]      3
(48.25, 60.0]      2
(20.999, 31.75]    2
(37.0, 48.25]      1
Name: 年龄分层_qcut1, dtype: int64

We used herein QCut (), and so divided frequency range, the theoretical amount of data for each interval are equal. But then a total of 9 because the sample number, to be divided into four parts, so there is a data segment three.

(3) Packet aggregation

Packet are often used in conjunction with polymerizable function, where the data packet to see:

#按照性别分组统计消费金额总和
print(data.groupby("性别")["消费金额"].sum())
性别
女    1404.857143
男    2506.000000
Name: 消费金额, dtype: float64

#聚合函数还可以是多个，比如用aggregate
import numpy as np
print(data.groupby("性别")[["年龄","消费金额"]].aggregate({"年龄":np.mean,"消费金额":sum}))
年龄         消费金额
性别                   
女   36.5  1404.857143
男   40.0  2506.000000

#除了根据已有的列进行分组，还可以自定义分组
age_cut = pd.cut(data["年龄"],bins = [0,23,35,45,55,65])
print(data["消费金额"].groupby(age_cut).sum())
年龄
(0, 23]      916.000000
(23, 35]     702.000000
(35, 45]    1247.000000
(45, 55]     734.857143
(55, 65]     311.000000
Name: 消费金额, dtype: float64

(4) Sorting and PivotTable

I do not believe this operation is to say, we all know that it is the pivot table excel inside, common parameters are:

index: row index

columns: column index

values: values

aggfunc: aggregate function

fill_value: filled with null

dropna: Delete the null value, True / False

margins: the marginal sum, True / False

margins_name: marginal sums name

print(pd.pivot_table(data,index = "性别",values = "消费金额",aggfunc = ["mean","sum"]))
         mean       sum
       消费金额     消费金额
性别                         
女   702.428571  1404.857143
男   417.666667  2506.000000

print(pd.pivot_table(data,index = "年龄分层_num",columns = "性别",values = "消费金额",aggfunc = ["mean","sum","count"]))
                        mean                     sum         count     
性别               女           男           女       男     女    男
年龄分层_num                                                              
(20.961, 30.75]  916.000000  220.000000  916.000000   220.0   1.0  1.0
(30.75, 40.5]           NaN  576.333333         NaN  1729.0   NaN  3.0
(40.5, 50.25]           NaN  246.000000         NaN   246.0   NaN  1.0
(50.25, 60.0]    488.857143  311.000000  488.857143   311.0   1.0  1.0

#根据单列排序
print(data.iloc[:,:6].sort_values(by = "消费金额",ascending =True))  # ascending = True为升序
   性别    年龄  省内省外  消费金额  贷款与否 年龄分层_bins
用户id                                           
2     男  25.0   1.0  220.000000   1.0   (0, 25]
3     男  47.0   1.0  246.000000   0.0  (45, 55]
8     男  37.0   0.0  267.000000   0.0  (35, 45]
1     男  60.0   1.0  311.000000   0.0  (55, 65]
7     男  34.0   0.0  482.000000   1.0  (25, 35]
4     女  52.0   0.0  488.857143   0.0  (45, 55]
5     女  21.0   0.0  916.000000   0.0   (0, 25]
6     男  37.0   0.0  980.000000   1.0  (35, 45]

#根据多列排序
print(data.iloc[:,:6].sort_values(by = ["省内省外","贷款与否","消费金额"],ascending =[True,False,True]))
  性别    年龄  省内省外    消费金额  贷款与否 年龄分层_bins
用户id                                           
7     男  34.0   0.0  482.000000   1.0  (25, 35]
6     男  37.0   0.0  980.000000   1.0  (35, 45]
8     男  37.0   0.0  267.000000   0.0  (35, 45]
4     女  52.0   0.0  488.857143   0.0  (45, 55]
5     女  21.0   0.0  916.000000   0.0   (0, 25]
2     男  25.0   1.0  220.000000   1.0   (0, 25]
3     男  47.0   1.0  246.000000   0.0  (45, 55]
1     男  60.0   1.0  311.000000   0.0  (55, 65]

Small data text brigade

Published 33 original articles · won praise 30 · views 30000 +

Private letter concerns