Pandas - 数据规整化


1 数据规整化-合并数据集

1.1 merge的内连,外连,左连,右连

  • merge默认采用的是“inner连结”,取交集部分,没有交集的会舍弃掉
  • 如果没指定用哪列连结,默认情况下merge会将重叠列的列名当做健,一般建议用 on 指定一下
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
from numpy import nan as NA
df1 = DataFrame({'key':['b','b','a','c','a','a','b'],
               'data1':range(7)})
df2 = DataFrame({'key':list('abd'),
               'data2':range(3)})
pd.merge(df1,df2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1,df2,on='key')#默认how='inner'
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1
1 1 b 1
2 6 b 1
3 2 a 0
4 4 a 0
5 5 a 0
pd.merge(df1,df2,on='key',how='outer')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0.0 b 1.0
1 1.0 b 1.0
2 6.0 b 1.0
3 2.0 a 0.0
4 4.0 a 0.0
5 5.0 a 0.0
6 3.0 c NaN
7 NaN d 2.0
  • how=left , 取merge连结的左边数据集,右边只在取有关联的,没关联的NAN值填充
  • how=right , 取merge连结的右边数据集,左边只在取有关联的,没关联的NAN值填充
df5 = DataFrame({'key':['b','b','a','c','a','a'],
               'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
               'data2':range(5)})
pd.merge(df1,df2,on='key',how='left')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0 b 1.0
1 1 b 1.0
2 2 a 0.0
3 3 c NaN
4 4 a 0.0
5 5 a 0.0
6 6 b 1.0
df5 = DataFrame({'key':['b','b','a','c','a','a'],
               'data1':range(6)})
df6 = DataFrame({'key':list('ababd'),
               'data2':range(5)})
pd.merge(df1,df2,on='key',how='right')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 key data2
0 0.0 b 1
1 1.0 b 1
2 6.0 b 1
3 2.0 a 0
4 4.0 a 0
5 5.0 a 0
6 NaN d 2

如果两个对象的列名不同,也可以分别进行指定

df3 = DataFrame({'lkey':['b','b','a','c','a','a','b'],
               'data1':range(7)})
df4 = DataFrame({'rkey':list('abd'),
               'data2':range(3)})
pd.merge(df3,df4,left_on='lkey',right_on='rkey')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a

merge方法总结

'''
pd.merge(df1,df2,on/left_on,right_on,how='')
on:
    指定进行连接的列名,不指定默认将重叠列的列名当做健
left_on,right_on:
    没有相同列名时使用这两个参数指定连接列名
how='':
    (1) inner    内连接   内链接,取交集(默认连接方法)
    (2) outer    外连接   取并集
    (3) left     左连接   以df1为主"表"进行链接
    (4) right    右链接   以df2为主"表"进行链接

'''

1.2 Series的数据连接

1.2.1 concat()

  • concat默认在垂直方向上进行连接,axis=0,构造hierarchical index的Series
  • axis=1,将Series在水平方向上从左到右进行链接,行索引取所有需要链接的Series的索引并集,生成一个DataFrame
  • 在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。
s1 = Series([0,1],index=["a","b"])
s2 = Series([2,3,4],index=list("cde"))
s3 = Series([5,6],index=list("fg"))
pd.concat([s1,s2,s3])# 默认axis=0
    a    0
    b    1
    c    2
    d    3
    e    4
    f    5
    g    6
    dtype: int64
#将Series在水平方向上
pd.concat([s1,s2,s3],axis=1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
  • 可以利用concat函数的join=inner 参数来取两个连结的交集
s4 = pd.concat([s1*5,s3])
s4
    a    0
    b    5
    f    5
    g    6
    dtype: int64
pd.concat([s1,s4],axis=1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1
a 0.0 0
b 1.0 5
f NaN 5
g NaN 6
#inner取交集
pd.concat([s1,s4],axis=1,join='inner')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1
a 0 0
b 1 5
  • 可以给key传入一个列表,在连结轴上创建层次索引
#创建层次索引
result = pd.concat([s1,s2,s3],keys=['one','two','three'])
result
    one    a    0
           b    1
    two    c    2
           d    3
           e    4
    three  f    5
           g    6
    dtype: int64
result.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0

1.2.2 合并重叠数据 combine_first & append

a = Series([NA,2.5,NA,3.5,4.5,NA],index=list("fedcba"))
b = Series(np.arange(len(a)),dtype=np.float64,index=list("fedcba"))
print(a,b)
b[-1] = np.nan
b[:-2].combine_first(a[2:])
  a:
    f    NaN
    e    2.5
    d    NaN
    c    3.5
    b    4.5
    a    NaN
  b:
    f    0.0
    e    1.0
    d    2.0
    c    3.0
    b    4.0
    a    5.0


    a    NaN
    b    4.5
    c    3.0
    d    2.0
    e    1.0
    f    0.0
    dtype: float64
a.append(b)
    f    NaN
    e    2.5
    d    NaN
    c    3.5
    b    4.5
    a    NaN
    f    0.0
    e    1.0
    d    2.0
    c    3.0
    b    4.0
    a    NaN
    dtype: float64

2 数据规整化-重塑与轴向选择

2.1 层次化索引

层次化索引是 pandas的一项重要功能,它使你能在一个轴上拥有多个(两个以上)索引
级别。抽象点说,它使你能以低维度形式处理高维度数据。

#层次化索引
data =Series(np.random.randn(10),
             index=[list("aaabbbccdd"),[1,2,3,1,2,3,1,2,2,3]])
data
    a  1   -0.467731
       2   -1.288590
       3   -2.002361
    b  1   -0.075169
       2   -0.666990
       3    0.725769
    c  1    0.583614
       2    0.866867
    d  2    0.424541
       3    0.888412
    dtype: float64

2.2 重塑层次化索引

默认情况下, unstack操作的是最内层( stack也是如此)。传入分层级别的编号或名称即可对其他级别进行 unstack操作

#重塑层次化索引
data.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
1 2 3
a -0.467731 -1.288590 -2.002361
b -0.075169 -0.666990 0.725769
c 0.583614 0.866867 NaN
d NaN 0.424541 0.888412
data.unstack(0)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d
1 -0.467731 -0.075169 0.583614 NaN
2 -1.288590 -0.666990 0.866867 0.424541
3 -2.002361 0.725769 NaN 0.888412
data = DataFrame(np.arange(6).reshape(2,3),
                 index=pd.Index(["Ohio","Colorado"],name="state"),
                 columns=pd.Index(["one","two","three"],name="numbers"))
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
numbers one two three
state
Ohio 0 1 2
Colorado 3 4 5
#将列转行
result = data.stack()
result
    state     numbers
    Ohio      one        0
              two        1
              three      2
    Colorado  one        3
              two        4
              three      5
    dtype: int32
#层次索引
result['Ohio']
    numbers
    one      0
    two      1
    three    2
    dtype: int32
#将行转列
result.unstack('state')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Ohio Colorado
numbers
one 0 3
two 1 4
three 2 5
df = DataFrame({'left':result,'right':result+5},
              columns=pd.Index(['left','right'],
                              name='side'))
df
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
side left right
state numbers
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10
df.index
    MultiIndex(levels=[['Ohio', 'Colorado'], ['one', 'two', 'three']],
               labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
               names=['state', 'numbers'])
temp = df.unstack('state')
temp
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
side left right
state Ohio Colorado Ohio Colorado
numbers
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
temp.left
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Ohio Colorado
numbers
one 0 3
two 1 4
three 2 5
temp.stack('side')#往层级低的方向添加索引
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
state Colorado Ohio
numbers side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7
  • 如果不是所有级别的数据都能在分组中找到的话,则unstack操作可能会引入缺失数据
  • stack 默认为滤除缺失数据,因此该运算是可逆的
s1 = Series([0,1,2,3],index=list('abcd'))
s2 = Series([4,5,6],index=list('cde'))
data2 = pd.concat([s1,s2],keys=['one','two'])
data2.unstack()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
data2.unstack().stack()
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
    two  c    4.0
         d    5.0
         e    6.0
    dtype: float64
data2.unstack().stack(dropna=False)
    one  a    0.0
         b    1.0
         c    2.0
         d    3.0
         e    NaN
    two  a    NaN
         b    NaN
         c    4.0
         d    5.0
         e    6.0
    dtype: float64

3 数据规整化-数据转换

3.1 清除重复数据

data = DataFrame({'k1':['one']*3+['two']*4,
                'k2':[1,1,2,3,3,4,4]})
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4

3.1.1 duplicated()方法

Data frame的 duplicated方法返回一个布尔型 Series,表示各行是否是重复行

data.duplicated()
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

3.1.2 drop_duplicates()方法

drop duplicates方法用于返回一个移除了重复行的DataFrame

data.drop_duplicates()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
data['v1']=range(7)
data
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
0 one 1 0
1 one 1 1
2 one 2 2
3 two 3 3
4 two 3 4
5 two 4 5
6 two 4 6
data.drop_duplicates(['k1'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
0 one 1 0
3 two 3 3
#drop_duplicates默认保留的是第一个出现的值组合。keep='first'/'last'
data.drop_duplicates(['k1','k2'],keep='last')
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
k1 k2 v1
1 one 1 1
2 one 2 2
4 two 3 4
6 two 4 6

3.2 利用函数和映射进行转换

foods = DataFrame({"food":["bacon","pulled pork","bacon","Pastrami",
                           "corned beef","Bacon","pastrami","honey ham",
                           "nova lox"],
                   "ounces":[4,3,12,6,7.5,8,3,5,6]})
meat_to_animal = {"bacon":"pig",
                  "pulled pork":"pig",
                  "pastrami":"cow",
                 "corned beef":"cow",
                 "honey ham":"pig",
                 "nova lox":"salmon"}
foods
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0

1、先编写一个映射
2、再利用map函数来进行映射

foods['animal'] = foods['food'].map(str.lower).map(meat_to_animal)
foods
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon

直接传入一个能直接完成此功能的函数

foods['food'].map(lambda x : meat_to_animal[x.lower()])
    0       pig
    1       pig
    2       pig
    3       cow
    4       cow
    5       pig
    6       cow
    7       pig
    8    salmon
    Name: food, dtype: object

3.3 数据拆分

3.3.1 按照区间对数据进行拆分pd.cut()

ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)
cats
    [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
    Length: 12
    Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
#每个数样本落在第几个区间,序号从零开始
cats.codes
    array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
pd.value_counts(cats)
    (18, 25]     5
    (35, 60]     3
    (25, 35]     3
    (60, 100]    1
    dtype: int64
# 对数据进行分组,并为每一组添加标签
group_names = ['Youth','YoungAdult','middleAged','senior']
new_cats = pd.cut(ages,bins,labels=group_names)
new_cats
    [Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, senior, middleAged, middleAged, YoungAdult]
    Length: 12
    Categories (4, object): [Youth < YoungAdult < middleAged < senior]
pd.value_counts(new_cats)
    Youth         5
    middleAged    3
    YoungAdult    3
    senior        1
    dtype: int64
import matplotlib.pyplot as plt
plt.hist(new_cats,histtype='bar',density=False)
plt.show()

这里写图片描述

data = np.random.rand(10)
pd.cut(data,4,precision=2)#precision有效数字保存几位
    [(0.04, 0.19], (0.04, 0.19], (0.34, 0.48], (0.48, 0.63], (0.48, 0.63], (0.19, 0.34], (0.48, 0.63], (0.19, 0.34], (0.04, 0.19], (0.04, 0.19]]
    Categories (4, interval[float64]): [(0.04, 0.19] < (0.19, 0.34] < (0.34, 0.48] < (0.48, 0.63]]

3.3.2 按照分位数对数据进行分组—pd.qcut()

data = np.random.rand(10)
cat = pd.qcut(data,4) # 按照四分位进行切割
cat
    [(0.675, 0.856], (0.37, 0.675], (0.675, 0.856], (0.856, 0.971], (0.103, 0.37], (0.103, 0.37], (0.103, 0.37], (0.37, 0.675], (0.856, 0.971], (0.856, 0.971]]
    Categories (4, interval[float64]): [(0.103, 0.37] < (0.37, 0.675] < (0.675, 0.856] < (0.856, 0.971]]
pd.value_counts(cat,sort=True)
    (0.856, 0.971]    3
    (0.103, 0.37]     3
    (0.675, 0.856]    2
    (0.37, 0.675]     2
    dtype: int64
condition = np.where((data>0.856)&(data<=0.971))
condition
    (array([3, 8, 9], dtype=int64),)
data[condition]
    array([ 0.90659059,  0.93586907,  0.97081758])
#与cut一样也可以自定义分位数(0到1之间的数值,包括端点)
new_cat = pd.qcut(data,[0,0.1,0.5,0.9,1.0])
new_cat
    [(0.675, 0.939], (0.31, 0.675], (0.675, 0.939], (0.675, 0.939], (0.31, 0.675], (0.103, 0.31], (0.31, 0.675], (0.31, 0.675], (0.675, 0.939], (0.939, 0.971]]
    Categories (4, interval[float64]): [(0.103, 0.31] < (0.31, 0.675] < (0.675, 0.939] < (0.939, 0.971]]
pd.value_counts(new_cat)
    (0.675, 0.939]    4
    (0.31, 0.675]     4
    (0.939, 0.971]    1
    (0.103, 0.31]     1
    dtype: int64

3.4 检查和过滤异常值

异常值的过滤或变换运算在很大程度上其实就是数组运算。

data =DataFrame(np.random.randn(1000,4))
data.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.020880 0.001643 -0.019453 -0.026122
std 0.980023 1.025441 0.995088 0.960486
min -3.108915 -3.645860 -3.481593 -3.194414
25% -0.697479 -0.697678 -0.694020 -0.700987
50% -0.005279 0.031774 -0.014728 -0.038483
75% 0.618116 0.690065 0.651287 0.649747
max 2.859053 3.189940 3.525865 3.023720
#找出绝对值大于3的数据
col = data[3]
col[np.abs(col)>3]
    97     3.927528
    305   -3.399312
    400   -3.745356
    Name: 3, dtype: float64
#假如想找出任意一个包含绝对值大于3的行,则可以用 any来筛选
data[(np.abs(data)>3).any(1)]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
46 -0.658090 -0.207434 3.525865 0.283070
67 0.599947 -3.645860 0.255475 -0.549574
289 -1.559625 0.336788 -3.333767 -1.240685
371 -1.116332 -3.018842 -0.298748 0.406954
396 -3.108915 1.117755 -0.152780 -0.340173
526 1.188742 -3.183867 1.050471 -1.042736
573 -2.214074 -3.140963 -1.509976 -0.389818
738 -0.088202 1.090038 -0.848098 -3.194414
768 0.474358 0.003349 -0.011807 3.023720
797 2.368010 0.452649 -3.481593 0.789944
966 0.164293 3.082067 -0.516982 0.251909
994 -0.843849 3.189940 0.070978 0.516982
#获取数据的正负号
np.sign(data).head(10)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
0 1.0 -1.0 1.0 -1.0
1 -1.0 1.0 -1.0 -1.0
2 -1.0 1.0 -1.0 1.0
3 1.0 1.0 -1.0 1.0
4 -1.0 -1.0 -1.0 1.0
5 -1.0 1.0 -1.0 -1.0
6 1.0 1.0 -1.0 1.0
7 1.0 1.0 1.0 1.0
8 -1.0 1.0 1.0 1.0
9 1.0 -1.0 1.0 -1.0
data[np.abs(data)>3] = np.sign(data)*3
data.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
0 1 2 3
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.020772 0.002361 -0.019163 -0.025951
std 0.979685 1.021487 0.990725 0.959788
min -3.000000 -3.000000 -3.000000 -3.000000
25% -0.697479 -0.697678 -0.694020 -0.700987
50% -0.005279 0.031774 -0.014728 -0.038483
75% 0.618116 0.690065 0.651287 0.649747
max 2.859053 3.000000 3.000000 3.000000

猜你喜欢

转载自blog.csdn.net/qq_27171347/article/details/81488383
今日推荐