pandas(四)pandas的拼接操作

pandas的拼接操作

pandas的拼接分为两种:

  • 级联:pd.concat, pd.append
  • 合并:pd.merge, pd.join

0. 回顾numpy的级联

In [1]:
import numpy as np
In [2]:
nd1 = np.array([1,2,3])
nd2 = np.array([-1,-2,-3,-4])
In [3]:
np.concatenate([nd1,nd2],axis=0)
Out[3]:
array([ 1,  2,  3, -1, -2, -3, -4])
In [4]:
nd3 = np.array([[1,2,3],[4,5,6]])
nd3
Out[4]:
array([[1, 2, 3],
       [4, 5, 6]])
In [5]:
np.concatenate([nd1,nd3],axis=1) # 维度不同无法级联
---------------------------------------------------------------------------
AxisError                                 Traceback (most recent call last)
<ipython-input-5-8f0014705afb> in <module>()
----> 1 np.concatenate([nd1,nd3],axis=1) # 维度不同无法级联

AxisError: axis 1 is out of bounds for array of dimension 1

In [6]:
nd4 = np.random.randint(0,10,size=(3,3))
nd4
Out[6]:
array([[4, 6, 1],
       [9, 3, 7],
       [9, 6, 3]])
In [7]:
np.concatenate([nd3,nd4],axis=0)
Out[7]:
array([[1, 2, 3],
       [4, 5, 6],
       [4, 6, 1],
       [9, 3, 7],
       [9, 6, 3]])
In [8]:
nd3 + nd4
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-abc835f3e1d9> in <module>()
----> 1 nd3 + nd4

ValueError: operands could not be broadcast together with shapes (2,3) (3,3) 

In [9]:
nd1 + nd3 # 维度不一样可以广播机制
Out[9]:
array([[2, 4, 6],
       [5, 7, 9]])

============================================

练习12:

  1. 生成2个3*3的矩阵,对其分别进行两个维度上的级联

============================================

In [ ]:

为方便讲解,我们首先定义一个生成DataFrame的函数:

In [10]:
import pandas as pd
from pandas import DataFrame,Series
In [11]:
# 定义一个函数,根据行he列名对元素设置值
def make_df(cols,inds):
    data = {c:[c+str(i) for i in inds] for c in cols}
    return DataFrame(data,index=inds)
    
    
In [12]:
df1 = make_df(list("abc"),[1,2,4])
df1
Out[12]:
  a b c
1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
In [13]:
df2 = make_df(list("abc"),[4,5,6])
df2
Out[13]:
  a b c
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6

1. 使用pd.concat()级联

pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

1) 简单级联

和np.concatenate一样,优先增加行数(默认axis=0)

In [14]:
pd.concat([df1,df2])
Out[14]:
  a b c
1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6
In [15]:
pd.concat([df1,df2],axis=1)
Out[15]:
  a b c a b c
1 a1 b1 c1 NaN NaN NaN
2 a2 b2 c2 NaN NaN NaN
4 a4 b4 c4 a4 b4 c4
5 NaN NaN NaN a5 b5 c5
6 NaN NaN NaN a6 b6 c6

可以通过设置axis来改变级联方向

In [ ]:

注意index在级联时可以重复

也可以选择忽略ignore_index,重新索引

In [16]:
pd.concat([df1,df2],axis=0,ignore_index=True)
Out[16]:
  a b c
0 a1 b1 c1
1 a2 b2 c2
2 a4 b4 c4
3 a4 b4 c4
4 a5 b5 c5
5 a6 b6 c6

或者使用多层索引 keys

concat([x,y],keys=['x','y'])

In [17]:
pd.concat([df1,df2],keys=["教学","品保"])
Out[17]:
    a b c
教学 1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
品保 4 a4 b4 c4
5 a5 b5 c5
6 a6 b6 c6

============================================

练习13:

  1. 想一想级联的应用场景?

  2. 使用昨天的知识,建立一个期中考试张三、李四的成绩表ddd

  3. 假设新增考试学科"计算机",如何实现?

  4. 新增王老五同学的成绩,如何实现?

============================================

In [ ]:

2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致

In [18]:
df1
Out[18]:
  a b c
1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
In [19]:
df3 = make_df(list("abcd"),[1,2,3,4])
df3
Out[19]:
  a b c d
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
In [20]:
pd.concat([df1,df3],axis=0)
Out[20]:
  a b c d
1 a1 b1 c1 NaN
2 a2 b2 c2 NaN
4 a4 b4 c4 NaN
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
In [21]:
pd.concat([df1,df3],axis=1)
Out[21]:
  a b c a b c d
1 a1 b1 c1 a1 b1 c1 d1
2 a2 b2 c2 a2 b2 c2 d2
3 NaN NaN NaN a3 b3 c3 d3
4 a4 b4 c4 a4 b4 c4 d4

有3种连接方式:

  • 外连接:补NaN(默认模式)
In [22]:
pd.concat([df1,df3],axis=0,join="outer")
# 1、不匹配级联在外连接的模式下要求如果axis=0行级联列索引必须保持一致,
#axis=1列级联行索引必须保持一致
# 2、如果不一致缺哪个索引就补全哪个索引
Out[22]:
  a b c d
1 a1 b1 c1 NaN
2 a2 b2 c2 NaN
4 a4 b4 c4 NaN
1 a1 b1 c1 d1
2 a2 b2 c2 d2
3 a3 b3 c3 d3
4 a4 b4 c4 d4
  • 内连接:只连接匹配的项
In [23]:
pd.concat([df1,df3],axis=0,join="inner")
# 内连接在级联的时候不一致的地方全部丢弃
Out[23]:
  a b c
1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
In [24]:
df4 = make_df(list("aecd"),[1,2,3,4])
df4
Out[24]:
  a c d e
1 a1 c1 d1 e1
2 a2 c2 d2 e2
3 a3 c3 d3 e3
4 a4 c4 d4 e4
In [25]:
pd.concat([df3,df4],axis=0,join="inner")
Out[25]:
  a c d
1 a1 c1 d1
2 a2 c2 d2
3 a3 c3 d3
4 a4 c4 d4
1 a1 c1 d1
2 a2 c2 d2
3 a3 c3 d3
4 a4 c4 d4
In [ ]:
  • 连接指定轴 join_axes
In [26]:
df5 = make_df(list("abcf"),[2,3,4,8])
df5
Out[26]:
  a b c f
2 a2 b2 c2 f2
3 a3 b3 c3 f3
4 a4 b4 c4 f4
8 a8 b8 c8 f8
In [27]:
pd.concat([df1,df4,df5],axis=1,join_axes=[df1.index])
Out[27]:
  a b c a c d e a b c f
1 a1 b1 c1 a1 c1 d1 e1 NaN NaN NaN NaN
2 a2 b2 c2 a2 c2 d2 e2 a2 b2 c2 f2
4 a4 b4 c4 a4 c4 d4 e4 a4 b4 c4 f4
In [28]:
df1
Out[28]:
  a b c
1 a1 b1 c1
2 a2 b2 c2
4 a4 b4 c4
In [29]:
df1.append(df4)
Out[29]:
  a b c d e
1 a1 b1 c1 NaN NaN
2 a2 b2 c2 NaN NaN
4 a4 b4 c4 NaN NaN
1 a1 NaN c1 d1 e1
2 a2 NaN c2 d2 e2
3 a3 NaN c3 d3 e3
4 a4 NaN c4 d4 e4
In [30]:
df1.append(df4,axis=1) # append函数没有axis属性,只能对行进行级联
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-344f316a2fe3> in <module>()
----> 1 df1.append(df4,axis=1) # append函数没有axis属性,只能对行进行级联

TypeError: append() got an unexpected keyword argument 'axis'

In [ ]:

============================================

练习14:

假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联

============================================

3) 使用append()函数添加

In [ ]:

2、合并

由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加

merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并

使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。

注意每一列元素的顺序不要求一致

1) 一对一合并

如果在合并的时候,另个表的key那一列对应的值完全一样,这个合并就是一对一合并

In [31]:
df1 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming"],
    "age":[50,60,48,18],
    "sex":["男","男","女","男"],
    "weight":[60,68,80,50]
})
df1
Out[31]:
  age name sex weight
0 50 Jack Ma 60
1 60 Gates Bill 68
2 48 MrsWang 80
3 18 Xiaoming 50
In [32]:
df2 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming"],
    "home":["杭州","美国","隔壁","河南"],
    "work":["teacher","seller","boss","student"],  
})
df2
Out[32]:
  home name work
0 杭州 Jack Ma teacher
1 美国 Gates Bill seller
2 隔壁 MrsWang boss
3 河南 Xiaoming student
In [33]:
pd.concat([df1,df2],axis=1)
Out[33]:
  age name sex weight home name work
0 50 Jack Ma 60 杭州 Jack Ma teacher
1 60 Gates Bill 68 美国 Gates Bill seller
2 48 MrsWang 80 隔壁 MrsWang boss
3 18 Xiaoming 50 河南 Xiaoming student
In [34]:
df1.merge(df2)
Out[34]:
  age name sex weight home work
0 50 Jack Ma 60 杭州 teacher
1 60 Gates Bill 68 美国 seller
2 48 MrsWang 80 隔壁 boss
3 18 Xiaoming 50 河南 student
In [35]:
pd.merge(df1,df2)
Out[35]:
  age name sex weight home work
0 50 Jack Ma 60 杭州 teacher
1 60 Gates Bill 68 美国 seller
2 48 MrsWang 80 隔壁 boss
3 18 Xiaoming 50 河南 student

2) 多对一合并

表1中的某些属性的值,在表2中有多个和他对应,此时就是一对多合并

在合并的时候,首先要把“一”的哪一方不足的属性值根据已有的属性进行复制,然后在合并

In [36]:
df3 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
    "home":["杭州","美国","隔壁","河南","山东","安徽"],
    "work":["teacher","seller","boss","student","studet","boss"],  
})
df3
Out[36]:
  home name work
0 杭州 Jack Ma teacher
1 美国 Gates Bill seller
2 隔壁 MrsWang boss
3 河南 Xiaoming student
4 山东 Xiaoming studet
5 安徽 Jack Ma boss
In [37]:
df1.merge(df3)
Out[37]:
  age name sex weight home work
0 50 Jack Ma 60 杭州 teacher
1 50 Jack Ma 60 安徽 boss
2 60 Gates Bill 68 美国 seller
3 48 MrsWang 80 隔壁 boss
4 18 Xiaoming 50 河南 student
5 18 Xiaoming 50 山东 studet
In [ ]:

3) 多对多合并

表1中某些属性有多个值,在表2中对应的值也有多个,这时就是多对多合并

依次拿出表1中的多个值去和表2的多个值匹配

In [38]:
df4 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Jack Ma","Jack Ma","Xiaoming"],
    "age":[50,60,48,18,20,30,40],
    "sex":["男","男","女","男","M","M","F"],
    "weight":[60,68,80,50,40,50,30]
})
df4
Out[38]:
  age name sex weight
0 50 Jack Ma 60
1 60 Gates Bill 68
2 48 MrsWang 80
3 18 Xiaoming 50
4 20 Jack Ma M 40
5 30 Jack Ma M 50
6 40 Xiaoming F 30
In [39]:
df3
Out[39]:
  home name work
0 杭州 Jack Ma teacher
1 美国 Gates Bill seller
2 隔壁 MrsWang boss
3 河南 Xiaoming student
4 山东 Xiaoming studet
5 安徽 Jack Ma boss
In [40]:
df4.merge(df3)
Out[40]:
  age name sex weight home work
0 50 Jack Ma 60 杭州 teacher
1 50 Jack Ma 60 安徽 boss
2 20 Jack Ma M 40 杭州 teacher
3 20 Jack Ma M 40 安徽 boss
4 30 Jack Ma M 50 杭州 teacher
5 30 Jack Ma M 50 安徽 boss
6 60 Gates Bill 68 美国 seller
7 48 MrsWang 80 隔壁 boss
8 18 Xiaoming 50 河南 student
9 18 Xiaoming 50 山东 studet
10 40 Xiaoming F 30 河南 student
11 40 Xiaoming F 30 山东 studet
In [ ]:

4) key的规范化

  • 使用on=显式指定哪一列为key,当有多个key相同时使用
In [41]:
df1
Out[41]:
  age name sex weight
0 50 Jack Ma 60
1 60 Gates Bill 68
2 48 MrsWang 80
3 18 Xiaoming 50
In [42]:
df5 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
    "home":["杭州","美国","隔壁","河南","山东","安徽"],
    "work":["teacher","seller","boss","student","studet","boss"], 
    "age":[15,67,89,12,34,56]
})
df5
Out[42]:
  age home name work
0 15 杭州 Jack Ma teacher
1 67 美国 Gates Bill seller
2 89 隔壁 MrsWang boss
3 12 河南 Xiaoming student
4 34 山东 Xiaoming studet
5 56 安徽 Jack Ma boss
In [43]:
# 如果两个表有多个相同的属性,需要指定以哪一个属性为基准来合并
df1.merge(df5,on="name",suffixes=["_实际","_假的"])
Out[43]:
  age_实际 name sex weight age_假的 home work
0 50 Jack Ma 60 15 杭州 teacher
1 50 Jack Ma 60 56 安徽 boss
2 60 Gates Bill 68 67 美国 seller
3 48 MrsWang 80 89 隔壁 boss
4 18 Xiaoming 50 12 河南 student
5 18 Xiaoming 50 34 山东 studet
In [ ]:
  • 使用left_on和right_on指定左右两边的列作为key,当左右两边的key都不想等时使用
In [44]:
df6 = DataFrame({
    "姓名":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaoming","Jack Ma"],
    "home":["杭州","美国","隔壁","河南","山东","安徽"],
    "work":["teacher","seller","boss","student","studet","boss"],  
})
df6
Out[44]:
  home work 姓名
0 杭州 teacher Jack Ma
1 美国 seller Gates Bill
2 隔壁 boss MrsWang
3 河南 student Xiaoming
4 山东 studet Xiaoming
5 安徽 boss Jack Ma
In [45]:
df1
Out[45]:
  age name sex weight
0 50 Jack Ma 60
1 60 Gates Bill 68
2 48 MrsWang 80
3 18 Xiaoming 50
In [46]:
df1.merge(df6,left_on="name",right_on="姓名")
# 如果两个表没有相同的属性,可以左边的表出一个属性,右边的表出一个属性,然后进行合并
# 这种合并两个属性不能并在一起
Out[46]:
  age name sex weight home work 姓名
0 50 Jack Ma 60 杭州 teacher Jack Ma
1 50 Jack Ma 60 安徽 boss Jack Ma
2 60 Gates Bill 68 美国 seller Gates Bill
3 48 MrsWang 80 隔壁 boss MrsWang
4 18 Xiaoming 50 河南 student Xiaoming
5 18 Xiaoming 50 山东 studet Xiaoming
In [47]:
df1.merge(df6,left_on="age",right_on="home") # 没有相同的值,和出来是空
Out[47]:
In [48]:
df1.merge(df6,right_index=True,left_index=True) # 用索引来合并
Out[48]:
  age name sex weight home work 姓名
0 50 Jack Ma 60 杭州 teacher Jack Ma
1 60 Gates Bill 68 美国 seller Gates Bill
2 48 MrsWang 80 隔壁 boss MrsWang
3 18 Xiaoming 50 河南 student Xiaoming

============================================

练习16:

  1. 假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?

  2. 如果ddd4中张三的名字被打错了,成为了张十三,怎么办?

  3. 自行练习多对一,多对多的情况

  4. 自学left_index,right_index

============================================

5) 内合并与外合并

  • 内合并:只保留两者都有的key(默认模式)
In [49]:
df7 = DataFrame({
    "name":["Jack Ma","Gates Bill","MrsWang","Xiaoming","Xiaowang","MaYun"],
    "home":["杭州","美国","隔壁","河南","山东","安徽"],
    "work":["teacher","seller","boss","student","studet","boss"],  
})
df7
Out[49]:
  home name work
0 杭州 Jack Ma teacher
1 美国 Gates Bill seller
2 隔壁 MrsWang boss
3 河南 Xiaoming student
4 山东 Xiaowang studet
5 安徽 MaYun boss
In [50]:
df1.merge(df7,how="inner")
Out[50]:
  age name sex weight home work
0 50 Jack Ma 60 杭州 teacher
1 60 Gates Bill 68 美国 seller
2 48 MrsWang 80 隔壁 boss
3 18 Xiaoming 50 河南 student
  • 外合并 how='outer':补NaN
In [51]:
df1.merge(df7,how="outer")
Out[51]:
  age name sex weight home work
0 50.0 Jack Ma 60.0 杭州 teacher
1 60.0 Gates Bill 68.0 美国 seller
2 48.0 MrsWang 80.0 隔壁 boss
3 18.0 Xiaoming 50.0 河南 student
4 NaN Xiaowang NaN NaN 山东 studet
5 NaN MaYun NaN NaN 安徽 boss
  • 左合并、右合并:how='left',how='right',
In [52]:
df1["name"][0] = "Peater"
df1
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[52]:
  age name sex weight
0 50 Peater 60
1 60 Gates Bill 68
2 48 MrsWang 80
3 18 Xiaoming 50
In [53]:
df1.merge(df7,how="right")
# 左和并:以左侧为基准,左侧有右侧没有补nan,左侧没有右侧有去掉
# 有合并:和上面相反
Out[53]:
  age name sex weight home work
0 60.0 Gates Bill 68.0 美国 seller
1 48.0 MrsWang 80.0 隔壁 boss
2 18.0 Xiaoming 50.0 河南 student
3 NaN Jack Ma NaN NaN 杭州 teacher
4 NaN Xiaowang NaN NaN 山东 studet
5 NaN MaYun NaN NaN 安徽 boss

============================================

练习17:

  1. 如果只有张三赵小六语数英三个科目的成绩,如何合并?

  2. 考虑应用情景,使用多种方式合并ddd与ddd4

============================================

6) 列冲突的解决

当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key,配合suffixes指定冲突列名

可以使用suffixes=自己指定后缀

============================================

练习18:

假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?

============================================

作业

3. 案例分析:美国各州人口数据分析

首先导入文件,并查看数据样本

In [54]:
abbr = pd.read_csv("./usapop/state-abbrevs.csv")
abbr.head()
Out[54]:
  state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [55]:
areas = pd.read_csv("./usapop/state-areas.csv")
areas.head()
Out[55]:
  state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
In [56]:
pop = pd.read_csv("./usapop/state-population.csv")
pop.head()
Out[56]:
  state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0

合并pop与abbrevs两个DataFrame,分别依据state/region列和abbreviation列来合并。

为了保留所有信息,使用外合并。

In [57]:
pop2 = pop.merge(abbr,left_on="state/region",right_on="abbreviation",how="outer")
# 用外连接(或者左链接)
pop2
Out[57]:
  state/region ages year population state abbreviation
0 AL under18 2012 1117489.0 Alabama AL
1 AL total 2012 4817528.0 Alabama AL
2 AL under18 2010 1130966.0 Alabama AL
3 AL total 2010 4785570.0 Alabama AL
4 AL under18 2011 1125763.0 Alabama AL
5 AL total 2011 4801627.0 Alabama AL
6 AL total 2009 4757938.0 Alabama AL
7 AL under18 2009 1134192.0 Alabama AL
8 AL under18 2013 1111481.0 Alabama AL
9 AL total 2013 4833722.0 Alabama AL
10 AL total 2007 4672840.0 Alabama AL
11 AL under18 2007 1132296.0 Alabama AL
12 AL total 2008 4718206.0 Alabama AL
13 AL under18 2008 1134927.0 Alabama AL
14 AL total 2005 4569805.0 Alabama AL
15 AL under18 2005 1117229.0 Alabama AL
16 AL total 2006 4628981.0 Alabama AL
17 AL under18 2006 1126798.0 Alabama AL
18 AL total 2004 4530729.0 Alabama AL
19 AL under18 2004 1113662.0 Alabama AL
20 AL total 2003 4503491.0 Alabama AL
21 AL under18 2003 1113083.0 Alabama AL
22 AL total 2001 4467634.0 Alabama AL
23 AL under18 2001 1120409.0 Alabama AL
24 AL total 2002 4480089.0 Alabama AL
25 AL under18 2002 1116590.0 Alabama AL
26 AL under18 1999 1121287.0 Alabama AL
27 AL total 1999 4430141.0 Alabama AL
28 AL total 2000 4452173.0 Alabama AL
29 AL under18 2000 1122273.0 Alabama AL
... ... ... ... ... ... ...
2514 USA under18 1999 71946051.0 NaN NaN
2515 USA total 2000 282162411.0 NaN NaN
2516 USA under18 2000 72376189.0 NaN NaN
2517 USA total 1999 279040181.0 NaN NaN
2518 USA total 2001 284968955.0 NaN NaN
2519 USA under18 2001 72671175.0 NaN NaN
2520 USA total 2002 287625193.0 NaN NaN
2521 USA under18 2002 72936457.0 NaN NaN
2522 USA total 2003 290107933.0 NaN NaN
2523 USA under18 2003 73100758.0 NaN NaN
2524 USA total 2004 292805298.0 NaN NaN
2525 USA under18 2004 73297735.0 NaN NaN
2526 USA total 2005 295516599.0 NaN NaN
2527 USA under18 2005 73523669.0 NaN NaN
2528 USA total 2006 298379912.0 NaN NaN
2529 USA under18 2006 73757714.0 NaN NaN
2530 USA total 2007 301231207.0 NaN NaN
2531 USA under18 2007 74019405.0 NaN NaN
2532 USA total 2008 304093966.0 NaN NaN
2533 USA under18 2008 74104602.0 NaN NaN
2534 USA under18 2013 73585872.0 NaN NaN
2535 USA total 2013 316128839.0 NaN NaN
2536 USA total 2009 306771529.0 NaN NaN
2537 USA under18 2009 74134167.0 NaN NaN
2538 USA under18 2010 74119556.0 NaN NaN
2539 USA total 2010 309326295.0 NaN NaN
2540 USA under18 2011 73902222.0 NaN NaN
2541 USA total 2011 311582564.0 NaN NaN
2542 USA under18 2012 73708179.0 NaN NaN
2543 USA total 2012 313873685.0 NaN NaN

2544 rows × 6 columns

去除abbreviation的那一列(axis=1)

In [58]:
pop2.drop("abbreviation",axis=1,inplace=True)
# 默认axis=0删除行
# inplace是否在原表上删除,默认是False不在原表上删除
In [59]:
pop2.head()
Out[59]:
  state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama

查看存在缺失数据的列。

使用.isnull().any(),只有某一列存在一个缺失数据,就会显示True。

In [60]:
cond = pop2.isnull().any(axis=1)
cond
Out[60]:
0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Length: 2544, dtype: bool
In [61]:
pop2[cond]
Out[61]:
  state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
2453 PR under18 1993 NaN NaN
2454 PR under18 1992 NaN NaN
2455 PR total 1992 NaN NaN
2456 PR under18 1994 NaN NaN
2457 PR total 1994 NaN NaN
2458 PR total 1995 NaN NaN
2459 PR under18 1995 NaN NaN
2460 PR under18 1996 NaN NaN
2461 PR total 1996 NaN NaN
2462 PR under18 1998 NaN NaN
2463 PR total 1998 NaN NaN
2464 PR total 1997 NaN NaN
2465 PR under18 1997 NaN NaN
2466 PR total 1999 NaN NaN
2467 PR under18 1999 NaN NaN
2468 PR total 2000 3810605.0 NaN
2469 PR under18 2000 1089063.0 NaN
2470 PR total 2001 3818774.0 NaN
2471 PR under18 2001 1077566.0 NaN
2472 PR total 2002 3823701.0 NaN
2473 PR under18 2002 1065051.0 NaN
2474 PR total 2004 3826878.0 NaN
2475 PR under18 2004 1035919.0 NaN
2476 PR total 2003 3826095.0 NaN
2477 PR under18 2003 1050615.0 NaN
... ... ... ... ... ...
2514 USA under18 1999 71946051.0 NaN
2515 USA total 2000 282162411.0 NaN
2516 USA under18 2000 72376189.0 NaN
2517 USA total 1999 279040181.0 NaN
2518 USA total 2001 284968955.0 NaN
2519 USA under18 2001 72671175.0 NaN
2520 USA total 2002 287625193.0 NaN
2521 USA under18 2002 72936457.0 NaN
2522 USA total 2003 290107933.0 NaN
2523 USA under18 2003 73100758.0 NaN
2524 USA total 2004 292805298.0 NaN
2525 USA under18 2004 73297735.0 NaN
2526 USA total 2005 295516599.0 NaN
2527 USA under18 2005 73523669.0 NaN
2528 USA total 2006 298379912.0 NaN
2529 USA under18 2006 73757714.0 NaN
2530 USA total 2007 301231207.0 NaN
2531 USA under18 2007 74019405.0 NaN
2532 USA total 2008 304093966.0 NaN
2533 USA under18 2008 74104602.0 NaN
2534 USA under18 2013 73585872.0 NaN
2535 USA total 2013 316128839.0 NaN
2536 USA total 2009 306771529.0 NaN
2537 USA under18 2009 74134167.0 NaN
2538 USA under18 2010 74119556.0 NaN
2539 USA total 2010 309326295.0 NaN
2540 USA under18 2011 73902222.0 NaN
2541 USA total 2011 311582564.0 NaN
2542 USA under18 2012 73708179.0 NaN
2543 USA total 2012 313873685.0 NaN

96 rows × 5 columns

In [ ]:

查看缺失数据

根据数据是否缺失情况显示数据,如果缺失为True,那么显示

In [62]:
cond_state = pop2["state"].isnull()

找到有哪些state/region使得state的值为NaN,使用unique()查看非重复值

In [63]:
pop2[cond_state]["state/region"].unique()
Out[63]:
array(['PR', 'USA'], dtype=object)

为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN!

记住这样清除缺失数据NaN的方法!

In [64]:
cond_pr = pop2["state/region"] == "PR"
cond_usa = pop2["state/region"] == "USA"
In [65]:
pop2["state"][cond_pr] = "Puerto Rico"
pop2["state"][cond_usa] = "United State"
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [66]:
pop2.head()
Out[66]:
  state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
In [67]:
pop2.isnull().any()
Out[67]:
state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

合并各州面积数据areas,使用左合并。

思考一下为什么使用外合并?

In [68]:
areas.head()
Out[68]:
  state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
In [69]:
pop3 = pop2.merge(areas,how="outer")
pop3.head()
Out[69]:
  state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
In [ ]:

继续寻找存在缺失数据的列

In [70]:
pop3.isnull().any()
Out[70]:
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool
In [71]:
cond_area = pop3["area (sq. mi)"].isnull()
In [72]:
pop3[cond_area]
Out[72]:
  state/region ages year population state area (sq. mi)
2496 USA under18 1990 64218512.0 United State NaN
2497 USA total 1990 249622814.0 United State NaN
2498 USA total 1991 252980942.0 United State NaN
2499 USA under18 1991 65313018.0 United State NaN
2500 USA under18 1992 66509177.0 United State NaN
2501 USA total 1992 256514231.0 United State NaN
2502 USA total 1993 259918595.0 United State NaN
2503 USA under18 1993 67594938.0 United State NaN
2504 USA under18 1994 68640936.0 United State NaN
2505 USA total 1994 263125826.0 United State NaN
2506 USA under18 1995 69473140.0 United State NaN
2507 USA under18 1996 70233512.0 United State NaN
2508 USA total 1995 266278403.0 United State NaN
2509 USA total 1996 269394291.0 United State NaN
2510 USA total 1997 272646932.0 United State NaN
2511 USA under18 1997 70920738.0 United State NaN
2512 USA under18 1998 71431406.0 United State NaN
2513 USA total 1998 275854116.0 United State NaN
2514 USA under18 1999 71946051.0 United State NaN
2515 USA total 2000 282162411.0 United State NaN
2516 USA under18 2000 72376189.0 United State NaN
2517 USA total 1999 279040181.0 United State NaN
2518 USA total 2001 284968955.0 United State NaN
2519 USA under18 2001 72671175.0 United State NaN
2520 USA total 2002 287625193.0 United State NaN
2521 USA under18 2002 72936457.0 United State NaN
2522 USA total 2003 290107933.0 United State NaN
2523 USA under18 2003 73100758.0 United State NaN
2524 USA total 2004 292805298.0 United State NaN
2525 USA under18 2004 73297735.0 United State NaN
2526 USA total 2005 295516599.0 United State NaN
2527 USA under18 2005 73523669.0 United State NaN
2528 USA total 2006 298379912.0 United State NaN
2529 USA under18 2006 73757714.0 United State NaN
2530 USA total 2007 301231207.0 United State NaN
2531 USA under18 2007 74019405.0 United State NaN
2532 USA total 2008 304093966.0 United State NaN
2533 USA under18 2008 74104602.0 United State NaN
2534 USA under18 2013 73585872.0 United State NaN
2535 USA total 2013 316128839.0 United State NaN
2536 USA total 2009 306771529.0 United State NaN
2537 USA under18 2009 74134167.0 United State NaN
2538 USA under18 2010 74119556.0 United State NaN
2539 USA total 2010 309326295.0 United State NaN
2540 USA under18 2011 73902222.0 United State NaN
2541 USA total 2011 311582564.0 United State NaN
2542 USA under18 2012 73708179.0 United State NaN
2543 USA total 2012 313873685.0 United State NaN

我们会发现area(sq.mi)这一列有缺失数据,为了找出是哪一行,我们需要找出是哪个state没有数据

In [73]:
# usa面积缺失,就可以吧所有的州面积求和
usa_area = areas["area (sq. mi)"].sum()
In [74]:
# 把美国的面积赋值到pop3中
pop3["area (sq. mi)"][cond_area] = usa_area
d:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [75]:
pop3
Out[75]:
  state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
5 AL total 2011 4801627.0 Alabama 52423.0
6 AL total 2009 4757938.0 Alabama 52423.0
7 AL under18 2009 1134192.0 Alabama 52423.0
8 AL under18 2013 1111481.0 Alabama 52423.0
9 AL total 2013 4833722.0 Alabama 52423.0
10 AL total 2007 4672840.0 Alabama 52423.0
11 AL under18 2007 1132296.0 Alabama 52423.0
12 AL total 2008 4718206.0 Alabama 52423.0
13 AL under18 2008 1134927.0 Alabama 52423.0
14 AL total 2005 4569805.0 Alabama 52423.0
15 AL under18 2005 1117229.0 Alabama 52423.0
16 AL total 2006 4628981.0 Alabama 52423.0
17 AL under18 2006 1126798.0 Alabama 52423.0
18 AL total 2004 4530729.0 Alabama 52423.0
19 AL under18 2004 1113662.0 Alabama 52423.0
20 AL total 2003 4503491.0 Alabama 52423.0
21 AL under18 2003 1113083.0 Alabama 52423.0
22 AL total 2001 4467634.0 Alabama 52423.0
23 AL under18 2001 1120409.0 Alabama 52423.0
24 AL total 2002 4480089.0 Alabama 52423.0
25 AL under18 2002 1116590.0 Alabama 52423.0
26 AL under18 1999 1121287.0 Alabama 52423.0
27 AL total 1999 4430141.0 Alabama 52423.0
28 AL total 2000 4452173.0 Alabama 52423.0
29 AL under18 2000 1122273.0 Alabama 52423.0
... ... ... ... ... ... ...
2514 USA under18 1999 71946051.0 United State 3790399.0
2515 USA total 2000 282162411.0 United State 3790399.0
2516 USA under18 2000 72376189.0 United State 3790399.0
2517 USA total 1999 279040181.0 United State 3790399.0
2518 USA total 2001 284968955.0 United State 3790399.0
2519 USA under18 2001 72671175.0 United State 3790399.0
2520 USA total 2002 287625193.0 United State 3790399.0
2521 USA under18 2002 72936457.0 United State 3790399.0
2522 USA total 2003 290107933.0 United State 3790399.0
2523 USA under18 2003 73100758.0 United State 3790399.0
2524 USA total 2004 292805298.0 United State 3790399.0
2525 USA under18 2004 73297735.0 United State 3790399.0
2526 USA total 2005 295516599.0 United State 3790399.0
2527 USA under18 2005 73523669.0 United State 3790399.0
2528 USA total 2006 298379912.0 United State 3790399.0
2529 USA under18 2006 73757714.0 United State 3790399.0
2530 USA total 2007 301231207.0 United State 3790399.0
2531 USA under18 2007 74019405.0 United State 3790399.0
2532 USA total 2008 304093966.0 United State 3790399.0
2533 USA under18 2008 74104602.0 United State 3790399.0
2534 USA under18 2013 73585872.0 United State 3790399.0
2535 USA total 2013 316128839.0 United State 3790399.0
2536 USA total 2009 306771529.0 United State 3790399.0
2537 USA under18 2009 74134167.0 United State 3790399.0
2538 USA under18 2010 74119556.0 United State 3790399.0
2539 USA total 2010 309326295.0 United State 3790399.0
2540 USA under18 2011 73902222.0 United State 3790399.0
2541 USA total 2011 311582564.0 United State 3790399.0
2542 USA under18 2012 73708179.0 United State 3790399.0
2543 USA total 2012 313873685.0 United State 3790399.0

2544 rows × 6 columns

去除含有缺失数据的行

In [76]:
pop3.isnull().any()
Out[76]:
state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)    False
dtype: bool

查看数据是否缺失

In [77]:
pop3.dropna(inplace=True)
In [78]:
pop3.isnull().any()
Out[78]:
state/region     False
ages             False
year             False
population       False
state            False
area (sq. mi)    False
dtype: bool

找出2010年的全民人口数据,df.query(查询语句)

In [79]:
pop3.head()
Out[79]:
  state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
In [80]:
pop3.query("year==2010 & ages=='total' & state=='United State'")
Out[80]:
  state/region ages year population state area (sq. mi)
2539 USA total 2010 309326295.0 United State 3790399.0
In [81]:
pop_2010 = pop3.query("year==2010 & ages=='total'")
pop_2010
Out[81]:
  state/region ages year population state area (sq. mi)
3 AL total 2010 4785570.0 Alabama 52423.0
91 AK total 2010 713868.0 Alaska 656425.0
101 AZ total 2010 6408790.0 Arizona 114006.0
189 AR total 2010 2922280.0 Arkansas 53182.0
197 CA total 2010 37333601.0 California 163707.0
283 CO total 2010 5048196.0 Colorado 104100.0
293 CT total 2010 3579210.0 Connecticut 5544.0
379 DE total 2010 899711.0 Delaware 1954.0
389 DC total 2010 605125.0 District of Columbia 68.0
475 FL total 2010 18846054.0 Florida 65758.0
485 GA total 2010 9713248.0 Georgia 59441.0
570 HI total 2010 1363731.0 Hawaii 10932.0
581 ID total 2010 1570718.0 Idaho 83574.0
666 IL total 2010 12839695.0 Illinois 57918.0
677 IN total 2010 6489965.0 Indiana 36420.0
762 IA total 2010 3050314.0 Iowa 56276.0
773 KS total 2010 2858910.0 Kansas 82282.0
858 KY total 2010 4347698.0 Kentucky 40411.0
869 LA total 2010 4545392.0 Louisiana 51843.0
954 ME total 2010 1327366.0 Maine 35387.0
965 MD total 2010 5787193.0 Maryland 12407.0
1050 MA total 2010 6563263.0 Massachusetts 10555.0
1061 MI total 2010 9876149.0 Michigan 96810.0
1146 MN total 2010 5310337.0 Minnesota 86943.0
1157 MS total 2010 2970047.0 Mississippi 48434.0
1242 MO total 2010 5996063.0 Missouri 69709.0
1253 MT total 2010 990527.0 Montana 147046.0
1338 NE total 2010 1829838.0 Nebraska 77358.0
1349 NV total 2010 2703230.0 Nevada 110567.0
1434 NH total 2010 1316614.0 New Hampshire 9351.0
1445 NJ total 2010 8802707.0 New Jersey 8722.0
1530 NM total 2010 2064982.0 New Mexico 121593.0
1541 NY total 2010 19398228.0 New York 54475.0
1626 NC total 2010 9559533.0 North Carolina 53821.0
1637 ND total 2010 674344.0 North Dakota 70704.0
1722 OH total 2010 11545435.0 Ohio 44828.0
1733 OK total 2010 3759263.0 Oklahoma 69903.0
1818 OR total 2010 3837208.0 Oregon 98386.0
1829 PA total 2010 12710472.0 Pennsylvania 46058.0
1914 RI total 2010 1052669.0 Rhode Island 1545.0
1925 SC total 2010 4636361.0 South Carolina 32007.0
2010 SD total 2010 816211.0 South Dakota 77121.0
2021 TN total 2010 6356683.0 Tennessee 42146.0
2106 TX total 2010 25245178.0 Texas 268601.0
2117 UT total 2010 2774424.0 Utah 84904.0
2202 VT total 2010 625793.0 Vermont 9615.0
2213 VA total 2010 8024417.0 Virginia 42769.0
2298 WA total 2010 6742256.0 Washington 71303.0
2309 WV total 2010 1854146.0 West Virginia 24231.0
2394 WI total 2010 5689060.0 Wisconsin 65503.0
2405 WY total 2010 564222.0 Wyoming 97818.0
2490 PR total 2010 3721208.0 Puerto Rico 3515.0
2539 USA total 2010 309326295.0 United State 3790399.0

对查询结果进行处理,以state列作为新的行索引:set_index

In [82]:
pop_2010.set_index("state",inplace=True)
In [83]:
pop_2010.shape
Out[83]:
(53, 5)

计算人口密度。注意是Series/Series,其结果还是一个Series。

In [84]:
pop_dense = pop_2010["population"] / pop_2010["area (sq. mi)"]
pop_dense
Out[84]:
state
Alabama                   91.287603
Alaska                     1.087509
Arizona                   56.214497
Arkansas                  54.948667
California               228.051342
Colorado                  48.493718
Connecticut              645.600649
Delaware                 460.445752
District of Columbia    8898.897059
Florida                  286.597129
Georgia                  163.409902
Hawaii                   124.746707
Idaho                     18.794338
Illinois                 221.687472
Indiana                  178.197831
Iowa                      54.202751
Kansas                    34.745266
Kentucky                 107.586994
Louisiana                 87.676099
Maine                     37.509990
Maryland                 466.445797
Massachusetts            621.815538
Michigan                 102.015794
Minnesota                 61.078373
Mississippi               61.321530
Missouri                  86.015622
Montana                    6.736171
Nebraska                  23.654153
Nevada                    24.448796
New Hampshire            140.799273
New Jersey              1009.253268
New Mexico                16.982737
New York                 356.094135
North Carolina           177.617157
North Dakota               9.537565
Ohio                     257.549634
Oklahoma                  53.778278
Oregon                    39.001565
Pennsylvania             275.966651
Rhode Island             681.339159
South Carolina           144.854594
South Dakota              10.583512
Tennessee                150.825298
Texas                     93.987655
Utah                      32.677188
Vermont                   65.085075
Virginia                 187.622273
Washington                94.557817
West Virginia             76.519582
Wisconsin                 86.851900
Wyoming                    5.768079
Puerto Rico             1058.665149
United State              81.607845
dtype: float64

排序,并找出人口密度最高的五个州sort_values()

In [85]:
pop_dense.sort_values(inplace=True)
pop_dense
Out[85]:
state
Alaska                     1.087509
Wyoming                    5.768079
Montana                    6.736171
North Dakota               9.537565
South Dakota              10.583512
New Mexico                16.982737
Idaho                     18.794338
Nebraska                  23.654153
Nevada                    24.448796
Utah                      32.677188
Kansas                    34.745266
Maine                     37.509990
Oregon                    39.001565
Colorado                  48.493718
Oklahoma                  53.778278
Iowa                      54.202751
Arkansas                  54.948667
Arizona                   56.214497
Minnesota                 61.078373
Mississippi               61.321530
Vermont                   65.085075
West Virginia             76.519582
United State              81.607845
Missouri                  86.015622
Wisconsin                 86.851900
Louisiana                 87.676099
Alabama                   91.287603
Texas                     93.987655
Washington                94.557817
Michigan                 102.015794
Kentucky                 107.586994
Hawaii                   124.746707
New Hampshire            140.799273
South Carolina           144.854594
Tennessee                150.825298
Georgia                  163.409902
North Carolina           177.617157
Indiana                  178.197831
Virginia                 187.622273
Illinois                 221.687472
California               228.051342
Ohio                     257.549634
Pennsylvania             275.966651
Florida                  286.597129
New York                 356.094135
Delaware                 460.445752
Maryland                 466.445797
Massachusetts            621.815538
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64
In [86]:
pop_dense.tail()
Out[86]:
state
Connecticut              645.600649
Rhode Island             681.339159
New Jersey              1009.253268
Puerto Rico             1058.665149
District of Columbia    8898.897059
dtype: float64

找出人口密度最低的五个州

In [87]:
pop_dense.head()
Out[87]:
state
Alaska           1.087509
Wyoming          5.768079
Montana          6.736171
North Dakota     9.537565
South Dakota    10.583512
dtype: float64

要点总结:

  • 统一用loc()索引
  • 善于使用.isnull().any()找到存在NaN的列
  • 善于使用.unique()确定该列中哪些key是我们需要的
  • 一般使用外合并、左合并,目的只有一个:宁愿该列是NaN也不要丢弃其他列的信息

回顾:Series/DataFrame运算与ndarray运算的区别

  • Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
  • ndarray有广播,通过重复已有值来计算

猜你喜欢

转载自blog.csdn.net/qq_29784441/article/details/80861438