Python pandas 入门

数据结构Series

构造和初始化Series

import pandas as pd
import numpy as np
Series是一个一维的数据结构
s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'])
s
0                  7
1            Beijing
2               2.17
3              -1232
4    Happy birthday!
dtype: object

pandas会默认用0到n来作为Series的index,但是我们也可以自己指定index。

s = pd.Series([7, 'Beijing', 2.17, -1232, 'Happy birthday!'],
             index = ['A', 'B', 'C', 'D', 'E'])
type(s)
pandas.core.series.Series

还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。

cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou':20000, \
         'Guangzhou': 25000, 'Suzhou': None}
apts = pd.Series(cities)
print(apts)
print(type(apts))
Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
dtype: float64
<class 'pandas.core.series.Series'>

选择数据

前面定义的index就是用来选择数据的

apts["Hangzhou"]
20000.0
apts[["Hangzhou", "Beijing", "Shenzhen"]]
# type(apts[["Hangzhou", "Beijing", "Shenzhen"]])
Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
dtype: float64

boolean indexing在pandas当中也可以使用。

下面我再详细展示一下这个boolean indexing是如何工作的

apts[apts < 50000]
Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64
less_than_50000 = apts < 50000
print(less_than_50000)
Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
dtype: bool
apts[less_than_50000]
Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64

Series元素赋值

Series的元素可以被赋值

print("Old value:", apts['Shenzhen'])
Old value: 50000.0
apts['Shenzhen'] = 55000
apts['Shenzhen']
55000.0

前面的boolean indexing在赋值的时候也可以用

print(apts[apts < 50000])
Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64
apts[apts <= 50000] = 40000
apts
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
Suzhou           NaN
dtype: float64

数学运算

下面我们来讲一些基本的数学运算。

apts / 2
Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     27500.0
Suzhou           NaN
dtype: float64
apts * 2
Beijing      110000.0
Guangzhou     80000.0
Hangzhou      80000.0
Shanghai     120000.0
Shenzhen     110000.0
Suzhou            NaN
dtype: float64
np.square(apts)
Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
dtype: float64

square也可以写成 **

apts ** 2
Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
dtype: float64

我们再定义一个新的Series做加法

cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
                      'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars
Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64
print(cars + apts * 100)
Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64

数据缺失

'Hangzhou' in apts
True
'Hangzhou' in cars
False
apts.notnull()
Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
dtype: bool
apts.isnull()
Beijing      False
Guangzhou    False
Hangzhou     False
Shanghai     False
Shenzhen     False
Suzhou        True
dtype: bool
apts[apts.isnull()]
Suzhou   NaN
dtype: float64
apts[apts.notnull()]
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
dtype: float64
apts[apts.isnull() == False]
Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     55000.0
dtype: float64

数据结构Dataframe

一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

创建一个DataFrame

dataframe可以由一个dictionary构造得到。

data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
       'year': [2016,2017,2016,2017,2016, 2016],
       'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
city population year
0 Beijing 2100 2016
1 Shanghai 2300 2017
2 Guangzhou 1000 2016
3 Shenzhen 700 2017
4 Hangzhou 500 2016
5 Chongqing 500 2016

columns的名字和顺序可以指定

pd.DataFrame(data, columns = ['year', 'city', 'population'])
year city population
0 2016 Beijing 2100
1 2017 Shanghai 2300
2 2016 Guangzhou 1000
3 2017 Shenzhen 700
4 2016 Hangzhou 500
5 2016 Chongqing 500
pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'])
year city population debt
0 2016 Beijing 2100 NaN
1 2017 Shanghai 2300 NaN
2 2016 Guangzhou 1000 NaN
3 2017 Shenzhen 700 NaN
4 2016 Hangzhou 500 NaN
5 2016 Chongqing 500 NaN
frame2 = pd.DataFrame(data, columns = ['year', 'city', 'population', 'debt'],
            index=['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN

从DataFrame里选择数据

frame2['city']
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object
type(frame2['city'])
pandas.core.series.Series
print(frame2.city)
one        Beijing
two       Shanghai
three    Guangzhou
four      Shenzhen
five      Hangzhou
six      Chongqing
Name: city, dtype: object
frame2.ix['three']  #拿一行
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object
type(frame2.ix['three'])
pandas.core.series.Series
print(frame2.ix[2])   
year               2016
city          Guangzhou
population         1000
debt                NaN
Name: three, dtype: object

下面这种方法默认用来选列而不是选行

DataFrame元素赋值

frame2["population"]["one"] = 2200   #修改数据警告
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
frame2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
year city population debt
one 2016 Beijing 2200 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN

可以给一整列赋值

frame2['debt'] = 100000000
frame2
year city population debt
one 2016 Beijing 2200 100000000
two 2017 Shanghai 2300 100000000
three 2016 Guangzhou 1000 100000000
four 2017 Shenzhen 700 100000000
five 2016 Hangzhou 500 100000000
six 2016 Chongqing 500 100000000
frame2.ix['six'] = 0    #ix函数拿一行
frame2
year city population debt
one 2016 Beijing 2200 100000000
two 2017 Shanghai 2300 100000000
three 2016 Guangzhou 1000 100000000
four 2017 Shenzhen 700 100000000
five 2016 Hangzhou 500 100000000
six 0 0 0 0
frame2 = pd.DataFrame(data, \
                     columns = ['year', 'city', 'population', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)
       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN
frame2.debt = np.arange(6)
frame2
year city population debt
one 2016 Beijing 2100 0
two 2017 Shanghai 2300 1
three 2016 Guangzhou 1000 2
four 2017 Shenzhen 700 3
five 2016 Hangzhou 500 4
six 2016 Chongqing 500 5

还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.

val  = pd.Series([200, 300, 500], index=['two', 'three', 'five'])
val
frame2['debt'] = val
print(frame2)
       year       city  population   debt
one    2016    Beijing        2100    NaN
two    2017   Shanghai        2300  200.0
three  2016  Guangzhou        1000  300.0
four   2017   Shenzhen         700    NaN
five   2016   Hangzhou         500  500.0
six    2016  Chongqing         500    NaN
frame2['western'] = (frame2.city == 'Chongqing')
print(frame2)
       year       city  population   debt western
one    2016    Beijing        2100    NaN   False
two    2017   Shanghai        2300  200.0   False
three  2016  Guangzhou        1000  300.0   False
four   2017   Shenzhen         700    NaN   False
five   2016   Hangzhou         500  500.0   False
six    2016  Chongqing         500    NaN    True
frame2.columns
Index(['year', 'city', 'population', 'debt', 'western'], dtype='object')
frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

一个DataFrame就和一个numpy 2d array一样,可以被转置

frame2.T
one two three four five six
year 2016 2017 2016 2017 2016 2016
city Beijing Shanghai Guangzhou Shenzhen Hangzhou Chongqing
population 2100 2300 1000 700 500 500
debt NaN 200 300 NaN 500 NaN
western False False False False False True
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print(frame3)
      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500
2017   2200.0      2600
print(frame3.T)
            2015    2016    2017
Beijing      NaN  2100.0  2200.0
Shanghai  2400.0  2500.0  2600.0

指定index的顺序,以及使用切片初始化数据

frame3['Beijing'][1:3]
2016    2100.0
2017    2200.0
Name: Beijing, dtype: float64
frame3['Shanghai'][:-1]
2015    2400
2016    2500
Name: Shanghai, dtype: int64
pdata = {'Beijing': frame3['Beijing'][:-1], 'Shanghai':frame3['Shanghai'][:-1]}
print(pd.DataFrame(pdata))
      Beijing  Shanghai
2015      NaN      2400
2016   2100.0      2500

我们还可以指定index的名字和列的名字

frame3.index.name = 'year'
frame3.columns.name = 'city'
frame3
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
city Beijing Shanghai
year
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600
print(frame3.values)

type(frame3.values)
[[  nan 2400.]
 [2100. 2500.]
 [2200. 2600.]]

numpy.ndarray

Index

index object

obj = pd.Series(range(3), index=['a', 'b', 'c'])
print(obj)
index = obj.index
print(index)
print(index[1:])
a    0
b    1
c    2
dtype: int64
Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')

index的值是不能被更改的

index[1] = 'd'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-21-a452e55ce13b> in <module>()
----> 1 index[1] = 'd'


C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   1722 
   1723     def __setitem__(self, key, value):
-> 1724         raise TypeError("Index does not support mutable operations")
   1725 
   1726     def __getitem__(self, key):


TypeError: Index does not support mutable operations
index = pd.Index(np.arange(3))
# index
obj2 = pd.Series([2,5,7], index=index)
print(obj2)
print(obj2.index is index)
0    2
1    5
2    7
dtype: int64
True
pop = {'Beijing': {2016: 2100, 2017:2200},
      'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
frame3 = pd.DataFrame(pop)
print('Shanghai' in frame3.columns)
print(2015 in frame3.columns)
True
False

针对index进行索引和切片

obj = pd.Series(np.arange(4), index=['a','b','c','d'])
print(obj)
a    0
b    1
c    2
d    3
dtype: int32
obj['b']
obj[['b', 'a']]
b    1
a    0
dtype: int32

默认的数字index依旧可以使用

obj[[0, 2]]
a    0
c    2
dtype: int32

下面介绍如何对Series进行切片

obj[1:3]    #[1:3)
b    1
c    2
dtype: int32
obj['b':'d'] = 5    #['b':'d']
print(obj)
a    0
b    5
c    5
d    5
dtype: int32

对DataFrame进行Indexing与Series基本相同

frame = pd.DataFrame(np.arange(9).reshape(3,3),
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8
frame['Hangzhou']
a    0
c    3
d    6
Name: Hangzhou, dtype: int32
frame[:2]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
frame.ix['a']
Hangzhou    0
Shenzhen    1
Nanjing     2
Name: a, dtype: int32
frame.ix['a':'d']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['a':'d']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame['Hangzhou':'Najing']  
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
frame.ix[['a','d'], ['Shenzhen', 'Nanjing']] #索引
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Nanjing
a 1 2
d 7 8
frame.ix[:, 'Shenzhen':'Nanjing']  #切片
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Nanjing
a 1 2
c 4 5
d 7 8
frame.ix[:'c', 'Hangzhou']
a    0
c    3
Name: Hangzhou, dtype: int32

DataFrame也可以用condition selection

frame[frame.Hangzhou > 1]
Hangzhou Shenzhen Nanjing
c 3 4 5
d 6 7 8
frame[frame < 5] = 0
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         0        0
c         0         0        5
d         6         7        8

reindex

把一个Series或者DataFrame按照新的index顺序进行重排


obj = pd.Series([4.5, 7.2, -5.3, 3.2], index=['d', 'b', 'a', 'c'])
print(obj)
d    4.5
b    7.2
a   -5.3
c    3.2
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'])
a   -5.3
b    7.2
c    3.2
d    4.5
e    NaN
dtype: float64
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0) #默认填充
a   -5.3
b    7.2
c    3.2
d    4.5
e    0.0
dtype: float64
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0,2,4])
print(obj3)
0      blue
2    purple
4    yellow
dtype: object
obj3.reindex(range(6), method='ffill')  #按照前面的进行填充
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
obj3.reindex(range(6), method='bfill')  #按照下一个填充
0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

既然我们可以对Series进行reindex,相应地,我们也可以用同样的方法对DataFrame进行reindex。

frame = pd.DataFrame(np.arange(9).reshape(3,3), 
                    index = ['a', 'c', 'd'],
                    columns = ['Hangzhou', 'Shenzhen', 'Nanjing'])
print(frame)
   Hangzhou  Shenzhen  Nanjing
a         0         1        2
c         3         4        5
d         6         7        8
frame.reindex(['a' , 'b', 'c', 'd'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Shenzhen Nanjing
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0

在reindex的同时,我们还可以重新指定columns

frame.reindex(columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Hangzhou Chongqing
a 1 0 NaN
c 4 3 NaN
d 7 6 NaN
frame.reindex(index =['a', 'b', 'c', 'd'],

              columns = ['Shenzhen', 'Hangzhou', 'Chongqing'])
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Shenzhen Hangzhou Chongqing
a 1.0 0.0 NaN
b NaN NaN NaN
c 4.0 3.0 NaN
d 7.0 6.0 NaN

下面介绍如何用drop来删除Series和DataFrame中的index

print(obj3)
print('*************')
obj4 = obj3.drop(2)
print(obj4)
0      blue
2    purple
4    yellow
dtype: object
*************
0      blue
4    yellow
dtype: object
obj3.drop([2, 4])
0    blue
dtype: object

删除DataFrame中的内容

frame
Hangzhou Shenzhen Nanjing
a 0 1 2
c 3 4 5
d 6 7 8
frame.drop(['a', 'c'])
Hangzhou Shenzhen Nanjing
d 6 7 8

drop不仅仅可以删除行,还可以删除列

frame.drop('Shenzhen', axis=1)  #0行,1按列删
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Hangzhou Nanjing
a 0 2
c 3 5
d 6 8
frame.drop(['Shenzhen', 'Hangzhou'], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Nanjing
a 2
c 5
d 8

hierarchical index

Series的hierarchical indexing

data = pd.Series(np.random.randn(10), 
                 index=[['a','a','a','b','b','c','c','c','d','d'], \   #大索引在前,小索引在后
                        [1,2,3,1,2,1,2,3,1,2]])
data
a  1   -0.535940
   2    0.190444
   3    0.137448
b  1    2.442986
   2    0.211423
c  1    1.439100
   2   -1.690219
   3    1.338686
d  1    1.274225
   2   -0.238439
dtype: float64
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])
print(data["b"])
1    2.442986
2    0.211423
dtype: float64
data['b':'d']
b  1    2.442986
   2    0.211423
c  1    1.439100
   2   -1.690219
   3    1.338686
d  1    1.274225
   2   -0.238439
dtype: float64
data[1:4]   #所有的,按0,1,2...顺序切片出来
a  2    0.190444
   3    0.137448
b  1    2.442986
dtype: float64

unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。

data.unstack()  #变成DataFrame
1 2 3
a -0.535940 0.190444 0.137448
b 2.442986 0.211423 NaN
c 1.439100 -1.690219 1.338686
d 1.274225 -0.238439 NaN
type(data.unstack())
pandas.core.frame.DataFrame
data.unstack().stack()
a  1   -0.535940
   2    0.190444
   3    0.137448
b  1    2.442986
   2    0.211423
c  1    1.439100
   2   -1.690219
   3    1.338686
d  1    1.274225
   2   -0.238439
dtype: float64

DataFrame的hierarchical indexing

frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [['a','a','b','b'], [1,2,1,2]],
                    columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)  #行两层,列两层的DataFrame
    Beijing      Shanghai
       apts cars     apts
a 1       0    1        2
  2       3    4        5
b 1       6    7        8
  2       9   10       11
frame.index.names = ['alpha', 'number']
frame.columns.names = ['city', 'type']
print(frame)
city         Beijing      Shanghai
type            apts cars     apts
alpha number                      
a     1            0    1        2
      2            3    4        5
b     1            6    7        8
      2            9   10       11
frame.ix['a', 1]
#print(type(frame.ix['a', 1]))
city      type
Beijing   apts    0
          cars    1
Shanghai  apts    2
Name: (a, 1), dtype: int32
frame.ix['a', 2]['Beijing']['apts']
3

关于Merge, Join和Concatenate

concatenate

df1 = pd.DataFrame({'apts': [55000, 60000],
                   'cars': [200000, 300000],},
                  index = ['Shanghai', 'Beijing'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
df2 = pd.DataFrame({'apts': [25000, 20000],
                   'cars': [150000, 120000],},
                  index = ['Hangzhou', 'Najing'])
print(df2)
           apts    cars
Hangzhou  25000  150000
Najing    20000  120000
df3 = pd.DataFrame({'apts': [30000, 10000],
                   'cars': [180000, 100000],},
                  index = ['Guangzhou', 'Chongqing'])
print(df3)
            apts    cars
Guangzhou  30000  180000
Chongqing  10000  100000
result = pd.concat([df1, df2, df3])  #上下连接起来
result 
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
Guangzhou 30000 180000
Chongqing 10000 100000

在concatenate的时候可以指定keys,这样可以给每一个部分加上一个Key。

以下的例子就构造了一个hierarchical index。

frames = [df1, df2, df3]
frames
[           apts    cars
 Shanghai  55000  200000
 Beijing   60000  300000,            apts    cars
 Hangzhou  25000  150000
 Najing    20000  120000,             apts    cars
 Guangzhou  30000  180000
 Chongqing  10000  100000]
result2 = pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])    #指定key
result2
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
x Shanghai 55000 200000
Beijing 60000 300000
y Hangzhou 25000 150000
Najing 20000 120000
z Guangzhou 30000 180000
Chongqing 10000 100000
result2.ix['y']
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Hangzhou 25000 150000
Najing 20000 120000
result2.ix[3:6]
apts cars
y Najing 20000 120000
z Guangzhou 30000 180000
Chongqing 10000 100000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
                  index = ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
result3 = pd.concat([result, df4], axis=1)    #按列concat,左右合拼{链接}起来
result3
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Chongqing 10000.0 100000.0 NaN
Guangzhou 30000.0 180000.0 20000.0
Hangzhou 25000.0 150000.0 NaN
Najing 20000.0 120000.0 NaN
Shanghai 55000.0 200000.0 30000.0
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

复习一下前面讲过的stack

result3.stack()
Beijing    apts         60000.0
           cars        300000.0
           salaries     30000.0
Chongqing  apts         10000.0
           cars        100000.0
Guangzhou  apts         30000.0
           cars        180000.0
           salaries     20000.0
Hangzhou   apts         25000.0
           cars        150000.0
Najing     apts         20000.0
           cars        120000.0
Shanghai   apts         55000.0
           cars        200000.0
           salaries     30000.0
Suzhou     salaries     10000.0
Tianjin    salaries     15000.0
dtype: float64

用inner可以去掉NaN的数据,只留下匹配的数据

result3 = pd.concat([result, df4], axis=1, join='inner') #只留下匹配的 
result3
apts cars salaries
Beijing 60000 300000 30000
Shanghai 55000 200000 30000
Guangzhou 30000 180000 20000

用append来做concatenation

df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
df1.append(df2)
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Hangzhou 25000 150000
Najing 20000 120000
df1.append(df4)
apts cars salaries
Shanghai 55000.0 200000.0 NaN
Beijing 60000.0 300000.0 NaN
Suzhou NaN NaN 10000.0
Beijing NaN NaN 30000.0
Shanghai NaN NaN 30000.0
Guangzhou NaN NaN 20000.0
Tianjin NaN NaN 15000.0

Series和DataFrame还可以被一起concatenate,这时候Series会先被转成DataFrame然后做Join,因为Series本来就是一个只有一维的DataFrame对吧。

s1 = pd.Series([60, 50], index=['Shanghai', 'Beijing'], name='meal')
print(s1)
Shanghai    60
Beijing     50
Name: meal, dtype: int64
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
pd.concat([df1, s1], axis=1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars meal
Shanghai 55000 200000 60
Beijing 60000 300000 50

下面讲如何append一个row到DataFrame里。

s2 = pd.Series([18000, 120000], index=['apts', 'cars'], name='Xiamen')
s2
apts     18000
cars    120000
Name: Xiamen, dtype: int64
df1.append(s2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Xiamen 18000 120000
pd.concat([df1,s2],axis=0)   #Series和DataFrame还可以被一起concatenate,只能用append
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars 0
Shanghai 55000.0 200000.0 NaN
Beijing 60000.0 300000.0 NaN
apts NaN NaN 18000.0
cars NaN NaN 120000.0

Merge(Join)


df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
                   'cars': [200000, 300000,250000],
                  'cities': ['Shanghai', 'Beijing','Shenzhen']})
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars cities
0 55000 200000 Shanghai
1 60000 300000 Beijing
2 58000 250000 Shenzhen
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000],
                  'cities': ['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin']})
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
cities salaries
0 Suzhou 10000
1 Beijing 30000
2 Shanghai 30000
3 Guangzhou 20000
4 Tianjin 15000
result = pd.merge(df1, df4, on='cities')   #按cities这列拼在一起
result
apts cars cities salaries
0 55000 200000 Shanghai 30000
1 60000 300000 Beijing 30000
result2 = pd.merge(df1, df4, on='cities', how='outer')   #没匹配上的也要   
                                #https://blog.csdn.net/weixin_37226516/article/details/64137043                                                                            
result2 
apts cars cities salaries
0 55000.0 200000.0 Shanghai 30000.0
1 60000.0 300000.0 Beijing 30000.0
2 58000.0 250000.0 Shenzhen NaN
3 NaN NaN Suzhou 10000.0
4 NaN NaN Guangzhou 20000.0
5 NaN NaN Tianjin 15000.0

join on index

df1 = pd.DataFrame({'apts': [55000, 60000, 58000],
                   'cars': [200000, 300000,250000]},
                  index=['Shanghai', 'Beijing','Shenzhen'])
df1
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars
Shanghai 55000 200000
Beijing 60000 300000
Shenzhen 58000 250000
df4 = pd.DataFrame({'salaries': [10000, 30000, 30000, 20000, 15000]},
                  index=['Suzhou', 'Beijing', 'Shanghai', 'Guangzhou', 'Tianjin'])
df4
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
salaries
Suzhou 10000
Beijing 30000
Shanghai 30000
Guangzhou 20000
Tianjin 15000
df1.join(df4)   #join只保留前者的数据 默认left join
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Shanghai 55000 200000 30000.0
Beijing 60000 300000 30000.0
Shenzhen 58000 250000 NaN
df4.join(df1)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
salaries apts cars
Suzhou 10000 NaN NaN
Beijing 30000 60000.0 300000.0
Shanghai 30000 55000.0 200000.0
Guangzhou 20000 NaN NaN
Tianjin 15000 NaN NaN
df1.join(df4, how='outer')   #保留全部数据
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

也可以用merge来写

pd.merge(df1, df4,left_index=True,right_index=True,how='outer')   #join的条件是left_index
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
apts cars salaries
Beijing 60000.0 300000.0 30000.0
Guangzhou NaN NaN 20000.0
Shanghai 55000.0 200000.0 30000.0
Shenzhen 58000.0 250000.0 NaN
Suzhou NaN NaN 10000.0
Tianjin NaN NaN 15000.0

猜你喜欢

转载自blog.csdn.net/douhh_sisy/article/details/80560869