Pandas Cookbook -- 09合并Pandas对象及数据库

合并Pandas对象及数据库

简书大神SeanCheney的译作,我作了些格式调整和文章目录结构的变化,更适合自己阅读,以后翻阅是更加方便自己查找吧

import pandas as pd
import numpy as np

DataFrame插入

读取names数据集

names = pd.read_csv('data/names.csv')
names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2

用loc直接赋值新的行

new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1

也可以用字典赋值新行

names.loc[len(names)] = {'Name':'Zayd', 'Age':2}

字典可以打乱列名的顺序

names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})

用append添加新行

1.直接append一个字典

names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Aria2222 12222

2.append方法可以将DataFrame和Series相连

创建一个Series对象

s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
names.append(s)
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Zach 3

append方法可以同时连接多行,只要将对象放到列表中

s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
5 Zayd 2
6 Dean 32
7 Zach 3
USA Zayd 2

用insert添加新列

DataFrame.insert(loc, column, value, allow_duplicates=False)

Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

扫描二维码关注公众号,回复: 3631947 查看本文章

向dataframe指定的位置插入新列,如果该列与已有列重复则会抛出异常,除非allow_duplicates=True.

  • loc : int
    • Insertion index. Must verify 0 <= loc <= len(columns)
  • column : string, number, or hashable object
    • label of the inserted column
  • value : int, Series, or array-like
  • allow_duplicates : bool, optional
names.insert(0,'id',np.random.randint(len(names)))
names
id Name Age
0 6 Cornelia 70
1 6 Abbas 69
2 6 Penelope 4
3 6 Niko 2
4 6 Aria 1
5 6 Zayd 2
6 6 Dean 32

DataFrame合并

读取stocks_2016和stocks_2017两个数据集,用Symbol作为行索引名

years = 2016, 2017, 2018
stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
stocks_2016, stocks_2017, stocks_2018 = stock_tables

names = ['prices', 'transactions']
food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]

concat

将两个DataFrame放到一个列表中,用pandas的concat方法将它们连接起来

s_list = [stocks_2016, stocks_2017]
pd.concat(s_list)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

concat是唯一一个可以将DataFrames垂直连接起来的函数

keys参数可以给两个DataFrame命名,该标签会出现在行索引的最外层,会生成多层索引,names参数可以重命名每个索引层

pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])
Shares Low High
Year Symbol
2016 AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
2017 AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

也可以横向连接。只要将axis参数设为columns或1

pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)
Year 2016 2017
Shares Low High Shares Low High
AAPL 80.0 95.0 110.0 50.0 120.0 140.0
GE NaN NaN NaN 100.0 30.0 40.0
IBM NaN NaN NaN 87.0 75.0 95.0
SLB NaN NaN NaN 20.0 55.0 85.0
TSLA 50.0 80.0 130.0 100.0 100.0 300.0
TXN NaN NaN NaN 500.0 15.0 23.0
WMT 40.0 55.0 70.0 NaN NaN NaN

concat函数默认使用的是外连接,会保留每个DataFrame中的所有行。也可以通过设定join参数,使用内连接:

pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])
Year 2016 2017
Shares Low High Shares Low High
Symbol
AAPL 80 95 110 50 120 140
TSLA 50 80 130 100 100 300

concat方法的TIPS

给dataframe命名后,再连接

pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)
2016 2017 2018
Shares Low High Shares Low High Shares Low High
AAPL 80.0 95.0 110.0 50.0 120.0 140.0 40.0 135.0 170.0
AMZN NaN NaN NaN NaN NaN NaN 8.0 900.0 1125.0
GE NaN NaN NaN 100.0 30.0 40.0 NaN NaN NaN
IBM NaN NaN NaN 87.0 75.0 95.0 NaN NaN NaN
SLB NaN NaN NaN 20.0 55.0 85.0 NaN NaN NaN
TSLA 50.0 80.0 130.0 100.0 100.0 300.0 50.0 220.0 400.0
TXN NaN NaN NaN 500.0 15.0 23.0 NaN NaN NaN
WMT 40.0 55.0 70.0 NaN NaN NaN NaN NaN NaN

append是concat方法的超简化版本,append内部其实就是调用concat。前本节的第二个例子,pd.concat也可以如下实现:

stocks_2016.append(stocks_2017)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300

join

用join将DataFrame连起来;如果列名有相同的,需要设置lsuffix或rsuffix以进行区分

stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
Shares_2016 Low_2016 High_2016 Shares_2017 Low_2017 High_2017
Symbol
AAPL 80.0 95.0 110.0 50.0 120.0 140.0
GE NaN NaN NaN 100.0 30.0 40.0
IBM NaN NaN NaN 87.0 75.0 95.0
SLB NaN NaN NaN 20.0 55.0 85.0
TSLA 50.0 80.0 130.0 100.0 100.0 300.0
TXN NaN NaN NaN 500.0 15.0 23.0
WMT 40.0 55.0 70.0 NaN NaN NaN

merge

查看food_prices和food_transactions两个小数据集

food_prices, food_transactions = food_tables
food_prices
item store price Date
0 pear A 0.99 2017
1 pear B 1.99 2017
2 peach A 2.99 2017
3 peach B 3.49 2017
4 banana A 0.39 2017
5 banana B 0.49 2017
6 steak A 5.99 2017
7 steak B 6.99 2017
8 steak B 4.99 2015
food_transactions
custid item store quantity
0 1 pear A 5
1 1 banana A 10
2 2 steak B 3
3 2 pear B 1
4 2 peach B 2
5 2 steak B 1
6 2 coconut B 4

通过键item和store,将food_transactions和food_prices两个数据集融合

food_transactions.merge(food_prices, on=['item', 'store'])
custid item store quantity price Date
0 1 pear A 5 0.99 2017
1 1 banana A 10 0.39 2017
2 2 steak B 3 6.99 2017
3 2 steak B 3 4.99 2015
4 2 steak B 1 6.99 2017
5 2 steak B 1 4.99 2015
6 2 pear B 1 1.99 2017
7 2 peach B 2 3.49 2017

因为steak在两张表中分别出现了两次,融合时产生了笛卡尔积,造成结果中出现了四行steak
因为coconut没有对应的价格,造成结果中没有coconut

下面只融合2017年的数据

food_transactions.merge(food_prices.query('Date == 2017'), how='left')
custid item store quantity price Date
0 1 pear A 5 0.99 2017.0
1 1 banana A 10 0.39 2017.0
2 2 steak B 3 6.99 2017.0
3 2 pear B 1 1.99 2017.0
4 2 peach B 2 3.49 2017.0
5 2 steak B 1 6.99 2017.0
6 2 coconut B 4 NaN NaN

concat/join/merge的区别

concat:

  1. Pandas函数
  2. 可以垂直和水平地连接两个或多个pandas对象
  3. 只用索引对齐
  4. 索引出现重复值时会报错
  5. 默认是外连接(也可以设为内连接)
  6. concat是唯一一个可以将DataFrames垂直连接起来的函数

join:

  1. DataFrame方法
  2. 只能水平连接两个或多个pandas对象
  3. 对齐是靠被调用的DataFrame的列索引或行索引和另一个对象的行索引(不能是列索引)
  4. 通过笛卡尔积处理重复的索引值
  5. 默认是左连接(也可以设为内连接、外连接和右连接)

merge:

  1. DataFrame方法
  2. 只能水平连接两个DataFrame对象
  3. 对齐是靠被调用的DataFrame的列或行索引和另一个DataFrame的列或行索引
  4. 通过笛卡尔积处理重复的索引值
  5. 默认是内连接(也可以设为左连接、外连接、右连接)

连接SQL数据库

创建SQLAlchemy引擎

在读取chinook数据库之前,需要创建SQLAlchemy引擎

from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')

read_sql_table函数

read_sql_table函数可以读取一张表,第一个参数是表名,第二个参数是引擎,返回一个dataframe

tracks = pd.read_sql_table('tracks', engine)
tracks.iloc[:5,:5]
TrackId Name AlbumId MediaTypeId GenreId
0 1 For Those About To Rock (We Salute You) 1 1 1
1 2 Balls to the Wall 2 2 1
2 3 Fast As a Shark 3 2 1
3 4 Restless and Wild 3 2 1
4 5 Princess of the Dawn 3 2 1
genres = pd.read_sql_table('genres', engine)
genres.head()
GenreId Name
0 1 Rock
1 2 Jazz
2 3 Metal
3 4 Alternative & Punk
4 5 Rock And Roll

找到每种类型歌曲的平均时长

genre_track = genres.merge(
    tracks[['GenreId', 'Milliseconds']], 
    on='GenreId', 
    how='left').drop('GenreId', axis='columns')
genre_track.head()
Name Milliseconds
0 Rock 343719
1 Rock 342562
2 Rock 230619
3 Rock 252051
4 Rock 375418

将Milliseconds列转变为timedelta数据类型

genre_time = genre_track.groupby('Name')['Milliseconds'].mean()
pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]
Name
Rock And Roll        00:02:14
Opera                00:02:54
Hip Hop/Rap          00:02:58
Easy Listening       00:03:09
Bossa Nova           00:03:39
R&B/Soul             00:03:40
World                00:03:44
Pop                  00:03:49
Latin                00:03:52
Alternative & Punk   00:03:54
Name: Milliseconds, dtype: timedelta64[ns]

找到每名顾客花费的总时长

cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
cust_inv.head()
CustomerId FirstName LastName InvoiceId UnitPrice Quantity
0 1 Luís Gonçalves 98 1.99 1
1 1 Luís Gonçalves 98 1.99 1
2 1 Luís Gonçalves 121 0.99 1
3 1 Luís Gonçalves 121 0.99 1
4 1 Luís Gonçalves 121 0.99 1

现在可以用总量乘以单位价格,找到每名顾客的总消费

total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cols = ['CustomerId', 'FirstName', 'LastName']
cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()
CustomerId  FirstName  LastName  
6           Helena     Holý          49.62
26          Richard    Cunningham    47.62
57          Luis       Rojas         46.62
46          Hugh       O'Reilly      45.62
45          Ladislav   Kovács        45.62
Name: Total, dtype: float64

read_sql_query函数

pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]
TrackId Name AlbumId MediaTypeId GenreId
0 1 For Those About To Rock (We Salute You) 1 1 1
1 2 Balls to the Wall 2 2 1
2 3 Fast As a Shark 3 2 1
3 4 Restless and Wild 3 2 1
4 5 Princess of the Dawn 3 2 1

可以将长字符串传给read_sql_query

sql_string1 = '''
          select 
              Name, 
              time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
          from (
                  select 
                      g.Name, 
                      t.Milliseconds
                  from 
                      genres as g 
                  join
                      tracks as t
                      on 
                          g.genreid == t.genreid
              )
          group by 
              Name
          order by 
              avg_time
          '''
pd.read_sql_query(sql_string1, engine)[:10]
Name avg_time
0 Rock And Roll 00:02:14
1 Opera 00:02:54
2 Hip Hop/Rap 00:02:58
3 Easy Listening 00:03:09
4 Bossa Nova 00:03:39
5 R&B/Soul 00:03:40
6 World 00:03:44
7 Pop 00:03:49
8 Latin 00:03:52
9 Alternative & Punk 00:03:54
sql_string2 = '''
          select 
                c.customerid, 
                c.FirstName, 
                c.LastName, 
                sum(ii.quantity *  ii.unitprice) as Total
          from
                customers as c
          join
                invoices as i
                     on c.customerid = i.customerid
          join
                invoice_items as ii
                     on i.invoiceid = ii.invoiceid
          group by
                c.customerid, c.FirstName, c.LastName
          order by
                Total desc
          '''
pd.read_sql_query(sql_string2, engine)[:10]
CustomerId FirstName LastName Total
0 6 Helena Holý 49.62
1 26 Richard Cunningham 47.62
2 57 Luis Rojas 46.62
3 45 Ladislav Kovács 45.62
4 46 Hugh O'Reilly 45.62
5 37 Fynn Zimmermann 43.62
6 24 Frank Ralston 43.62
7 28 Julia Barnett 43.62
8 25 Victor Stevens 42.62
9 7 Astrid Gruber 42.62

猜你喜欢

转载自www.cnblogs.com/shiyushiyu/p/9818378.html