This blog takes advantage of
pandas
the data likesql
the same to deal with.
Read test data
import pandas as pd
import numpy as np
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url) # 读取数据
tips.head()
5 before the line test data are as follows:
SELECT (select statement)
SQL statement:
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
Python statement:
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
UPDATE (update statement)
SQL statement:
UPDATE tips SET tip = tip*2 WHERE tip < 2;
Python statement:
tips.loc[tips['tip'] < 2, 'tip'] *= 2
DELETE (delete statement)
SQL statement:
DELETE FROM tips WHERE tip > 9;
Python statement:
tips = tips.loc[tips['tip'] <= 9]
WHERE (condition)
SQL statement:
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
Python statement:
tips[tips['time'] == 'Dinner'].head(5)
AND&OR
SQL statement:
SELECT * FROM tips WHERE time = 'Dinner' AND tip >5.00;
Python statement:
# pandas中用“&”表示and;用“|”表示or
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
SQL statement:
SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;
Python statement:
# 选出size大于5或者total_bill大于45的
tips[(tips['size'] >=5 ) | (tips['total_bill'] > 45)]
GROUP BY (packet polymerization)
In pandas, a similarly named groupby()
method for executing SQL GROUP BY operation. groupby()
Generally it means that we want to split into groups of data sets, some of the applications functions (typically polymerization), then the combined process.
Common SQL operations will focus acquisition record count in each group in the entire data. For example, a query let's get the remaining number of sex tips:
SQL statement:
SELECT sex, count(*) FROM tips GROUP BY sex;
/*
Female 87
Male 157
*/
Python statement:
# sql中的ocunt和pandas的count不一样,这里是size()达到我们的目的
tips.groupby('sex').size()
Python statement:
tips.groupby('sex').count()
Python statement:
# 对单独一列进行count
tips.groupby('sex')['total_bill'].count()
SQL statement:
SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/
It can be applied multiple functions simultaneously. For example, suppose we want to see the number of skills vary between a few weeks, it agg()
will allow you to transfer a dictionary to your packet DataFrame
, indicating which apply to specific columns.
Python statement:
tips.groupby('day').agg({'tip':np.mean, 'day':np.size})
By grouping multiple columns
SQL statement:
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/
Python statement:
tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})
Check of missing values notnull()
and isnull()
Re-establishing a test data set:
df = pd.DataFrame({'col2':['A','B',np.NaN, 'C', 'D'],
'col1':['F', np.NaN, 'G','H','I']})
SQL statement:
SELECT * FROM df WHERE col2 IS NULL;
Python statement:
# 选择变量是col为null的行(观测)
df[df['col2'].isnull()]
SQL statement:
SELECT * FROM df WHERE col1 IS NOT NULL;
Python statement:
# 选择col1不是空值的行(观测)
df[df['col1'].notnull()]
JOIN
You can use join()
or merge()
execution JOIN
. By default, join()
it will join in its index DataFrames
. Each method parameter allows you to specify the type of connection you want to perform (LEFT, RIGHT, INNER, FULL ) or to join the column (column name or index).
df1 = pd.DataFrame({'key':['A','B','C','D'], 'value':np.random.randn(4)})
df2 = pd.DataFrame({'key':['B','D','D','E'], 'value':np.random.randn(4)})
INNER JOIN
SQL statement:
SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;
Python statement:
pd.merge(df1,df2, on = 'key')
indexed_df2 = df2.set_index('key')
pd.merge(df1, indexed_df2, left_on='key',right_index=True)
LEFT OUTER JOIN
SQL statement:
-- show all records from df2
SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key=df2.key;
Python statement:
pd.merge(df1, df2, on = 'key', how='left')
RIGHT OUTER JOIN
SQL statement:
-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;
Python statement:
pd.merge(df1, df2, on = 'key', how='right')
FULL JOIN
SQL statement:
-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;
Python statement:
pd.merge(df1, df2 , on = 'key', how = 'outer')
UNION
New data sets:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
'rank': [1, 4, 5]})
SQL statement:
SELECT city, rank FROM df1
UNION ALL
SELECT city, rank FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
Python statement:
pd.concat([df1,df2])
SQL UNION similar UNION ALL, UNION but will remove duplicate rows.
SELECT city, rank FROM df1
UNION
SELECT city, rank FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
In the pandas, you can use concat()
the drop_duplicate()
combination.
pd.concat([df1, df2]).drop_duplicates()