This blog takes advantage of pandas the data like sql the same to deal with.

Read test data

import pandas as pd
import numpy as np

url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

tips = pd.read_csv(url)  # 读取数据
tips.head()

5 before the line test data are as follows:

SELECT (select statement)

SQL statement:

SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;

Python statement:

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

UPDATE (update statement)

SQL statement:

UPDATE tips SET tip = tip*2 WHERE tip < 2;

Python statement:

tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE (delete statement)

SQL statement:

DELETE FROM tips WHERE tip > 9;

Python statement:

tips = tips.loc[tips['tip'] <= 9]

WHERE (condition)

SQL statement:

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

Python statement:

tips[tips['time'] == 'Dinner'].head(5)

AND&OR

SQL statement:

SELECT * FROM tips WHERE time = 'Dinner' AND tip >5.00;

Python statement:

# pandas中用“&”表示and;用“|”表示or
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]

SQL statement:

SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

Python statement:

# 选出size大于5或者total_bill大于45的
tips[(tips['size'] >=5 ) | (tips['total_bill'] > 45)]

GROUP BY (packet polymerization)

In pandas, a similarly named groupby() method for executing SQL GROUP BY operation. groupby() Generally it means that we want to split into groups of data sets, some of the applications functions (typically polymerization), then the combined process.
Common SQL operations will focus acquisition record count in each group in the entire data. For example, a query let's get the remaining number of sex tips:

SQL statement:

SELECT sex, count(*) FROM tips GROUP BY sex;
/*
Female     87
Male      157
*/

Python statement:

# sql中的ocunt和pandas的count不一样，这里是size()达到我们的目的
tips.groupby('sex').size()

Python statement:

tips.groupby('sex').count()

Python statement:

# 对单独一列进行count
tips.groupby('sex')['total_bill'].count()

SQL statement:

SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62
*/

It can be applied multiple functions simultaneously. For example, suppose we want to see the number of skills vary between a few weeks, it agg() will allow you to transfer a dictionary to your packet DataFrame , indicating which apply to specific columns.

Python statement:

tips.groupby('day').agg({'tip':np.mean, 'day':np.size})

By grouping multiple columns

SQL statement:

SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thur    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thur    17  3.030000
*/

Python statement:

tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})

Check of missing values `notnull()` and `isnull()`

Re-establishing a test data set:

df = pd.DataFrame({'col2':['A','B',np.NaN, 'C', 'D'],
                   'col1':['F', np.NaN, 'G','H','I']})

SQL statement:

SELECT * FROM df WHERE col2 IS NULL;

Python statement:

# 选择变量是col为null的行（观测）
df[df['col2'].isnull()]

SQL statement:

SELECT * FROM df WHERE col1 IS NOT NULL;

Python statement:

# 选择col1不是空值的行（观测）
df[df['col1'].notnull()]

JOIN

You can use join() or merge() execution JOIN . By default, join() it will join in its index DataFrames . Each method parameter allows you to specify the type of connection you want to perform (LEFT, RIGHT, INNER, FULL ) or to join the column (column name or index).

df1 = pd.DataFrame({'key':['A','B','C','D'], 'value':np.random.randn(4)})
df2 = pd.DataFrame({'key':['B','D','D','E'], 'value':np.random.randn(4)})

INNER JOIN

SQL statement:

SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1,df2, on = 'key')

indexed_df2 = df2.set_index('key')
pd.merge(df1, indexed_df2, left_on='key',right_index=True)

LEFT OUTER JOIN

SQL statement:

-- show all records from df2
SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key=df2.key;

Python statement:

pd.merge(df1, df2, on = 'key', how='left')

RIGHT OUTER JOIN

SQL statement:

-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1, df2, on = 'key', how='right')

FULL JOIN

SQL statement:

-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1, df2 , on = 'key', how = 'outer')

UNION

New data sets:

df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
                    'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
                    'rank': [1, 4, 5]})

SQL statement:

SELECT city, rank FROM df1 
UNION ALL 
SELECT city, rank FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/

Python statement:

pd.concat([df1,df2])

SQL UNION similar UNION ALL, UNION but will remove duplicate rows.

SELECT city, rank FROM df1
UNION
SELECT city, rank FROM df2;
-- notice that there is only one Chicago record this time
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
       Boston     4
  Los Angeles     5
*/

In the pandas, you can use concat() the drop_duplicate() combination.

pd.concat([df1, df2]).drop_duplicates()

"True" pandas "false" sql

Read test data

SELECT (select statement)

UPDATE (update statement)

DELETE (delete statement)

WHERE (condition)

AND&OR

GROUP BY (packet polymerization)

By grouping multiple columns

Check of missing values `notnull()` and `isnull()`

JOIN

INNER JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

FULL JOIN

UNION

Guess you like

"True" pandas "false" sql

Read test data

SELECT (select statement)

UPDATE (update statement)

DELETE (delete statement)

WHERE (condition)

AND&OR

GROUP BY (packet polymerization)

By grouping multiple columns

Check of missing values notnull() and isnull()

JOIN

INNER JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

FULL JOIN

UNION

Guess you like

Check of missing values `notnull()` and `isnull()`