"True" pandas "false" sql

This blog takes advantage of pandas the data like sql the same to deal with.

Read test data

import pandas as pd
import numpy as np

url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'

tips = pd.read_csv(url)  # 读取数据
tips.head()

5 before the line test data are as follows:
image.png

SELECT (select statement)

SQL statement:

SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;

Python statement:

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

image.png

UPDATE (update statement)

SQL statement:

UPDATE tips SET tip = tip*2 WHERE tip < 2;


Python statement:

tips.loc[tips['tip'] < 2, 'tip'] *= 2

DELETE (delete statement)

SQL statement:

DELETE FROM tips WHERE tip > 9;

Python statement:

tips = tips.loc[tips['tip'] <= 9]

WHERE (condition)

SQL statement:

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

Python statement:

tips[tips['time'] == 'Dinner'].head(5)

image.png

AND&OR

SQL statement:

SELECT * FROM tips WHERE time = 'Dinner' AND tip >5.00;

Python statement:

# pandas中用“&”表示and;用“|”表示or
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]

image.png
SQL statement:

SELECT * FROM tips WHERE size >= 5 OR total_bill > 45;

Python statement:

# 选出size大于5或者total_bill大于45的
tips[(tips['size'] >=5 ) | (tips['total_bill'] > 45)]

image.png

GROUP BY (packet polymerization)

In pandas, a similarly named groupby() method for executing SQL GROUP BY operation. groupby() Generally it means that we want to split into groups of data sets, some of the applications functions (typically polymerization), then the combined process.
Common SQL operations will focus acquisition record count in each group in the entire data. For example, a query let's get the remaining number of sex tips:

SQL statement:

SELECT sex, count(*) FROM tips GROUP BY sex;
/*
Female     87
Male      157
*/

Python statement:

# sql中的ocunt和pandas的count不一样,这里是size()达到我们的目的
tips.groupby('sex').size()

image.png
Python statement:

tips.groupby('sex').count()

image.png


Python statement:

# 对单独一列进行count
tips.groupby('sex')['total_bill'].count()

image.png


SQL statement:

SELECT day, AVG(tip), COUNT(*) FROM tips GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62
*/

It can be applied multiple functions simultaneously. For example, suppose we want to see the number of skills vary between a few weeks, it  agg() will allow you to transfer a dictionary to your packet  DataFrame , indicating which apply to specific columns.

Python statement:

tips.groupby('day').agg({'tip':np.mean, 'day':np.size})

image.png

By grouping multiple columns

SQL statement:

SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thur    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thur    17  3.030000
*/

Python statement:

tips.groupby(['smoker','day']).agg({'tip':[np.size,np.mean]})

image.png


Check of missing values  notnull() and isnull() 

Re-establishing a test data set:

df = pd.DataFrame({'col2':['A','B',np.NaN, 'C', 'D'],
                   'col1':['F', np.NaN, 'G','H','I']})

image.png
SQL statement:

SELECT * FROM df WHERE col2 IS NULL;

Python statement:

# 选择变量是col为null的行(观测)
df[df['col2'].isnull()]

image.png
SQL statement:

SELECT * FROM df WHERE col1 IS NOT NULL;

Python statement:

# 选择col1不是空值的行(观测)
df[df['col1'].notnull()]

image.png


JOIN

You can use join() or merge() execution JOIN . By default, join() it will join in its index DataFrames . Each method parameter allows you to specify the type of connection you want to perform (LEFT, RIGHT, INNER, FULL ) or to join the column (column name or index).

df1 = pd.DataFrame({'key':['A','B','C','D'], 'value':np.random.randn(4)})
df2 = pd.DataFrame({'key':['B','D','D','E'], 'value':np.random.randn(4)})

image.pngimage.png

INNER JOIN

SQL statement:

SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1,df2, on = 'key')

image.png

indexed_df2 = df2.set_index('key')
pd.merge(df1, indexed_df2, left_on='key',right_index=True)


LEFT OUTER JOIN

SQL statement:

-- show all records from df2
SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key=df2.key;

Python statement:

pd.merge(df1, df2, on = 'key', how='left')

image.png

RIGHT OUTER JOIN

SQL statement:

-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1, df2, on = 'key', how='right')

image.png

FULL JOIN

SQL statement:

-- show all records from both tables
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

Python statement:

pd.merge(df1, df2 , on = 'key', how = 'outer')

image.png


UNION

New data sets:

df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
                    'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
                    'rank': [1, 4, 5]})

SQL statement:

SELECT city, rank FROM df1 
UNION ALL 
SELECT city, rank FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/

Python statement:

pd.concat([df1,df2])

image.png
SQL UNION similar UNION ALL, UNION but will remove duplicate rows.

SELECT city, rank FROM df1
UNION
SELECT city, rank FROM df2;
-- notice that there is only one Chicago record this time
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
       Boston     4
  Los Angeles     5
*/

In the pandas, you can use concat()  the drop_duplicate() combination.

pd.concat([df1, df2]).drop_duplicates()

Guess you like

Origin www.cnblogs.com/selfcs/p/11345476.html