Pandas | 28 compared with SQL

Since many potential users Pandas have some knowledge of SQL, so this article is intended to provide some how to use Pandas example perform various SQL operations.

File: tips.csv -

total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4

import pandas as pd

url = 'tips.csv'
tips=pd.read_csv(url)
print (tips.head())

Output:

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Select (Select)

In SQL, the option is to use a comma-separated list of columns (or select all columns) to complete, for example -

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

In Pandas , the selected column is passed through a column name to DataFrame -

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

Complete program -

import pandas as pd

url = 'tips.csv'
tips=pd.read_csv(url)
rs = tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
print(rs)

Output:

   total_bill   tip smoker    time
0       16.99  1.01     No  Dinner
1       10.34  1.66     No  Dinner
2       21.01  3.50     No  Dinner
3       23.68  3.31     No  Dinner
4       24.59  3.61     No  Dinner

Call no column name list DataFrame will display all the columns (SQL-like *).

WHERE condition

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

Data frames may be filtered in various ways; most intuitive to use boolean index.

tips[tips['time'] == 'Dinner'].head(5)

完整的程序

import pandas as pd

url = 'tips.csv'
tips=pd.read_csv(url)
rs = tips[tips['time'] == 'Dinner'].head(5)
print(rs)

Output:

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

The above statement will be a series of True/Falseobject to DataFrame, and returns all rows True.

By grouping GroupBy

This will get the entire dataset of the number of records for each group. For example, a number of queries to extract sex (ie, grouped by gender) -

SELECT sex, count(*)
FROM tips
GROUP BY sex;

In Pandas equivalent statement would be -

tips.groupby('sex').size()

Complete program

import pandas as pd

url = 'tips.csv'

tips=pd.read_csv(url)
rs = tips.groupby('sex').size()
print(rs)

Output:

sex
Female    2
Male      3
dtype: int64

The first N rows

SQL (MySQL database) using the LIMITreturn to the previous nline

SELECT * FROM tips
LIMIT 5 ;

In Pandas equivalent statement would be

tips.head(5)

Let's take a look at the complete program

import pandas as pd

url = 'tips.csv'
tips=pd.read_csv(url)
rs = tips[['smoker', 'day', 'time']].head(5)
print(rs)

输出结果：

  smoker  day    time
0     No  Sun  Dinner
1     No  Sun  Dinner
2     No  Sun  Dinner
3     No  Sun  Dinner
4     No  Sun  Dinner

这些是比较的几个基本操作，在前几章的Pandas库中学到的。