Since many potential users Pandas have some knowledge of SQL, so this article is intended to provide some how to use Pandas example perform various SQL operations.
File: tips.csv -
total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
import pandas as pd url = 'tips.csv' tips=pd.read_csv(url) print (tips.head())
Output:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Select (Select)
In SQL, the option is to use a comma-separated list of columns (or select all columns) to complete, for example -
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
In Pandas , the selected column is passed through a column name to DataFrame -
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Complete program -
import pandas as pd url = 'tips.csv' tips=pd.read_csv(url) rs = tips[['total_bill', 'tip', 'smoker', 'time']].head(5) print(rs)
Output:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
Call no column name list DataFrame will display all the columns (SQL-like *
).
WHERE condition
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
Data frames may be filtered in various ways; most intuitive to use boolean index.
tips[tips['time'] == 'Dinner'].head(5)
完整的程序
import pandas as pd url = 'tips.csv' tips=pd.read_csv(url) rs = tips[tips['time'] == 'Dinner'].head(5) print(rs)
Output:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
The above statement will be a series of True/False
object to DataFrame, and returns all rows True
.
By grouping GroupBy
This will get the entire dataset of the number of records for each group. For example, a number of queries to extract sex (ie, grouped by gender) -
SELECT sex, count(*)
FROM tips
GROUP BY sex;
In Pandas equivalent statement would be -
tips.groupby('sex').size()
Complete program
import pandas as pd url = 'tips.csv' tips=pd.read_csv(url) rs = tips.groupby('sex').size() print(rs)
Output:
sex
Female 2
Male 3
dtype: int64
The first N rows
SQL (MySQL database) using the LIMIT
return to the previous n
line
SELECT * FROM tips
LIMIT 5 ;
In Pandas equivalent statement would be
tips.head(5)
Let's take a look at the complete program
import pandas as pd url = 'tips.csv' tips=pd.read_csv(url) rs = tips[['smoker', 'day', 'time']].head(5) print(rs)
输出结果:
smoker day time
0 No Sun Dinner
1 No Sun Dinner
2 No Sun Dinner
3 No Sun Dinner
4 No Sun Dinner
这些是比较的几个基本操作,在前几章的Pandas库中学到的。