Programming in Python, using different methods to accomplish the same goal, is sometimes a very interesting thing. This reminds me of Kong Yiji in Lu Xun's works. Kong Yiji has done a lot of research on the four writing styles of fennel beans. I dare not compare myself to Kong Yiji, here are some Python fennel beans, for all coders.
SELECT *
FROM table_name
WHERE column_name = value
The above SQL statement can be used to select qualified records based on field values in the database. So how to select rows based on column values in DataFrame?
Source of test data for this article: https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('tips.csv')
>>> df.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Fennel Bean 1: [] Index
The most intuitive way is to filter rows by logical conditions. In logical conditions, use == for equal, != for not equal, and >, <, >= and <=. In the case of combining multiple conditions, use & for AND, | for OR, and ~ for NOT. Whether to use isin in a certain range. Examples are as follows:
>>> mask = [False] * 244
>>> mask[1] = True
>>> mask[3] = True
>>> df[mask]
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
>>> # 选取性别为男性的行
>>> df[df['sex'] == 'Male'].head()
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df[(df['tip']>2) | (df['sex']=='Female')].head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
Fennel Beans II: Label Index
The label index accepts a boolean array as input, so the label using the row index can filter the rows.
>>> # 选取性别为男性的行
>>> df.loc[df['sex'] == 'Male'].head()
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df.loc[(df['tip']>2) | (df['sex']=='Female')].head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
>>> # 选择不是周末,且小费大于 5 的行
>>> df.loc[~df['day'].isin(['Sun', 'Sat']) & (df['tip']>5)]
total_bill tip sex smoker day time size
85 34.83 5.17 Female No Thur Lunch 4
88 24.71 5.85 Male No Thur Lunch 2
141 34.30 6.70 Male No Thur Lunch 6
Fennel Beans Three: Location Index
The position index accepts a boolean array as input, so the position using the row index can filter the rows.
>>> mask = list(df['sex'] == 'Male')
>>> df.iloc[mask].head()
total_bill tip sex smoker day time size
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
5 25.29 4.71 Male No Sun Dinner 4
6 8.77 2.00 Male No Sun Dinner 2
Anise Bean Four: Calling a Function
All the above three indexes can use functions, and you understand functions.
>>> df[lambda df: df['tip']>5].head()
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
Fennel Bean Five: query
Well, those who are familiar with SQL must like it.
>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df.query('tip>2 | sex=="Female"').head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
>>> # 选择不是周末,且小费大于 5 的行
>>> # @可以引用当前环境中的变量
>>> weekday = ['Sun', 'Sat']
>>> df.query('day not in @weekday & tip>5')
total_bill tip sex smoker day time size
85 34.83 5.17 Female No Thur Lunch 4
88 24.71 5.85 Male No Thur Lunch 2
141 34.30 6.70 Male No Thur Lunch 6
Fennel beans six: where
Where can change the unqualified ones to NaN, and then come to a dropna.
>>> df.where(df.tip>5).dropna().head()
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4.0
44 30.40 5.60 Male No Sun Dinner 4.0
47 32.40 6.00 Male No Sun Dinner 4.0
52 34.81 5.20 Female No Sun Dinner 4.0
59 48.27 6.73 Male No Sat Dinner 4.0
There are many ways to select rows, here are just some bricks about single-index DataFrame, just kidding. For detailed documentation, please go to: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html