[Python Fennel Bean Series] How does PANDAS select DataFrame rows based on column values

Programming in Python, using different methods to accomplish the same goal, is sometimes a very interesting thing. This reminds me of Kong Yiji in Lu Xun's works. Kong Yiji has done a lot of research on the four writing styles of fennel beans. I dare not compare myself to Kong Yiji, here are some Python fennel beans, for all coders.

SELECT *
FROM table_name
WHERE column_name = value

The above SQL statement can be used to select qualified records based on field values ​​in the database. So how to select rows based on column values ​​in DataFrame?

Source of test data for this article: https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/tips.csv

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.read_csv('tips.csv')
>>> df.head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Fennel Bean 1: [] Index

The most intuitive way is to filter rows by logical conditions. In logical conditions, use == for equal, != for not equal, and >, <, >= and <=. In the case of combining multiple conditions, use & for AND, | for OR, and ~ for NOT. Whether to use isin in a certain range. Examples are as follows:

>>> mask = [False] * 244
>>> mask[1] = True
>>> mask[3] = True
>>> df[mask]
   total_bill   tip   sex smoker  day    time  size
1       10.34  1.66  Male     No  Sun  Dinner     3
3       23.68  3.31  Male     No  Sun  Dinner     2

>>> # 选取性别为男性的行
>>> df[df['sex'] == 'Male'].head()
   total_bill   tip   sex smoker  day    time  size
1       10.34  1.66  Male     No  Sun  Dinner     3
2       21.01  3.50  Male     No  Sun  Dinner     3
3       23.68  3.31  Male     No  Sun  Dinner     2
5       25.29  4.71  Male     No  Sun  Dinner     4
6        8.77  2.00  Male     No  Sun  Dinner     2

>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df[(df['tip']>2) | (df['sex']=='Female')].head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
5       25.29  4.71    Male     No  Sun  Dinner     4

Fennel Beans II: Label Index

The label index accepts a boolean array as input, so the label using the row index can filter the rows.

>>> # 选取性别为男性的行
>>> df.loc[df['sex'] == 'Male'].head()
   total_bill   tip   sex smoker  day    time  size
1       10.34  1.66  Male     No  Sun  Dinner     3
2       21.01  3.50  Male     No  Sun  Dinner     3
3       23.68  3.31  Male     No  Sun  Dinner     2
5       25.29  4.71  Male     No  Sun  Dinner     4
6        8.77  2.00  Male     No  Sun  Dinner     2

>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df.loc[(df['tip']>2) | (df['sex']=='Female')].head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
5       25.29  4.71    Male     No  Sun  Dinner     4

>>> # 选择不是周末,且小费大于 5 的行
>>> df.loc[~df['day'].isin(['Sun', 'Sat']) & (df['tip']>5)]
     total_bill   tip     sex smoker   day   time  size
85        34.83  5.17  Female     No  Thur  Lunch     4
88        24.71  5.85    Male     No  Thur  Lunch     2
141       34.30  6.70    Male     No  Thur  Lunch     6

Fennel Beans Three: Location Index

The position index accepts a boolean array as input, so the position using the row index can filter the rows.

>>> mask = list(df['sex'] == 'Male')
>>> df.iloc[mask].head()
   total_bill   tip   sex smoker  day    time  size
1       10.34  1.66  Male     No  Sun  Dinner     3
2       21.01  3.50  Male     No  Sun  Dinner     3
3       23.68  3.31  Male     No  Sun  Dinner     2
5       25.29  4.71  Male     No  Sun  Dinner     4
6        8.77  2.00  Male     No  Sun  Dinner     2

Anise Bean Four: Calling a Function

All the above three indexes can use functions, and you understand functions.

>>> df[lambda df: df['tip']>5].head()
    total_bill   tip     sex smoker  day    time  size
23       39.42  7.58    Male     No  Sat  Dinner     4
44       30.40  5.60    Male     No  Sun  Dinner     4
47       32.40  6.00    Male     No  Sun  Dinner     4
52       34.81  5.20  Female     No  Sun  Dinner     4
59       48.27  6.73    Male     No  Sat  Dinner     4

Fennel Bean Five: query

Well, those who are familiar with SQL must like it.

>>> # 选取小费超过 2 ,或者性别为女性的行
>>> df.query('tip>2 | sex=="Female"').head()
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
5       25.29  4.71    Male     No  Sun  Dinner     4

>>> # 选择不是周末,且小费大于 5 的行
>>> # @可以引用当前环境中的变量
>>> weekday = ['Sun', 'Sat']
>>> df.query('day not in @weekday & tip>5')
     total_bill   tip     sex smoker   day   time  size
85        34.83  5.17  Female     No  Thur  Lunch     4
88        24.71  5.85    Male     No  Thur  Lunch     2
141       34.30  6.70    Male     No  Thur  Lunch     6

Fennel beans six: where

Where can change the unqualified ones to NaN, and then come to a dropna.

>>> df.where(df.tip>5).dropna().head()
    total_bill   tip     sex smoker  day    time  size
23       39.42  7.58    Male     No  Sat  Dinner   4.0
44       30.40  5.60    Male     No  Sun  Dinner   4.0
47       32.40  6.00    Male     No  Sun  Dinner   4.0
52       34.81  5.20  Female     No  Sun  Dinner   4.0
59       48.27  6.73    Male     No  Sat  Dinner   4.0

There are many ways to select rows, here are just some bricks about single-index DataFrame, just kidding. For detailed documentation, please go to: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

Guess you like

Origin blog.csdn.net/mouse2018/article/details/114686961