62_Pandas conditionally extract rows of pandas.DataFrame
Use the query() method to extract rows based on conditions on the column values of a pandas.DataFrame. It's convenient because you can describe conditional specifications succinctly using comparison operators and string methods, as well as combinations of multiple conditions.
Table of contents
- Use comparison operators to specify conditions
- Use the in operator for conditional specification (equivalent to isin())
- Specify conditions with string methods
- When there are missing values NaN or None
- index condition
- Specify multiple conditions
- Enclose column names containing spaces or dots with "`"
- Update the original object with the inplace parameter
For condition specification of Boolean index, please refer to the following article.
The pandas version of the sample code in this article is version 2.0.3. Note that behavior may vary by version.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Use comparison operators to specify conditions
In pandas, you can use comparison operators to extract rows like so:
print(df[df['age'] < 25])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Similar conditions can be specified via strings using the query() method.
print(df.query('age < 25'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Prefix a variable name with @ to use that variable in a condition string.
val = 25
print(df.query('age < @val'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Ranges can be specified using two comparison operators, just like Python's conditional specifications.
print(df.query('30 <= age < 50'))
# name age state point
# 1 Bob 42 CA 92
# 5 Frank 30 NY 57
You can also compare columns and compare them by performing calculations using arithmetic operators.
print(df.query('age < point / 3'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
==, != indicate match and mismatch. Note that the strings in the condition string must be quoted.
Double quotes " can be used in strings enclosed in single quotes ', and single quotes ' can be used in strings enclosed in double quotes ". The same notation can be used by escaping it with a backslash \.
print(df.query('state == "CA"'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Don't worry about quotes when using variables.
s = 'CA'
print(df.query('state != @s'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
Use the in operator for conditional specification (equivalent to isin())
isin() is a method that returns bool (True, False) to determine whether the column (pandas.Series) element is contained in the parameter list. This can be used to extract rows where elements in a column match a specific value.
print(df[df['state'].isin(['NY', 'TX'])])
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
The equivalent can be done using in in the query() method.
print(df.query('state in ["NY", "TX"]'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
As a special usage, == for lists is handled in the same way.
print(df.query('state == ["NY", "TX"]'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
You can also use list variables.
l = ['NY', 'TX']
print(df.query('state in @l'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
Specify conditions with string methods
Conditions for full string matching can be specified using == or in above, but partial matching conditions can be specified using the string methods str.xxx().
Refer to the following method:
- str.contains(): Contains a specific string
- str.endswith(): end with a specific string
- str.startswith(): start with a specific string
- str.match(): Match regular expression patterns
They can also be used in query(), although not more compact than boolean indexing.
print(df.query('name.str.endswith("e")'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
print(df.query('name.str.contains("li")'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
String methods can be used by converting a column of dtype type other than string to string type str using astype(). This can also be specified with query().
print(df.query('age.astype("str").str.endswith("8")'))
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
When there are missing values NaN or None
Note that if you use string methods as conditions for columns with missing values NaN or None, you will get an error.
df.at[0, 'name'] = None
print(df)
# name age state point
# 0 None 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# print(df.query('name.str.endswith("e")'))
# ValueError: unknown type object
Many string methods allow the parameter na to specify a value to replace the result of None or the missing value NaN. Specify True to extract rows containing missing values, or False to not extract rows containing missing values.
print(df[df['name'].str.endswith('e', na=False)])
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
Arguments can be specified in the same way as query().
print(df.query('name.str.endswith("e", na=False)'))
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
index condition
Criteria (row names) can be specified using index.index.
df = pd.read_csv('data/sample_pandas_normal.csv')
print(df.query('index % 2 == 0'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
If the index has a name, it can be that name or the index.
df_name = df.set_index('name')
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(df_name.query('name.str.endswith("e")'))
# age state point
# name
# Alice 24 NY 64
# Charlie 18 CA 70
# Dave 68 TX 70
print(df_name.query('index.str.endswith("e")'))
# age state point
# name
# Alice 24 NY 64
# Charlie 18 CA 70
# Dave 68 TX 70
Specify multiple conditions
When specifying multiple conditions with a Boolean index, the description is as follows.
print(df[(df['age'] < 25) & (df['point'] > 65)])
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
The query() method can be written as follows. Parentheses are not required for each condition, AND (and) can be & or and.
print(df.query('age < 25 & point > 65'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df.query('age < 25 and point > 65'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
OR (or), | or or are all acceptable.
print(df.query('age < 20 | point > 80'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df.query('age < 20 or point > 80'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
NOT (negative) not.
print(df.query('not age < 25 and not point > 65'))
# name age state point
# 5 Frank 30 NY 57
The same is true for three or more conditions, but the results vary depending on the order, eg & has precedence over |, so it is safer to explicitly enclose the part that is processed first in parentheses.
print(df.query('age == 24 | point > 80 & state == "CA"'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
print(df.query('(age == 24 | point > 80) & state == "CA"'))
# name age state point
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
Enclose column names containing spaces or dots with "`"
Be careful with column names when using the query() method. For example, change the column names as follows.
df.columns = ['0name', 'age.year', 'state name', 3]
print(df)
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Using a column name that is not valid as a Python variable name will result in an error. For example, column names starting with a number, column names containing . or spaces are all wrong.
# print(df.query('0name.str.endswith("e")'))
# SyntaxError: invalid syntax
# print(df.query('age.year < 25'))
# UndefinedVariableError: name 'age' is not defined
# print(df.query('state name == "CA"'))
# SyntaxError: invalid syntax
Must be enclosed in "`".
print(df.query('`0name`.str.endswith("e")'))
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
print(df.query('`age.year` < 25'))
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df.query('`state name` == "CA"'))
# 0name age.year state name 3
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Error occurs even if numeric column names are enclosed in "`". If you specify the condition using boolean indexing, there is no problem.
# print(df.query('3 > 75'))
# KeyError: False
# print(df.query('`3` > 75'))
# UndefinedVariableError: name 'BACKTICK_QUOTED_STRING_3' is not defined
print(df[df[3] > 75])
# 0name age.year state name 3
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
Update the original object with the inplace parameter
In the examples so far, a new pandas.DataFrame is returned containing the rows extracted via query() and the original objects are left untouched. The parameter inplace=True will change the original object itself.
df = pd.read_csv('data/sample_pandas_normal.csv')
df.query('age > 25', inplace=True)
print(df)
# name age state point
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57