62_Pandas conditionally extract rows of pandas.DataFrame

62_Pandas conditionally extract rows of pandas.DataFrame

Use the query() method to extract rows based on conditions on the column values ​​of a pandas.DataFrame. It's convenient because you can describe conditional specifications succinctly using comparison operators and string methods, as well as combinations of multiple conditions.

Table of contents

  • Use comparison operators to specify conditions
  • Use the in operator for conditional specification (equivalent to isin())
  • Specify conditions with string methods
    • When there are missing values ​​NaN or None
  • index condition
  • Specify multiple conditions
  • Enclose column names containing spaces or dots with "`"
  • Update the original object with the inplace parameter

For condition specification of Boolean index, please refer to the following article.

The pandas version of the sample code in this article is version 2.0.3. Note that behavior may vary by version.

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Use comparison operators to specify conditions

In pandas, you can use comparison operators to extract rows like so:

print(df[df['age'] < 25])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

Similar conditions can be specified via strings using the query() method.

print(df.query('age < 25'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

Prefix a variable name with @ to use that variable in a condition string.

val = 25
print(df.query('age < @val'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

Ranges can be specified using two comparison operators, just like Python's conditional specifications.

print(df.query('30 <= age < 50'))
#     name  age state  point
# 1    Bob   42    CA     92
# 5  Frank   30    NY     57

You can also compare columns and compare them by performing calculations using arithmetic operators.

print(df.query('age < point / 3'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

==, != indicate match and mismatch. Note that the strings in the condition string must be quoted.

Double quotes " can be used in strings enclosed in single quotes ', and single quotes ' can be used in strings enclosed in double quotes ". The same notation can be used by escaping it with a backslash \.

print(df.query('state == "CA"'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

Don't worry about quotes when using variables.

s = 'CA'
print(df.query('state != @s'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

Use the in operator for conditional specification (equivalent to isin())

isin() is a method that returns bool (True, False) to determine whether the column (pandas.Series) element is contained in the parameter list. This can be used to extract rows where elements in a column match a specific value.

print(df[df['state'].isin(['NY', 'TX'])])
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

The equivalent can be done using in in the query() method.

print(df.query('state in ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

As a special usage, == for lists is handled in the same way.

print(df.query('state == ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

You can also use list variables.

l = ['NY', 'TX']
print(df.query('state in @l'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

Specify conditions with string methods

Conditions for full string matching can be specified using == or in above, but partial matching conditions can be specified using the string methods str.xxx().

Refer to the following method:

  • str.contains(): Contains a specific string
  • str.endswith(): end with a specific string
  • str.startswith(): start with a specific string
  • str.match(): Match regular expression patterns
    They can also be used in query(), although not more compact than boolean indexing.
print(df.query('name.str.endswith("e")'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

print(df.query('name.str.contains("li")'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

String methods can be used by converting a column of dtype type other than string to string type str using astype(). This can also be specified with query().

print(df.query('age.astype("str").str.endswith("8")'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

When there are missing values ​​NaN or None

Note that if you use string methods as conditions for columns with missing values ​​NaN or None, you will get an error.

df.at[0, 'name'] = None
print(df)
#       name  age state  point
# 0     None   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

# print(df.query('name.str.endswith("e")'))
# ValueError: unknown type object

Many string methods allow the parameter na to specify a value to replace the result of None or the missing value NaN. Specify True to extract rows containing missing values, or False to not extract rows containing missing values.

print(df[df['name'].str.endswith('e', na=False)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

Arguments can be specified in the same way as query().

print(df.query('name.str.endswith("e", na=False)'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

index condition

Criteria (row names) can be specified using index.index.

df = pd.read_csv('data/sample_pandas_normal.csv')

print(df.query('index % 2 == 0'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

If the index has a name, it can be that name or the index.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.query('name.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

print(df_name.query('index.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

Specify multiple conditions

When specifying multiple conditions with a Boolean index, the description is as follows.

print(df[(df['age'] < 25) & (df['point'] > 65)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

The query() method can be written as follows. Parentheses are not required for each condition, AND (and) can be & or and.

print(df.query('age < 25 & point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 25 and point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

OR (or), | or or are all acceptable.

print(df.query('age < 20 | point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 20 or point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

NOT (negative) not.

print(df.query('not age < 25 and not point > 65'))
#     name  age state  point
# 5  Frank   30    NY     57

The same is true for three or more conditions, but the results vary depending on the order, eg & has precedence over |, so it is safer to explicitly enclose the part that is processed first in parentheses.

print(df.query('age == 24 | point > 80 & state == "CA"'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

print(df.query('(age == 24 | point > 80) & state == "CA"'))
#     name  age state  point
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

Enclose column names containing spaces or dots with "`"

Be careful with column names when using the query() method. For example, change the column names as follows.

df.columns = ['0name', 'age.year', 'state name', 3]
print(df)
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70
# 4    Ellen        24         CA  88
# 5    Frank        30         NY  57

Using a column name that is not valid as a Python variable name will result in an error. For example, column names starting with a number, column names containing . or spaces are all wrong.

# print(df.query('0name.str.endswith("e")'))
# SyntaxError: invalid syntax

# print(df.query('age.year < 25'))
# UndefinedVariableError: name 'age' is not defined

# print(df.query('state name == "CA"'))
# SyntaxError: invalid syntax

Must be enclosed in "`".

print(df.query('`0name`.str.endswith("e")'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70

print(df.query('`age.year` < 25'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

print(df.query('`state name` == "CA"'))
#      0name  age.year state name   3
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

Error occurs even if numeric column names are enclosed in "`". If you specify the condition using boolean indexing, there is no problem.

# print(df.query('3 > 75'))
# KeyError: False

# print(df.query('`3` > 75'))
# UndefinedVariableError: name 'BACKTICK_QUOTED_STRING_3' is not defined

print(df[df[3] > 75])
#    0name  age.year state name   3
# 1    Bob        42         CA  92
# 4  Ellen        24         CA  88

Update the original object with the inplace parameter

In the examples so far, a new pandas.DataFrame is returned containing the rows extracted via query() and the original objects are left untouched. The parameter inplace=True will change the original object itself.

df = pd.read_csv('data/sample_pandas_normal.csv')

df.query('age > 25', inplace=True)
print(df)
#     name  age state  point
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/131755497