53 Conditional replacement value (where, mask) in Pandas

53 Conditional replacement value (where, mask) in Pandas

I will explain how to assign values ​​based on conditions in pandas. Although it doesn't use if statements, it can handle conditional branches like if then... or if then...else....

For the replacement of specific values, see the following article, replace or delete the missing value NaN.

Take the pandas.DataFrame below as an example.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    
    'A': [-20, -10, 0, 10, 20],
                   'B': [1, 2, 3, 4, 5],
                   'C': ['a', 'b', 'b', 'b', 'a']})

print(df)
#     A  B  C
# 0 -20  1  a
# 1 -10  2  b
# 2   0  3  b
# 3  10  4  b
# 4  20  5  a

The following contents are explained.

  • Boolean index reference with loc, iloc
  • where() method of pandas.DataFrame, Series
    • False elements can change, True elements remain unchanged
  • mask() method of pandas.DataFrame, Series
    • True elements can change, False elements remain unchanged
  • NumPy where() function
    • Both True and False elements can be changed

Boolean index reference with loc, iloc

A scalar value can be replaced conditionally by writing as follows.

df.loc[df['A'] < 0, 'A'] = -100
df.loc[~(df['A'] < 0), 'A'] = 100
print(df)
#      A  B  C
# 0 -100  1  a
# 1 -100  2  b
# 2  100  3  b
# 3  100  4  b
# 4  100  5  a

If you perform a comparison operation on pandas.DataFrame or pandas.DataFrame columns (= pandas.Series), you will get a bool type pandas.DataFrame or pandas.Series.

An example is dealing with columns of pandas.DataFrame (= pandas.Series). ~ is the negation operator.

print(df['A'] < 0)
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# Name: A, dtype: bool

print(~(df['A'] < 0))
# 0    False
# 1    False
# 2     True
# 3     True
# 4     True
# Name: A, dtype: bool

Using bool type pandas.Series as loc or iloc row specification will select only True rows. loc is specified by row and column names, and iloc is specified by row and column numbers.

print(df.loc[df['A'] < 0, 'A'])
# 0   -100
# 1   -100
# Name: A, dtype: int64

References with loc and iloc can be used not only to get values, but also to assign values. The rows where bool type pandas.Series is True (rows that satisfy the condition) and the specified column elements are changed to the scalar value on the right.

df.loc[df['A'] < 0, 'A'] = -10
print(df)
#      A  B  C
# 0  -10  1  a
# 1  -10  2  b
# 2  100  3  b
# 3  100  4  b
# 4  100  5  a

A pandas.Series, list, or array can also be specified instead of a scalar value. The corresponding row values ​​are replaced.

df.loc[~(df['A'] < 0), 'A'] = df['B']
print(df)
#     A  B  C
# 0 -10  1  a
# 1 -10  2  b
# 2   3  3  b
# 3   4  4  b
# 4   5  5  a

In the examples so far we've assigned values ​​to elements of existing columns, but specifying a new column name adds a new column and allows us to assign values ​​to rows that satisfy the condition.

df.loc[df['B'] % 2 == 0, 'D'] = 'even'
df.loc[df['B'] % 2 != 0, 'D'] = 'odd'
print(df)
#     A  B  C     D
# 0 -10  1  a   odd
# 1 -10  2  b  even
# 2   3  3  b   odd
# 3   4  4  b  even
# 4   5  5  a   odd

Multiple conditions can also be specified with and, or. Use &, | to enclose each condition in parentheses.

When adding a new column, elements that do not satisfy the condition will have the missing value NaN. Note that the type dtype of columns containing NaNs will be float.

df.loc[~(df['A'] < 0) & (df['C'] == 'b'), 'E'] = df['B'] * 2
print(df)
#     A  B  C     D    E
# 0 -10  1  a   odd  NaN
# 1 -10  2  b  even  NaN
# 2   3  3  b   odd  6.0
# 3   4  4  b  even  8.0
# 4   5  5  a   odd  NaN

The process of selecting one of the two columns based on their values ​​can be written as follows.

df.loc[~(df['A'] < 0), 'A'] = 10
print(df)
#     A  B  C     D    E
# 0 -10  1  a   odd  NaN
# 1 -10  2  b  even  NaN
# 2  10  3  b   odd  6.0
# 3  10  4  b  even  8.0
# 4  10  5  a   odd  NaN

df.loc[df['C'] == 'a', 'F'] = df['A']
df.loc[df['C'] == 'b', 'F'] = df['B']
print(df)
#     A  B  C     D    E     F
# 0 -10  1  a   odd  NaN -10.0
# 1 -10  2  b  even  NaN   2.0
# 2  10  3  b   odd  6.0   3.0
# 3  10  4  b  even  8.0   4.0
# 4  10  5  a   odd  NaN  10.0

Multiple columns can also be specified in a list using loc and iloc.

df.loc[df['C'] == 'a', ['E', 'F']] = 100
print(df)
#     A  B  C     D      E    F
# 0 -10  1  a   odd  100.0  100
# 1 -10  2  b  even    NaN    2
# 2  10  3  b   odd    6.0    3
# 3  10  4  b  even    8.0    4
# 4  10  5  a   odd  100.0  100

Comparing pandas.DataFrames results in a pandas.DataFrame of bool type, but it cannot be broadcasted, so assigning it as in the previous example will result in an error.

print(df < 0)
#        A      B     C     D      E      F
# 0   True  False  True  True  False  False
# 1   True  False  True  True  False  False
# 2  False  False  True  True  False  False
# 3  False  False  True  True  False  False
# 4  False  False  True  True  False  False

print(df[df < 0])
#       A   B  C     D   E   F
# 0 -10.0 NaN  a   odd NaN NaN
# 1 -10.0 NaN  b  even NaN NaN
# 2   NaN NaN  b   odd NaN NaN
# 3   NaN NaN  b  even NaN NaN
# 4   NaN NaN  a   odd NaN NaN

# df[df < 0] = 0
# TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

If you want to apply conditions to the entire pandas.DataFrame, use the where() method or the mask() method described below.

  • pandas.DataFrame, where() method of Series
    pandas.DataFrame, pandas.Series method has where().

If you specify a pandas.Series or an array with bool-valued elements as the first argument, the value of the True element remains the object from which it was called, and the value of the False element becomes NaN.

df = pd.DataFrame({
    
    'A': [-20, -10, 0, 10, 20],
                   'B': [1, 2, 3, 4, 5],
                   'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
#     A  B  C
# 0 -20  1  a
# 1 -10  2  b
# 2   0  3  b
# 3  10  4  b
# 4  20  5  a

print(df['A'].where(df['C'] == 'a'))
# 0   -20.0
# 1     NaN
# 2     NaN
# 3     NaN
# 4    20.0
# Name: A, dtype: float64

If a scalar value, pandas.Series, or array is specified as the second argument, that value will be used instead of NaN as the value for the False element. Unlike NumPy's where() function, the True value cannot be specified (the original value is preserved).

print(df['A'].where(df['C'] == 'a', 100))
# 0    -20
# 1    100
# 2    100
# 3    100
# 4     20
# Name: A, dtype: int64

print(df['A'].where(df['C'] == 'a', df['B']))
# 0   -20
# 1     2
# 2     3
# 3     4
# 4    20
# Name: A, dtype: int64

It can also be added as a new column.

df['D'] = df['A'].where(df['C'] == 'a', df['B'])
print(df)
#     A  B  C   D
# 0 -20  1  a -20
# 1 -10  2  b   2
# 2   0  3  b   3
# 3  10  4  b   4
# 4  20  5  a  20

The parameter inplace=True modifies the original object.

df['D'].where((df['D'] % 2 == 0) & (df['A'] < 0), df['D'] * 100, inplace=True)
print(df)
#     A  B  C     D
# 0 -20  1  a   -20
# 1 -10  2  b     2
# 2   0  3  b   300
# 3  10  4  b   400
# 4  20  5  a  2000

pandas.DataFrame also has a where() method. Specify a pandas.DataFrame or a two-dimensional array whose elements are bools the same size as the caller's condition on the first argument.

print(df < 0)
#        A      B     C      D
# 0   True  False  True   True
# 1   True  False  True  False
# 2  False  False  True  False
# 3  False  False  True  False
# 4  False  False  True  False

print(df.where(df < 0))
#       A   B  C     D
# 0 -20.0 NaN  a -20.0
# 1 -10.0 NaN  b   NaN
# 2   NaN NaN  b   NaN
# 3   NaN NaN  b   NaN
# 4   NaN NaN  a   NaN

print(df.where(df < 0, df * 2))
#     A   B  C     D
# 0 -20   2  a   -20
# 1 -10   4  b     4
# 2   0   6  b   600
# 3  20   8  b   800
# 4  40  10  a  4000

print(df.where(df < 0, 100))
#      A    B  C    D
# 0  -20  100  a  -20
# 1  -10  100  b  100
# 2  100  100  b  100
# 3  100  100  b  100
# 4  100  100  a  100

mask() method of pandas.DataFrame, Series

The pandas.DataFrame, pandas.Series methods have mask().

The mask() method is the opposite of the where() method, elements for which the condition is False in the first parameter remain the caller's object, and True elements become NaN or the value specified in the second parameter. Other usages are the same as where().

df = pd.DataFrame({
    
    'A': [-20, -10, 0, 10, 20],
                   'B': [1, 2, 3, 4, 5],
                   'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
#     A  B  C
# 0 -20  1  a
# 1 -10  2  b
# 2   0  3  b
# 3  10  4  b
# 4  20  5  a

print(df['C'].mask(df['C'] == 'a'))
# 0    NaN
# 1      b
# 2      b
# 3      b
# 4    NaN
# Name: C, dtype: object

print(df['C'].mask(df['C'] == 'a', 100))
# 0    100
# 1      b
# 2      b
# 3      b
# 4    100
# Name: C, dtype: object

df['D'] = df['A'].mask(df['C'] == 'a', df['B'])
print(df)
#     A  B  C   D
# 0 -20  1  a   1
# 1 -10  2  b -10
# 2   0  3  b   0
# 3  10  4  b  10
# 4  20  5  a   5

df['D'].mask(df['D'] % 2 != 0, df['D'] * 10, inplace=True)
print(df)
#     A  B  C   D
# 0 -20  1  a  10
# 1 -10  2  b -10
# 2   0  3  b   0
# 3  10  4  b  10
# 4  20  5  a  50

It seems more intuitive than where(), because the second parameter is assigned to the element that satisfies the condition of the first parameter (becomes True). pandas.DataFrame also has a mask() method.

print(df.mask(df < 0, -100))
#      A  B     C    D
# 0 -100  1  -100   10
# 1 -100  2  -100 -100
# 2    0  3  -100    0
# 3   10  4  -100   10
# 4   20  5  -100   50

If you want to apply a method only to numeric columns of objects containing numbers and strings, as in this example, you can use select_dtypes() as follows.

print(df.select_dtypes(include='number').mask(df < 0, -100))
#      A  B    D
# 0 -100  1   10
# 1 -100  2 -100
# 2    0  3    0
# 3   10  4   10
# 4   20  5   50

It is also possible to concatenate non-numeric columns after processing only numeric columns.

df_mask = df.select_dtypes(include='number').mask(df < 0, -100)
df_mask = pd.concat([df_mask, df.select_dtypes(exclude='number')], axis=1)
print(df_mask.sort_index(axis=1))
#      A  B  C    D
# 0 -100  1  a   10
# 1 -100  2  b -100
# 2    0  3  b    0
# 3   10  4  b   10
# 4   20  5  a   50

NumPy where() function

Using NumPy's where() function can also assign values ​​​​based on conditions.

In the pandas where() method or mask() method, the second parameter can only be the value assigned when False or True, and the value of the calling object is used as it is for the other. . Therefore, the process of selecting values ​​according to conditions (processing of specifying different values ​​for True and False) cannot be performed.

In the NumPy where() function, the first argument is the condition, the second argument is the value to assign to the elements where the condition is True, and the third argument is the value to assign to the elements where the condition is False. Scalar values ​​and arrays can be specified for the second and third arguments, and are assigned via broadcast.

numpy.where() returns a NumPy array ndarray.

A one-dimensional numpy.ndarray can be specified as a column of a pandas.DataFrame.

df = pd.DataFrame({
    
    'A': [-20, -10, 0, 10, 20],
                   'B': [1, 2, 3, 4, 5],
                   'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
#     A  B  C
# 0 -20  1  a
# 1 -10  2  b
# 2   0  3  b
# 3  10  4  b
# 4  20  5  a

print(np.where(df['B'] % 2 == 0, 'even', 'odd'))
# ['odd' 'even' 'odd' 'even' 'odd']

print(np.where(df['C'] == 'a', df['A'], df['B']))
# [-20   2   3   4  20]

df['D'] = np.where(df['B'] % 2 == 0, 'even', 'odd')
print(df)
#     A  B  C     D
# 0 -20  1  a   odd
# 1 -10  2  b  even
# 2   0  3  b   odd
# 3  10  4  b  even
# 4  20  5  a   odd

df['E'] = np.where(df['C'] == 'a', df['A'], df['B'])
print(df)
#     A  B  C     D   E
# 0 -20  1  a   odd -20
# 1 -10  2  b  even   2
# 2   0  3  b   odd   3
# 3  10  4  b  even   4
# 4  20  5  a   odd  20

Returns a two-dimensional numpy.ndarray if a pandas.DataFrame is specified as the condition. You can create a pandas.DataFrame using the index and columns of the original pandas.DataFrame.

print(np.where(df < 0, df, 100))
# [[-20 100 'a' 'odd' -20]
#  [-10 100 'b' 'even' 100]
#  [100 100 'b' 'odd' 100]
#  [100 100 'b' 'even' 100]
#  [100 100 'a' 'odd' 100]]

df_np_where = pd.DataFrame(np.where(df < 0, df, 100),
                           index=df.index, columns=df.columns)

print(df_np_where)
#      A    B  C     D    E
# 0  -20  100  a   odd  -20
# 1  -10  100  b  even  100
# 2  100  100  b   odd  100
# 3  100  100  b  even  100
# 4  100  100  a   odd  100

Guess you like

Origin blog.csdn.net/qq_18351157/article/details/127938064