53 Conditional replacement value (where, mask) in Pandas
I will explain how to assign values based on conditions in pandas. Although it doesn't use if statements, it can handle conditional branches like if then... or if then...else....
For the replacement of specific values, see the following article, replace or delete the missing value NaN.
Take the pandas.DataFrame below as an example.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [-20, -10, 0, 10, 20],
'B': [1, 2, 3, 4, 5],
'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
# A B C
# 0 -20 1 a
# 1 -10 2 b
# 2 0 3 b
# 3 10 4 b
# 4 20 5 a
The following contents are explained.
- Boolean index reference with loc, iloc
- where() method of pandas.DataFrame, Series
- False elements can change, True elements remain unchanged
- mask() method of pandas.DataFrame, Series
- True elements can change, False elements remain unchanged
- NumPy where() function
- Both True and False elements can be changed
Boolean index reference with loc, iloc
A scalar value can be replaced conditionally by writing as follows.
df.loc[df['A'] < 0, 'A'] = -100
df.loc[~(df['A'] < 0), 'A'] = 100
print(df)
# A B C
# 0 -100 1 a
# 1 -100 2 b
# 2 100 3 b
# 3 100 4 b
# 4 100 5 a
If you perform a comparison operation on pandas.DataFrame or pandas.DataFrame columns (= pandas.Series), you will get a bool type pandas.DataFrame or pandas.Series.
An example is dealing with columns of pandas.DataFrame (= pandas.Series). ~ is the negation operator.
print(df['A'] < 0)
# 0 True
# 1 True
# 2 False
# 3 False
# 4 False
# Name: A, dtype: bool
print(~(df['A'] < 0))
# 0 False
# 1 False
# 2 True
# 3 True
# 4 True
# Name: A, dtype: bool
Using bool type pandas.Series as loc or iloc row specification will select only True rows. loc is specified by row and column names, and iloc is specified by row and column numbers.
print(df.loc[df['A'] < 0, 'A'])
# 0 -100
# 1 -100
# Name: A, dtype: int64
References with loc and iloc can be used not only to get values, but also to assign values. The rows where bool type pandas.Series is True (rows that satisfy the condition) and the specified column elements are changed to the scalar value on the right.
df.loc[df['A'] < 0, 'A'] = -10
print(df)
# A B C
# 0 -10 1 a
# 1 -10 2 b
# 2 100 3 b
# 3 100 4 b
# 4 100 5 a
A pandas.Series, list, or array can also be specified instead of a scalar value. The corresponding row values are replaced.
df.loc[~(df['A'] < 0), 'A'] = df['B']
print(df)
# A B C
# 0 -10 1 a
# 1 -10 2 b
# 2 3 3 b
# 3 4 4 b
# 4 5 5 a
In the examples so far we've assigned values to elements of existing columns, but specifying a new column name adds a new column and allows us to assign values to rows that satisfy the condition.
df.loc[df['B'] % 2 == 0, 'D'] = 'even'
df.loc[df['B'] % 2 != 0, 'D'] = 'odd'
print(df)
# A B C D
# 0 -10 1 a odd
# 1 -10 2 b even
# 2 3 3 b odd
# 3 4 4 b even
# 4 5 5 a odd
Multiple conditions can also be specified with and, or. Use &, | to enclose each condition in parentheses.
When adding a new column, elements that do not satisfy the condition will have the missing value NaN. Note that the type dtype of columns containing NaNs will be float.
df.loc[~(df['A'] < 0) & (df['C'] == 'b'), 'E'] = df['B'] * 2
print(df)
# A B C D E
# 0 -10 1 a odd NaN
# 1 -10 2 b even NaN
# 2 3 3 b odd 6.0
# 3 4 4 b even 8.0
# 4 5 5 a odd NaN
The process of selecting one of the two columns based on their values can be written as follows.
df.loc[~(df['A'] < 0), 'A'] = 10
print(df)
# A B C D E
# 0 -10 1 a odd NaN
# 1 -10 2 b even NaN
# 2 10 3 b odd 6.0
# 3 10 4 b even 8.0
# 4 10 5 a odd NaN
df.loc[df['C'] == 'a', 'F'] = df['A']
df.loc[df['C'] == 'b', 'F'] = df['B']
print(df)
# A B C D E F
# 0 -10 1 a odd NaN -10.0
# 1 -10 2 b even NaN 2.0
# 2 10 3 b odd 6.0 3.0
# 3 10 4 b even 8.0 4.0
# 4 10 5 a odd NaN 10.0
Multiple columns can also be specified in a list using loc and iloc.
df.loc[df['C'] == 'a', ['E', 'F']] = 100
print(df)
# A B C D E F
# 0 -10 1 a odd 100.0 100
# 1 -10 2 b even NaN 2
# 2 10 3 b odd 6.0 3
# 3 10 4 b even 8.0 4
# 4 10 5 a odd 100.0 100
Comparing pandas.DataFrames results in a pandas.DataFrame of bool type, but it cannot be broadcasted, so assigning it as in the previous example will result in an error.
print(df < 0)
# A B C D E F
# 0 True False True True False False
# 1 True False True True False False
# 2 False False True True False False
# 3 False False True True False False
# 4 False False True True False False
print(df[df < 0])
# A B C D E F
# 0 -10.0 NaN a odd NaN NaN
# 1 -10.0 NaN b even NaN NaN
# 2 NaN NaN b odd NaN NaN
# 3 NaN NaN b even NaN NaN
# 4 NaN NaN a odd NaN NaN
# df[df < 0] = 0
# TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
If you want to apply conditions to the entire pandas.DataFrame, use the where() method or the mask() method described below.
- pandas.DataFrame, where() method of Series
pandas.DataFrame, pandas.Series method has where().
If you specify a pandas.Series or an array with bool-valued elements as the first argument, the value of the True element remains the object from which it was called, and the value of the False element becomes NaN.
df = pd.DataFrame({
'A': [-20, -10, 0, 10, 20],
'B': [1, 2, 3, 4, 5],
'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
# A B C
# 0 -20 1 a
# 1 -10 2 b
# 2 0 3 b
# 3 10 4 b
# 4 20 5 a
print(df['A'].where(df['C'] == 'a'))
# 0 -20.0
# 1 NaN
# 2 NaN
# 3 NaN
# 4 20.0
# Name: A, dtype: float64
If a scalar value, pandas.Series, or array is specified as the second argument, that value will be used instead of NaN as the value for the False element. Unlike NumPy's where() function, the True value cannot be specified (the original value is preserved).
print(df['A'].where(df['C'] == 'a', 100))
# 0 -20
# 1 100
# 2 100
# 3 100
# 4 20
# Name: A, dtype: int64
print(df['A'].where(df['C'] == 'a', df['B']))
# 0 -20
# 1 2
# 2 3
# 3 4
# 4 20
# Name: A, dtype: int64
It can also be added as a new column.
df['D'] = df['A'].where(df['C'] == 'a', df['B'])
print(df)
# A B C D
# 0 -20 1 a -20
# 1 -10 2 b 2
# 2 0 3 b 3
# 3 10 4 b 4
# 4 20 5 a 20
The parameter inplace=True modifies the original object.
df['D'].where((df['D'] % 2 == 0) & (df['A'] < 0), df['D'] * 100, inplace=True)
print(df)
# A B C D
# 0 -20 1 a -20
# 1 -10 2 b 2
# 2 0 3 b 300
# 3 10 4 b 400
# 4 20 5 a 2000
pandas.DataFrame also has a where() method. Specify a pandas.DataFrame or a two-dimensional array whose elements are bools the same size as the caller's condition on the first argument.
print(df < 0)
# A B C D
# 0 True False True True
# 1 True False True False
# 2 False False True False
# 3 False False True False
# 4 False False True False
print(df.where(df < 0))
# A B C D
# 0 -20.0 NaN a -20.0
# 1 -10.0 NaN b NaN
# 2 NaN NaN b NaN
# 3 NaN NaN b NaN
# 4 NaN NaN a NaN
print(df.where(df < 0, df * 2))
# A B C D
# 0 -20 2 a -20
# 1 -10 4 b 4
# 2 0 6 b 600
# 3 20 8 b 800
# 4 40 10 a 4000
print(df.where(df < 0, 100))
# A B C D
# 0 -20 100 a -20
# 1 -10 100 b 100
# 2 100 100 b 100
# 3 100 100 b 100
# 4 100 100 a 100
mask() method of pandas.DataFrame, Series
The pandas.DataFrame, pandas.Series methods have mask().
The mask() method is the opposite of the where() method, elements for which the condition is False in the first parameter remain the caller's object, and True elements become NaN or the value specified in the second parameter. Other usages are the same as where().
df = pd.DataFrame({
'A': [-20, -10, 0, 10, 20],
'B': [1, 2, 3, 4, 5],
'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
# A B C
# 0 -20 1 a
# 1 -10 2 b
# 2 0 3 b
# 3 10 4 b
# 4 20 5 a
print(df['C'].mask(df['C'] == 'a'))
# 0 NaN
# 1 b
# 2 b
# 3 b
# 4 NaN
# Name: C, dtype: object
print(df['C'].mask(df['C'] == 'a', 100))
# 0 100
# 1 b
# 2 b
# 3 b
# 4 100
# Name: C, dtype: object
df['D'] = df['A'].mask(df['C'] == 'a', df['B'])
print(df)
# A B C D
# 0 -20 1 a 1
# 1 -10 2 b -10
# 2 0 3 b 0
# 3 10 4 b 10
# 4 20 5 a 5
df['D'].mask(df['D'] % 2 != 0, df['D'] * 10, inplace=True)
print(df)
# A B C D
# 0 -20 1 a 10
# 1 -10 2 b -10
# 2 0 3 b 0
# 3 10 4 b 10
# 4 20 5 a 50
It seems more intuitive than where(), because the second parameter is assigned to the element that satisfies the condition of the first parameter (becomes True). pandas.DataFrame also has a mask() method.
print(df.mask(df < 0, -100))
# A B C D
# 0 -100 1 -100 10
# 1 -100 2 -100 -100
# 2 0 3 -100 0
# 3 10 4 -100 10
# 4 20 5 -100 50
If you want to apply a method only to numeric columns of objects containing numbers and strings, as in this example, you can use select_dtypes() as follows.
print(df.select_dtypes(include='number').mask(df < 0, -100))
# A B D
# 0 -100 1 10
# 1 -100 2 -100
# 2 0 3 0
# 3 10 4 10
# 4 20 5 50
It is also possible to concatenate non-numeric columns after processing only numeric columns.
df_mask = df.select_dtypes(include='number').mask(df < 0, -100)
df_mask = pd.concat([df_mask, df.select_dtypes(exclude='number')], axis=1)
print(df_mask.sort_index(axis=1))
# A B C D
# 0 -100 1 a 10
# 1 -100 2 b -100
# 2 0 3 b 0
# 3 10 4 b 10
# 4 20 5 a 50
NumPy where() function
Using NumPy's where() function can also assign values based on conditions.
In the pandas where() method or mask() method, the second parameter can only be the value assigned when False or True, and the value of the calling object is used as it is for the other. . Therefore, the process of selecting values according to conditions (processing of specifying different values for True and False) cannot be performed.
In the NumPy where() function, the first argument is the condition, the second argument is the value to assign to the elements where the condition is True, and the third argument is the value to assign to the elements where the condition is False. Scalar values and arrays can be specified for the second and third arguments, and are assigned via broadcast.
numpy.where() returns a NumPy array ndarray.
A one-dimensional numpy.ndarray can be specified as a column of a pandas.DataFrame.
df = pd.DataFrame({
'A': [-20, -10, 0, 10, 20],
'B': [1, 2, 3, 4, 5],
'C': ['a', 'b', 'b', 'b', 'a']})
print(df)
# A B C
# 0 -20 1 a
# 1 -10 2 b
# 2 0 3 b
# 3 10 4 b
# 4 20 5 a
print(np.where(df['B'] % 2 == 0, 'even', 'odd'))
# ['odd' 'even' 'odd' 'even' 'odd']
print(np.where(df['C'] == 'a', df['A'], df['B']))
# [-20 2 3 4 20]
df['D'] = np.where(df['B'] % 2 == 0, 'even', 'odd')
print(df)
# A B C D
# 0 -20 1 a odd
# 1 -10 2 b even
# 2 0 3 b odd
# 3 10 4 b even
# 4 20 5 a odd
df['E'] = np.where(df['C'] == 'a', df['A'], df['B'])
print(df)
# A B C D E
# 0 -20 1 a odd -20
# 1 -10 2 b even 2
# 2 0 3 b odd 3
# 3 10 4 b even 4
# 4 20 5 a odd 20
Returns a two-dimensional numpy.ndarray if a pandas.DataFrame is specified as the condition. You can create a pandas.DataFrame using the index and columns of the original pandas.DataFrame.
print(np.where(df < 0, df, 100))
# [[-20 100 'a' 'odd' -20]
# [-10 100 'b' 'even' 100]
# [100 100 'b' 'odd' 100]
# [100 100 'b' 'even' 100]
# [100 100 'a' 'odd' 100]]
df_np_where = pd.DataFrame(np.where(df < 0, df, 100),
index=df.index, columns=df.columns)
print(df_np_where)
# A B C D E
# 0 -20 100 a odd -20
# 1 -10 100 b even 100
# 2 100 100 b odd 100
# 3 100 100 b even 100
# 4 100 100 a odd 100