Detailed explanation of pandas chain operation and SettingWithCopyWarning

1.SettingWithCopyWarning problem

SettingWithCopyWarning is a classic problem in pandas, and it is also one of the few pits in the pandas library. On this issue, let's look at an example below.

import pandas as pd


def t1():
    data = {
        'name': ['a', 'b', 'c', 'd', 'e', 'f'],
        'num': [1, 2, 3, 4, 5, 6],
        'ss': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    }
    df = pd.DataFrame(data)
    print(df, "\n")
    df[df['name'] == 'a']['num'] = 10
    print(df)

The original intention of the above code is to assign num to 10 in the row named 'a' in df. Let's take a look at the result of running the code:

  name  num   ss
0    a    1  0.1
1    b    2  0.2
2    c    3  0.3
3    d    4  0.4
4    e    5  0.5
5    f    6  0.6 

/Users/wanglei/wanglei/code/python/finance-trade/p2/DfCopyCode.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['name'] == 'a']['num'] = 10
  name  num   ss
0    a    1  0.1
1    b    2  0.2
2    c    3  0.3
3    d    4  0.4
4    e    5  0.5
5    f    6  0.6

First of all, the code gives a warning of SettingWithCopyWarning. Note that it is Warning, not Error.
Then we observed the output and found that the assignment did not have the expected effect, and the value of df did not change.

Where is the specific reason?
To understand SettingWithCopyWarning, you must first know that some operations in Pandas return a view of the data (View), and some operations return a copy of the data (Copy).

insert image description here
As shown above, the view df2 on the left is just a subset of the original data df1, while the copy on the right creates a new object df2.

This can be problematic when we try to make changes to the dataset:
insert image description here
depending on our requirements, we may want to modify the original df1 (left), or we may want to modify df2 (right). The warning reminds us that the code may not meet the requirements, and the modified data set may not be the data set we want to modify.
(This part of the picture and text comes from Reference 1)

How to solve the above problem? In fact, the answer has been given in SettingWithCopyWarning, use loc.

def t2():
    data = {
        'name': ['a', 'b', 'c', 'd', 'e', 'f'],
        'num': [1, 2, 3, 4, 5, 6],
        'ss': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    }
    df = pd.DataFrame(data)
    df.loc[df['name'] == 'a', 'num'] = 10
    print(df)

The output of the above method is

  name  num   ss
0    a   10  0.1
1    b    2  0.2
2    c    3  0.3
3    d    4  0.4
4    e    5  0.5
5    f    6  0.6

This achieves our intended purpose.

Why is the loc method guaranteed to do what it expects? For details, please refer to Reference 1, which is very clear and detailed.

2. Cross-line SettingWithCopyWarning problem

Let's look at another example

def t3():
    data = {
        'name': ['a', 'b', 'c', 'd', 'e', 'f'],
        'num': [1, 2, 3, 4, 5, 6],
        'ss': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    }
    df = pd.DataFrame(data)
    subdf = df.loc[df.num <= 3]
    print(subdf)

    subdf.loc[subdf['name'] == 'a', 'num'] = 10
    print(subdf)

The output of the above code is

  name  num   ss
0    a    1  0.1
1    b    2  0.2
2    c    3  0.3
/Users/wanglei/anaconda3/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:965: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
  name  num   ss
0    a   10  0.1
1    b    2  0.2
2    c    3  0.3

We see that the assignment has been successful, the subdf has been changed, and the loc method is also used for copying. Why is there still a SettingWithCopyWarning alarm?
Chained indexing can happen within one line of code, or across two lines of code. Because the subdf variable was created as the output of the Get operation, it may or may not be a copy of the original DataFrame, and we cannot be sure unless we inspect it. When indexing subdf, chain index is actually used.

The solution to the warning in this case is: explicitly tell Pandas to create a copy when creating a new DataFrame (from reference 1)

def t4():
    data = {
        'name': ['a', 'b', 'c', 'd', 'e', 'f'],
        'num': [1, 2, 3, 4, 5, 6],
        'ss': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    }
    df = pd.DataFrame(data)
    subdf = df.loc[df.num <= 3].copy()
    print(subdf)

    subdf.loc[subdf['name'] == 'a', 'num'] = 10
    print(subdf)

The output of the code is:

  name  num   ss
0    a    1  0.1
1    b    2  0.2
2    c    3  0.3
  name  num   ss
0    a   10  0.1
1    b    2  0.2
2    c    3  0.3

In this case, use the copy method to explicitly tell that a new copy is created.

Regarding the issue of SettingWithCopyWarning, it is explained in detail in Reference 1, and it is strongly recommended to read it carefully!

References:
1. https://zhuanlan.zhihu.com/p/41202576

Guess you like

Origin blog.csdn.net/bitcarmanlee/article/details/131210540
Recommended