Pandas missing data processing Daquan

This time, we will introduce several common methods for data processing with missing values.
picture

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

This article is shared and summarized by the fans of the fan group. Article source code, data, and technical exchange improvements can all be obtained by adding the exchange group, which has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends.

Method ①, add WeChat account: dkl88194, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

1. Missing value type


In pandas, missing data are shown as NaN . There are 3 ways to represent missing values, np.nan, none, pd.NA.

1、np.in

A missing value has a characteristic (pit), it is not equal to any value, not even itself. nanReturns if compared with any other value nan.

np.nan == np.nan
>> False

Also because of this feature, after the data set is read in, no matter what type of data the column is, the default missing values ​​are all np.nan.

Because the type nanin Numpyis a floating point, the integer column will be converted to a floating point; and the character type can only be merged into the object type ('O') because it cannot be converted into a floating point type. constant.

type(np.nan)
>> float
pd.Series([1,2,3]).dtype
>> dtype('int64')
pd.Series([1,np.nan,3]).dtype
>> dtype('float64')

Beginners who do data processing will be confused when encountering the object type. They don’t know what it is. It is obviously a character type, but it changes after importing. In fact, it is caused by missing values.

In addition, we will also introduce a missing value for time series, which exists alone and is represented by NaT. It is apandas built-in type that can be regarded as a time series versionnp.nan , and it is not equal to itself.

s_time = pd.Series([pd.Timestamp('20220101')]*3)
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 2022-01-01
   dtype:datetime64[ns]
-----------------
s_time[2] = pd.NaT
s_time
>> 0 2022-01-01
   1 2022-01-01
   2 NaT
   dtype:datetime64[ns]

2、None

Another is that Noneit is nana little bit better than that, because it is at least equal to itself.

None == None
>> True

After the numeric type is passed in, it will automatically become np.nan.

type(pd.Series([1,None])[1])
>> numpy.float64

Only when objectthe type is passed in, it is unchanged, so it can be considered that if it is not named artificially None, it will basically not automatically appear in pandas, so Noneyou basically can't see it.

type(pd.Series([1,None],dtype='O')[1])
>> NoneType

3. NA scalar

In versions after pandas1.0, a scalar pd.NA that specifically represents missing values ​​has been introduced , which represents empty integers, empty Boolean values, and empty characters. This function is currently in the experimental stage.

Developers have also noticed this, and it would be messy to use different missing value representations for different data types. pd.NA exists for unity. The goal of pd.NA is to provide a missing value indicator that can be used consistently across data types (as opposed to np.nan , None , or NaT case by case).

s_new = pd.Series([1, 2], dtype="Int64")
s_new
>> 0   1
   1   2
   dtype: Int64
-----------------
s_new[1] = pd.NaT
s_new
>> 0    1
   1  <NA>
   dtype: Int64

objectIn the same way, the original data type will not be changed for the Boolean type and the character type, which solves the trouble of changing the original data type at every turn .

The following are examples of some commonly used arithmetic and comparison operations with pd.NA:

##### 算术运算
# 加法
pd.NA + 1
>> <NA>
-----------
# 乘法
"a" * pd.NA
>> <NA>
-----------
# 以下两种其中结果为1
pd.NA ** 0
>> 1
-----------
1 ** pd.NA
>> 1

##### 比较运算
pd.NA == pd.NA
>> <NA>
-----------
pd.NA < 2.5
>> <NA>
-----------
np.log(pd.NA)
>> <NA>
-----------
np.add(pd.NA, 1)
>> <NA>

2. Missing value judgment

After understanding several forms of missing values, we need to know how to judge missing values. For one dataframe, the main method of judging missing is isnull()or isna(), these two methods will directly return Truethe FalseBoolean value of sum. It can be for the whole dataframeor a certain column.

df = pd.DataFrame({
    
    
      'A':['a1','a1','a2','a3'],
      'B':['b1',None,'b2','b3'],
      'C':[1,2,3,4],
      'D':[5,None,9,10]})
# 将无穷设置为缺失值      
pd.options.mode.use_inf_as_na = True

1. The judgment of the entire dataframe is missing

df.isnull()
>> A B C D
0 False False False False
1 False True False True
2 False False False False
3 False False False False

2. The judgment of a certain column is missing

df['C'].isnull()
>> 0    False
   1    False
   2    False
   3    False
Name: C, dtype: bool

If you want to take non-missing, you can use notna()it, the method of use is the same, and the result is opposite.

3. Missing value statistics

1. Missing columns

Generally, we will perform missing statistics on a column to check how many missing in each column, and delete or interpolate if the missing rate is too dataframehigh . isnull()Then it can be directly applied to the returned result above .sum(), axiswhich is equal to 0 by default, 0 is a column, and 1 is a row.

## 列缺失统计
isnull().sum(axis=0)

2. Missing rows

But in many cases, we also need to judge the missing value of the row . For example, a row of data may not have a single value. If this sample enters the model, it will cause great interference. Therefore, the missing rates of both rows and columns are usually checked and counted.

The operation is very simple, just need to sum()set in axis=1.

## 行缺失统计
isnull().sum(axis=1)

3. Missing rate

Sometimes I not only want to know the number of missing, but I also want to know the proportion of missing, that is, the missing rate. Normally, you may think of using the value obtained above to compare the total number of rows. But in fact, here is a little trick that can be achieved in one step.

## 缺失率
df.isnull().sum(axis=0)/df.shape[0]

## 缺失率(一步到位)
isnull().mean()

4. Missing value screening

Screening needs to be completed with the cooperation of loc. The missing screening of rows and columns is as follows:

# 筛选有缺失值的行
df.loc[df.isnull().any(1)]
>> A B C D
1 a1 None 2 NaN
-----------------
# 筛选有缺失值的列
df.loc[:,df.isnull().any()]
>> B D
0 b1 5.0
1 None NaN
2 b2 9.0
3 b3 10.0

To find rows and columns with no missing values, you can negate the expression ~:

df.loc[~(df.isnull().any(1))]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

The above uses anythe judgment to filter as long as there is a missing value. It can also be used to alljudge whether all missing values ​​are used. The row can also be judged. If the entire column or row is missing values, then this variable or sample loses the meaning of analysis. You can Consider deleting.

5. Missing value filling

Generally, we have two ways to deal with missing values, one is to delete them directly, and the other is to keep and fill them. The method of filling is introduced first fillna.

# 将dataframe所有缺失值填充为0
df.fillna(0)
>> A B C D
0 a1 b1 1 5.0
1 a1 0 2 0.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0
--------------
# 将D列缺失值填充为-999
df.D.fillna('-999')
>> 0       5
   1    -999
   2       9
   3      10
Name: D, dtype: object

The method is very simple, but you need to pay attention to some parameters when using it.

  • inplace: Can be set fillna(0, inplace=True)to make filling effective, the original dataFrame is filled.

  • methodd: You can set methondthe method to achieve forward or backward filling, pad/ffillfor forward filling, bfill/backfillfor backward filling, for example df.fillna(methond='ffill'), it can also be abbreviated as df.ffill().

df.ffill()
>> A B C D
0 a1 b1 1 5.0
1 a1 b1 2 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

The original missing value will be filled according to the previous value (column B, 1 row, column D, 1 row).

In addition to filling with before and after values, you can also fill with the mean of the entire column, such as the mean of 8 for other non-missing values ​​in column D to fill missing values.

df.D.fillna(df.D.mean())
>> 0     5.0
   1     8.0
   2     9.0
   3    10.0
Name: D, dtype: float64

6. Missing value deletion

Deleting missing values ​​is not the case. For example, whether to delete all or delete with a relatively high missing rate depends on your tolerance. Real data will inevitably have missing values, which cannot be avoided. And the absence also represents a certain meaning in some cases, depending on the situation.

1. Delete all directly

# 全部直接删除
df.dropna()
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

2. Line missing deletion

# 行缺失删除
df.dropna(axis=0)
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

3. Column missing deletion

# 列缺失删除
df.dropna(axis=1)
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
# 删除指定列范围内的缺失,因为C列无缺失,所以最后没有变化
df.dropna(subset=['C'])
>> A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0

4. Delete according to the missing rate

This can be achieved by using a screening method, such as deleting the column missing greater than 0.1 (that is, filtering less than 0.1).

df.loc[:,df.isnull().mean(axis=0) < 0.1]
>> A C
0 a1 1
1 a1 2
2 a2 3
3 a3 4
-------------
# 删除行缺失大于0.1的
df.loc[df.isnull().mean(axis=1) < 0.1]
>> A B C D
0 a1 b1 1 5.0
2 a2 b2 3 9.0
3 a3 b3 4 10.0

7. Missing values ​​participate in the calculation

If the missing value is not processed, what logic will the missing value be calculated according to?

Let's take a look at the participation logic of missing values ​​under various operations.

1. Addition

df
>>A B C D
0 a1 b1 1 5.0
1 a1 None 2 NaN
2 a2 b2 3 9.0
3 a3 b3 4 10.0
---------------
# 对所有列求和
df.sum()
>> A    a1a1a2a3
   C          10
   D          24

As you can see, addition ignores missing values.

2. Accumulation

# 对D列进行累加
df.D.cumsum()
>> 0     5.0
   1     NaN
   2    14.0
   3    24.0
Name: D, dtype: float64
---------------
df.D.cumsum(skipna=False)
>> 0    5.0
   1    NaN
   2    NaN
   3    NaN
Name: D, dtype: float64

cumsumAccumulation ignores NAs, but the values ​​remain in the column, which can be used to skipna=Falseskip calculations with missing values ​​and return missing values.

3. Count

# 对列计数
df.count()
>> A    4
   B    3
   C    4
   D    3
dtype: int64

Missing values ​​are not included in the counting range.

4. Aggregation and grouping

df.groupby('B').sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
---------------
df.groupby('B',dropna=False).sum()
>> C D
B  
b1 1 5.0
b2 3 9.0
b3 4 10.0
NaN 2 0.0

Missing values ​​will be ignored by default during aggregation. If you want missing values ​​to be included in the grouping, you can set it dropna=False. This usage is the same as other examples value_counts, sometimes you need to look at the number of missing values.

The above are all common operations on missing values, starting from understanding the three manifestations of missing values, to judging, counting, processing, and calculating missing values.

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/131623153