A summary of the dropna() function and fillna() function used in Python to deal with null values (missing values) in data, and understand the usage of these two functions through examples.

Introduction: When processing data with python, we usually encounter incomplete data, such as when there are null values, we want to delete the row or column where the control is located, or we set the null value to a certain value. At this time, you can use the dropna and fillna functions to handle null values.

content

(1) Usage of dropna() function.

(2) Usage of fillna() function


(1) Usage of dropna() function.

dropna(axis,how,thresh,subset,inplace)

Parameter Description:

axis: This parameter defaults to 0. When equal to 0, it means to delete the row where the null value nan is located, and when it is equal to 1, delete the column where the null value is located.

how: The default value of this parameter is 'any', which means to delete the row or column where the null value is located. This mainly depends on whether you set the previous axis parameter to 0 or 1; when the parameter is equal to 'all', it means that Delete a row or a whole column of rows or columns with null values ​​nan. If one of your rows or columns is not all null, it will not be deleted, that is, it will not work.

thresh: This parameter is an integer x, which means to keep each row or column where the number of nulls nan is less than x. For example, if I set x=2, then the number of non-null values ​​in each row or column of my row or column is greater than or equal to 2 will be saved, specifically row or column, it depends on the previous axis parameter setting.

subset: This parameter means to specify the column or row where the null value of a specific row or column is deleted. If axis=0, it means that if there is a null value in the specified row x, the column where it is located will be deleted; if axis=1, it means that if If the specified column x has a null value, the row where the null value is located is deleted.

inplace: This parameter defaults to False, which means that when you process the null value nan, whether to process it on the original data or first make a copy of the original data, and then process it on the copy, when processing on the copy, the original data Not affected in any way; if inplace is set to True, it means that you are processing on the original data, and the original data is directly affected.

After talking about the concept of parameters, let's look at an example to see the influence and role of these parameters on the function.

Example:

First create a 3X4 matrix, then set some null values ​​at the specified positions, and then process the data containing null values.

import numpy as np
import  pandas as pd
dataSet = pd.DataFrame(np.arange(12).reshape(4,3),index=[0,1,2,3],columns=['a','b','c'])
dataSet.iloc[1,[1]] = np.nan
dataSet.iloc[2,[1,2]] = np.nan
print(dataSet)

The created data is shown below:

The row identifiers are 0, 1, 2, and 3; the column identifiers are a, b, and c, which I named myself, not matrix data.

 Note: When we declare the parameters here, the ones that are not declared are the default parameters, and the ones that are declared are to change the default parameters to the parameters we want, which play a different role. Explanation of each parameter. Next, look at the difference in results when the parameters are different.

Call the function dataSet.deropna(), all parameters are default values, then the data in the first and second rows where the null value is located will be deleted. Let's look at the following results. I just took a screenshot of the code and the result together.

 Call the function dataSet.dropna(axis=1), other parameters default, delete the column where the null value nan is located. As shown below.

 Call the function dataSet.dropna(axis=1,how='all'), we will delete the = column where the entire column is empty, but in our data, none of the columns are empty nan, so the result is still the original The results of the data, did not change. The result is shown in the figure.

 Call the function dataSet.dropna(axis=1,thresh=3), which means that as long as the number of non-null values ​​in each column of my column is greater than or equal to 3, I will keep it, otherwise delete this column. Because the number of non-null values ​​in the b column in the original data is 2, which is not greater than 3, the b column should be deleted. The result is shown in the figure below.

 Call the function dataSet.dropna(axis=1, subset=['b']), if axis=0, it means that if there is a null value in the specified row x, the column where it is located will be deleted; if axis=1, it means that if the specified column x is If there is a null value, delete the row with the null value. So what I call this statement means, I want to specify that if there is a null value in the b column, I will delete the row where the null value is located, and if there is no null value, then do nothing. The result is shown in the figure below.

 When calling the function dataSet.dropna(inplace=True), when the parameter is True, it indicates that I am operating on the original data, and there is no copy of the original data. The parameters inplace of the functions we called above are the default values ​​of False, and we are calling the above After those functions, you can print out the original data of the dataSet, and he does not need to change it. If I am True here, the original data has also changed. As shown below. It can be seen that the original data has also changed.

 Summary: The above is about the usage of the dropna function. You can practice it by changing different parameters and see the effect. You will be more impressed by hands-on practice.

(2) Usage of fillna() function

fillna(axis,mthod,limit,inplace)

Does it look very similar to the parameters of the dropna function? The meaning of inplace is the same as that of the parameters in the dropna function, and will not be explained here.

axis: When this parameter is set to 1, it means to fill by row, and when it is set to 0, it means to fill by column. Defaults to 0, i.e. by column. Just the opposite of the dropna function.

method: This parameter means the filling method. If it is 'ffill', the previous data of the empty value is copied to the empty value; if it is 'bfill', the latter data of the empty value is copied. give this null value. If you do not use this parameter, you can do not declare it.

limit: This parameter limits the number of empty values ​​to be filled. For example, if a column has two empty values, I specify here that only one empty value is filled, and the other empty value is ignored.

Example: We also use the previous data as an example to create a data matrix. As shown in the figure below, you need to import the header files numpy and pandas, see the above program.

 Call the function dataSet.fillna(100), if you do not specify any parameters, you can use a numerical value to replace all the null values ​​nan in the data. as shown below

You can also use a dictionary to change the empty value in the specified place to the value we want to set.

 Call the function dataSet.fillna(method='ffill'), here I do not declare axis, the axis defaults to 0, that is, it is processed according to the column, where the null value nan will be the previous column where it is not null. value to assign to it. Here, the previous values ​​of each column of nulls are 1.0 and 5.0, respectively, and the values ​​of the nulls will be equal to these values ​​of the column they are in. As shown below.

 Call the function dataSet.fill(method='bfill', limit=1), the method here is equal to bfill, here I do not declare the axis, the axis defaults to 0, that is, it is processed according to the column, where the null value nan will be used by it The last value of the column where it is located is not a null value to assign to it, and we limit that each column can only be filled with one null value at most, and we don't care about the rest. As shown below.

 We will not demonstrate the inplace function here. It is consistent with the usage of the dropna function above. The default is to modify the copy of the data, and the original data will not be changed.

Summary: The above is about the usage of these two functions. When processing data, if you want to deal with null values, these two functions are still very useful, and you can get the data we want without null values.

It's not easy to write, please give a like to encourage it, please indicate the source when reprinting, every word is typed out by hand.

If you have any questions, welcome to communicate.

 

Guess you like

Origin blog.csdn.net/BaoITcore/article/details/123847927