This article explains the data thoroughly (three): data cleaning

I. Introduction

In the last two articles, we have learned that "data" is a huge system (as shown in the figure below); and used the example of the vegetable market to explain the meaning of the data source, and use the example of shopping to explain the data. The steps of collection; and today Xiao Chen mainly explains how to select and wash vegetables after we "finish the vegetables", that is, the process of data cleaning.
Insert picture description here

2. Data cleaning (choose dishes, wash dishes)

To understand how data cleaning works step by step, we first need to clarify what is the concept of data cleaning?

1. Basic concepts and importance of data cleaning

Data cleaning-the process of re-checking and verifying data with the aim of removing duplicate information, correcting existing errors and providing data consistency.

The above is the concept definition of data cleaning on Baidu Baike. From my personal understanding, data cleaning is a process of replacing "dirty data" with "high-quality usable data".

After all, data cleaning is a crucial link in data preprocessing, and the quality of the cleaned data largely determines the accuracy of the results of subsequent research-based data analysis.

2. Objects and methods of data cleaning

The above stated the importance of data cleaning, let's further determine the objects that need to be cleaned.

I personally roughly divide the objects of data cleaning into two categories, and we will introduce them one by one below.

1) Avoidable dirty data

Avoidable dirty data. As the name implies, this type of dirty data can be directly processed into effective data or avoided by manual modification.

This kind of dirty data is actually very common in daily life, such as errors caused by irregular naming, spelling errors, input errors, null values, and so on.

For example, the following is the relevant data of a used car platform (file name: car-data.xlsx), you can see that there is obvious dirty data in the 4-year resale column in the figure (the data in the column should be in the form of Arabic Number, the unit is ten thousand), this is the avoidable dirty data caused by input errors.

Insert picture description here

Knowing the types of such dirty data, how can we detect and correct such "correctable" data errors in a timely manner after we get the data? Here we take excel and python as examples, and the data set is still the used car data above.

In excel, the reconnaissance of "avoidable" dirty data can be viewed through the filter function, as follows, select the data of the "4-year resale" dimension and filter it, you can detect the nan (null) value 2 Numbers, 2 incorrect values ​​were entered.

Insert picture description here

In the python language, you can try data.describe() to view the basic statistics of the target column:

Insert picture description here

After viewing the corresponding information, if it is determined that there is a typo and the case of English capitalization is not uniform, you can use: data['car-data'].str.upper(); input an extra space: data['car-data']. str.strip().

2) Inevitable dirty data

Inevitable dirty data, the main forms include outliers, repeated values, null values, etc.; the processing of such dirty data requires contact with some statistical knowledge for reconnaissance and filling. Here are some specific examples.

Outliers:

Commonly used reconnaissance method 3σ law test (assuming a set of detected data contains only random errors, calculate and process them to get the standard deviation, determine an interval with a certain probability, think that any error exceeding this interval is not a random error but a gross error , The data containing the error should be eliminated. Generally speaking, this interval is the mean plus or minus three standard deviations, so it is called the 3σ law).

As shown below, the abnormal value of "vehicle width" needs to be checked in the second-hand car data:
Insert picture description here

Duplicate value:

As shown below, after getting the data, we first need to check whether there are duplicate records; if there are duplicate records, drop_duplicates() can be used in python to delete duplicate data, so as to avoid double counting, resulting in a decrease in data accuracy.

Insert picture description here

As shown above, the 5th and 9th data, except the id information, the rest of the information is the same, we need to delete this kind of data according to their data characteristics; and observe the data below, FIRST NAME and LAST NAME are unique Identification, we can use the code below to eliminate duplicate values ​​based on the characteristics of the data.

df.drop_duplicates([‘first_name’,’last_name’],inplace=True)

Null value:

Regarding the null value, the python language has many ways to detect and return the null value, and we will introduce them one by one below.

data.is null(), data.not null(), will return true or false, we can know the null value of the corresponding indicator, and use the sum() function to control the total number of null values .
Insert picture description here

What should we do in the face of the above various types of null values? Delete a single? Delete multiple? Use the average and median to supplement?

In fact, the above operation methods are very common when dealing with null values, and what we need to master is to use the corresponding methods in appropriate scenarios. Here are some common null value processing scenarios~

Scenario 1: For this dimension data, more than half or all are null values-from the perspective of indicator validity, whether to delete the corresponding indicator.

Command: data.dropna(how='all'), delete all rows with empty values ​​(invalid indicators).

Scenario 2: There are null values ​​in this dimension (but the number of null values ​​is not large), and the overall data sample size is large-because the data samples are sufficient, you can consider filtering samples with nan values, and use samples without nan values ​​(code is as follows, Data involving nan values ​​will be eliminated).

df.dropna(axis=0,how=’any’) #drop all rows that have any NaN values。

Scenario 3: There are null values ​​in this dimension (but the number of null values ​​is not large), and the total number of samples is limited, so data with nan values ​​cannot be discarded as in scenario 2, and mathematical statistical methods need to be used to select appropriate values ​​for nan The value is filled.

Code: data.fillna (we can see that in this example, empty values ​​are filled with the mean).
Insert picture description here

Guess you like

Origin blog.csdn.net/amumuum/article/details/113179762