Version 2020 Python data cleansing Ultimate Guide!

Author | Lianne & Justin

Translator | land from

Produced | AI technology base camp (ID: rgznai100)

In general, we fit a machine learning model before or statistical model, we must always work to clean up the data. Because no one can model a few errant data to produce meaningful results of the project.

Data clear means or cleaning process from a set of database tables or detect and modify (or delete) corrupted or inaccurate data recorded in the recording, which is used to identify data is incomplete, incorrect, inaccurate or part not related to the project itself, then these invalid data to be replaced, modify or delete files.

This is a very long definition, but the description is simple, easy to understand.

For simplicity, we have created a new Python is a complete, step by step guide, you will learn some of the ways how to find and clean up the data:

  • Missing data;

  • Irregular data (outlier);

  • Unnecessary data - data repetition;

  • Inconsistent data - capitalization, address.

In this article, we will use the Russian real estate data set provided by Kaggle (

https://www.kaggle.com/c/sberbank-russian-housing-market/overview/description), the goal is to predict Russia's recent price fluctuations. We're not going to clean up the entire data set, because this is only part of the sample will use them.

Before data collection began cleanup work, let's briefly look at the data inside.

From the above results, we learned that this data set a total of 30,471 rows and 292, also identified characteristic variable is numeric or categorical variables, these are all useful information for us.

You can now look at "dirty" data type list, and then one by one to repair it.

Let's get started right away.

Missing data

Data processing missing data cleansing is the most difficult but also the most common type of situation. Although many models can be adapted to a variety of situations, but most models do not accept missing data.

How to find the missing data?

We will introduce you three techniques, you can learn more about the missing data in the data set.

1, FIG heat missing data

When a smaller number of features, we can visualize the missing data by hot working FIG.

The following figure shows the first 30 data samples missing feature. The horizontal axis represents the name of the feature; vertical axis shows the number of observations and the number of rows; yellow indicates data missing, while the other portions are represented in blue.

For example, we have seen in many feature life_sq line is missing values. The feature floor near the 7000th line almost no missing values.

FIG heat missing data

2, list the percentage of missing data

When a data set has more than enough features, we can list the percentage of the missing data for each feature.

This will form a list below, shows the percentage of missing values ​​for each feature.

Specifically, we see features life_sq missing 21% of the data, only missing feature floor of 1%. This list is a more useful summary, it can be supplemented in accordance with the heat map visualization of.

The percentage of missing data list - 30 wherein the front

3, histogram data missing

When we have enough features when the missing data histogram is a technology.

To learn more about the missing observation data values ​​of the samples, we can use it to perform a histogram visualization.

This helps to identify missing value or a histogram of 30,471 observation data.

For example, a plurality of missing data was not observed 6000 values, while nearly 4,000 observations only a missing value.

Missing data histogram

What should we do?

For handling missing data, there is no consistent solution. Then we have to study the specific features and data sets to determine the best way to deal with them.

In the following, four methods were introduced common processing missing data. However, if you encounter more complex situations, we need to use some relatively more sophisticated methods, such as missing data modeling.

1, to abandon observed

In statistics, this method is called to delete the list of technology. In this scenario, if only one missing value, we will delete the entire observational data.

Only when we determine that the missing data did not provide useful information, we can do this. Otherwise, we should consider using other approaches.

Of course, you can also use other criteria to delete observation data.

For example, missing from the histogram data, we can see a total of at least 35 missing features of the observed data. We can create a new data set df_less_missing_rows, then remove with a 35 observations over the missing features.

2, delete features

A comparison with similar programs, we can only determine the current feature does not provide any useful information on when to perform this action.

For example, the percentage of missing data from the list, we note that the percentage of missing values ​​hospital_beds_raion of up to 47%. So, we can delete the entire feature data.

3, fill in missing data

Wherein when a time variable value, can fill in missing data. We will replace missing values ​​or the average values ​​of the same characteristic data already values.

When the feature is a categorical variable, we can fill in the missing data by mode (most frequently occurring value).

To life_sq example, we can use it to replace the value of the missing values ​​feature.

In addition, we can use the same way to fill the data for all of the digital features.

Compare Fortunately, our data set and no missing values ​​of classification features. However, we can one-time mode for all classification features to fill operation.

4, the replacement of missing data

For classification feature, we can add a similar "_MISSING_" such values, which is the value of a new type. For numeric features, we can replace it with a special value such as -999.

In this way, we can still retain the value of the missing information as useful.

Irregular data (outliers)

Outliers are data with other observations very different, they may be the real outliers or wrong values.

How to find irregular data?

Characterized in that the value or classification, we can study the distribution of it to detect outliers using different techniques according to.

1, a block diagram of a histogram, and

When the feature value is, we can use the histogram to detect it or is a block diagram of outliers.

The following is a feature life_sq histogram.

As there may be an abnormal value, and therefore, the accuracy of the differential data appears to be significant anomalies.

Histogram

For more in-depth study of this feature, let's draw a block diagram.

In this figure, we can see a anomaly value over 7000.

Block diagram

2. Descriptive statistics

Further, for the feature value, the abnormal value may be too significant, a block diagram that can not be visualized. Instead, we can look at their descriptive statistics.

For example, for features life_sq, we can see that the maximum value is 7478, while only 75% quartile 43. Obviously, the 7478 value is an outlier.

3, bar

For classification feature, we can use a bar chart to understand the characteristics of the category and the distribution situation.

For example, a reasonable distribution of feature ecology, however, if there is only one value of a category "other" is called, then this is definitely an exception value.

Bar chart

4, other techniques

There are many other techniques can also be used to discover outliers, such as scatter plots, z-score and clustering, etc., where not explained one by one.

What should we do?

While looking for outliers is not difficult, but we must determine the right solution for processing. It is highly dependent on the target using data sets and projects.

The method of process somewhat similar to the operation of the abnormal value of the missing data. We either give up or adjustment, or leave them. For possible solutions, we can refer to missing data section of this article.

Unnecessary data

After the missing data and outliers were all the effort, let us look at the data on unnecessary, which is more simple.

First, all data inputs to the model should be the goal of the project services. Unnecessary data is the data no practical value. Depending on the situation, we are mainly divided into three types of unnecessary data.

1, no information or duplicate values

Sometimes, a feature no useful information because too many rows have the same value.

No information or how to find duplicate values?

We can create a higher percentage of the list of features with the same value.

For example, we specify more than 95% of the features have the same value are shown below.

We can study these variables one by one to see if they have valuable information here is not to show the details.

What should we do?

We need to understand the reasons behind repeat feature, when they really lack of useful information, when you can renounce them.

2, irrelevant data

Similarly, the data needed to provide useful information for the project. If these features nothing to do with the data we issue to be addressed in the project, then they are not related.

How to find irrelevant data?

First, we need to look at these characteristics, in order to be able to later identify those irrelevant data.

For example, a record Toronto weather feature data does not provide any useful information for predicting the Russian prices.

What should we do?

When these characteristics data are not consistent with the objectives of the project, we can give them.

3, duplicate data

It refers to a duplicate data multiple observations of the same.

Duplicate data mainly includes two types.

(1) based on all the repeating characteristic data

How to find duplicate data based on all the features?

When all the features of the observed data are the same time, this duplication occurs, it is very easy to find.

We must first remove the unique identifier id dataset, and then create a file called df_dedupped data set by removing duplicate data. We compare the two data sets (df and df_deduped), to find out how many duplicate rows.

Draw, line 10 is an exact duplicate of observation.

What should we do?

We should delete the duplicate data.

(2) data based on the key repeat feature

How to find duplicate data based on key characteristics?

Sometimes it is desirable to remove those duplicate set of data in accordance with a unique identifier.

For example, the area of ​​the same building, the same price, the possibility of building two real estate transactions in the same year occur simultaneously is almost zero.

We can set up a set of key features as a unique identifier of the transaction, including timestamp, full_sq, life_sq, floor, build_year, num_room, price_doc, we will check whether there is a copy based on these identifiers (duplicates).

Based on this set of key features, a total of 16 copies, that is duplicate data.

What should we do?

We can delete the duplicate data based on key features.

We called the new data df_dedupped2 focused removed 16 duplicate data.

Inconsistent data

Let the data set to follow specific standards to fit the model is also crucial. We need a different way to explore data, so that you can identify the inconsistent data. In many cases, it depends on careful observation and experience, and no fixed code to run and repair inconsistent data.

Here we will introduce four kinds of inconsistent data types.

1, the case is inconsistent

There are inconsistencies with the case of the classification values, this is a common mistake. Since the data analysis in Python are case-sensitive, so this may cause problems.

How to find capitalization inconsistency?

Let us look at features sub_area.

It is used to store the names of the different regions, it seems to have very standardized.

However, sometimes there are inconsistencies with the case of the same characteristic data. For example, "Poselenie Sosenskoe" and "pOseleNie sosenskeo" may refer to the same area.

What should we do?

To prevent this from happening, we either all lowercase letters, or all uppercase.

2, inconsistent data format

We need another standardized data format is implemented. Here is an example, is the feature (String) string format is converted from the date and time (the DateTime) format.

How to find inconsistent data formats?

Format feature timestamp is a string to represent the date.

What should we do?

We can use the following code conversion, and extracts the value of the date or time. After that, it would be easier annually or monthly trading volume grouping analysis.

3, the data are inconsistent classification

The classification is inconsistent with the value of the type we have to discuss last inconsistent data. Classification limited number of eigenvalues. Sometimes the input error and other reasons, there may be other values.

How to find a category value inconsistent?

We need to carefully observe a feature to find values ​​do not match, where we use an example to illustrate this point.

As we in the real estate data set does not exist such a problem, therefore, in the following we have created a new data set. For example, the value of city features are incorrectly defined as "torontoo" and "tronto". But they both point to the correct value "toronto".

A simple method is to confirm the fuzzy logic (interval or edit, edit distance). It measures we need to change the value of a number of letters used to spell the difference to match with another value (distance).

We know that these categories should be only "toronto", "vancouver", "montreal" and "calgary" these four values. We calculated the distance of all values ​​of the word "toronto" (and "vancouver") between. It can be seen that there may be a typing error distance between the word and the right word is small, because only a thin few letters between them only.

What should we do?

We can set a standard spelling errors will convert these to the correct value. For example, the following code will be all values ​​within the distance "toronto" 2 letters are set to "toronto".

4, the address data inconsistencies

Address feature has now become a problem for many of us the most headaches. Because people often do not follow the standard format of the case, the data will be entered into the database.

How to find inconsistencies address?

We can find the address difficult to handle by viewing the data. Even if sometimes we can not find any problems, but we can also run the code, the address data standardization process.

In our data set does not belong to the private address. Therefore, we take advantage of new features address to create a data set df_add_ex.

As we have seen, the address data but very non-standard.

What should we do?

We run the following code, the purpose is to unify the letters to lowercase, remove spaces, remove blank lines as well as standardization of the word.

It looks much better now.

We finally completed, after a long process to clear the obstruction model fit all those "dirty" data.

Original link:

https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d

This article CSDN translation, please indicate the source of the source.

【End】

"Force plan [the second quarter] - learning ability Challenge" started!
From now until March 21, must flow to support the original author, the exclusive [more] medal waiting for you to challenge

Recommended Reading 

barley Ali into the electronic business platform, Jay concert tickets will not be good Qiangdian?

30 million lines of data, Python analysis of two decades of his career Bryant | Force program

recent face more than a dozen data analysts, these problems can be avoided!

talk about novel coronavirus, Bitcoin, Apple ...... • Warren Buffett respondents in verse 18, worth a visit!

BZip2Codec compression, Map end compression control, Reduce end compression control ...... knowledge are compressed in this point in the Hadoop integration!

Heard the code a little older, a little older how to do it Bug?

You look at every point, I seriously as a favorite

Released 1805 original articles · won praise 40000 + · Views 16,310,000 +

Guess you like

Origin blog.csdn.net/csdnnews/article/details/104765648