The Road to Big Data - (1) Data Cleaning in Algorithmic Modeling

Author: Mochou

Source: Hang Seng LIGHT Cloud Community

In the context of the current big data, data processing accounts for a huge share, just like a tomato made into scrambled eggs, it needs to go through seasoning integration, vegetable cleaning, meal processing, etc. before it can be released to production, no, sent to the table .

Here I will briefly share my understanding of data cleaning. It has a very important position. Otherwise, no one is willing to take a bite in the face of messy tomatoes and scrambled eggs.

An uncleaned data generally has these problems that do not meet the analysis requirements: such as repetition, error, null value, abnormal data, etc. For wrong data, because it is a business source problem, such as the gender is obviously male but becomes female , we can't deal with these, we can only regulate from the source, just like the customer wants to eat tomatoes from Henan, only Shandong's in the kitchen, the chef can't solve it, he can only notify the purchaser to change. Therefore, we only clean and revise the other three problems. It must be stated that all cleaning must be done based on actual business, such as repetition. Maybe the business needs repetition. If you wash it off for others, problems will arise.

1627760456(1).jpg

1. Repeat

If the actual business does not need duplicate values, the duplicate values ​​can be deleted directly. For example, in the database, union can be used instead of union all when integrating and merging. For those that do not support union, the primary key can be used to sort the first one.

row_number() over (partition by .. order by..desc) as..

If you don't support row_number,,, then live well.

Other languages ​​also have similar deduplication functions, such as python can directly use drop_duplicates()

2. Missing

Missing is also a null value. It needs to be clear. There are two cases of 'empty'. One is that the real object is empty, that is, null, and the other is a null value, that is, xxx='', so we need to deal with these null values. There are two cases, one is xxx is null, the other is length(trim(xxx))=0.

Null value processing is generally filled, which is done according to actual business needs. Generally speaking,

  • When the number of null values ​​is relatively small, one of the continuous values ​​can be filled, such as the mean, median, etc.;
  • When there are many empty values, accounting for more than 50%, you can consider using the mode to fill;
  • Null values ​​account for the vast majority. At this time, there is no need to use the original data. You can create your own data, generate an indicator dummy variable, and participate in subsequent modeling needs.

I have already said how to deal with it in the database, here is how to deal with it in python

# 列出空值在每个列所占的比重
# df是数据集,col.size是当前数据的行数
df.apply(lambda col:sum(col.isnull())/col.size)
# 用均值填补,使用pandas包里的fillna
df.col1.fillna(df.col1.mean())
3. Noise value

Noise value refers to the value in the data that has a relatively large difference compared with other values, and some are called outliers, such as a few that are over 150 in age. Noise values ​​can seriously interfere with model results, making conclusions untrue or biased. Therefore, these noise values ​​must be removed, and the commonly used methods are: capping method for univariate, binning method, and clustering method for multivariate.

  • cap method

We have all learned about normal distribution. The probability of adding a block to the recorded values ​​outside the range of three standard deviations above and below the mean is only 0.01%. Therefore, we can replace these peripheral data with their respective three standard deviations above and below the mean, which is cap method

1627760879(1).jpg

The database can be replaced with case when, and python can write a function

def cap(x,quantile=[0.01,0.99]):
'''盖帽法处理异常值
Args:
    x:是series列,连续变量
    quantile:上下分位数范围,这里写为0.01和0.99
'''
 
# 生成分位数,Q01,Q99分别是百分之一分位点和百分之99分位点
    Q01,Q99=x.quantile(quantile).values.tolist()
 
# 替换异常值为制定的分位数
 
    if Q01 > x.min():
        x = x.copy()
        x.loc[x < Q01] = Q01
    if Q99 < x.max():
        x = x.copy()
        x.loc[x > Q99] = Q99
    return(x)
  • binning method

The binning method smoothes the values ​​of the ordered data by examining the "nearest neighbors" of the data, and the ordered values ​​are distributed into some bins. By taking the specific values ​​of each bin, such as the maximum value, mean, median, etc. The value of the box, and then set the standard to judge the value of these boxes, so as to judge whether each box is good or bad, and the bad box needs special treatment. The binning method is divided into equal depth binning: the sample size of each bin is the same, and equal width binning: the value range of each bin is the same.

For example, a set of numbers 1 2 66 8 9 2 1 4 6, first sort 1 1 2 2 4 6 8 9 66, and then divide into three boxes Box A: 1 1 2 Box B: 2 4 6 Box C: 8 9 66

Here we take the average of the boxes, then A is 1.3 B is 4 C is 27.3 Obviously, the C value is much larger than the mean and median of this group of data, so the C box is a bad box, you can focus on processing C in data

  • Clustering

All of the above are univariate, and multivariate outlier processing requires the use of clustering methods.

The idea is that normal values ​​all have similar labels, for example, the good tomatoes in the front are ruddy in color, the taste is sweet and sour, the skins are intact, etc., while the bad tomatoes have different colors from "others". characteristics, such as a weird taste. Therefore, we can divide data objects into multiple sets. Objects in the same set have high similarity, while objects in different sets are quite different. Cluster analysis can mine outliers through these different sets, and these outliers are often abnormal data.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324034157&siteId=291194637