Data cleaning machine learning art

REVIEW

Data cleansing is the first step in a large data processing, many students feel this step have no idea, I do not know what the cleansed and what not, take a look at this article, is not capable of a little inspiration?

Like Obama lost the microphone as bad to lose data

Scientific data should be cleaning up the data (DS) or Step (ML) workflow machine learning. If there is no clear data, you will explore hard to see the really important part. Once you finally start training ML model, then train them to become unnecessarily more challenging. The main point is that if you want to maximize the use of data, then it should be clean.

In the context of scientific data and machine learning, data cleansing and filtering means to modify the data, make it easier to discover, understand and modeling. Filter out part you do not want or need, so you do not need to see or handle them. You do need to modify the part, but they are not the format you need, so you can use them properly.

Here, we'll look at some of our usually end up with something in the data, as well as panda codes you can use!

Missing data

Large data sets are rarely completely intact. Herein refers to complete all features of all the data points have variable values. Typically lose some value, similarly when loaded "pd.read_csv ()" panda values, these values ​​are marked as NaN or None. There are a lot of very practical reasons why these data will be lost. People may just forget to collect the data, or until they are half the data collection process began collecting the characteristic variable.

Before using data sets need to deal with missing data. For example, suppose you are doing data exploration, find some key information about the data from a certain characteristic variables, such as "variable F". But then you will find that the value of the data set variable F 95% are NaNs, you can not only from a properly represented to make any concrete conclusions on data sets 5% of variable data! When you start training your ML model, your application may be considered as NaNs 0 or infinity, leading to problems with your training!

There are ways to handle missing data in pandas in:

Check NANs:  pd.isnull (Object) detecting missing data values, it also detects "NaN" and "None"

Remove the missing data:  df.dropna (Axis = 0, = How 'the any') returns a data frame, which contains any data point is deleted of NaNs.

Replace missing data:  df.replace (to_replace = None, value = None) If you want to know what the value of this feature is variable, then this is very useful.

To delete a feature:  df.drop ( 'feature_variable_name', Axis = 1) If you find a feature variable in the data set has> 90% NaN, then it makes sense to delete the entire feature from your data.

机器学习中数据清洗的艺术

Throw away those bad characteristics, such as Obama's microphone

Outliers

Dataset outliers are mixed. On the one hand, they may contain crucial information, because they are very different from the major groups. On the other hand, they abandoned our view of the major groups, as we have seen so far, just to see outliers! In terms of ML, the training includes outliers can make your model a good generalization, but it can be removed from the major groups most of your data is located.

In general, the usual advice from two aspects to consider. There are no studies or data outliers. If you decide you need them the ML model, then choose a sufficiently robust way to handle them. If you find these outliers do exist, and to obtain global information and data modeling does not help, then the most like the one to remove them as shown.

Also, if you want to filter out these outliers can use the following method:

# Get the 98th and 2nd percentile as the limits of our outliers
upper_limit = np.percentile(train_df.logerror.values, 98)
lower_limit = np.percentile(train_df.logerror.values, 2)
# Filter the outliers from the dataframe
data[‘target’].loc[train_df[‘target’]>upper_limit] = upper_limit data[‘target’].loc[train_df[‘target’]<lower_limit] = lower_limit

机器学习中数据清洗的艺术


机器学习中数据清洗的艺术

包含异常值的图(左)和剔除异常值的直方图(右)

坏数据和重复数据

坏数据是指任何不应该存在的数据点或值,或者是完全错误的数据点或值。例如,假设你的一个特征变量名为“gender”,其中大多数值是“male”或“female”。但是,当你浏览数据集时,您会注意到有几个数据点的性别值为67.3 !显然,67.3在这个变量中没有任何意义。此外,如果你尝试将“gender”特征变量转换为分类浮点数:male = 0.0, female = 1.0,那么您将得到一个额外的浮点数:67.3 = 2.0!

重复仅仅是数据集中重复的数据点。如果有太多这样的特征,就会影响ML模型的训练。正如我们前面看到的,重复数据可以简单地从数据中删除。

坏数据可以通过删除或使用一些智能替换来处理。例如,我们可以查看性别为67.3的数据点,发现所有这些数据点的正确值都应该是“女性”。因此,我们只需将所有67.3值转换为“女性”。这样做的好处是,我们有效地为ML训练重新获得了这些数据点,而不是将它们丢弃。你可以使用pandas做这样的转换:

value_map = {'male': 'male', 'female': 'female', '67.3': 'female'}
pd_dataframe['gender'].map(value_map)


机器学习中数据清洗的艺术


小心重复的Loki数据

不相关的特征

并非所有的特征都是相同的。有些东西你可能根本不需要!例如,你可能正在查看过去一年从Amazon购买的图书的数据集,其中一个特性变量称为“font-type”,表示书中使用的字体类型。这和预测一本书的销量是毫无关系的!你可以安全地把这个功能丢弃,就像这样:

df.drop('feature_variable_name', axis=1)

这样做使你的数据探索更容易,因为你不需要查看那么多东西。由于你没有处理那么多的数据,它还有助于使ML模型的训练变得更容易和更快。如果你不确定变量是否重要,那么你可以一直等到开始研究数据集时再决定。计算特征变量与目标输出之间的相关矩阵有助于确定该变量的重要性。

机器学习中数据清洗的艺术

当你的特征变量没什么用时…

标准化

每个特征变量中的所有数据都应该采用相同的标准化格式。它将使数据探索和建模的技术方面更加容易。例如,让我们再次以“gender”变量为例,它的值是“male”或“female”。如果数据是由人类收集的,你可能会得到许多你没有预料到的不同的值:

  • male, female(这个不错)

  • MALE, FEMALE(带大写锁定)

  • Male, Female (有些人会大写)

  • Make,Femall(拼写错误!)

如果我们直接将特征变量转换为分类浮点数,我们会得到比我们想要的0和1更多的值!我们会得到这样的结果:

{
'male': 0,
'female': 1,
'MALE': 2,
'FEMALE': 3,
'Male': 4,
'Female': 5,
'Make': 6,
'Femall': 7
}

处理这种情况有两种方法。如果是一些简单的事情,比如把第一个字母大写或者小写,就像这样做:

# Make the whole string lower case
s.lower()
# Make the first letter capitalised
s.capitalize()

如果有拼写错误,你会想要使用我们之前看到的映射函数:

郑州不孕不育医院:http://jbk.39.net/yiyuanzaixian/zztjyy/

value_map = {'Make': 'male', 'Femall': 'female'}
pd_dataframe['gender'].map(value_map)

英文原文:https://towardsdatascience.com/the-art-of-cleaning-your-data-b713dbd49726


Guess you like

Origin blog.51cto.com/14510269/2432067