Engineering Series features: pre-processing features (at)

Engineering Series features: pre-processing features (at)

This article data between tea group Friends of the original, published in this number of authorized public.

About the Author: JunLiang, a love of data mining practitioners, inquisitive hands of people, explore content related to machine learning look forward to exchange with you ~

0x00 Foreword

Preprocessing data comprising data exploration, data cleaning and pretreatment characteristics of three parts, " wherein Engineering Series: wherein the pretreatment (a) " describes the dimensionless characteristic points of the tub and the relevant processing method described in this chapter will continue feature Pretreatment the statistical category feature encoding and conversion-related content.

0x01 statistical transformation

Tilt data distribution has many negative effects. We can use the feature engineering techniques, the use of statistical or mathematical transformation to mitigate the effects of data distribution inclined. So that the original dense dispersion value interval as possible, the original value of the interval of possible dispersion polymerization.

These transformation functions are all clustered power transformation function, normally used to create monotonous data conversion. Their main role is that it can help stabilize the variance, and always maintain the distribution is close to normal and so that the average data has nothing to do with the distribution.

1.Log transformation

1) the definition of

Log transformed data transformation normally used to create monotonous. Its main role is to help stabilize the variance, and always maintain the distribution is close to normal and so that the average data has nothing to do with the distribution.

Log conversion power conversion function belongs to the cluster. This function is represented by mathematical expression

Natural logarithm using b = e, e = 2.71828, often called Euler's constant. You can use commonly used in the decimal system as in base b = 10.

When applied to the inclination distribution Log conversion is useful, tend to stretch as Log conversion range from the variable values ​​fall within the amplitude range of those low, it tends to compress or reduce the argument value higher amplitude range range. So that the skewed distribution as close to normal distribution.  

2) the role of

Variance values ​​for a number of consecutive unstable characteristics, heavy-tailed eigenvalues ​​we need to adjust the Log of the variance of the distribution of the entire data, data belonging to the variance of stable conversion. For example, in word frequency statistics, the number of occurrences of some prepositions much higher than other word, this word frequency characteristics of the distribution will now feature some word frequency values ​​very uncoordinated situation, widening the variance of the entire data distribution. This time can be considered Log of. Especially in this sub-field analysis, time series analysis field, Log of very common, the goal is to stabilize the variance, the goals focus on its fluctuations.

3) effect of the change

4) implementation code

fcc_survey_df['Income_log'] = np.log((1+fcc_survey_df['Income']))

2.Box-Cox transformation

1) the definition of

Box-Cox transformation function is a function of another popular power transformation cluster. The function has a precondition, i.e. numeric value must be converted to positive numbers (log transformed and as required). Event of a negative value, the use of a constant offset value is helpful.

Box-Cox transformation function:

Generating the transformed output y is a function of the input x and the transformation parameter; when λ = 0, the transformation is the natural logarithm log transformed foregoing we have already mentioned before. Best value λ is generally determined by the maximum likelihood or the maximum log-likelihood.

2) the role of

Box-Cox transformation is a generalized power conversion method proposed by Box and Cox in 1964, is a commonly used statistical modeling data conversion, for continuous response variable is not normally distributed. After the Box-Cox transformation can be reduced unobservable correlation and predictor error to some extent. The main features of the Box-Cox transformation to introduce a parameter that determines in turn should take the form of data conversion, Box-Cox transformation can significantly improve the normality of data, equal variances of symmetry and by estimating the data itself, for many the actual data is effective.

3) the effect of changes

 

4) implementation code

import scipy.stats as spstats
# 从数据分布中移除非零值
income = np.array(fcc_survey_df['Income'])
income_clean = income[~np.isnan(income)]
# 计算最佳λ值
l, opt_lambda = spstats.boxcox(income_clean)
print('Optimal lambda value:', opt_lambda)

# 进行Box-Cox变换
fcc_survey_df['Income_boxcox_lambda_opt'] = spstats.boxcox(fcc_survey_df['Income'],lmbda=opt_lambda)

0x02 classification characteristic (feature category) encoding

In statistics, the classification characteristic is finite and usually one of a fixed number of possible values ​​of the variable, based on certain qualitative attribute assigning each individual or other viewing means to a specific group or the name of the category.

1. Label coding (LabelEncode)

1) the definition of

LabelEncoder is discrete digital numbers or text, encoded tag values ​​ranged between 0 and n_classes-1.

2) the advantages and disadvantages

Advantages: relative OneHot coding, LabelEncoder encoding a small memory space, and supports text feature encoding.

缺点:它隐含了一个假设:不同的类别之间,存在一种顺序关系。在具体的代码实现里,LabelEncoder会对定性特征列中的所有独特数据进行一次排序,从而得出从原始输入到整数的映射。所以目前还没有发现标签编码的广泛使用,一般在树模型中可以使用。
例如:比如有[dog,cat,dog,mouse,cat],我们把其转换为[1,2,1,3,2]。这里就产生了一个奇怪的现象:dog和mouse的平均值是cat。

3)实现代码

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])

print('特征:{}'.format(list(le.classes_)))
# 输出 特征:['amsterdam', 'paris', 'tokyo']

print('转换标签值:{}'.format(le.transform(["tokyo", "tokyo", "paris"])))
# 输出 转换标签值:array([2, 2, 1]...)

print('特征标签值反转:{}'.format(list(le.inverse_transform([2, 2, 1]))))
# 输出 特征标签值反转:['tokyo', 'tokyo', 'paris']

2.独热编码(OneHotEncode)

1)定义

OneHotEncoder用于将表示分类的数据扩维。最简单的理解就是与位图类似,设置一个个数与类型数量相同的全0数组,每一位对应一个类型,如该位为1,该数字表示该类型。

OneHotEncode只能对数值型变量二值化,无法直接对字符串型的类别变量编码。

2)为什么要使用独热编码

独热编码是因为大部分算法是基于向量空间中的度量来进行计算的,为了使非偏序关系的变量取值不具有偏序性,并且到圆点是等距的。使用one-hot编码,将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点。将离散型特征使用one-hot编码,会让特征之间的距离计算更加合理。  

为什么特征向量要映射到欧式空间?
将离散特征通过one-hot编码映射到欧式空间,是因为,在回归、分类、聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算。

3)例子

假如有三种颜色特征:红、黄、蓝。

在利用机器学习的算法时一般需要进行向量化或者数字化。那么你可能想令 红=1,黄=2,蓝=3。那么这样其实实现了标签编码,即给不同类别以标签。然而这意味着机器可能会学习到“红<黄<蓝”,但这并不是我们的让机器学习的本意,只是想让机器区分它们,并无大小比较之意。

所以这时标签编码是不够的,需要进一步转换。因为有三种颜色状态,所以就有3个比特。即红色:1 0 0 ,黄色: 0 1 0,蓝色:0 0 1 。如此一来每两个向量之间的距离都是根号2,在向量空间距离都相等,所以这样不会出现偏序性,基本不会影响基于向量空间度量算法的效果。

4)优缺点

优点:独热编码解决了分类器不好处理属性数据的问题,在一定程度上也起到了扩充特征的作用。它的值只有0和1,不同的类型存储在垂直的空间。

缺点:当类别的数量很多时,特征空间会变得非常大。在这种情况下,一般可以用PCA来减少维度。而且one hot encoding+PCA这种组合在实际中也非常有用。

5)实现代码

使用sklearn实现
注:当特征是字符串类型时,需要先用 LabelEncoder() 转换成连续的数值型变量,再用 OneHotEncoder() 二值化 。

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # fit来学习编码
enc.transform([[0, 1, 3]]).toarray() # 进行编码
# 输出:array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

使用pandas实现

import pandas as pd
import numpy as np

sex_list = ['MALE', 'FEMALE', np.NaN, 'FEMALE', ]
df = pd.DataFrame({'SEX': sex_list})
display(df)
# 输出
SEX
0 MALE
1 FEMALE
2 NaN
3 FEMALE

df = pd.get_dummies(df['SEX'],prefix='IS_SEX')
display(df)
# 输出
IS_SEX_FEMALE IS_SEX_MALE
0 0 1
1 1 0
2 0 0
3 1 0

3.标签二值化(LabelBinarizer)

1)定义

功能与OneHotEncoder一样,但是OneHotEncode只能对数值型变量二值化,无法直接对字符串型的类别变量编码,而LabelBinarizer可以直接对字符型变量二值化。

2)实现代码

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit([1, 2, 6, 4, 2])

print(lb.classes_)
# 输出 array([1, 2, 4, 6])

print(lb.transform([1, 6]))
# 输出 array([[1, 0, 0, 0],
[0, 0, 0, 1]])

print(lb.fit_transform(['yes', 'no', 'no', 'yes']))
# 输出 array([[1],
[0],
[0],
[1]])

4.多标签二值化(MultiLabelBinarizer)

1)定义

用于label encoding,生成一个(n_examples * n_classes)大小的0~1矩阵,每个样本可能对应多个label。

2)适用情况

  • 每个特征中有多个文本单词;


     用户兴趣特征(如特征值:

    ”健身 电影 音乐”)适合使用多标签二值化,因为每个用户可以同时存在多种兴趣爱好。

  • 多分类类别值编码的情况。


     电影分类标签中(如:

    [action, horror]和[romance, commedy])需要先进行多标签二值化,然后使用二值化后的值作为训练数据的标签值。

3)实现代码

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
print(mlb.fit_transform([(1, 2), (3,)]))
# 输出
array([[1, 1, 0],
[0, 0, 1]])

print(mlb.classes_)
# 输出:array([1, 2, 3])

print(mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}]))
# 输出:array([[0, 1, 1],
[1, 0, 0]])

print(list(mlb.classes_))
# 输出:['comedy', 'sci-fi', 'thriller']

5.平均数编码(Mean Encoding)

1)定义

平均数编码(mean encoding),针对高基数类别特征的有监督编码。当一个类别特征列包括了极多不同类别时(如家庭地址,动辄上万)时,可以采用。

平均数编码(mean encoding)的编码方法,在贝叶斯的架构下,利用所要预测的应变量(target variable),有监督地确定最适合这个定性特征的编码方式。在Kaggle的数据竞赛中,这也是一种常见的提高分数的手段。

算法原理详情可参考:平均数编码:针对高基数定性特征(类别特征)的数据预处理/特征工程。

2)为什么要用平均数编码

如果某一个特征是定性的(categorical),而这个特征的可能值非常多(高基数),那么平均数编码(mean encoding)是一种高效的编码方式。在实际应用中,这类特征工程能极大提升模型的性能。

因为定性特征表示某个数据属于一个特定的类别,所以在数值上,定性特征值通常是从0到n的离散整数。例子:花瓣的颜色(红、黄、蓝)、性别(男、女)、地址、某一列特征是否存在缺失值(这种NA 指示列常常会提供有效的额外信息)。

一般情况下,针对定性特征,我们只需要使用sklearn的OneHotEncoder或LabelEncoder进行编码,这类简单的预处理能够满足大多数数据挖掘算法的需求。定性特征的基数(cardinality)指的是这个定性特征所有可能的不同值的数量。在高基数(high cardinality)的定性特征面前,这些数据预处理的方法往往得不到令人满意的结果。

3)优点

和独热编码相比,节省内存、减少算法计算时间、有效增强模型表现。

4)实现代码

MeanEnocodeFeature = ['item_city_id','item_brand_id'] #声明需要平均数编码的特征
ME = MeanEncoder(MeanEnocodeFeature) #声明平均数编码的类
trans_train = ME.fit_transform(X,y)#对训练数据集的X和y进行拟合
test_trans = ME.transform(X_test)#对测试集进行编码

MeanEncoder实现源码详情可参考:平均数编码:针对高基数定性特征(类别特征)的数据预处理/特征工程。

0x0FF 总结

  1. 特征预处理是数据预处理过程的重要步骤,是对数据的一个的标准的处理,几乎所有的数据处理过程都会涉及该步骤。

  2. 由于树模型(Random Forest、GBDT、xgboost等)对特征数值幅度不敏感,可以不进行无量纲化和统计变换处理;

    同时,由于树模型依赖于样本距离来进行学习,所以也可以不进行类别特征编码(但字符型特征不能直接作为输入,所以需要至少要进行标签编码)。

  3. 依赖样本距离来学习的模型(如线性回归、SVM、深度学习等)

    • 对于数值型特征需要进行无量纲化处理;

    • 对于一些长尾分布的数据特征,可以做统计变换,使得模型能更好优化;

    • 对于线性模型,特征分箱可以提升模型表达能力;

  4. 对数值型特征进行特征分箱可以让模型对异常数据有很强的鲁棒性,模型也会更稳定。

    另外,分箱后需要进行特征编码。

如有错误欢迎指正~

参考文献

[1] sklearn中的数据预处理. http://d0evi1.com/sklearn/preprocessing/
[2] 归一化与标准化. https://ssjcoding.github.io/2019/03/27/normalization-and-standardization/
[3] Preprocessing Data : 類別型特徵_OneHotEncoder & LabelEncoder 介紹與實作. https://medium.com/ai%E5%8F%8D%E6%96%97%E5%9F%8E/preprocessing-data-onehotencoder-labelencoder-%E5%AF%A6%E4%BD%9C-968936124d59
[4] 平均数编码:针对高基数定性特征(类别特征)的数据预处理/特征工程. https://zhuanlan.zhihu.com/p/26308272
[5] 特征工程之分箱. https://blog.csdn.net/Pylady/article/details/78882220
[6] https://www.leiphone.com/news/201801/T9JlyTOAMxFZvWly.html

Guess you like

Origin www.cnblogs.com/purple5252/p/11343769.html