Data preprocessing - hot encoded (One-Hot Encoding) and tag coding LabelEncoder

First, the origin of the problem

In many machine learning tasks, the feature is not always continuous values, and there may be classification value.

Encoding discrete features two cases:

1, there is no significance between the value of the size of the discrete features, such as color: [red, blue], then use one-hot encoding

2, the value of meaningful size of discrete features, such as size: [X, XL, XXL], then use the mapped values ​​{X: 1, XL: 2, XXL: 3}

Pandas can easily use the discrete characteristics of one-hot encoding

import pandas as pd
df = pd.DataFrame([
            ['green', 'M', 10.1, 'class1'], 
            ['red', 'L', 13.5, 'class2'], 
            ['blue', 'XL', 15.3, 'class1']])
 
df.columns = ['color', 'size', 'prize', 'class label']
 
size_mapping = {
           'XL': 3,
           'L': 2,
           'M': 1}
df['size'] = df['size'].map(size_mapping)
 
class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}
df['class label'] = df['class label'].map(class_mapping)

One-hot encoding

To solve the above problem, one possible solution is to use a one-hot encoding (One-Hot Encoding). I.e. hot encoded One-Hot encoding, also known as an efficient coding, which is N-bit status registers used to encode the N states, each register bit by other independent, and at any time, which only an effective.

  • Natural state code is: 000,001,010,011,100,101

    One-hot encoding is: 000001,000010,000100,001000,010000,100000

Can be understood, for each feature, if there are m possible values ​​it, then after a one-hot encoding, it becomes a dyadic wherein m (e.g., good performance of this feature, in that the difference becomes one-hot 100, 010, 001). Moreover, these features are mutually exclusive, only one active. Thus, the data becomes sparse.

Benefits of doing so are:

  • Solve the problem of classification is hard to deal with attribute data
  • To some extent, also played a role in the expansion characteristics

A realization: pandas method of get_dummies

pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False
该方法可以讲类别变量转换成新增的虚拟变量/指示变量。

  常用参数


data : array-like, Series, or DataFrame 
输入的数据
prefix : string, list of strings, or dict of strings, default None 
get_dummies转换后,列名的前缀 
*columns : list-like, default None 
指定需要实现类别转换的列名
dummy_na : bool, default False 
增加一列表示空缺值,如果False就忽略空缺值
drop_first : bool, default False 
获得k中的k-1个类别值,去除第一个

Here Insert Picture Description

Implement Method Two: sklearn

from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])    # fit来学习编码
enc.transform([[0, 1, 3]]).toarray()    # 进行编码
复制代码
输出:array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

4 * 3 matrix of data, namely four data, three characteristic dimensions.

003 left observation data matrix, as a first feature of a first dimension, there are two values ​​0 \ 1. 10 therefore corresponds to the encoding mode, 01

110 Similarly, the second feature as the second dimension, there are three values ​​0 \ 1 \ 2, the corresponding coding mode is 100,010,001

021 Similarly, the third dimension as the third feature, there are four values ​​0 \ 1 \ 2 \ 3, the corresponding coding mode is 1000,0100,0010,0001

1 0 2

Look to be encoded parameters [0, 1, 3], wherein a 0 as the first encoder 10, as a second feature 010 is encoded, as the third feature 3 0001. Hence encoded coding result is 1 00,100,001

III. Why should one-hot encoding?

As said above, the one-hot encoding (dummy variable dummy variable) because most of the algorithm is performed based on the calculated metric vector space, in order to make the non-variable value of partial order relation having no partial order, and to dot equidistant. Using the one-hot encoded value of the discrete features extend to Euclidean space, wherein a discrete value corresponding to a point on the Euclidean space. The use of discrete one-hot encoding feature, the distance between the calculated feature makes more reasonable. After discrete features one-hot encoding, the encoded characteristic, in fact, each dimension features can be seen as a continuous feature. You can like normalization method for continuous characteristics for each dimension features are normalized. For example normalized to [-1, 1] or normalized to a zero mean, unit variance.

Why feature vectors to be mapped to Euclidean space?

  • The discrete features are mapped to European space by one-hot encoding, because, in return, classification, clustering, and other machine learning algorithms, the calculation of the distance or the similarity between the feature is very important, and we used the distance calculating the degree of similarity or similarity in the Euclidean space are calculated, the cosine similarity is based on the Euclidean space.

IV. Advantages and disadvantages of one-hot encoding

Advantages: one-hot encoding to solve the problem of classification is hard to deal with attribute data, to a certain extent, also played a role in the expansion characteristics. Its value is only 0 and 1, stored in different types of vertical space.
Cons: When the number of classes a lot, the feature space becomes very large. In this case, generally used to reduce the PCA dimensions. And one hot encoding + PCA This combination is also useful in practice.

V. (not) with one-hot encoding under what circumstances?

Use: one-hot encoding to solve the problem of discrete values of categorical data,
do not: the discrete characteristics of the role of one-hot encoding, is to make more reasonable distance calculation, but if the feature is discrete, and no one-hot coding can be very reasonable to calculate the distance, then there is no need for one-hot encoding. Some tree-based algorithm in dealing with variables, measure is not based on vector space, the value is just a class symbols, that is, no partial order, so do not be one-hot encoding. Tree Model less need for one-hot encoding: For decision tree, it is essentially one-hot increasing depth of the tree.
  In general, if the number one hot encoding category is not much, we recommend priority

VI. Under what circumstances (do not) need to normalize?

Need: Based on the model or model parameters based on distance, it is to feature normalization.
Does not need to: a method is that no feature tree based on the normalized, for example, random forests, Bagging and boosting the like.

Seven, one-hot encoding Why can solve the problem of discrete values ​​of categorical data

First, one-hot encoding for the N-bit status register for the N state coding method
  eg: high, medium, low inseparable, then 000 → with encoding become divided into three, and became mutually independent event
similar SVM, the original linearly inseparable feature, after the project becomes high dimensional separable after the
  GBDT high dimensional sparse matrix when the effect is not good, even low-dimensional sparse matrix is not necessarily better than SVM

Eight, Tree Model less need for one-hot encoding

For the decision tree, the essentially one-hot increasing depth of the tree is
  tree-model to generate a similar mechanism One-Hot + Feature Crossing of the dynamic process
    1. wherein a plurality of features or a leaf node is finally converted into a coding, one-hot can be understood as three separate incidents
    2. a decision tree is no concept of the feature size, which only feature in the conceptual part of his distribution of
  one-hot problem can be solved linearly separable but not as econding label
  One -hot drop disadvantage of dimensionality:

After dimension reduction may cross anterior descending dimension may become not cross

Guess you like

Origin blog.csdn.net/lgy54321/article/details/94412313