One-Hot Encoding独热编码

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u013317445/article/details/84983569

one-hot encoding:The Standard Approach for Categorical Features

Categorical feature:如,color of flowers: yellow, red, green。

Imgur

one-hot encoding:一种码制,有多少个状态(或者叫类别值)就有多少个比特,且只有一个比特为1,其它全为0.

Pandas offers a convenient function called get_dummies to get one-hot encodings.

code

独热编码
Pandas offers a convenient function called get_dummies to get one-hot encodings. Call it like this:

one_hot_encoded_data = pd.get_dummies(data)
help(pd.get_dummies)
Help on function get_dummies in module pandas.core.reshape.reshape:

get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
    Convert categorical variable into dummy/indicator variables
    
    Parameters
    ----------
    data : array-like, Series, or DataFrame
    prefix : string, list of strings, or dict of strings, default None
        String to append DataFrame column names.
        Pass a list with length equal to the number of columns
        when calling get_dummies on a DataFrame. Alternatively, `prefix`
        can be a dictionary mapping column names to prefixes.
    prefix_sep : string, default '_'
        If appending prefix, separator/delimiter to use. Or pass a
        list or dictionary as with `prefix.`
    dummy_na : bool, default False
        Add a column to indicate NaNs, if False NaNs are ignored.
    columns : list-like, default None
        Column names in the DataFrame to be encoded.
        If `columns` is None then all the columns with
        `object` or `category` dtype will be converted.
    sparse : bool, default False
        Whether the dummy columns should be sparse or not.  Returns
        SparseDataFrame if `data` is a Series or if all columns are included.
        Otherwise returns a DataFrame with some SparseBlocks.
    drop_first : bool, default False
        Whether to get k-1 dummies out of k categorical levels by removing the
        first level.
    
        .. versionadded:: 0.18.0
    
    dtype : dtype, default np.uint8
        Data type for new columns. Only a single dtype is allowed.
    
        .. versionadded:: 0.23.0
    
    Returns
    -------
    dummies : DataFrame or SparseDataFrame
    
    Examples
    --------
    >>> import pandas as pd
    >>> s = pd.Series(list('abca'))
    
    >>> pd.get_dummies(s)
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    
    >>> s1 = ['a', 'b', np.nan]
    
    >>> pd.get_dummies(s1)
       a  b
    0  1  0
    1  0  1
    2  0  0
    
    >>> pd.get_dummies(s1, dummy_na=True)
       a  b  NaN
    0  1  0    0
    1  0  1    0
    2  0  0    1
    
    >>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
    ...                    'C': [1, 2, 3]})
    
    >>> pd.get_dummies(df, prefix=['col1', 'col2'])
       C  col1_a  col1_b  col2_a  col2_b  col2_c
    0  1       1       0       0       1       0
    1  2       0       1       1       0       0
    2  3       1       0       0       0       1
    
    >>> pd.get_dummies(pd.Series(list('abcaa')))
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    4  1  0  0
    
    >>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    
    >>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
         a    b    c
    0  1.0  0.0  0.0
    1  0.0  1.0  0.0
    2  0.0  0.0  1.0
    
    See Also
    --------
    Series.str.get_dummies

align:

final_train_predictors, final_test_predictors= one_hot_encoded_training_data_predictors.align(one_hot_encoded_test_data_predictors, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:对齐后没有值的地方填0,默认填的是NaN
#align
help(one_hot_encoded_X.align)
Help on method align in module pandas.core.frame:

align(self, other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None) method of pandas.core.frame.DataFrame instance
    Align two objects on their axes with the
    specified join method for each axis Index
    
    Parameters
    ----------
    other : DataFrame or Series
    join : {'outer', 'inner', 'left', 'right'}, default 'outer'
    axis : allowed axis of the other object, default None
        Align on index (0), columns (1), or both (None)
    level : int or level name, default None
        Broadcast across a level, matching Index values on the
        passed MultiIndex level
    copy : boolean, default True
        Always returns new objects. If copy=False and no reindexing is
        required then original objects are returned.
    fill_value : scalar, default np.NaN
        Value to use for missing values. Defaults to NaN, but can be any
        "compatible" value
    method : str, default None
    limit : int, default None
    fill_axis : {0 or 'index', 1 or 'columns'}, default 0
        Filling axis, method and limit
    broadcast_axis : {0 or 'index', 1 or 'columns'}, default None
        Broadcast values along this axis, if aligning two objects of
        different dimensions
    
    Returns
    -------
    (left, right) : (DataFrame, type of other)
        Aligned objects

Example:西瓜数据3.0

#_*_coding:utf-8_*_
import pandas as pd
watermelon_data= pd.read_csv(r'G:\kaggle\watermelon_3.csv')
watermelon_data
编号 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜
0 1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
1 2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
2 3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
3 4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
4 5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
5 6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
6 7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
7 8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
8 9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
9 10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
10 11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
11 12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
12 13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
13 14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
14 15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
15 16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
16 17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103

遇到问题:读出来的中文乱码。
解决:将csv文件用记事本打开,然后点另存为后出现编码选项,选择:UTF-8

watermelon_data.dtypes
编号       int64
色泽      object
根蒂      object
敲声      object
纹理      object
脐部      object
触感      object
密度     float64
含糖率    float64
好瓜      object
dtype: object
#色泽下有几种值
watermelon_data['色泽'].nunique()
3
#判断色泽的的值的类型是不是非数值的
watermelon_data['色泽'].dtype=="object"
True
watermelon_data['含糖率'].dtype=="float"
True
#target
y=watermelon_data['好瓜']
#X
X=watermelon_data.drop(['编号','好瓜'], axis=1)
X

#----------或者------------
#features=...
#X= watermelon_data[features]
色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率
0 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
1 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
2 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264
3 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
4 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215
5 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237
6 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149
7 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211
8 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091
9 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267
10 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057
11 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099
12 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161
13 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
14 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
15 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042
16 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103
#处理categorical feature:独热编码
one_hot_encoded_X= pd.get_dummies(X)
one_hot_encoded_X
密度 含糖率 色泽_乌黑 色泽_浅白 色泽_青绿 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷缩 敲声_沉闷 敲声_浊响 敲声_清脆 纹理_模糊 纹理_清晰 纹理_稍糊 脐部_凹陷 脐部_平坦 脐部_稍凹 触感_硬滑 触感_软粘
0 0.697 0.460 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.634 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.556 0.215 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
5 0.403 0.237 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
6 0.481 0.149 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1
7 0.437 0.211 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0
8 0.666 0.091 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0
9 0.243 0.267 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1
10 0.245 0.057 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0
11 0.343 0.099 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1
12 0.639 0.161 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
13 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
14 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
15 0.593 0.042 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0
16 0.719 0.103 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0
from sklearn.tree import DecisionTreeClassifier
#model
model= DecisionTreeClassifier()
model
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
#fit
clf= model.fit(one_hot_encoded_X, y)
clf
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
#predict[青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.300]
clf.predict([[0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0,0.608,0.300]])
array(['\xe5\x90\xa6'], dtype=object)

怎么能看到中文???若有多个文件(test dataset、一些其它的做预测的数据)。Sklearn对columns的顺序敏感,所以如果训练集和测试集没有对齐,结果将毫无意义。
if a categorical had a different number of values in the training data vs the test data,这将有可能发生。
如何确保test data和training data以同样的方式编码呢?
假如:
test data:watermelon_3_test.csv(csv文件编码:UTF-8)

import pandas as pd
watermelon_test_data= pd.read_csv(r'G:\kaggle\watermelon_3_test.csv')
watermelon_test_data= watermelon_test_data.drop(['编号'], axis=1)
watermelon_test_data
#看到:test文件里的纹理仅有2个值,而训练数据中纹理有3个值,那one_hot encoding后是不一致的
色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率
0 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460
1 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376
2 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.611 0.264
3 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318
4 青绿 稍蜷 浊响 稍糊 稍凹 硬滑 0.639 0.172
5 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198
6 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370
one_hot_encoded_watermelon_test_data= pd.get_dummies(watermelon_test_data)
one_hot_encoded_watermelon_test_data
密度 含糖率 色泽_乌黑 色泽_浅白 色泽_青绿 根蒂_稍蜷 根蒂_蜷缩 敲声_沉闷 敲声_浊响 纹理_清晰 纹理_稍糊 脐部_凹陷 脐部_稍凹 触感_硬滑 触感_软粘
0 0.697 0.460 1 0 0 0 1 0 1 1 0 1 0 1 0
1 0.774 0.376 1 0 0 0 1 1 0 1 0 1 0 1 0
2 0.611 0.264 1 0 0 0 1 0 1 1 0 1 0 1 0
3 0.608 0.318 0 0 1 0 1 1 0 1 0 1 0 1 0
4 0.639 0.172 0 0 1 1 0 0 1 0 1 0 1 1 0
5 0.657 0.198 0 1 0 1 0 1 0 0 1 1 0 1 0
6 0.360 0.370 1 0 0 1 0 0 1 1 0 0 1 0 1

如何让test.csv的one_hot_encoded_watermelon_test_data和训练集one_hot_encoded_X的编码保持align呢:

final_train, final_test= one_hot_encoded_X.align(one_hot_encoded_watermelon_test_data, join='left',axis=1, fill_value=0)
#axis=1:columns
#join='left' : keep exactly the columns from our training data
#fill_value=0:对齐后没有值的地方填0,默认填的是NaN
final_train
密度 含糖率 色泽_乌黑 色泽_浅白 色泽_青绿 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷缩 敲声_沉闷 敲声_浊响 敲声_清脆 纹理_模糊 纹理_清晰 纹理_稍糊 脐部_凹陷 脐部_平坦 脐部_稍凹 触感_硬滑 触感_软粘
0 0.697 0.460 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.634 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.556 0.215 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
5 0.403 0.237 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 1
6 0.481 0.149 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1
7 0.437 0.211 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0
8 0.666 0.091 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 1 0
9 0.243 0.267 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 1
10 0.245 0.057 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0
11 0.343 0.099 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1
12 0.639 0.161 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0
13 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
14 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
15 0.593 0.042 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0
16 0.719 0.103 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 0
final_test
密度 含糖率 色泽_乌黑 色泽_浅白 色泽_青绿 根蒂_硬挺 根蒂_稍蜷 根蒂_蜷缩 敲声_沉闷 敲声_浊响 敲声_清脆 纹理_模糊 纹理_清晰 纹理_稍糊 脐部_凹陷 脐部_平坦 脐部_稍凹 触感_硬滑 触感_软粘
0 0.697 0.460 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
1 0.774 0.376 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0
2 0.611 0.264 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0
3 0.608 0.318 0 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0
4 0.639 0.172 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0
5 0.657 0.198 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0
6 0.360 0.370 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1
clf.predict(final_test)
array(['\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf', '\xe6\x98\xaf',
       '\xe5\x90\xa6', '\xe5\x90\xa6', '\xe5\x90\xa6'], dtype=object)

猜你喜欢

转载自blog.csdn.net/u013317445/article/details/84983569