One-Hot Encoding (One-Hot Encoding) and LabelEncoder label encoding difference data preprocessing: (machine learning) sklearn


✌ One-hot encoding and LabelEncoder label encoding

1. ✌ Introduction

For some feature projects, we sometimes need to use OneHotEncoder and LabelEncoder two encodings.
This is to solve some non-digital classification problems.
For example, for the classification of gender: male and female. These two values ​​can't be put into the model, so they need to be coded into numbers.
E.g:

feature coding
male 1
Female 0
Female 0
male 1
Female 0
male 1

For LabelEncoder, it will be converted into a numerical classification of 0, 1. If there are three categories, it will become 0, 1, 2.

And using OneHotEncoder will convert it into a matrix form

feature Sex_male Sex_女
male 1 0
Female 0 1
Female 0 1
male 1 0
Female 0 1
male 1 0

So the question is, these two methods can be encoded, is there any difference?

  • Using LabelEncoder, the feature is still one-dimensional, but it will produce coded numbers such as 0, 1, 2, and 3.
  • OneHotEncoder will generate linearly independent vectors.
    If for red, blue, and green, 0, 1, and 2 will be generated after encoding, which will generate a new mathematical relationship, such as green is greater than red, and the mean value of green and red is blue Color, and these categories are mutually independent categories, there is no such relationship before the transformation.
    However, if OneHotEncoder is used, multiple linearly independent vectors will be generated, which solves the problem of that relationship. However, if there are more categories, the feature dimension will be greatly increased, resulting in waste of resources, long operation time, and too sparse matrices. , But sometimes you can contact PCA for use.

2. ✌ Code test

2.1 ✌ Import related libraries

import numpy as np
import pandas as pd
# 导入SVC模型
from sklearn.svm import SVC
# 导入评分指标
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# 编码库
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# 交叉验证
from sklearn.model_selection import cross_val_score

2.2 ✌ Read data

Insert picture description here

data=pd.read_csv('Narrativedata.csv',index_col=0)
data

2.3 ✌ View missing values

Insert picture description here

data.isnull().sum()

2.4 ✌ Use median to fill in age

Insert picture description here

data['Age'].fillna(data['Age'].median(),inplace=True)
data.isnull().sum()

2.5 ✌ Delete Embarked missing lines

Insert picture description here

data.dropna(inplace=True)
data.isnull().sum()

2.6 ✌ View the category of each feature

Insert picture description here

display(np.unique(data['Sex']))
display(np.unique(data['Embarked']))
display(np.unique(data['Survived']))
x=data.drop(columns=['Survived'])
y=data['Survived']

2.7 ✌ LabelEncoder encoding the label

Insert picture description here

from sklearn.preprocessing import LabelEncoder
y=LabelEncoder().fit_transform(y)
y

2.8 ✌ Use pandas' dummy variable processing

Insert picture description here

y=data['Survived']

y=pd.get_dummies(y)
y

2.9 ✌ Dummy variable processing for features

Insert picture description here

x=pd.get_dummies(x.drop(columns=['Age']))
x

2.10 ✌ One-hot encoding of features

Insert picture description here

from sklearn.preprocessing import OneHotEncoder

x=data.drop(columns=['Survived','Age'])
x=OneHotEncoder().fit_transform(x).toarray()
pd.DataFrame(x)

2.11 ✌ Model test

2.11.1 ✌ One-hot encoding

x=data.drop(columns=['Age','Survived'])
y=data['Survived']
x=pd.get_dummies(x)
x['Age']=data['Age']
y=LabelEncoder().fit_transform(y)
# 模型测试
for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
                ,gamma="auto"
                ,degree = 1
                ,cache_size = 5000
                )
    score=cross_val_score(clf,x,y,cv=5,scoring='accuracy').mean()
    print('{:10s}:{}'.format(kernel,score))

Insert picture description here

2.11.2 ✌ LabelEncoder encoding

x=data.drop(columns=['Age','Survived'])
y=data['Survived']
df=pd.DataFrame()
# 循环拼接特征矩阵
for i in x.columns:
    df=pd.concat([df,pd.DataFrame(LabelEncoder().fit_transform(x[i]))],axis=1)    
y=LabelEncoder().fit_transform(y)
for kernel in ["linear","poly","rbf","sigmoid"]:
    clf = SVC(kernel = kernel
                ,gamma="auto"
                ,degree = 1
                ,cache_size = 5000
                )
    score=cross_val_score(clf,df,y,cv=5,scoring='accuracy').mean()
    print('{:10s}:{}'.format(kernel,score))  

Insert picture description here

Guess you like

Origin blog.csdn.net/m0_47256162/article/details/113788166