Article Directory
- ✌ One-hot encoding and LabelEncoder label encoding
-
- 1. ✌ Introduction
- 2. ✌ Code test
-
- 2.1 ✌ Import related libraries
- 2.2 ✌ Read data
- 2.3 ✌ View missing values
- 2.4 ✌ Use median to fill in age
- 2.5 ✌ Delete Embarked missing lines
- 2.6 ✌ View the category of each feature
- 2.7 ✌ LabelEncoder encoding the label
- 2.8 ✌ Use pandas' dummy variable processing
- 2.9 ✌ Dummy variable processing for features
- 2.10 ✌ One-hot encoding of features
- 2.11 ✌ Model test
✌ One-hot encoding and LabelEncoder label encoding
1. ✌ Introduction
For some feature projects, we sometimes need to use OneHotEncoder and LabelEncoder two encodings.
This is to solve some non-digital classification problems.
For example, for the classification of gender: male and female. These two values can't be put into the model, so they need to be coded into numbers.
E.g:
feature | coding |
---|---|
male | 1 |
Female | 0 |
Female | 0 |
male | 1 |
Female | 0 |
male | 1 |
For LabelEncoder, it will be converted into a numerical classification of 0, 1. If there are three categories, it will become 0, 1, 2.
And using OneHotEncoder will convert it into a matrix form
feature | Sex_male | Sex_女 |
---|---|---|
male | 1 | 0 |
Female | 0 | 1 |
Female | 0 | 1 |
male | 1 | 0 |
Female | 0 | 1 |
male | 1 | 0 |
So the question is, these two methods can be encoded, is there any difference?
- Using LabelEncoder, the feature is still one-dimensional, but it will produce coded numbers such as 0, 1, 2, and 3.
- OneHotEncoder will generate linearly independent vectors.
If for red, blue, and green, 0, 1, and 2 will be generated after encoding, which will generate a new mathematical relationship, such as green is greater than red, and the mean value of green and red is blue Color, and these categories are mutually independent categories, there is no such relationship before the transformation.
However, if OneHotEncoder is used, multiple linearly independent vectors will be generated, which solves the problem of that relationship. However, if there are more categories, the feature dimension will be greatly increased, resulting in waste of resources, long operation time, and too sparse matrices. , But sometimes you can contact PCA for use.
2. ✌ Code test
2.1 ✌ Import related libraries
import numpy as np
import pandas as pd
# 导入SVC模型
from sklearn.svm import SVC
# 导入评分指标
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# 编码库
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# 交叉验证
from sklearn.model_selection import cross_val_score
2.2 ✌ Read data
data=pd.read_csv('Narrativedata.csv',index_col=0)
data
2.3 ✌ View missing values
data.isnull().sum()
2.4 ✌ Use median to fill in age
data['Age'].fillna(data['Age'].median(),inplace=True)
data.isnull().sum()
2.5 ✌ Delete Embarked missing lines
data.dropna(inplace=True)
data.isnull().sum()
2.6 ✌ View the category of each feature
display(np.unique(data['Sex']))
display(np.unique(data['Embarked']))
display(np.unique(data['Survived']))
x=data.drop(columns=['Survived'])
y=data['Survived']
2.7 ✌ LabelEncoder encoding the label
from sklearn.preprocessing import LabelEncoder
y=LabelEncoder().fit_transform(y)
y
2.8 ✌ Use pandas' dummy variable processing
y=data['Survived']
y=pd.get_dummies(y)
y
2.9 ✌ Dummy variable processing for features
x=pd.get_dummies(x.drop(columns=['Age']))
x
2.10 ✌ One-hot encoding of features
from sklearn.preprocessing import OneHotEncoder
x=data.drop(columns=['Survived','Age'])
x=OneHotEncoder().fit_transform(x).toarray()
pd.DataFrame(x)
2.11 ✌ Model test
2.11.1 ✌ One-hot encoding
x=data.drop(columns=['Age','Survived'])
y=data['Survived']
x=pd.get_dummies(x)
x['Age']=data['Age']
y=LabelEncoder().fit_transform(y)
# 模型测试
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
)
score=cross_val_score(clf,x,y,cv=5,scoring='accuracy').mean()
print('{:10s}:{}'.format(kernel,score))
2.11.2 ✌ LabelEncoder encoding
x=data.drop(columns=['Age','Survived'])
y=data['Survived']
df=pd.DataFrame()
# 循环拼接特征矩阵
for i in x.columns:
df=pd.concat([df,pd.DataFrame(LabelEncoder().fit_transform(x[i]))],axis=1)
y=LabelEncoder().fit_transform(y)
for kernel in ["linear","poly","rbf","sigmoid"]:
clf = SVC(kernel = kernel
,gamma="auto"
,degree = 1
,cache_size = 5000
)
score=cross_val_score(clf,df,y,cv=5,scoring='accuracy').mean()
print('{:10s}:{}'.format(kernel,score))