Processing classification data based on Sklearn

Sometimes it is more effective to measure objects based on certain characteristics rather than quantity. This qualitative information is often used to determine the attributes of an observation, such as classifying it according to categories such as gender, color, or car brand. But not all classified data are like this. A category without inherent order is called nominal. On the contrary, if a group of categories has inherent order, it is called ordinal.

1. Coding the nominal classification features

There is a nominal classification feature with no internal order, now it is processed

  • Use sklearn's LabelBinarizer to perform one-hot encoding (one-hot encoding) on ​​features:
import numpy as np 
from sklearn.preprocessing import LabelBinarizer,MultiLabelBinarizer

#创建特征
feature = np.array([["Texas"],
                    ["California"],
                    ["Texas"],
                    ["Delaware"],
                    ["Texas"]])
#创建one-hot编码
one_hot = LabelBinarizer()
#对特征进行one-hot编码
one_hot.fit_transform(feature)

—>

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])
#对one-hot编码逆转换
one_hot.inverse_transform(one_hot.transform(feature))

—>

array(['Texas', 'California', 'Texas', 'Delaware', 'Texas'], dtype='<U10')
  • Use pandas to perform one-hot encoding on features:
import pandas as pd
#创建虚拟变量
pd.get_dummies(feature[:,0])

—>

    California	Delaware	Texas
0	0	        0	        1
1	1	        0	        0
2	0	        0	        1
3	0	        1	        0
4	0	        0	        1

scikit-learn can also handle the situation where there are two classifications for each observation, using MultiLabelBinarizer().

In addition, after one-hot encoding, it is best to delete a one-hot encoding feature from the result matrix to avoid linear dependence.

2. Encoding ordinal classification features

Problem description: There is an ordinal classification feature (for example, high, medium, low), and now it needs to be coded.

Solution: Use the replace method of pandas data frame to convert string labels into corresponding numbers.

import pandas as pd
#创建特征
dataframe = pd.DataFrame({
    
    "score":["Low","Low","Medium","Medium","High"]})
#创建映射器
scale_mapper = {
    
    "Low":1,
                "Medium":2,
                "High":3}
#使用映射器来替换特征
dataframe['score'].replace(scale_mapper)

—>

0    1
1    1
2    2
3    2
4    3

When encoding the features used for machine learning, it is necessary to convert the ordinal classification into a numeric value while preserving its order. The most common method is to create a dictionary, map the classified string to a number, and then map it on the feature.

3. Encoding the feature dictionary

Use DictVectorizer to convert a dictionary into a feature matrix:

from sklearn.feature_extraction import DictVectorizer
#创建一个字典
data_dict = [{
    
    "Red":2,"Blue":4},
             {
    
    "Red":4,"Blue":3},
             {
    
    "Red":1,"Yello":2},
             {
    
    "Red":2,"Yello":2}]
#创建字典向量化器
dictvectorizer = DictVectorizer(sparse=False)
#将字典转换成特征矩阵
features = dictvectorizer.fit_transform(data_dict)
#查看特征矩阵
features

—>

array([[4., 2., 0.],
       [3., 4., 0.],
       [0., 1., 2.],
       [0., 2., 2.]])

By default, Dictvectorizer will output a sparse matrix to store elements other than 0. If the matrix is ​​very large, doing so can help save memory. You can force DictVectorizer to output a dense matrix by specifying sparse=False.

Use the get_feature_names method to get the names of the generated features:

#获取特征的名字
feature_names = dictvectorizer.get_feature_names()
feature_names

—>

['Blue', 'Red', 'Yello']
import pandas as pd
#从特征中创建数据帧
pd.DataFrame(features,columns=feature_names)

—>

    Blue  Red	Yello
0	4.0	  2.0	0.0
1	3.0	  4.0	0.0
2	0.0	  1.0	2.0
3	0.0	  2.0	2.0

4. Fill in missing categorical values

Problem description: A categorical feature contains missing values, which need to be filled in with predicted values.

Solution: The most ideal solution is to train a machine learning classifier to predict missing values, usually a KNN classifier:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
#用分类特征创建特征矩阵
X = np.array([[0,2.10,1.45],
              [1,1.18,1.33],
              [0,1.22,1.27],
              [1,-0.21,-1.19]])
#创建带缺失值的特征矩阵
X_with_nan = np.array([[np.nan,0.87,1.31],
                       [np.nan,-0.67,-0.22]])
#训练KNN分类器
clf = KNeighborsClassifier(3,weights='distance')
trained_model = clf.fit(X[:,1:],X[:,0])
#预测缺失值的分类
imputed_values = trained_model.predict(X_with_nan[:,1:])
#将所预测的分类和它们的其他特征连接起来
X_with_imputed = np.hstack((imputed_values.reshape(-1,1),X_with_nan[:,1:]))
#连接两个特征矩阵
np.vstack((X_with_imputed,X))

—>

array([[ 0.  ,  0.87,  1.31],
       [ 1.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

Another solution is to fill in missing values ​​with the most frequent value in the feature

from sklearn.impute import SimpleImputer
#连接两个特征矩阵
X_complete = np.vstack((X_with_nan,X))
imputer = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer.fit_transform(X_complete)

—>

array([[ 0.  ,  0.87,  1.31],
       [ 0.  , -0.67, -0.22],
       [ 0.  ,  2.1 ,  1.45],
       [ 1.  ,  1.18,  1.33],
       [ 0.  ,  1.22,  1.27],
       [ 1.  , -0.21, -1.19]])

When there are missing values ​​in the classification features, the best solution is to use machine learning algorithms to predict the missing values. Using the feature with missing values ​​as the target vector and other features as the feature matrix, the prediction can be completed. The commonly used algorithm is KNN, which uses the median of the k most recent observations as the filling value for missing values.

In addition, the category with the most occurrences in the feature can be used to fill in missing values. Although it is less effective than using KNN, it can be more easily extended to large data sets.

5. Deal with uneven classification

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/108783830