Machine Learning Notes 7 - Supervised Learning

Normalized Digital Features

In addition to applying transformations to highly skewed features, it is usually a good practice to apply some form of scaling to numerical features. Applying a scaling to the data does not change the form of the data distribution (such as 'capital-gain' or 'capital-loss' as mentioned above); however, normalization ensures that each feature can be used when using a supervised learner. be treated equally. Note that once scaling is used, the original form of the observation data no longer has its original meaning, as the following example shows.

Run the code cell below to normalize each numeric feature. We will use sklearn.preprocessing.MinMaxScalerto accomplish this task.

from sklearn.preprocessing import MinMaxScaler

# 初始化一个 scaler,并将它施加到特征上
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_raw[numerical] = scaler.fit_transform(data[numerical])

# 显示一个经过缩放的样例记录
display(features_raw.head(n = 1))
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 0.30137 State-gov Bachelors 0.8 Never-married Adm-clerical Not-in-family White Male 0.02174 0.0 0.397959 United-States

Exercise: Data Preprocessing

From the table in the data exploration above , we can see that every record with several attributes is non-numeric. Typically, learning algorithms expect the input to be numeric, which requires non-numeric features (called categorical variables) to be transformed. A popular way to transform categorical variables is to use a one- hot encoding scheme. One-hot encoding creates a "dummy" variable for each possible class of each non-numeric feature . For example, suppose someFeaturethere are three possible values A, Bor C, . We will encode this feature as someFeature_AsomeFeature_Band someFeature_C.

Feature X   Feature X_A Feature X_B Feature X_C
B   0 1 0
C ----> One-hot encoding ----> 0 0 1
A   1 0 0

Also, for non-numeric features, we need to convert the non-numeric labels 'income'to numeric values ​​to ensure that the learning algorithm works properly. Since there are only two possible classes for this label ("<=50K" and ">50K"), we don't need to use one-hot encoding, we can directly encode them into two classes 0sum 1, in the following code cell you will Implement the following functions:

  • Use pandas.get_dummies()to 'features_raw'apply a one-hot encoding to the data.
  • Convert target labels 'income_raw'to numeric items.
    • Convert "<=50K" to 0; convert ">50K" to 1.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325560788&siteId=291194637