Normalized Digital Features
In addition to applying transformations to highly skewed features, it is usually a good practice to apply some form of scaling to numerical features. Applying a scaling to the data does not change the form of the data distribution (such as 'capital-gain' or 'capital-loss' as mentioned above); however, normalization ensures that each feature can be used when using a supervised learner. be treated equally. Note that once scaling is used, the original form of the observation data no longer has its original meaning, as the following example shows.
Run the code cell below to normalize each numeric feature. We will use sklearn.preprocessing.MinMaxScaler
to accomplish this task.
from sklearn.preprocessing import MinMaxScaler
# 初始化一个 scaler,并将它施加到特征上
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_raw[numerical] = scaler.fit_transform(data[numerical])
# 显示一个经过缩放的样例记录
display(features_raw.head(n = 1))
age | workclass | education_level | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.30137 | State-gov | Bachelors | 0.8 | Never-married | Adm-clerical | Not-in-family | White | Male | 0.02174 | 0.0 | 0.397959 | United-States |
Exercise: Data Preprocessing
From the table in the data exploration above , we can see that every record with several attributes is non-numeric. Typically, learning algorithms expect the input to be numeric, which requires non-numeric features (called categorical variables) to be transformed. A popular way to transform categorical variables is to use a one- hot encoding scheme. One-hot encoding creates a "dummy" variable for each possible class of each non-numeric feature . For example, suppose someFeature
there are three possible values A
, B
or C
, . We will encode this feature as someFeature_A
, someFeature_B
and someFeature_C
.
Feature X | Feature X_A | Feature X_B | Feature X_C | |
---|---|---|---|---|
B | 0 | 1 | 0 | |
C | ----> One-hot encoding ----> | 0 | 0 | 1 |
A | 1 | 0 | 0 |
Also, for non-numeric features, we need to convert the non-numeric labels 'income'
to numeric values to ensure that the learning algorithm works properly. Since there are only two possible classes for this label ("<=50K" and ">50K"), we don't need to use one-hot encoding, we can directly encode them into two classes 0
sum 1
, in the following code cell you will Implement the following functions:
- Use
pandas.get_dummies()
to'features_raw'
apply a one-hot encoding to the data. - Convert target labels
'income_raw'
to numeric items.- Convert "<=50K" to
0
; convert ">50K" to1
.
- Convert "<=50K" to