Getting Started with Udacity Machine Learning - Feature Scaling

Chirs weight + height = 146.1 

Caneron weight + height = 180.9 

Serah weight + height = 120.2                  

From the data point of view, Chirs is closer to the value of S, and should wear S size clothes, but due to the difference in weight and height metrics (one is a single digit, the other is a hundred digit), the weight occupies a dominant position, and it is used at this time. Feature scaling so that the range spanned by these features is comparable, usually between 0 and 1 (inclusive)

One advantage of feature scaling is that the estimated output is relatively stable. The disadvantage is that when there are outliers, max or min may be extreme values.


The weight feature value of Chirs is scaled to 0.417 by feature scaling

Write code to calculate feature scaling values

def featureScaling(arr):
    scaler_list=[]
    if max(arr)==min(arr):
        for i in range(0,len(arr)):
             scaler_list.append(0.5)
        return scaler_list
    else:
        for i in range(0,len(arr)):
            scaler =(float)(arr[i]-min(arr))  / ( max(arr)-min(arr))
            scaler_list.append(scaler)
        return scaler_list

# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print featureScaling(data)

In fact, there is code in sklearn to achieve:

>>>from sklearn.preprocessing import MinMaxScaler
>>>import numpy
>>>weights = numpy.array([[115.],[140.],[175.]])#Requires floating point numbers
>>>scaler = MinMaxScaler()
>>>rescaled_weight = scaler.fit_transform(weights)
array([[0.        ],
       [0.41666667],
       [1.        ]])

Algorithms not affected by feature scaling: linear regression, decision trees

Affected algorithms: SVM with RBF kernel, K-means clustering

SVM and K-means clustering use one dimension to exchange with another dimension when calculating distance. For example, if a point is doubled, its value will also double.

K-means is the same

Decision tree: A series of horizontal and vertical lines will be presented. There is no exchange between the two, but only cutting in different directions. When dealing with one dimension, you do not need to consider the situation of another dimension. For example, if the box is reduced by half, that is, the feature is scaled. The position of the lines in the figure will change, but the order of division is the same. It is divided proportionally, so there is no exchange between the two variables.

Linear regression: each feature has a coefficient, which always occurs at the same time as the corresponding feature, and changes in feature A will not affect the coefficient of feature B (and if the variable is doubled, its feature will shrink by 1/ 2, the result remains unchanged)


Feature Scaling Mini Project

1. Review  the last part of the K-means clustering mini-project . Analyze what kind of scaling we are using. MinMaxScaler

2. What will be the scaled value of the "salary" feature with the original value of 200,000, and the scaled value of the "exercised_stock_options" feature with the original value of $1 million? There is an error in the two calculations

Ordinary calculation:

"exercised_stock_options" feature scaling value of 1 million: 0.0290205889347

200,000 "salary" feature scaling value: 0.17962406631

MaxMinScaler calculation

"exercised_stock_options" feature scaling value for 1 million: [ 0.02911624]
"salary" feature scaling value for 200,000:  [ 0.18005349]
stocklist = []
salarylist = []
for item in data_dict:
    stock = data_dict[item]['exercised_stock_options']
    salary = data_dict[item]['salary']
    if stock != 'NaN':
        stocklist.append(stock)
    if salary != 'NaN':
        salarylist.append(salary)

from  sklearn.preprocessing import MinMaxScaler
import numpy as np

salarylist = np.array(salarylist).reshape(-1,1)
stocklist = np.array(stocklist).reshape(-1,1)
#Common method to calculate feature scaling
print '1 million "exercised_stock_options" feature scaling value:' ,(1000000.0 - np.min(stocklist))/(np.max(stocklist)-np.min(stocklist))
print '200,000 "salary" feature scaling value: ', (200000.0 - np.min(salarylist))/(np.max(salarylist)-np.min(salarylist))

min_max_scaler = MinMaxScaler()
min_max_scaler.fit_transform(stocklist)
print 1000000.0 * min_max_scaler.scale_ #Original value * scaling factor
min_max_scaler.fit_transform(salarylist)
print 200000.0 * min_max_scaler.scale_

3. What if we wanted to cluster based on "from_messages" (the number of emails sent from a particular email account) and "salary"? In this case, is feature scaling unnecessary, or is it important?

important. The number of emails is usually in the hundreds or thousands of people, and the salary is usually at least 1000 times higher.




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325981182&siteId=291194637