KNN- distance - data normalization Feature Scaling

KNN- distance - data normalization Feature Scaling

Learning videos
online mathematical formula editing tools
Q: Why data normalization?
A: in different dimensions, the main measure of value at this time is not necessarily a measure of the expected value

sample Tumor size (cm) Find Time (days)
Sample 1 1 200
Sample 2 5 100

Euler distance: ( 1 5 ) 2 + ( 200 100 ) 2 \sqrt{(1-5)^{2}+(200-100)^{2}} This value is 发现时间dominated, while 肿瘤大小a difference of 5 times, but due to different dimensions, at this time the main measure of 发现天数difference;
if the transfer 整天->年:

sample Tumor size (cm) Find Time (days)
Sample 1 1 200 = 0.55 years
Sample 2 5 100 = 0.27 years

Euclidean distance at this time is dominated by the size of the tumor.

So to the source data normalization processing, the so-called data normalization process is to have the map data to the same scale, the
simplest of which is the most values were normalized [normalizition] , 把所有数据都映射到0-1之间:
most values were normalized [normalizition ] formula: x s c a l e = x x m i n x m a x x m i n x_{scale}=\frac{x-x_{min}}{x_{max}-x_{min}}

公式解释: Obtaining maximum and minimum values for each of the corresponding characteristic feature, for each feature point subtracting the minimum value,
corresponding to the entire data are mapped to [O- ( x m a x x m i n x_{max}-x_{min} )] Range, compared to the proportion x wherein seeking the entire range, so are the mapping between 0-1.

The most value normalization [normalizition] applies to the distribution of a clear border situation;
for example: Grading [0-100], pixels [0-255],
Cons: greatly influenced by outlier, such as no clear boundaries month income [0-100 Wan], most people are 10000, only one is 1 million, are mapped between 0-1 words, the monthly income of 1 million is 1, most of the monthly income of 10,000 gathered in near 0.01. Data mapping at this time is not good enough, a corresponding improvement is to use a mean and variance normalization [Standardization]

Mean normalized variance [Standardization] : All the normalized data to a distribution with mean 0 and variance 1 in this case does not guarantee that all data between 0-1, but in the mean is zero and variance 1
applies no boundaries, the presence of extreme values of the data set.

所以一般采用Mean-variance normalization [Standardization] .
Mean-variance normalization [standardization] formula: x s c a l e = x x m e a n S x_{scale}=\frac{x-x_{mean}}{S}

公式解释: Subtracting the mean value for each of the features, and then divided by the variance of feature values corresponding to
the specific code implementation, with reference to the original video, where after only demonstrate a code implementation,

X2 = np.random.randint(0, 100, (50, 2)) # 0-100,2个特征值
X2 = np.array(X1, dtype=float)
X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])
# X2[:,0] - np.mean(X2[:,0]) 结果是第一列特征的向量
# np.std(X2[:,0]) 结果是第一列特征的方差值
# 向量 / 值 = 向量 ==> 归一化第一列的特征值

# 绘图
plt.scatter(X2[:,0], X2[:,1])
plt.show()
Published 85 original articles · won praise 27 · views 160 000 +

Guess you like

Origin blog.csdn.net/qq_22038327/article/details/103059115