In general, the distance metric used two methods: the Euclidean distance and cosine similarity
Euclidean distance scale impact indicators will be affected by different units, therefore, before the first general use standardized, the greater the distance, the greater the differences between individuals
Cosine similarity index scale similarity measure is not affected by the impact, [1,1], the greater the cosine value in the interval, the more similar.
1, Euclidean distance: also known as Euclidean distance
Distances between two or more points notation
d (x, y) = sqrt ((x1-y1) ^ 2 + (x2-y2) ^ 2 + ... + (xn-a) ^ 2)
Improved Method 1:
Standardized Euclidean distance: for each component distribution is inconsistent, each of the components are standardized to the mean and variance equal
After normalization value before standardization :( - mean component) / the component standard deviation
Improved Method 2:
Mahalanobis distance: the distance between the point and distribution, taking into account the links between the various characteristics, and scale-independent.
For [mu] is the mean, covariance [Sigma multivariable vectors, the Mahalanobis distance sqrt ((x-μ) Σ ^ (- 1) (x-μ))
2, cosine similarity calculated
With the cosine of the angle between two vectors as a measure of vector space between the two individual differences in size.
cos(seta) = (a^2+b^2-c^2)/(2ab)
or
cos (seta) = (a * b) / (a || || || x || b)
or
(x1,y1)*(x2,y2)/sqrt(x1^2+y1^2)xsqrt(x2^2+y2^2)
==
(x1x2,+y1y2)/sqrt(x1^2+y1^2)xsqrt(x2^2+y2^2)