Measurement method Data Mining

  In data mining, whether for data classification, clustering, or anomaly detection, correlation analysis, it is built on the basis of a measure of similarity or dissimilarity between the data. Is generally used as a distance or dissimilarity of similarity between the data metrics, commonly used methods Euclidean distance metric, the Manhattan distance, Chebyshev distance, Minkowski distance, Hamming distance, cosine distance, Mahalanobis distance, Jaccard coefficient the correlation coefficient information entropy.

Euclidean distance

  $ $ n-dimensional space defined in Euclidean distance between the two sample points and $ X $ $ $ Y as follows:
$$ D (X, Y) = sqrt {Sigma_. 1} ^ {n-K = (x_k- y_k) ^ 2} $$
normalized Euclidean distance formula is as follows:
$$ D (X, Y) = sqrt {Sigma_. 1} ^ {n-K = (dfrac {} {S_k x_k-Y_k}) ^ 2} $$
formula , $ s_k $ data each dimension variance, standard-dimensional Euclidean distance takes into account the dimensions of each component of the data and distribution are not the same, the equivalent of every dimension data did normalized. Euclidean distance metric applies regardless of the data attributes, the same distribution, or range of data objects.

Manhattan distance

  Manhattan distance is also referred to as a block from, is calculated as follows:
$$ D (X, Y) = Sigma_. 1} ^ {n-K = left | x_k-y_kright | $$

Chebyshev distance

$$ d (x, y) = lim_ {nrightarrow infty} (Sigma_ {k = 1} ^ n (left | x_k-y_kright |) ^ r) ^ dfrac {1} {r} = max_k (left | x_k-y_kright |) $$
the above two equations are equivalent.

Minkowski distance

$$ d (x, y) = (Sigma_ {k = 1} ^ n (left | x_k-y_kright |) ^ r) ^ dfrac {1} {r} $$
formula, r is a variable parameter, in accordance with different values of the parameter r, Minkowski distance may represent a class of distance
  r = 1, the Manhattan distance
  = 2, is the Euclidean distance r
  r → ∞, the Chebyshev is cut from
the Minkowski distance comprises Euclidean distance, same Manhattan distance, Chebyshev distance assume various dimensions and dimensional distribution of the attribute data (expectation, variance), and therefore suitable for metric data object iid.

Hamming distance

  Hamming distance between two strings of length is the number of different characters and the like corresponding to the positions of the two strings, the string is converted into a minimum number of characters in another string needs to be replaced, e.g.
$$ Hamming distance (1001001, 0101101) = 3 $$
Hamming distance information used in coding.

Cosine distance

  Cosine similarity formula is defined as follows:
$$ cos⁡ (X, Y) = {XY} {dfrac left | xRight | left | Y RIGHT |} = {Sigma_ dfrac K = {}. 1} ^ {n-x_k Y_k sqrt {{Sigma_ k = 1} ^ n x_k ^ 2} sqrt {Sigma_ {k = 1} ^ n y_k ^ 2}} $$
cosine similarity is actually a vector $ X $ $ Y $ and cosine angle measurement can be used to measure vector direction of the difference between the two. If the cosine similarity is $ 1 $ is between $ x $ and $ Y $ angle of 0 [deg.] $ $, Two vectors can be considered in addition to the outer mold of the same; If the similarity is previously $ 0 $, $ x $ and $ the Y $ sandwiched between large columns  metrics data mining angle $ 90 ° $, is that two completely different vectors. When calculating the cosine distance, normalized to volume will have a length of $ $ 1, so regardless of the magnitude of the two data objects.
Used to measure the similarity cosine similarity between the text. Documents can be represented by vectors, each vector representing a property specific frequency words or terms appear in the document, although the document has a large number of attributes, each document is sparse vector having relatively few non-zero attribute value.

Mahalanobis distance

  Mahalanobis distance is calculated as follows:
$$ Mahalanobis (X, Y) = (XY) Sigma ^ {-}. 1 (XY) T ^ $$
formulas, $ Sigma ^ {- 1} $ is the data covariance matrix inverse.
Most of the foregoing methods distance metric sample iid assumption, no correlation between the data attributes. Mahalanobis distance takes into account the correlation between the attribute data, excluding the interference correlation between the properties, and independent of dimension. If the covariance matrix is a diagonal matrix, the Mahalanobis distance Euclidean distance becomes standard; if the covariance matrix is a unit matrix, iid, becomes Euclidean distance between the respective sample vectors.

Jaccard coefficient

  Jaccard coefficient is defined as two sets A and B of intersection of an element in its concentration and proportion, i.e.
$$ J (A, B) = dfrac {Acap B} {Acup B} $$
for two data objects $ x $ Y $ and $, $ by $ n-dyadic attributes, then
$$ J = dfrac {f_ {11 }} {f_ {01} + f_ {10} + f_ {11}} $$
formula, $ f_ {11} $ of $ X $ get $ 1 $ and $ Y $ taking the number of attributes $ 1 $ a, $ f_ {01} $ of $ X $ take $ 0 $ and $ Y $ taking the number of attributes $ 1 $, and $ f_ {10} $ $ X $ taken as $ 1 and $ Y $ $ $ 0 $ taken of the number of attributes.
Jaccard coefficient for handling objects contain only one asymmetric binary property.
Generalized Jaccard coefficient is defined as follows:
$$ EJ (X, Y) = {XY} {‖x‖ dfrac ^ 2 + XY-‖y‖ ^ 2} $$
generalized Jaccard coefficient yet Tanimoto coefficient, can be used for processing document data and in the case of binary case normalized attribute about Jaccard coefficient.

The correlation coefficient

  Measure the linear relationship between the correlation properties between the two objects are data objects, calculated as
$$ rho_ dfrac {XY} = S_ {XY} {} {} s_x s_y $$
$$} = S_ {XY dfrac . 1-n-{} {} Sigma_. 1. 1} ^ {n-K = (x_k-overline {X}) ({Y} Y_k-overline) $$
$$ s_x dfrac = sqrt {{{}. 1. 1-n-Sigma_} K =. 1} ^ {n-(x_k-overline {X}) ^ 2} $$
$$ s_y dfrac = sqrt {{{}. 1. 1-n-Sigma_}} ^ {n-K =. 1 (Y_k-overline {Y} ) ^ 2} $$
$$ dfrac overline {X} = {{}. 1 n-K = {} Sigma_. 1} ^ n-x_k, overline dfrac {Y} = {{}. 1 n-K = {} Sigma_. 1} ^ n- y_k $$
correlation coefficient is a measure of the degree of correlation of the data object, in the range of $ [- 1,1] $. The larger the absolute value of the correlation coefficient, it indicates a higher $ Y $ and X $ $ affinity. When Y $ X $ $ $ and linear correlation coefficient value of $ 1 $ (positive linear correlation) $ or $ -1 (negative linear correlation); when the value of $ 0 $ linearly independent. Linear regression, a straight line fit the sample points, which can be measure using the correlation coefficient of linearity.

Entropy

  Entropy is a description of the internal distance between samples of the entire system, it is a measure of disorder or the degree of dispersion of the measured distribution. The more dispersed sample distribution (or distribution of the average), the greater the information entropy; the more orderly distribution (or distribution is concentrated), the less entropy. Entropy defined by the equation given sample set X-$ $ follows:
$$ Entropy (X-) = {I = Sigma_. 1} ^ n--p_i log_2⁡ (P_i) $$
wherein, $ n $ is the number of sets of samples classified , $ p_i $ $ i $ the probability of the class elements appear. When a high probability as the $ S $ $ $ Category n-occurring, taking the maximum entropy $ log2 (n) $. When the $ X $ is only one classification, information entropy minimum value $ 0 $. For information entropy measure of the uncertainty in the classification tree, information entropy may be used to calculate the gain before and after the divided sub-tree as the metric of the optimal partition.

Guess you like

Origin www.cnblogs.com/wangziqiang123/p/11690915.html
Recommended