Distances commonly used in machine learning

 

1 Basic properties of the distance formula

2 Common distance formulas

2.1 Euclidean Distance:

Euclidean distance is the easiest distance measurement method to understand intuitively. The distance in space between two points that we touch in elementary school, junior high school and high school generally refers to Euclidean distance.

For example:

X=[[1,1],[2,2],[3,3],[4,4]];
经计算得:
d = 1.4142    2.8284    4.2426    1.4142    2.8284    1.4142

 

2.2  Manhattan Distance:

To drive from one intersection to another intersection in Manhattan, the driving distance is obviously not the straight-line distance between two points. This actual driving distance is the "Manhattan distance". Manhattan distance is also called "city block distance" (City Block distance).

For example:

X=[[1,1],[2,2],[3,3],[4,4]];
经计算得:
d =   2     4     6     2     4     2

2.3 Chebyshev Distance:

In chess, the king can move straight, horizontally, or diagonally, so the king can move to any of the 8 adjacent squares by taking one step. How many steps does the king need to walk from the grid (x1, y1) to the grid (x2, y2)? This distance is called Chebyshev distance.

For example:

X=[[1,1],[2,2],[3,3],[4,4]];
经计算得:
d =   1     2     3     1     2     1

2.4 Minkowski Distance:

Min's distance is not a kind of distance, but the definition of a set of distances. It is a general expression of multiple distance measurement formulas.

The Minkowski distance between two n-dimensional variables a(x11,x12,...,x1n) and b(x21,x22,...,x2n) is defined as:

image-20190225182628694

Where p is a variable parameter:

  • When p=1, it is the Manhattan distance;

  • When p=2, it is the Euclidean distance;

  • When p→∞, it is the Chebyshev distance.

According to the difference of p, Min's distance can represent a certain type/kind of distance.

summary:

1 Min's distance, including Manhattan distance, Euclidean distance and Chebyshev distance, has obvious shortcomings:

eg Two-dimensional sample (height [unit: cm], weight [unit: kg]), there are three samples: a(180,50), b(190,50), c(180,60).

The Min distance between a and b (whether Manhattan distance, Euclidean distance or Chebyshev distance) is equal to the Min distance between a and c. But in fact, the height of 10cm is not equal to the weight of 10kg.

2 Disadvantages of Min's distance:

(1) The dimensions of the individual components (scale), i.e. "units" look the same;

(2) does not consider the distribution of each component (desired, variance, etc.) may be different.

 

3 Distance calculation between "continuous attribute" and "discrete attribute"

We often divide attributes into "continuous attribute" (continuous attribute) and "discrete attribute" (categorical attribute). The former has an infinite number of possible values ​​in the domain, and the latter has a finite number of values ​​in the domain.

  • If there is an ordinal relationship between the attribute values, they can be converted into continuous values. For example, the height attribute "high", "medium" and "short" can be converted into {1, 0.5, 0}.
    • Minkowski distance can be used for ordered attributes.
  • If there is no order relationship between the attribute values, it is usually transformed into the form of a vector, for example, the gender attribute "male" and "female" can be transformed into {(1,0), (0,1)}.

 

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/114992655