Metric Machine Learning - the correlation coefficient

      Machine learning is popular AI technology in a very important direction, either supervised or unsupervised learning are learning to use a variety of "metrics" to get different degrees of similarity difference sample data or different sample data. Good "metrics" can significantly improve the classification or prediction algorithm accuracy, will be described herein in a variety of machine learning "measure", "metric" is mainly composed of two, is a distance, similarity and correlation coefficients from the studies generally linear body points in space; the similarity research body linear vector space; correlation coefficient distributed data mainly research subject. This paper describes the correlation coefficient.

A Pearson correlation coefficients - common correlation coefficient

      Machine statistics, Pearson correlation coefficient (earson correlation coefficient) used to measure the degree of correlation between two variables X and Y (linear correlation), a value between -1 and 1. In the natural science field, the widely used measure of the coefficient of linear correlation between two variables. It is evolved from Karl Pearson from a similar Francis Galton proposed in the 1880s but slightly different idea of evolution.
For overall (a collection of many things that have some common properties of the composition), given the random variable (X, y), defined as the total Pearson correlation coefficient

\[{\rho _{X,Y}}{\rm{ = }}\frac{{{\mathop{\rm cov}} \left( {X,Y} \right)}}{{{\sigma _X}{\sigma _Y}}}{\rm{ = }}\frac{{E\left( {\left( {X - {\mu _X}} \right)\left( {Y - {\mu _Y}} \right)} \right)}}{{{\sigma _X}{\sigma _Y}}}\]

      Machine wherein cov (X, Y) is the covariance between the random variables random variables X and Y
      machine σx is the variance of the random variable X
      machine σy is the variance of the random variable Y
      machine μx is the mean of random variable X
      machine μy is a random variable the mean Y

      Also for this machine, a given sample of {(x1, y1), (x2, y2), ..., (xn, yn)}, define the sample correlation coefficient is Pearson

\[{r_{x,y}}{\rm{ = }}\frac{{\sum\limits_{i = 1}^n {\left( {{x_i} - \bar x} \right)\left( {{y_i} - \bar y} \right)} }}{{\sqrt {\sum\limits_{i = 1}^n {{{\left( {{x_i} - \bar x} \right)}^2}} } \sqrt {\sum\limits_{i = 1}^n {{{\left( {{y_i} - \bar y} \right)}^2}} } }} = \frac{{n\sum\limits_{i = 1}^n {{x_i}{y_i}} - \sum\limits_{i = 1}^n {{x_i}} \sum\limits_{i = 1}^n {{y_i}} }}{{\sqrt {n\sum\limits_{i = 1}^n {x_i^2} - {{\left( {\sum\limits_{i = 1}^n {{x_i}} } \right)}^2}} \sqrt {n\sum\limits_{i = 1}^n {y_i^2} - {{\left( {\sum\limits_{i = 1}^n {{y_i}} } \right)}^2}} }}\]

      Wherein n is the number of samples dryer
      machine Xi, yi is the i-th sample data independent
      machine xi x is the mean of all the
      machines is the average of all the yi y


Example 1 FIG scattergram having different correlation coefficient values ​​([rho]) of

Figure 2 sets of correlation coefficients set point

2 Phi correlation coefficient - bivariate correlation

      Machine In statistics, "Phi correlation coefficient" (Phi coefficient) (notation as [Phi]) is the correlation tool between two binary variables measured by Karl Pearson the invention [1]. He also invented the Pearson's chi-squared test is closely associated with the Phi correlation coefficients (Pearson's chi-squared test. Commonly known as the chi-square test), as well as the invention of measuring the degree of correlation between two consecutive variables Pearson correlation coefficient . Phi correlation coefficient in the field of machine learning, also known as the Matthews correlation coefficient.

      The first two variables arranged machine 2 × 2 contingency table, note the location to be 0 and 1 as the tables, if only change or changes only 0/1 X position Y, the calculated correlation coefficient will sign Phi in contrast. Basic Concepts Phi correlation coefficient is: two variable observations when the two yuan falls mostly on 2 × 2 contingency table "main diagonal" field, i.e., if the observed values ​​mostly (X, Y) = (1 , 1), (0,0) both combined, these two variables are correlated. Conversely, if two observations variable falls mostly on two yuan "off-diagonal" field corresponding to the 2 × 2 contingency table, i.e. when most of the observed values ​​(X, Y) = (0,1), (1,0) these two groups

Y = 1 Y = 0 total
X=1 n11 n10 a1
X=2 n01 n00 a2
total b1 b2 n

      Machine wherein n11, n10, n01, n00 times the value of the count field is non-negative, they add up to n, i.e. the number of observations. The table can be drawn from the above X and Y Phi correlation coefficient as follows:

      Machine a simple example: For researchers observed correlation between the sex and handedness. The null hypothesis is: sex and handedness no correlation. Observation target is out of a random sample of individuals who have two binary variables (sex X, dominant hand Y), X has two result values ​​(M = 1 / F = 0), Y there are two result values ​​(right handed sub = 1/0 = left-handed). Two binary variables observed correlation may use a correlation coefficient Phi. Assuming simple random sampling 100, obtained following the 2 × 2 contingency table:

M = 1 F = 0 total
Right = 1 43 44 87
Left = 2 7 6 13
total 50 50 100

      -0.0297 machine is assumed as a significant correlation coefficient test, the specified variable in the present embodiment is 1/0, as representative of a right-handed men as a slight negative correlation with the pair, the male is slightly less than a right-handed female the proportion of right-handed; or, conversely, the proportion of men was slightly higher than the proportion of female left-handed left-handed.

Guess you like

Origin www.cnblogs.com/Kalafinaian/p/10994010.html