Correlation analysis of basic knowledge of machine learning

Correlation Analysis Definition

Correlation analysis generally refers to determining the closeness of correlation between two or more variable data through mathematical analysis of two or more variable data. From this definition, we can know that the purpose of correlation analysis is to measure the degree of correlation between variable data, the object of analysis is two or more variable data, and the analysis method is mainly mathematical statistics.
Correlation analysis is generally used in the big data analysis process in various fields, including developing positive or negative correlations between different data, measuring the strength and weakness of relationships between different data such as complete correlation or incomplete correlation, analyzing data relationship to establish a model to complete predictions, etc. Common data correlation analysis methods include graph correlation analysis, covariance analysis, correlation coefficient analysis, and regression analysis.

1. Chart correlation analysis

When observing data, the amount of data is generally large and the change range of data is difficult to measure. Therefore, it is difficult to observe the change trend of single data and the connection between multiple data from the perspective of data. Chart correlation analysis can easily accomplish the above purpose. Chart correlation analysis is a method to understand the development trend and connection of data by drawing charts. The biggest feature of this method is that it is easy to operate, and it is also one of the most widely used methods at present. In our common Stock trend charts, weather charts, etc. all use this method.

2. Analysis of covariance

Before introducing the covariance analysis, you first need to understand the definition of variance. Usually, variance is used to measure the degree of dispersion of a certain variable or a set of data. It indicates the degree of dispersion of a certain variable or a set of data. Its calculation formula is as follows :
insert image description here

where n represents the number of samples and x ̅ represents the mean of the samples.
Covariance analysis is established on the basis of variance. This method is specially used to measure the overall error of two variables. Its calculation formula is as follows:
insert image description here

Among them, x ̅ and y ̅ represent the mean of two different samples, n represents the number of samples, and the number of two samples must be the same. Generally speaking, when two variables have the same trend of change, the calculated covariance is a positive number, and the two variables can be said to be positively correlated; when the two variables have opposite trends, the calculated covariance is positive. If the variance is negative, the two variables are negatively correlated; and when the two variables are independent of each other and there is no correlation, the calculated covariance value should be 0.
The above covariance calculation formula can only perform correlation analysis on two variables. When it is necessary to perform correlation analysis on more than two variables, you need to use the covariance matrix for calculation. The matrix formula is as follows:
insert image description here

where x, y and z represent three different variables respectively.
Covariance can only be calculated to determine whether there is correlation between different variables, that is, if the calculated covariance is positive, it is positively correlated, and if it is negative, it is negatively correlated, but the degree of correlation between different variables cannot be expressed.

3. Correlation coefficient analysis

When introducing covariance analysis, it can be learned that this analysis method cannot express the degree of correlation between different variables, but correlation coefficient analysis can accomplish this. Correlation coefficient analysis is to express the closeness of correlation between different variables through calculation, and its calculation formula is as follows:
insert image description here

Where cov(x,y) is the covariance between variable x and variable y, σ_x represents the standard deviation of variable x, σ_y represents the standard deviation of variable y, and the formula for calculating the standard deviation is as follows:
insert image description here

The calculation result of the correlation coefficient ρ_xy is between -1 and 1. When the value is 1, it means that the two variables are completely positively correlated. When the value is -1, it means that the two variables are completely negative. Correlation, when the value is 0, it means that there is no correlation between the two variables, and the closer the calculation result is to 0, the weaker the correlation between the variables.
The above calculation method is the basic method of correlation coefficient analysis, and there are mainly three commonly used correlation coefficient calculation methods, which are Pearson Linear Correlation Coefficient (PLCC for short), Spearman Rank Correlation Coefficient (Spearman Rank -order Correlation Coefficient, referred to as SRCC) and Kendall Rank-order Correlation Coefficient (Kendall Rank-order Correlation Coefficient, referred to as KRCC).
The Pearson linear correlation coefficient is mainly used to describe the linear correlation between two variables. Its calculation formula is as follows. The relationship between the calculation result and the correlation is the same as the previous correlation coefficient. The calculation result interval of this coefficient is -1 Between 1 and 1 and the greater the absolute value of the result, the greater the correlation between the variables.
insert image description here

The Spearman rank correlation coefficient is mainly used to measure the dependence between two variables. It uses a monotone equation to evaluate the correlation between two statistical variables. When the calculation result is 1 or -1, it means that the two variables are completely Monotone correlation, the calculation of the Spearman rank correlation coefficient between the variables is equivalent to the calculation of the Pearson linear correlation coefficient between the ranks of the calculation variable data, the calculation formula of the Spearman rank correlation coefficient is as follows:
insert image description here

The biggest difference between the Kendall rank correlation coefficient and the previous two correlation coefficients is that it is a correlation coefficient used for correlation analysis of categorical variables, and the number of consistent element pairs between two variables needs to be counted during its calculation. Its calculation formula is as follows:
insert image description here

Where C represents the number of element pairs with consistency, and D represents the number of element pairs with inconsistency.
The variable x and the variable y can be regarded as two sets of elements respectively. The i-th element and the j-th element in them are x_i, y_i, x_j, and y_j respectively. When x_i>x_j and y_i>y_j or x_i<x_j exist at the same time and y_i<y_j, then this pair of elements has consistency, when x_i>x_j and y_i<y_j or x_i<x_j and y_i>y_j exist at the same time, this pair of elements has inconsistency, and when the same situation occurs , then the pair of elements is neither consistent nor inconsistent.

4. Regression Analysis

Regression analysis is a statistical method that expresses the relationship between two or more variables. It uses both independent variables and dependent variables to express the relationship between two variables. When expressing the relationship between two variables, it is usually expressed by a linear regression equation, and when expressing the relationship between multiple variables, it is expressed by multiple linear regression variance. The representation of unary linear regression is as follows:
insert image description here

Where x is the independent variable, y is the dependent variable, b_0 represents the intercept of the equation, and b_1 represents the slope of the equation. The intercept and slope of the equation need to be calculated by substituting the specific values ​​of the independent variable and the dependent variable into the formula. Similarly, the expression of multiple linear regression is as follows, where the number of independent variables is more than two, and each corresponding independent variable has a slope that needs to be calculated.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_42051846/article/details/129440842