Regression Algorithms, Linear Regression, Logistics

Regression , also known as multiple regression analysis: refers to a statistical analysis method that studies the relationship between a set of random variables (Y1, Y2, ..., Yi) and another set of variables (X1, X2, ..., Xk)
Usually Y1, Y2, ... , Yi is the dependent variable, X1, X2, ..., Xk is the independent variable
Regression, a mathematical model
Classification :
univariate linear regression model, consisting of an independent variable and a
dependent ; the model is Y=a +bX+ε (X is the independent variable, Y is the dependent variable, and ε is the random error).
It is usually assumed that the mean of random errors is 0 and the variance is σ^2 (σ^2>0, σ^2 has nothing to do with the value of X).
If it is further assumed that the random errors follow a normal distribution, it is called a normal linear model.
Generally, if there are k independent variables and 1 dependent variable, the value of the dependent variable is divided into two parts:
one part is affected by the independent variable, that is, it is expressed as its function, the function form is known and contains unknown parameters;
the other part is determined by Other factors that are not considered and the influence of randomness are random errors.
Linear regression analysis model: When the function is a linear function with unknown parameters
Nonlinear regression analysis model: When the function is a nonlinear function with unknown parameters
Multiple regression: When the number of dependent variables is greater than 1
Multiple regression: When the number of independent variables is greater than 1
content:
In the relationship between multiple independent variables affecting a dependent variable, judge whether the independent variable has a significant influence, and select the significant influence into the model, and eliminate the insignificant variables.
Stepwise regression, forward regression and backward regression are usually used to determine the quantitative relationship between certain variables
from a set of data; that is, to establish a mathematical model and estimate unknown parameters. Usually the least squares method is used.
The main types of regression are: linear regression, curve regression, binary logistic regression, multivariate logistic regression

Multiple Linear Regression :
There are two or more independent variables
Stepwise regression:
1) Forward introduction method: Start with single regression, and gradually increase the variables until the index value reaches the optimum
2) Backward elimination method: From all variables Start from the regression equation, gradually delete a variable until the index value reaches the optimum
3) Stepwise screening method: Combine the above two methods

Logistic Regression ——–> http://blog.chinaunix.net/xmlrpc.php?r=blog/article&uid=9162199&id=4223505
belongs to the generalized linear regression model
Basic principles:
(1) Find a suitable prediction function (Andrew Ng’s The public class is called hypothesis), generally expressed as the h function, this function is the classification function we need to find, it is used to predict the judgment result of the input data. This process is very critical, requiring a certain understanding or analysis of the data, knowing or guessing the "approximate" form of the prediction function, such as a linear function or a nonlinear function.
(2) Construct a Cost function (loss function), which represents the deviation between the predicted output (h) and the training data category (y), which can be the difference between the two (hy) or other forms. Considering the "loss" of all training data, the cost is summed or averaged, and recorded as the J(θ) function, which represents the deviation of the predicted value of all training data from the actual category.
(3) Obviously, the smaller the value of the J(θ) function is, the more accurate the prediction function is (that is, the more accurate the h function is), so what needs to be done in this step is to find the minimum value of the J(θ) function. There are different ways to find the minimum value of the function. Some of the Logistic Regression implementations are Gradient Descent.
The prediction function, generally using the soft max
loss function, is solved by maximum likelihood estimation.
Advantages :
1) The prediction result is a probability between 0 and 1;
2) It can be applied to continuous and categorical independent variables;
3) Easy Use and explanation;
4) Simple to implement, easy to understand and implement; low computational cost, fast speed, and low storage resources;
Disadvantages:
1) It is more sensitive to the multicollinearity of the independent variables in the model. For example, when two highly correlated independent variables are put into the model at the same time, the regression sign of the weaker independent variable may not meet the expectations, and the sign will be reversed. ?It is necessary to use factor analysis or variable cluster analysis to select representative independent variables to reduce the correlation between candidate variables;
2) The prediction results are "S" shaped, so the conversion from log(odds) to probability The process is non-linear, with the change of the ?log(odds) value at both ends, the probability changes very small, the marginal value is too small, the slope is too small, and the intermediate probability changes greatly and is very sensitive. The influence of variable changes in many intervals on the target probability has no degree of discrimination, and the threshold value cannot be determined.
3) It is easy to under-fit, and the classification accuracy may not be high

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325941840&siteId=291194637