Regression Analysis Concise Tutorial【Regression Analysis】

To understand the motivation behind regression, let us consider the following simple example. The scatterplot below shows the number of US college graduates from 2001 to 2012.

insert image description here

Now with the data available, if someone asked you how many college graduates received a master's degree in 2018? It can be seen that the number of university graduates with a master's degree increases almost linearly with the year. So, with a simple visual analysis, we can roughly estimate the number to be somewhere between 2 million and 2.1 million. Let's look at the actual numbers. The graph below plots the same variable from 2001 to 2018. As can be seen, our predicted numbers are roughly in line with the actual values.

insert image description here

Since this is a relatively simple problem (fitting a line to data), our brains can easily do this. This process of fitting a function to a set of data points is called regression analysis.

1. What is regression analysis?

Regression analysis is the process of estimating the relationship between dependent and independent variables. In short, this means fitting a function from a chosen family of functions to sampled data under some error function. Regression analysis is one of the most fundamental tools used in machine learning for prediction. Using regression, you fit a function on the available data and try to predict the outcome for future or retained data points. This function fitting serves two purposes.

  • Can estimate missing data in the data range (interpolation)
  • Future data outside the data range can be estimated (extrapolation)

Some real-world examples of regression analysis include predicting house prices based on house characteristics, predicting the impact of SAT/GRE scores on college admissions, predicting sales based on input parameters, predicting weather, etc.

Let us consider the previous example of college graduates.

  • Interpolation: Suppose we have access to some sparse data where we know the number of college graduates every 4 years, as shown in the scatterplot below.
    insert image description here

We want to estimate the number of college graduates for all missing years in between. We can do this by fitting a line to the limited available data points. This process is called interpolation.

insert image description here

  • Extrapolation: Suppose we only have limited data available from 2001 to 2012, and we want to predict the number of college graduates from 2013 to 2018.
    insert image description here

It can be seen that the number of university graduates with a master's degree increases almost linearly with the year. So it makes sense to fit a line to the dataset. Fitting a line to these 12 points and then testing the predictions of this line on the next 6 points shows that the predictions are very close.
insert image description here

Defining regression analysis in mathematical language is to use the specified data points to estimate the parameters of a certain function under the constraints of a certain loss function:
insert image description here

2. Types of regression analysis

Now let's talk about the different ways of doing regression. Depending on the family of functions (f_beta) and the loss function used (l), we can classify regression into the following categories.

  • linear regression
  • polynomial regression
  • Ridge regression
  • Lasso returns
  • ElasticNet returns
  • Bayesian regression
  • logistic regression

3. Linear regression

In linear regression, the goal is to fit a hyperplane (a line of data points in 2D) by minimizing the sum of the mean squared errors for each data point.

Mathematically, linear regression solves the following problems:
insert image description here

Therefore, we need to find 2 variables denoted by beta to parameterize the linear function f(.). An example of linear regression is shown in Figure 4 above, where P=5. The figure also shows the fitted linear function for beta_0 = -90.798 and beta_1 = 0.046

4. Polynomial regression

Linear regression assumes that the relationship between the dependent variable (y) and the independent variable (x) is linear. It cannot fit data points when the relationship between them is not linear. Polynomial regression extends the fitting capabilities of linear regression by fitting a polynomial of degree m to the data points. The richer the functions considered, the better (in general) its ability to fit. Mathematically, polynomial regression solves the following problem.
insert image description here

So we need to find (m+1) variables denoted by beta_0, ..., beta_m. It can be seen that linear regression is a special case of polynomial regression of degree 2.

Consider the following set of data points plotted as a scatterplot. If we use linear regression, the fit we get obviously fails to estimate the data points. But if we use polynomial regression of degree 6, we get a better fit as follows:

insert image description here

[Left] Data scatterplot - [Middle] Linear regression - [Right] Degree 6 polynomial regression

Since the data points do not have a linear relationship between the dependent and independent variables, linear regression cannot estimate a good fit function. Polynomial regression, on the other hand, is capable of capturing non-linear relationships.

5. Ridge regression

Ridge Regression solves the problem of overfitting in regression analysis. To understand this, consider the same example as above. When a polynomial of degree 25 is fitted to the data for the 10 training points, it can be seen that it fits the red data points perfectly (middle panel below). But doing so hurts the other points in the middle (the peak between the last two data points). This can be seen in the figure below. Ridge regression attempts to solve this problem. It tries to minimize the generalization error by destroying the fit of the training points.

insert image description here

[Left] Scatter plot of data - [Middle] Polynomial regression of degree 25 - [Right] Ridge regression of degree 25 polynomial

Mathematically, ridge regression solves the following problems by modifying the loss function:
insert image description here

The function f(x) can be a linear function or a polynomial function. In the absence of ridge regression, the learned weights tend to be quite high when the function overfits the data points. Ridge regression avoids overfitting by introducing a scaled L2 norm of the weights (beta) in the loss function to constrain the learned weight norm.

Therefore, training a model is a trade-off between fitting the data points perfectly (learning a large norm for the weights) and constraining the norm of the weights. A scaling constant alpha>0 is used to control this tradeoff. Smaller alpha values ​​will result in higher norm weights and overfitting to the training data points. On the other hand, larger values ​​of alpha will result in poorer fits of the function to the training data points, but with very small weight norms. Careful choice of alpha value will yield the best trade-offs.

6. LASSO regression

LASSO regression is similar to Ridge regression in that they are both used as regularizers to prevent overfitting to the training data points. But LASSO has an added benefit. It enforces sparsity of learned weights.

Ridge regression forces the learned weights to have a smaller norm, resulting in a set of weights with a reduced overall norm. Most weights, if not all, will be non-zero. On the other hand, LASSO tries to find a set of weights by making most of them close to zero. This results in a sparse weight matrix whose implementation is more energy efficient than a non-sparse weight matrix while maintaining similar accuracy in fitting data points.

The diagram below attempts to visualize this idea in the same example as above. Fits the data points using Ridge and Lasso regression and plots the corresponding fits and weights in ascending order. It can be seen that most of the weights in the LASSO regression are indeed close to zero.
insert image description here

Mathematically, LASSO regression solves the following problems by modifying the loss function:

insert image description here

The difference between LASSO and ridge regression is that LASSO uses the L1 norm of the weights instead of the L2 norm. The L1 norm in the loss function tends to increase the sparsity of the learned weights. See the L1 regularization section of this post for more details on how to enforce sparsity.

The constant alpha>0 is used to control the trade-off between goodness of fit and sparsity of learned weights. Larger alpha values ​​result in poorer fits, but a sparser set of learned weights. On the other hand, smaller alpha values ​​result in a tight fit to the training data points (possibly leading to overfitting), but a less sparse set of weights.

7. ElasticNet regression

ElasticNet regression is a combination of Ridge regression and LASSO regression. The loss term consists of the L1 and L2 norms of the weights and their respective scaling constants. It is often used to address limitations of LASSO regression, such as the non-convex nature. ElasticNet adds a quadratic penalty on the weights to make them mostly convex.

Mathematically, ElasticNet regression solves the following problems by modifying the loss function:
insert image description here

8. Bayesian regression

For the regression discussed above (the frequentist approach), the goal is to find a set of deterministic weight values ​​(β) that explain the data. In Bayesian regression, instead of finding a value for each weight, we try to find the distribution of these weights assuming a prior.

So we start with an initial distribution of weights and, depending on the data, use Bayes' theorem to relate the prior distribution to the posterior distribution based on likelihood and evidence, pushing the distribution in the right direction.
insert image description here

When we have infinite data points, the posterior distribution of the weights becomes impulsive in the solution of the ordinary least squares solution, i.e. the variance approaches zero.

Finding a distribution of weights rather than a set of deterministic values ​​serves two purposes

  • It naturally prevents the problem of overfitting and thus acts as a regularizer
  • It provides confidence and weight ranges, which is more logical than just returning a value.

Let's formulate this problem mathematically and give its solution:

insert image description here

Let's make a Gaussian prior on the weights with mean μ and covariance Σ, namely:

insert image description here

Based on available data D, we update this distribution. For the problem at hand, the posterior will be a Gaussian distribution with the following parameters:

insert image description here

A detailed mathematical explanation can be found here .

Let's try to understand it intuitively by looking at sequential Bayesian linear regression by updating the distribution of weights one data point at a time. As shown below:

insert image description here

Bayesian regression pushes the posterior distribution in the right direction based on the input data (x, y)

As each data point is included, the distribution of weights becomes closer to the actual underlying distribution.

The animation below plots the raw data, the predicted interquartile ranges, the marginal posterior distribution of the weights, and the joint distribution of the weights at each time step when considering a single new data point. It can be seen that as more points are included, the interquartile range narrows (green shaded area), the marginal distributions are distributed around the two weight parameters with variance close to zero, and the joint distribution converges to the actual weights.

insert image description here

9. Logistic regression

Logistic regression comes in handy in classification tasks where the output needs to be the conditional probability of the output given the input. Mathematically, logistic regression solves the following problems:
insert image description here

Consider the following example, where the data points belong to one of two classes: {0 (red), 1 (yellow)}, as shown in the scatterplot below.
insert image description here

[Left] Scatterplot of data points - [Right] Logistic regression trained on data points plotted in blue

Logistic regression uses the sigmoid function at the output of a linear or polynomial function to map the output from (-♾️, ♾️) to (0, 1). The test data is then classified into one of two classes using a threshold (usually 0.5).

It seems that logistic regression is not a regression, but a classification algorithm. but it is not the truth. You can find more information on this in Adrian's post.

10. Conclusion

In this article, we have looked at various methods in regression analysis, what are their motivations and how to use them. The table and cheat sheet below summarize the different approaches discussed above.

insert image description here
insert image description here


Original link: Regression Analysis Concise Tutorial—BimAnt

Guess you like

Origin blog.csdn.net/shebao3333/article/details/132015350