aic bic mdl

https://blog.csdn.net/xianlingmao/article/details/7891277

https://blog.csdn.net/lfdanding/article/details/50732762

Refer to the article http://blog.csdn.net/lynnucas/article/details/47947943 
from: http://blog.csdn.net/jteng/article/details/40823675 
Here we only consider the number of model parameters, not the number of model parameters. Involves the choice of model structure.

Many parameter estimation problems use the likelihood function as the objective function. When there is enough training data, the accuracy of the model can be continuously improved, but at the cost of increasing the complexity of the model, it also brings a very common problem in machine learning - overfitting combine. Therefore, the model selection problem seeks the best balance between the complexity of the model and the ability of the model to describe the data set (i.e. the likelihood function).

Many information criteria have been proposed to avoid the overfitting problem by adding a penalty term for model complexity. Here we introduce two commonly used model selection methods - Akaike Information Criterion (AIC) and Bayesian information. Criterion (Bayesian Information Criterion, BIC).

AIC is a standard for measuring the goodness of fitting of statistical models. It was proposed by Japanese statistician Hiroji Akaike in 1974. It is based on the concept of entropy and provides a standard for weighing the complexity of the estimated model and the goodness of fitting data.

Typically, AIC is defined as:

write picture description here

where k is the number of model parameters and L is the likelihood function. When choosing the best model from a set of available models, the model with the smallest AIC is usually chosen.

When there is a large difference between the two models, the difference is mainly reflected in the likelihood function term. When the likelihood function difference is not significant, the first term of the above formula, that is, the model complexity, comes into play, so the model with a small number of parameters A model is a better choice.

Generally speaking, when the model complexity increases (k increases), the likelihood function L will also increase, thereby making the AIC smaller, but when the k is too large, the likelihood function growth slows down, resulting in an increase in AIC, and the model Too complex can easily lead to overfitting. The goal is to select the model with the smallest AIC. AIC not only improves the model fit (maximum likelihood), but also introduces a penalty term to make the model parameters as few as possible, which helps reduce the possibility of overfitting.

BIC (Bayesian Information Criterion) Bayesian Information Criterion is similar to AIC and is used for model selection. It was proposed by Schwarz in 1978. When training a model, increasing the number of parameters, that is, increasing the complexity of the model, will increase the likelihood function, but it will also lead to overfitting. For this problem, both AIC and BIC introduce a penalty term related to the number of model parameters. , the penalty term of BIC is larger than that of AIC, considering the number of samples. When the number of samples is too large, it can effectively prevent the model from being too complex due to too high model accuracy.

write picture description here

Among them, k is the number of model parameters, n is the number of samples, and L is the likelihood function. The kln(n) penalty term can effectively avoid the dimensional disaster when the dimension is too large and the training sample data is relatively small.

 

 

 

Often, when modeling a bunch of data, especially classification and regression models, we have a lot of variables to use, choosing different combinations of variables can get different models, for example, we have 5 variables, 2 of 5 To the power, we will have 32 variable combinations and 32 models can be trained. But which model is better? Currently, the following methods are commonly used:

AIC=-2 ln( L ) + 2  k   Chinese name: akaike information criterion

BIC=-2 ln( L ) + ln(n)*k Chinese name: Bayesian information criterion

HQ=-2 ln(L) + ln(ln(n))*k  hannan-quinn criterion

where L is the maximum likelihood under the model, n is the number of data, and k is the number of variables in the model.

Note that these rules only describe the information loss relative to the "real model" after using a certain model [because we don't know what the real model looks like, all the models obtained from training are only an approximate model of the real model], so use these rules It cannot explain the accuracy of a certain model, that is, the three models A, B, C. After calculating through these rules, we know that the B model is the best of the three models, but there is no guarantee that the B model can be very good. Characterize the data, because it is very likely that all three models are very bad, and B is just a relatively good apple among rotten apples.

These rules are beautiful in theory, but it is still difficult to apply them in model selection. For example, we said above that there are 32 variable combinations for 5 variables. What if there are 10 variables? 2 to the 10th power, it is impossible for us to verify all these models one by one AIC, BIC, HQ rules to select the model, the workload is too large.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324730545&siteId=291194637