Machine Learning Notes 06---Maximum Likelihood Estimation

    A common strategy for estimating class conditional probability is to assume that it has a certain form of probability distribution, and then estimate the parameters of the probability distribution based on training samples. Specifically, note that the class conditional probability of category c is P( x |c), assuming that P( x |c) has a definite form and is uniquely determined by the parameter vector θc, then our task is to use the training set D to estimate the parameter θc. For clarity, we denote P( x |c) as P( x |θc).

    In fact, the training process of the probability model is the parameter estimation process. For parameter estimation, two schools of thought in the field of statistics provide different solutions: the frequentist school believes that although the parameters are unknown, they are fixed values ​​that exist objectively. Therefore, the parameter values ​​can be determined by optimizing the likelihood function and other criteria; Bayesians believe that parameters are unobserved random variables, which can also have a distribution. Therefore, it can be assumed that the parameters obey a prior distribution, and then the posterior distribution of the parameters can be calculated based on the observed data. This article introduces the maximum likelihood estimation derived from the frequentist school, which is a classic method for estimating the parameters of probability distributions based on data sampling. (also known as maximum likelihood method)

    Let Dc represent the set of c-th class samples in the training set D, assuming that these samples are independent and identically distributed, then the likelihood of the parameter θc for the data set Dc is:

 The maximum likelihood estimation of θc is to find the parameter value θ'c that can maximize the likelihood P(Dc|θc). Intuitively, maximum likelihood estimation tries to find a value that maximizes the "possibility" of data occurrence among all possible values ​​of θc.

    However, the multiplication operation of the above formula is easy to cause underflow, and log-likelihood is usually used:

 At this time, the maximum likelihood estimate θ'c of the parameter θc is:

    For example, in the case of continuous attributes, assuming a probability density function p(x|c)~N(μ, σ²), the maximum likelihood estimates of the parameters μ and σ² are:

 That is to say, the mean value of the normal distribution obtained by the maximum likelihood method is the sample mean value, and the variance is the mean value of (x-μ')(x-μ')T, which is obviously an intuitive result. In the case of discrete attributes, class conditional probabilities can also be estimated in a similar way.

    It is worth noting that although this parameterized method can make conditional probability estimation relatively simple, the accuracy of the estimation results depends heavily on whether the assumed probability distribution conforms to the underlying real data distribution. In practical applications, in order to make assumptions that can better approximate the potential real distribution, it is often necessary to use empirical knowledge about the application task itself to a certain extent. misleading results.

Refer to Zhou Zhihua's "Machine Learning"

Guess you like

Origin blog.csdn.net/m0_64007201/article/details/127586880