Bayesian HPO basic process

First, let's ignore the HPO question and look at the following example first. Suppose now we know a function f ( x ) f(x)The expression of f ( x ) and its argumentxxThe domain of definition of x , now, we hope to solve forxxThe value range of x is f ( x ) f(x)The minimum value of f ( x ) , how do you plan to solve this minimum value? Faced with this problem, whether from the perspective of pure mathematical theory or from the perspective of machine learning, we have seen several popular ideas:

  • 1 For f ( x ) f(x)f ( x ) derivation, let its first derivative be 0 to find its minimum value

function f ( x ) f(x)f ( x ) is differentiable, and the differential equation can be solved directly

  • 2 We can iterate f ( x ) f(x) through optimization methods such as gradient descentminimum value of f ( x )

function f ( x ) f(x)f ( x ) is differentiable, and the function itself is convex

  • 3 We will xx of the whole domainx intof ( x ) f(x)f ( x ) calculates all possible outcomes and finds the minimum

function f ( x ) f(x)f ( x ) is relatively uncomplicated, the dimension of the independent variable is relatively low, and the amount of calculation can be tolerated

When we know that the function f ( x ) f(x)f ( x ) expression, the above methods can often be effective, but each method has its own preconditions. Assume now that the function f ( x ) f(x)f ( x ) is a smooth and uniform function, but it is extremely complex and non-differentiable. We cannot use any of the above three methods to solve it, but we still want to find its minimum value. What can we do? Due to the extremely complicated function, bring in anyxxThe computation of x takes a long time, so we are unlikely to use the universexxx is brought into the calculation, but we can still randomly sample some observation points from it to observe the possible trend of the entire function. So we choose to be inxx4 points are randomly selected on the domain of x and brought into f ( x ) f(x)f ( x ) is calculated and the following results are obtained:

01

Ok, now with these 4 observations, can you tell me f ( x ) f(x)Where is the minimum value of f ( x ) ? Where do you think the minimum point might be? Most people would be inclined to think that the minimum point may be very close to the observed 4f ( x ) f(x)The smallest value of f ( x ) values, but many people think otherwise. When we have 4 observations and know that our function is a relatively uniform and smooth function, then we may have the following guesses about the overall distribution of the function:

02

When we have a guess about the overall distribution of a function, there must be a minimum value of the function on this distribution. At the same time, different people may have different guesses on the overall distribution of the function, and the corresponding minimum values ​​under different guesses are also different.

03

04

Now, suppose we invite tens of thousands of people to make guesses on this problem, and the curve of each person's guess is shown in the figure below. It is not difficult to find that near the observation point, there is not much difference in the function values ​​guessed by everyone, but in places far away from the far side point, the function values ​​guessed by everyone are highly inconsistent. This is also of course, because the distribution of the function between observation points is completely unknown, and the farther the distribution is from the observation point, the less sure we are about where the true function value is, so the range of function values ​​that people guess is very large.
insert image description here
Now, we average all the guesses, and use the color block to represent the area of ​​the potential function value around any mean, and we can get an average curve of everyone's guesses. It is not difficult to find that the range covered by the color block is actually the upper and lower bounds of the function value guessed by everyone, and any xxThe greater the difference between the upper and lower bounds corresponding to x , the more uncertain people are about the guess value of this position on the function. Therefore, the difference between the upper and lower bounds can measure people's confidence in the observation point. The larger the color block range, the lower the confidence.

Around the observation point, the confidence is always high, and far away from the observation point, the confidence is always low, so if we can add an actual observation point where the confidence is low, we can quickly unify everyone's guesses. The following figure is an example. When we take an actual observation value in an interval with a very low confidence, the "guess" around this interval will immediately become concentrated, and the confidence in this interval will increase significantly.

When the confidence over the entire function is very high, we can say that we have obtained a line that is consistent with the true f ( x ) f(x)The f ( x ) curves are highly similar to the curvef ∗ f^*f , the number of times we canf ∗ f^*fThe minimum value of is regarded as the real f ( x ) f(x)The minimum value of f ( x ) is considered. Naturally, if the estimate is more accurate,f ∗ f^*f The closer tof ( x ) f(x)f ( x ) , thenf ∗ f^*fThe minimum value of ∗ will be closer tof ( x ) f(x)The true minimum of f ( x ) . How can we makef ∗ f^*f closer tof ( x ) f(x)What about f ( x ) ? According to the process we just raised the confidence level, it is obvious - the more observation points, the closer our estimated curve will be to the realf ( x ) f(x)f ( x ) . However, due to the limited amount of calculation, we have to choose the observation point very carefully every time we make an observation. Now,how to choose the observation point can help us estimate f ( x ) f(x) to the greatest extentWhat about the minimum value of f ( x ) ?

There are many methods, the simplest method is to use the frequency of the minimum value to judge. Since different people have different guesses on the overall distribution of the function, the corresponding minimum values ​​under different guesses are also different. According to the function results guessed by each person, we are in XXOn the X- axis, the domain interval is evenly divided into 100 small intervals. If there is a guessed minimum value falling in one of the intervals, we will count the interval (this process is exactly the same as the process of drawing a histogram for discrete variables). After tens of thousands of people guessed, we also drewtheThe frequency graph of different intervals on the X-axis, the higher the frequency, the more people guessing the minimum value in the interval, and vice versa, the fewer people guessing the minimum value in the interval . The frequency reflects the probability of the occurrence of the minimum value to a certain extent. The higher the frequency interval, the higher the probability of the true minimum value of the function.

When we will XXAfter the interval on the X- axis is divided enough, the frequency map drawn can be turned into a probability density curve, andthe point corresponding to the maximum value of the curve is f ( x ) f(x)The minimum value of f ( x ) has the highest probability , so it is obvious that we should confirm the point corresponding to the maximum value of the curve as the next observation point. According to the image, we know that the most likely interval for the minimum value is around x=0.7. When we do not take new observation points, nowf ( x ) f(x)The reliable minimum value that can be obtained on f ( x ) is the point at x=0.6, but if we take a new observation at x=0.7, we are likely to find fmin f_{min} smaller than the current point of x = 0.6fmin. Therefore, we can decide on this and make observations at x=0.7.

After we take the observation at x=0.7, we have 5 known observations. Now, let tens of thousands of people guess the overall function distribution based on 5 known observation points. After guessing, calculate the interval with the highest frequency of the current minimum value, and then take a new observation point pair f ( x ) f(x )f ( x ) is calculated. When the allowed number of calculations has been exhausted (for example, 500), the entire estimation stops.
Did you discover it? In this process, we are actually constantly optimizing our objective functionf ( x ) f(x)An estimate of f ( x ) , although there is no estimate forf ( x ) f(x)f ( x ) carries out calculations on all domains, and it is not found that the final determination must bef ( x ) f(x)The curve of the f ( x ) distribution, but as we observe more and more points, our estimation of the function is more and more accurate, so there is also a greater possibility of estimatingf ( x ) f(x)The true minimum of f ( x ) . This optimization process is Bayesian optimization.


In the mathematical process of Bayesian optimization, we mainly perform the following steps:

  • 1 Define f ( x ) f(x) to be estimatedf ( x ) andxxdomain of x

  • 2 Take out limited n xxValues ​​on x , solve for these xxx corresponds tof ( x ) f(x)f ( x ) (solve for observations)

  • 3 According to the limited observation value, estimate the function (this assumption is called prior knowledge in Bayesian optimization), and obtain the estimate f ∗ f^*f target value (maximum or minimum)

  • 4 Define a certain rule to determine the next observation point that needs to be calculated

And continue to loop in steps 2-4 until the target value on the hypothetical distribution reaches our standard, or all computing resources are used up (for example, a maximum of m observations, or a maximum of t minutes allowed to run).


The above process is also called Sequential Model Optimization (SMBO), which is the most classic Bayesian optimization method. In the actual operation process, especially in the process of hyperparameter optimization, the following specific details need to be paid attention to:

  • When Bayesian optimization is not used for HPO, in general f ( x ) f(x)f ( x ) can be a complete black box function (black box function, also translated as a black box function, that is, only knowxxx andf ( x ) f(x)The corresponding relationship of f ( x ) does not know the internal laws of the function at all, and at the same time cannot write a specific expression of a type of function), so Bayesian optimization is also considered to be a class of classic methods that can be used for black-box function estimation. But in the HPO process,f ( x ) f(x)f ( x ) is generally the result of cross-validation/loss function, and we often know the expression of the loss function very well, but we don't understand the specific laws inside the loss function, so f ( x ) f(x) inHPOf ( x ) cannot be regarded as a black-box function in the strict sense.

  • In HPO, the argument xxx is the hyperparameter space. In the above two-dimensional image representation,xxx is one-dimensional, but in actual optimization, the hyperparameter space is often a high-dimensional and extremely complex space.

  • The initial number of observations n and the final maximum number of observations m that can be obtained are hyperparameters of Bayesian optimization, and the maximum number of observations m also determines the number of iterations of the entire Bayesian optimization.

  • In the third step, the tool for estimating the function distribution based on limited observations is called the Probability Surrogate model. After all, in mathematical calculations, we cannot really invite tens of thousands of people to connect our observation points. These probabilistic proxy models come with certain assumptions, and they can estimate the distribution of the objective function f ∗ f^* based on several observation pointsf (includingf ∗ f^*f the value of each point and the corresponding confidence level of the point). In actual use, probabilistic proxy models are often some powerful algorithms, the most common such as Gaussian process, Gaussian mixture model and so on. Gaussian processes are often used in traditional mathematical derivation, but now the most popular optimization libraries basically use the TPE process based on the Gaussian mixture model by default.

  • The rule used to determine the next observation point in step 4 is called the acquisition function (Aquisition Function), and the acquisition function measures the observation point pair fitting f ∗ f^*f , and select the point with the greatest influence to perform the next observation, so we often focus onthe point with the largest value of the acquisition function. The most common acquisition functions are mainly probability increment PI (Probability of improvement, such as the frequency we calculate), expectation increment (Expectation Improvement), upper confidence bound (Upper Confidence Bound), information entropy (Entropy) and so on. PI, UCB, and EI are shown in the gif above. Expected deltas are used by default in most of these optimization libraries.

When using Bayesian optimization in HPO, we often see the following image, which shows all the basic elements of Bayesian optimization. Our goal is to let f ∗ f^* under the guidance of the acquisition functionf As close as possible tof ( x ) f(x)f(x)

Guess you like

Origin blog.csdn.net/weixin_44820355/article/details/125993058