Logistic Regression - Advanced Optimization

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程，第七章《logistic回归》中第51课时《Advanced Optimization》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正，使其更加简洁，方便阅读，以便日后查阅使用。现分享给大家。如有错误，欢迎大家批评指正，在此表示诚挚地感谢！同时希望对大家的学习能有所帮助。

In the last video, we talked about gradient descent for minimizing the cost function $J(\theta )$ for logistic regression. In this video, I’d like to tell you about some advanced optimization algorithms and some advanced optimization concepts. Using some of these ideas, we’ll be able to get logistic regression to run much more quickly than it’s possible with gradient descent. And this will also let the algorithms scale much better to very large machine learning problems, such as if we have a very large number of features.

Here’s an alternative view of what gradient descent is doing. We have some cost function J and we want to minimize it. So what we need to do is, we need to write code that can take as input the parameters $\theta$ and they can compute two things: $J(\theta )$ and these partial derivative terms for j equals to 0, 1 up to n. Given code that can do these two things, what gradient descent does is it repeatedly performs the following update. So given the code that we wrote to compute these partial derivatives, gradient descent plugs in here, and uses that to update parameters. So another way of thinking about gradient descent is that we need to supply code to compute $J(\theta )$ and these derivatives, and then these get plugged into gradient descent, which can then try to minimize the function for us. For gradient descent, I guess technically you don’t actually need code to compute the cost function $J(\theta )$ . You only need code to compute the derivative terms. But if you think your code as also monitoring convergence of some such, we’ll just think of ourselves as providing code to compute both the cost function and the derivative terms.

So, having written code to compute these two things, one algorithm we can use is gradient descent. But gradient descent isn’t the only algorithm we can use. And there are other algorithms, more advanced, more sophisticated ones, that, if we only provide them a way to compute these two things, then these are different approaches to optimize the cost function for us. So Conjugate gradient, BFGS and L-BFGS are examples of more sophisticated optimization algorithms that need a way to compute $J(\theta )$ , and need a way, to compute the derivatives, and can then use more sophisticated strategies than gradient descent to minimize the cost function. The details of exactly what these three algorithms do is well beyond the scope of this course. And in fact you often end up spending many days, or a small number of weeks studying these algorithms if you take a class in advanced numerical computing. But let me just tell you about some of their properties. These three algorithms have a number of advantages. One is that, with any of these algorithms you usually do not need to manually pick the learning rate $\alpha$ . So one way to think of these algorithms is that given these the way to compute the derivative and a cost function. You can think of these algorithms as having a clever inner-loop. And in fact, they have a clever inner-loop called a line search algorithm that automatically tries out different values for the learning rate $\alpha$ and automatically picks a good learning rate $\alpha$ so that it can even pick a different learning rate for very iteration. And so then you don’t need to choose it yourself. These algorithms actually do more sophisticated things than just pick a good learning rate, and so they often end up converging much faster than gradient descent, but detailed discussion of exactly what they do is beyond the scope of this course. In fact, I actually used to have used these algorithms for a long time, like maybe over a decade, quite frequently, and it was only a few years ago that I actually figured out for myself the details on conjugate descent, BFGS and L-BFGS do. So it is actually entirely possible to use these algorithms successfully and apply to lots of different learning problems without actually understanding the inner-loop of what these algorithms do. If these algorithms have a disadvantage, I’d say that the main disadvantage is that they’re quite a lot more complex than gradient descent. And in particular, you should not implement these algorithms conjugate gradient, L-BGFS, BGFS yourself, unless you’re an expert in numerical computing. Instead, just as I wouldn’t recommend that you write your own code to compute square root of numbers or to compute inverses of matrices, for these algorithms also what I recommended you do is just use a software library. So to take a square root what all of us do is use some function that someone else has written to compute the square roots of our numbers. And fortunately, Octave and the closely related language MATLAB. We’ll be using that. Octave has a pretty reasonable library implementing some of these advanced optimization algorithms. And so if you just use the built-in library, you know, you get pretty good results. I should say that there is a difference between good and bad implementations of these algorithms. And so if you’re using a different language for your machine learning application, if you’re using C, C++, Java, and so on, you might want to try out a couple of different libraries to make sure that you find a good library for implementing these algorithms. Because there is a difference in performance between a good implementation of conjugate gradient or L-BFGS versus less good implementation of conjugate gradient or L-BFGS.

So now let’s explain how to use these algorithms, I’m going to do so with an example. Let’s say that you have a problem with two parameters $\theta =\begin{bmatrix} \theta _{1}\\ \theta _{2} \end{bmatrix}$ . And let’s say your cost function $J(\theta )=(\theta _{1}-5)^{2}+(\theta _{2}-5)^{2}$ . So with this cost function, the value for $\theta _{1}$ and $\theta _{2}$ . If you want to minimize $J(\theta )$ as a function of $\theta$ . The value that minimizes it is going to be $\theta _{1}=5$ , $\theta _{2}=5$ . Now, again, I know some of you know more calculus than others, but the derivatives of the cost function J turn out to be these two expressions down here. I’ve done the calculus. So if you want to apply one of the advanced optimization algorithms to minimize this cost function J. So, you know, if we didn’t know the minimum was at 5,5, but if you want to have a cost function find the minimum numerically using something like gradient descent but perfectly more advanced than gradient descent, what you would do is implement an octave function like this, so we implement a cost function, costFunction(theta) function like that, and what this does is that it returns two arguments, the first jVal, is how we would compute the cost function J. And so this says jVal equals theta(1) minus five squared plus theta(2) minus five squared. It’s just computing this cost function over here. And the second argument that this function returns is gradient. So gradient is going to be a 2×1 vector, and the two elements of the gradient vector corresponds to the two partial derivative terms over here. Having implemented this cost function, you would, you can then call the advanced optimization function called the fminunc. It stands for function minimization unconstrained in Octave, and the way you call this is as follows. You set a few options. Options is a data structure that stores the options you want. So GradObj, On, this sets the gradient objective parameter to on. It just means you are indeed going to provide a gradient to this algorithm. I’m going to set the maximum number of iterations to, let’s say, one hundred. We’re going to give it an initial guess for theta. That’s a 2×1 vector. And then this command calls fminunc. This @ symbol represents a pointer to the cost function that we just defined up there. And if you call this, this will use one of the more advanced optimization algorithms. And if you want to think it as just like gradient descent. But automatically choosing the learning rate $\alpha$ for you so you don’t have to do so yourself. But it will then attempt to use the advanced optimization algorithms. Like gradient descent on steroids. To try to find the optimal value of $\theta$ for you.Let me actually show you what this looks like in Octave.

So I’ve written this costFunction(theta) function exactly as we had it on the previous lines. It computes jVal which is the cost function. And it computes the gradient with the two elements being the partial derivatives of the cost function with respect to, the two parameters, theta one and theta two. Now let’s switch to my Octave window. I’m gonna type in those commands I had just now. So, options equals optimset. This is the notation for setting my parameters on my options, for my optimization algorithm. So that says 100 iterations, and I’m going to provide the gradient to my algorithm. Let’s say initialTheta=zero(2,1). So that’s my initial guess for theta. And now I have [optTheta, functionVal, exitFlag] = fminunc. A pointer to the cost function. And provide my initial guess. And the options like so. And if I hit enter, this will run the optimization algorithm. And it returns pretty quickly. This funny formatting that’s because my code wrapped around. So, this funny thing is just because my command line had wrapped around. But what this says is that numerically renders, think of it as gradient descent on steroids, that found the optimal value for theta is $\theta _{1}=1, \theta _{2}=1$ , exactly as we’re hoping for. The function value at the optimum is essentially $10^{-30}$ . So that’s essentially 0, which is also we’re hoping for. And the exitFlag is 1, and this shows what the convergence status of this. And if you want you can do help fminunc to read the documentation for how to interpret the exit flag. But the exit flag let’s you verify whether or not this algorithm has converged.

So that’s how you run these algorithms in Octave. I should mention, by the way, that for the Octave implementation, your parameter vector of $\theta$ , must be in $\theta \in \mathbb{R}^{d}$ , for d greater than or equal to 2. So if $\theta$ is just a real number, so if it is not at least a two-dimensional vector or some higher than two-dimensional vector, this fminunc may not work, so and if in case you have a one-dimensional function you need to optimize, you can look in the Octave documentation for fminunc for additional details. So, that’s how we optimize our trial example of this simple quadratic cost function. How do we apply this to logistic regression.

扫描二维码关注公众号，回复： 10834867 查看本文章

In logistic regression we have a parameter vector $\theta$ , and I’m going to use a mix of Octave notation and math notation. But I hope this explanation will be clear. Our parameter vector theta comprises these parameters $\theta _{0}$ through $\theta _{n}$ . Because Octave indexes vectors using index from 1, $\theta _{0}$ is actually written $\theta _{1}$ in Octave, $\theta _{1}$ is actually written $\theta _{2}$ in Octave, and $\theta _{n}$ is actually written $\theta _{n+1}$ . And that’s because Octave indexes its vectors starting from index of 1 instead of index of 0. So what we need to do then is write a cost function that captures the cost function for logistic regression. Concretely, the cost function needs to return jVal, which is jVal as you need some codes to compute $J(\theta )$ and we also need to give it the gradient. So, gradient(1) is going to be some code to compute the partial derivative in respect to $\theta _{0}$ , the next partial derivative, respect to $\theta _{1}$ , and so on. Once again, this is gradient(1), gradient(2) and so on, rather than gradient(0), gradient(1), because Octave indexes its vectors starting from 1 rather than from 0. But the main concept I hope you take away from this slide is that what you need to do is write a function that returns the cost function and returns the gradient. And so in order to apply this to logistic regression or even to linear regression, if you want to use these optimization algorithms for linear regression. What you need to do is plug in the appropriate code to compute these things over here.

So, now you know how to use these advanced optimization algorithms. Because for these algorithms, you’re using a sophisticated optimization library, it makes the code just a little bit more opaque, and so just maybe a little bit harder to debug. But because these algorithms often run much faster than gradient descent, often quite typically whenever I have a large machine learning problem, I will use these algorithms instead of using gradient descent. And with these ideas, hopefully, you’ll be able to get logistic regression and also linear regression to work on much larger problems. So, that’s it for advanced optimization concepts. And in the next and final video on Logistic Regression, I want to tell you how to take the logistic regression algorithm that you already know about and make it work also on multi-class classification problems.

<end>

王彩旗

发布了41 篇原创文章 · 获赞 12 · 访问量 1306

私信关注

Logistic Regression - Advanced Optimization

猜你喜欢