coursera—Nu Enda Machine Learning Notes (1-3 weeks)

Machine Learning Notes

       The notes mainly record the main content of the class and some code implementations according to the progress, because I will look at the first stage and then organize it, and the content will overlap to some extent. Regarding the code part, at first I only wrote the code for homework. Now I think I might as well read the code given by the teacher carefully and learn it as a skill language, so I will also put the teacher's code together when sharing. Come in.

Corrections are welcome, and discussions are welcome.

l Week 1

  First of all, this course requires a relatively low mathematical foundation and is biased towards application (in Wu Enda’s words, how to use tools like a master carpenter). It is a basic course and mainly focuses on supervised learning and unsupervised learning, without involving the theory behind the algorithm.

      The difference between supervised learning and unsupervised learning is (my understanding), did you tell the real y corresponding to the data set when you were training, supervised learning will tell the training input data what the output should be, not Supervised learning is to find the structure and laws behind the input data. Typical supervised learning problems such as regression and classification.

      Let's familiarize yourself with some basic concepts through a relatively simple linear regression with only one variable, organized as follows:

Picture 1-1

      After introducing the above concepts, introduce the concept of cost function (Cost Function) to show the deviation between the predicted value and the actual value. In linear regression, the loss function is expressed as:

Figure 1-2

      Among them, when there is only one independent variable, . As mentioned above, this loss function is only in the form of linear regression. In different models, the loss function will have different forms, which is related to the gradient descent method (Gradient Descent) to be involved next.

      Note that the independent variable of the loss function and the hypothesis is not the same, in the hypothesis the independent variable is x, and the independent variable in the loss function is . We can only calculate the predicted value if we know it, and then we can calculate a corresponding loss function value, and the parameters we are uncertain and ultimately required are all. This relationship must be clear.

      Now let's see how we can be sure. In fact, our ultimate wish is to find the smallest difference between the predicted value and the actual value, and this part is reflected in the loss function. Our objective function is:

      Now we have two methods, one is to directly derive the derivation on both sides of the loss function, and directly calculate the extreme value of the loss function; the other method is the gradient descent method (Gradient Descent), which is reasonable through iteration.

      Although I don't understand the principle behind the gradient descent method, I can clearly understand its general principle through intuitive graphics (take the downhill that Mr. Wu Enda gave in class as an example). But what we need to pay attention to is that when the function is more complex, the gradient descent method cannot guarantee to find the global optimal solution, and may converge to the local optimal solution, that is, it has a certain relationship with the initial selection. Now let's answer the question that the loss function mentioned above has different forms, because the gradient descent method cannot guarantee to find the global optimal solution. When the loss function is not suitable (when there are many local solutions), the prediction effect of the found model is not very good. Okay. (Convex optimization will be involved here). In the linear model, the loss function we find is similar to the bowl shape, and it will definitely find the global optimal solution.

      Now let's look at the iterative process of parameters in a linear regression model:

Figure 1-3

      That is (sorry I didn't find the symbol for partial derivative, use delta instead):

 

Figure 1-4

      It represents the assignment symbol, which has been filled with 1, so that the above formula can still be satisfied even for the paranoid term. Indicates the learning rate, which is a parameter we need to choose. How to choose will involve the content of the second week and put it in the content of the second week.

l Second week

  Linear regression is still covered this week, some techniques to pay attention to in practical applications, and related programming assignments.

  Let’s start with feature scaling. In practical applications, the variable x with different value ranges is often regressed. But when the range of values ​​is very different, taking binary as an example, the contour we get (I call it that for convenience) will be an ellipse whose major and minor axes are very different (very flat). Such a graph will allow us to increase the number of iterations, and when it is a circle, the number of iterations is the least (do not understand the principle very well).

According to the above, before regression, if we standardize the variables (not variables here), the efficiency of the calculation will be improved. Commonly used standard methods are:

 Figure 2-1

  Because the previous formulas of probability and statistics are relatively hard to memorize, what I understand is only the sample standard deviation, and the teacher can also use extremely poor in class.

  Let's talk about the choice of learning rate. You can draw the image of the loss function and the number of iterations to observe whether the learning rate is appropriate. If the learning rate is selected appropriately, the loss function gradually decreases with the number of iterations.

If the selected learning rate is too small, the convergence speed will be too slow and the number of iterations will be large; if the selected learning rate is too large, the loss function may not converge. You can use the above-mentioned drawing iteration-loss function image to determine whether the learning rate is appropriate.

  The following content is polynomial regression (polynomial regression), (there was always a problem before the start, and whether there will be a multicollinearity problem in the regression, I almost forgot the applied statistics I learned a year ago, I dare not I said hello to Mr. Hu Ping. Well, even if I study well, I will not say hello to Mr. Hu Ping). We have discussed linear regression before, and we will also encounter nonlinear problems in life. We can increase the power term to improve the applicability of the model. We still use the loss function and iterative relationship described above to calculate the parameters.

      Because the knowledge of line generation and Normal equation are relatively simple, or I can't explain it clearly (such as pseudo invert), I will skip it here, and share my ideas for writing code below.

      Because word is not good at typing code, I directly copy and paste the code and comments.

      Read in the data data file and slice it to separate independent and dependent variables. m represents the number of samples. and draw the image.

 

Figure 2-2

      The following adds (according to my personal understanding, I call it the creation function) a column vector of all 1s to the original matrix X. And create an initial with all 0s (note the difference between [ and ) in the creation), and then call the loss function.

Figure 2-3

  In the function, I use a loop statement, which may be troublesome to write in this way. If it is more convenient, you can use matrix operations to form a column vector and then sum the elements to achieve it.

Figure 2-4

  The loss function is defined, and the Gradient Descent function is called for an iterative process given the initial values ​​of the parameters.

 

Figure 2-5

  This is the gradient descent function I defined. Here I directly sum the column vector elements instead of the loop statement. Note that the updates in the parameters are synchronous.

 

Figure 2-6

      Draw on the original figure and add a title.

 

Figure 2-7

      Draw the image of the loss function, linspace should be a function similar to mgrid in np, and evenly generate the specified number of numbers in the specified interval.

Figure 2-8

      The code for drawing the stereogram of the loss function here is not special. It seems that the surf function is used, so the original loss function matrix needs to be transposed. Label the image and draw contours.

 

Figure 2-9

      Since the following multivariate exercise is similar to this exercise, the code will not be posted.

l The third week

  This week mainly talks about logistic regression, and discusses the classification problem.

      The content is introduced through a case of judgment on the nature of the tumor (categorical variable). Unlike the previous one, the dependent variable of the linear regression is a continuous variable, and this time the dependent variable is a categorical variable.

      If we use a linear model for regression, the model will get invalid points (ie y>1), and the model will be very sensitive to outliers (I don't know how to understand this).

      After that, we found the sigmoid function that can compress the output between [0, 1], namely

 Figure 3-1

      Its function image is:

 

Figure 3-2

      This gives us the logistic hypothesis:

 Figure 3-3

      The output value of the function can be seen as the probability that its label is 1 given the sum data x. If we take 0.5 as the threshold, that is, it is greater than or equal to 0.5, we think it is, otherwise we think it is 0. At the time, we saw, i.e., so that we could draw the decision boundary.

      After knowing the hypothesis, the next step is to define the loss function. If we still define the loss function according to the method of linear regression, that is, the mean square error, our loss function will be non-convex, and the function shape is as follows:

Figure 3-4

      Because of this, we need to find a new, convex loss function. The loss function found is as follows (personally feels that this definition should be cross entropy):

 Figure 3-5

      After knowing the loss function, you can know the iterative process:

 

Figure 3-6

 

Figure 3-7

      Some advanced optimization can be used in Octave, we only need to write the partial derivative formula.

      The usage method in Octave is:

 

Figure 3-8

 

Figure 3-9

      We will come across this code in later exercises. The classifications we mentioned above are all binary classifications. When we face multi-classification problems, what should we do? The teacher gave a one-versus-all method, that is, simplify the categories to 1 and 0 (either or not), calculate the probability, and then calculate the probability of other categories in the same way, and take the largest probability category the category to predict for.

      The following will involve a regular concept. When we have a lot of features, our model will be particularly prone to overfitting, that is, it performs well on the training set but the prediction results on the test set are not ideal.

      There are two ways to solve overfitting: one is to delete some features and reduce the number of features; the other is to use a regular method (my understanding is to impose a certain penalty on the parameters, that is, not to make the parameters too large). Let's see an example of regularization in linear regression:

 

Figure 3-10

      Note that it is selected from 1, that is, the parameters are not penalized. When we have a lot of parameters (features), we may not know which parameter to regularize. A better way is to regularize all parameters.

      The above is the loss function after linear regression regularization. For logistic regression, the regularized loss function is:

Figure 3-11

      Then its iteration formula is:

Figure 3-12

      Note that 0 is not included. Below I share the code from my homework.

      First, to complete X, generate the initial, and call the loss function.

 

Figure 3-13

      Before writing the loss function code, write the sigmoid function first. I wrote the function using the element operation of the matrix, and the output matrix or vector is just processed by the sigmoid function for each element.

 

Figure 3-14

      Write the loss function and define the partial derivatives section.

 

Figure 3-15

      After defining the loss function and partial derivative, use the advanced optimization mentioned earlier to iterate and calculate the parameters.

 Figure 3-16

  Then call the function to draw the decision boundary,

Figure 3-17

  In the plotDecisionBoundary function, first draw the scatter plot of the data, and then draw the graph according to the feature judgment. When the feature is less than or equal to (plus 0) 3, the decision boundary line is directly drawn.

Figure 3-18

  When the number of features is greater than 3, contour lines are drawn.

Figure 3-19

     Where mapFeature is the function used when adding features in the second case.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324940822&siteId=291194637