Wu Enda machine learning machine learning notes supplement (345678 11th week)

Because Wu Enda's machine learning has already been summarized in detail and thoughtfully, but for the convenience of inquiry, write here: outline, points to pay attention to and occasional ideas, and the first, second and third weeks will not be supplemented.

Portal:
Week 3,
Week 4,
Week 5,
Week 6,
Week 7 (I wrote it but didn’t save it, FML)
Week 8, Week
9 (I wrote it in the first half but didn’t save it, so I don’t write it anymore. If you want to read it, it’s up to others Right)
Week 10 (I watched it on the road, and then I don’t remember anything, let’s watch the ppt)
Week 11

week 3

Classification and Representation – Classification
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.
The reason for not using a linear model for binary classification (0/1) is that usually the data will not be linear (separable).


Logistic Regression Model – Simplified Cost Function and Gradient Descent
Logistic Function's J (cost function) is different from linear's J:
insert image description here
the main reason is that h(z) is not linear, so its derivative is not a convex function (what logic is this?? ?Anyway, he is not a convex function):
insert image description here
If you do this directly, you may fall into a local optimal solution, so you need to change the form of J. After deriving this J to theta, it is found that the form is the same as linear in the end: of course it is
insert image description here
only The expressions appear to be the same, because both the J and h functions are different from linear.


Logistic Regression Model – Advanced Optimization
has some built-in functions, which are very convenient:
insert image description here
I can’t help but remember it.


Multiclass Classification – One-vs-all
Here we talk about multi-class classification problems, the method of 1 vs all:
insert image description here
y changes from simple 0/1 to 0~n class, and then the h function also changes
0~n in the upper right corner to correspond to y 0 ~ n. After reading it, I don't know how to write the code. Write and draw, in fact, theta should not be a vector, it should be in the form of a matrix, for example:
θ 0 0 θ_0^0i00 θ 1 0 θ_1^0i10 θ 2 0 θ_2^0i20
θ 0 1 θ_0^1i01 θ 1 1 θ_1^1i11 θ 2 1 θ_2^1i21

Such θ θθ is in the form of 2x3, and the superscript number is the superscript corresponding to h, indicating the h function (probability) of the corresponding y; the subscript number corresponds to the number of parameters: θ 0 x 0 +θ 1 x 1 + θ 2 x 2 θ_0x_0+θ_1x_1+θ_2x_2i0x0+i1x1+i2x2. In this way, using the properties of the matrix, it is actually h θ = theta ∗ X h_θ = theta*Xhi=thetaX (X is set as a column vector: 3x1), but hereh θ h_θhiThere are multiple (0~n) only.

The statistical answer is to compare which h is the largest after substituting x, and choose the y represented by that h.


Solving The Problem of Overfitting – Regularized Linear Regression
insert image description here
这里 J θ J_θ JiThere is a minus sign. . . In fact, a negative sign is mentioned in the previous formula (the exercises in the video are done wrong):
insert image description here
insert image description here
Then regularization is to add this magenta box to the back.


week 4

Neural Networks – Model Representation I
“theta” parameters are sometimes called “weights”.
Theta parameters are also called “weights”, here they are called “parameters”

Neural Networks – Model Representation II
insert image description here
Note that here, θ 10 (2) , θ 11 (2) , θ 12 (2) , θ 13 (2) , the subscript is 1x, and the superscript is (2), which means that the θ parameter is Theta of the (2) layer, and then the No. 1 output of the (2)+1=(3) layer, the input is the x-th input of the (2) layer, where the superscript (2) is written a bit like (1), and then obviously found that 1x talks about output first and then input. I am not used to it. I think it is better to talk about input first and then output. However, the dimension (length multiplied by width) of θ is 1 multiplied by max(x), 1 is the number of nodes in the output layer, and max(x) is the number of nodes in the input layer, which exactly corresponds, which is to write the output first and then the input the reason.


Application – Examples and Intuitions II
Here I want the neural network to come up with the formula: x1 XNOR x2, XNOR is NOT XOR, the inversion of XOR, that is, the same is 1, that is, 1
and 1 are 1, 0 and 0 are 1, Therefore, it is ( x1 AND x2 ) OR ( (NOT x1) AND (NOT x2) ) The
three figures in the first row below contain the operation of the above formula, where a1 is the representation of the AND function, and a2 is the representation of the NOT AND function , h θ h_θhiIt is the representation of the OR function, and then in the left picture of the second line, the neural network starts to combine these three weights, and mark them with corresponding colors, which is amazing: )
insert image description here


week 5

Cost Function
In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is ​​equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is ​​equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term. θ
ab is axb dimension, and a represents the number of nodes in the next layer number, b represents the number of nodes in the current layer (does not include the bias unit, just that 1).


Backpropagation Algorithm
insert image description here
bp 中δ δThe derivation of δ is based onzero basics, deep learning,
attention toδ 1 δ_1d1It is updated without bp.
insert image description here
I don’t understand this △ △ (delta) How to update
In the end, I feel that I still need to learn a little bit more, so I went to see other people's derivation process:
Derivation and intuitive diagram of Backpropagation algorithm
Quote a sentence:

The purpose of the BP algorithm is to provide gradient values ​​for optimization functions (such as gradient descent, other advanced optimization methods), that is, to use the BP algorithm to calculate the partial derivative value of the cost function (cost function) for each parameter, and its mathematical form is: ∂∂ Θ(l)ijJ(Θ), and the final value is stored in the matrix Δ(l).


week 6

insert image description here
The d in this section refers to the degree of polynomial, the number of terms of the polynomial (out of θ0)


This week I talked about how to use some methods to determine λ and d.
insert image description here
I don’t understand the question in this section very well. I looked at other people’s notes : insert image description here
I feel that it is because it is the θ that was selected in the final training, so this θ is the d that matches it, so it will be better than the situation under the J test (? ?). It's a bit messy, anyway, there is a relationship between θ and d, and then the relationship between J cv and d is greater than that of J test , so it's better.
insert image description here
insert image description here
Exercise, here is a crazy mistake, the last smaller set should be aimed at high variance, which means "use fewer features" .

To sum up a few questions:

  1. How to divide E cv and E train :
    insert image description here
    That is to say: (set the polynomial item with the highest degree m)
    find the optimal θ 1 to θ m θ_1 to θ_m corresponding to the polynomial item d=1~m in the training set test seti1to thetam, and then use the data of the cv set to find the optimal θ among the m θ, which is denoted as θ d θ^did , d is the polynomial (exponent) with the smallest error in the polynomial, and then put this θ into the test set for testing:
    insert image description here

  2. Why use to divide E cv and E test , mainly to find out where the algorithm is not good. With these two Errors, you can distinguish whether it is high bais (high deviation (underfitting)) or high variance (high variance (overfitting)):
    according to d (polynomial highest term exponent):
    insert image description here
    according to N (number of samples):
    insert image description here


Error Analysis
This section talks about error analysis:
insert image description here
first of all, we need to make a simple algorithm (even if the effect is not very good, it does not matter), and then talk about learning curves to judge whether more data or more features are needed, etc., etc. Then start Error analysis: it is to manually pick out the common points of misclassified data, and then see if you want to add new features or adjust features, find out the advantages and disadvantages of your own algorithm, and guide yourself to figure out ways to improve your own algorithm. In this case, with a little effort, the direction will be clear, instead of blindly optimizing without direction.
insert image description here
For example, for classified spam mailboxes that are misclassified
(i), the type of misclassified mail will be seen
in (ii), and the misclassified mail will have common characteristics

insert image description here
Of course, for example, if you want to extract the stems of some words, the best way is to compare the error rate with and without extraction, so you can know whether it is good or not.
Finally, the verification of these error rates is to be verified on the cv set, not on the test set.
To sum up, the algorithm designed at the beginning should be simple. After checking out the improvements with various methods, try various new improvements and ideas on the cv set. Instead of designing a very complicated algorithm from the beginning, it will be annoying to modify and the direction may be biased.
insert image description here


If there is a sample, y=1 is a person suffering from cancer, we can know that the probability of y=1 is very small, then we are not very clear about whether the correct rate of our algorithm has improved (called Skewed Classes, skewed class), because y is very small, so if the algorithm made is 1% wrong, and the real situation is that only 0.5% of people y=1, then if we all judge y=0, the correct rate is 100%. In this way, it shows that when we make some modifications to improve the accuracy rate, we cannot know whether the modification we made is as bad as directly assigning the result y to 0: at this time,
insert image description here
for skewed data, we need to use Precision (precision rate) and Recall (recall rate), if both rates are high, it is a good algorithm. Because if you set y to 0, the recall here is 0, and this algorithm is not good.
insert image description here
precision: For those who say they have cancer, they really have cancer
recall: those who do have cancer are correctly diagnosed as having cancer,
but precision and recall are inversely proportional, and both cannot be high. If you want people to be less misdiagnosed and suffer less from cancer, then The precision needs to be higher, and if you want fewer people with cancer to be missed, the recall needs to be higher (I don’t know how to look back at the explanation and definition of the two rates in the above figure). But if you want to measure, you can use F (or F 1 F_1F1) = 2 P R P + R \frac{2PR}{P+R} P+R2PR(and so on) so that both rates are as high as possible.

Finally, when the data is large, if the following two theorems are satisfied, Jtrain will be small, and Jtrain is approximately equal to Jtest, so Jtest will also be small, that is, large data is useful for algorithm training.
insert image description here
insert image description here
insert image description here


week 7

I wrote it but didn't save it


week 8

There is no summary, remember 8.
* K-Means Algorithm
Convention K-Means algorithm in unsupervised learning, x(i) is n-dimensional, remove x 0 = 1 x_0 = 1x0=1 agreement.
insert image description here
μ is an n-dimensional vector, and I heard that writing R^n in this way means vertical.
insert image description here


PCA part (Principal Component Analysis)

Motivation I: Data Compression
data compression: If some features are highly correlated, they can be compressed.
For example, two features, centimeters and feet, are actually one thing, and then they can be expressed linearly (although there is a little bit of error), so two features can be expressed with one line, or there can be only one of the two features feature representation, R 2 R^2R2 can be reduced toR 1 R^1R1
insert image description here
Three-dimensional is to become a two-dimensional plane:the process of
bold style
Principal Component Analysis Algorithm. No proof is given here, but it is okay to know that the point is to compress high-dimensional to low-dimensional. u refers to the compressed vector and z refers to the value of x for the compressed vector. (here x refers to the large matrix). The first step is to mean normalize first, and then perform feature scaling. There is a svd function that is convenient for us to get the u vector. The blue X matrix on the right is arranged in a group of rows, so the actual application is the formula in the blue box. Then if K-means uses PCA, remember that there is no such convention as x0. This sigma is the covariance, pay attention to distinguish it from the summation symbol, and then the definition of covariance can be found in the book of linear algebra. Reconstruction from Compressed Representationreconstruction: Reconstruction of the original data, retrieve x from z, note that it is also an approximate value:Choosing the Number of Principal ComponentsHow to choose k in PCA? Here is a formula. If this formula is less than 0.01, we can say "I chose the parameter k such that 99% of the variance is preserved". "Usually you can greatly reduce the dimensionality of the data and still retain most of the variance because in most real-world data many feature variables are highly correlated." What needs to be optimized here is




**Bold Style**

insert image description here
insert image description here
insert image description here


insert image description here

1 m Σ i = 1 m ∣ ∣ x ( i ) − x a p p r o x ( i ) ) ∣ ∣ 2 \frac{1}{m}\Sigma^m_{i=1}||x^{(i)}-x_{approx}^{(i))}||^2 m1Si=1mx(i)xapprox(i))2
insert image description here
Then there is an easier method:
insert image description here
on the left, K starts violently from 1, and runs PCA each time until it meets the inequality. On the right is to run svd every time, and then use the S matrix inside to satisfy the inequality in the lower right corner. In this case, one is to run PCA and the other is to run svd. Obviously, the latter is faster and more recommended.


Three benefits of Advice for Applying PCA
insert image description here
PCA: However, it is not recommended to use it for overfitting, this should be done with regularization. Since regularization has deterministic y, it does not lose data compared to unsupervised PCA.
insert image description here
Although PCA is very good, don't use PCA at the same time. Use the original data first. If it is really not good, you need PCA to speed up or something, and then use PCA. The reasons are as above.
insert image description here
Note: There is a PCA process in the above figure:
first PCA the train data, then convert the data in cv or test set into z according to u, and then run the hypothesis on it.


week 11

Getting Lots of Data and Artificial Data
Sometimes we can't get so much data, we can manually make some data out:
the first method is to create new data from scratch,
the second method is to pass a smaller training set through some Transformed to become a larger training set. For example, letters are data. You can take out letters in different fonts and paste them on different backgrounds. Different fonts can be obtained for free on many websites.
insert image description here
Or to change the shape of the letters:
insert image description here
It is worth noting that if it is to add some purely random and meaningless noise, it has no positive significance for training.
insert image description here
In terms of speech recognition, you can also add a little noise to the original sound for training:
insert image description here
before deciding to expand the data, you must first think about these two questions:
insert image description here

  1. Before expanding the data, make sure that your classifier is **low bias (low bias) high variance (high variance)** (drawing the learning curve to see), if it is high bias, you must add appropriate features. Make sure that your classifier can really be improved after training with a lot of data, instead of spending time collecting it and finding that the effect is not very good.
  2. Always ask how much time it takes to get 10 times the data. We can estimate that if a label is completed in 10s, then how much data can be obtained by each person in a day and a week. Or hire someone to do it. It usually turns out that it doesn't take a lot of time.

Ceiling Analysis: What Part of the Pipeline to Work on Next
(ceiling analysis, in fact, I think it should be called upper limit analysis better, pipeline is that workflow)

**If you want to do a development (machine learning or something), the most precious thing is time, and you must plan your time. **Don't spend weeks and months discovering that it doesn't help much with the performance of the system.

Can be used: ceiling analysis, the following three parts are done, now we have to choose how many people to allocate to these three parts for optimization. The initial correct rate is 72%. If we change the correct rate of Text detection to 100%, we can see that the overall correct rate is 89%, and then keep it unchanged, and continue to change the correct rate of Character segmentation. 100%, we can see that the overall correct rate becomes 90%, and so on.
insert image description here
Therefore, we can see that we should give priority to improving the link of Text detection, and for Character Segmenation, this link may not be very effective, and it does not need to spend too much manpower on it to improve it.

There is also a face recognition system, which has the following tasks:
insert image description here
After ceiling analysis, we found that the remove background has only increased by 0.1%, indicating that this part should not be taken seriously, but two engineers spent 18 months on it, optimizing this One item, also published an article, but in the end it was found that the improvement of the overall system was not significant. If the ceiling analysis was done 18 months ago , it would not go in the wrong direction and cost precious time. This price is a bit too high. See It is really important to choose a good direction . I think of the high school teacher's phrase "wrong direction, the faster the speed, the faster the failure".
insert image description here
Final summary: Don't trust your intuition too much, "feel" where to spend time and where to skip. It is still necessary to divide the problem into multiple modules (pipeline), and then do a ceiling analysis (ceiling analysis), just like gradient descent, and spend energy on the place with the best effect, so as to save precious time and improve work efficiency.


终章:Summary and Thank You !!!

To sum up what you have learned, you should not only know them, but also use them flexibly to build a powerful machine learning system.
insert image description here
In the end Ng said "You can confidently think that you have become an expert in machine learning" (I am a bit hypocritical 233), and these tools should be used to make people's lives better.
insert image description here
insert image description here
insert image description here
insert image description here
Thank you too Ng ~ End Sahua

Guess you like

Origin blog.csdn.net/Only_Wolfy/article/details/89931957