Write down the problems you encountered while doing your homework.
coursera Wu Enda machine learning website
PS: There is no language written in the code snippet here. You can check it. I’m too lazy. If I find that I don’t write it, the magenta is displayed here (see Wu Enda and found that the red is called magenta hhh), which is still very good-looking: I use Google to
browse Reader, and then downloaded a "dark reader" plug-in, which is really perfect to protect the dog's eyes, and I was really going blind when I was facing a white screen.
Added later: I found that it was magenta only when I was editing, and it was white when I was reading. .
Ex1
Bits and pieces of notes:
- nx1, or 1xn is called vector, vector, denoted as R n R^nRn , nxm is called a matrix, denoted as Rnxm
- I only saw that password when I clicked on the programming assignment. After watching the video, I wanted to try it, but I couldn't find the password. . .
1. warmUpExercise
A = eye(5);
The video also gave the answer, a good start, we got our first "Nice Work!" The
place to pay attention to is to add a semicolon, if you don't need to test, otherwise a bunch of things will be output... Then the output does not seem to affect the result, and the returned Just do it right.
2. Computing Cost (for One Variable)
prediction = X*theta;
cost = (prediction - y).^2;
J = 1/(2*m)*sum(cost);
Note that adding a "." to .^2 is an element-by-element operation
3. Gradient Descent (for One Variable)
can be written according to the formula.
delta = X'*(X*theta - y); %X'可以不加括号
theta = theta - alpha/m*delta;
Pay attention to the size relationship of the matrix, such as delta = X' * (X * theta - y); cannot be written as (X*theta - y)*X'; because X' is 20x2, one is 20 x 1, and delta (That is, the derivative of theta, the same size as theta) is 2x1, so X' should be written in front. This problem is more difficult to find when pushing the formula by hand. As for the length and width of those parameters, it should be printed with a black frame to see.
This problem was mentioned in the reading materials in the submitting programming assignments in the second week. At that time, I didn't know what was going on. . . After I finished writing, I accidentally clicked on it and found out that I was talking about this issue.
4. Feature Normalization
feature normalization processing, where normalize is (x - mean) / standard deviation. . .
Write a (x - average) / (max-min), has been wa. . . Post it, otherwise it will be written in vain:
mu = mean(X_norm')
mu = mu' * ones(1,3);
sigma = std(X_norm');
X_norm = (X_norm-mu)./(max(X_norm')'*ones(1,3)-min(X_norm')'*ones(1,3))
Mainly: here x is an example for each row, and each column is a feature, so it needs to be transposed when calling the mean, std, and max functions. . . Because they should be calculated column by column, you can't see it in a single-row vector, but you can see it in a matrix.
5. Computing Cost (for Multiple Variables)
is the same as Computing Cost (for One Variable), just paste it.
6. Gradient Descent (for Multiple Variables)
is the same as Gradient Descent (for One Variable), just paste it. The charm of matrix operations.
7. Normal Equations
theta = (pinv(X' * X))*(X') * y
wrote -1 when inverting the matrix. . . Can't remember what X-1 stands for, neither inverse nor transpose ==
Finally, let’s commemorate the completion of ex1~
Ex2
1. Sigmoid Function
g = ones(size(z,1),size(z,2))./(1+exp(-z));
It is not easy to pay attention to point division ( ./ ), and then note that 1 cannot be written in the numerator, because this 1 is 1x1 dimensional, and then the denominator is OK, because 1 is added to the exp(-z) of the correct dimension later, and the dimension will be merged into the large one.
2. Logistic Regression Cost
Haha, I finally had to look at the submit file. Both 2 and 3 call costFunction, but one check J and the other checks grad. Then I found that the J in 3 was wrong. In fact, 3 is not right, because the logistic function is not a convex function when written according to linear, so the form needs to be changed: it is
easy to handle, just look for it, and the size of the matrix multiplication is also handled in the figure:
prediction = ones(size(X*theta,1),size(X*theta,2))./(1+exp(-X*theta));
cost = (prediction - y).^2;
%J = 1/(2*m)*sum(cost); Wrong!!!
J = 1/m*(-y'*log(prediction)-(1-y)'*log(1-prediction));
grad = X'*(prediction - y)/m;
3. Logistic Regression Gradient - costFunction
wants to do the second one, but the result is the third one (???)
prediction = ones(size(X*theta,1),size(X*theta,2))./(1+exp(-X*theta));
cost = (prediction - y).^2
J = 1/(2*m)*sum(cost)
grad = X'*(prediction - y)/m
Because the logistic and linear cost (J) and cost gradient calculation formulas are the same, just change the prediction (z). Just copy and paste the previous code.
4. Predict
suddenly found that the functions in a folder (bar) can call each other. .
%X(1,:)
%theta
for i=1:m
if sigmoid(X(i,:)*theta)>=0.5
p(i) = 1;
endif
endfor
The thing to pay attention to is to look at the length and width of X and theta. To match, i is to facilitate the X of each row (each group), and then multiply it by theta. p(i) is the i-th row (group) results. emmm, the for loop is used here, but I can't think of how to update the p(i) of each row without using for.
I thought of it just after I finished writing it (hh, I’m so smart), and I remembered that there is a = [X>0] in Python... It’s almost like this (the grammar seems to be wrong, it’s used in ReLu). Check the element to see if it is > 0, and then assign it to a (also a matrix). Then I imitated it in the black frame here:
this wave of gourd painting is very OK! So continue to imitate:
predict = sigmoid(X*theta)
p = predict>=0.5
It's over. Do not use for, directly operate on the matrix.
5. &6. Regularized Logistic Regression Cost & Regularized Logistic Regression Gradient
are all tested in one file and written together.
h = sigmoid(X*theta)
J = 1/m*(-y'*log(h)-(1-y)'*log(1-h))+lambda/2/m*(sum(theta.^2)-theta(1)*theta(1))
grad = X'*(h-y)/m + lambda/m*theta
grad(1) = grad(1) - lambda/m*theta(1)
J’s reference:
The above mentioned how to calculate the matrix:
Σ is omitted because y is mx1, yT is 1xm, and h is mx1. Just print them with size() to know their length and width.
Initially handed in:
J = 1/m*(-y'*log(h)-(1-y)'*log(1-h))+lambda/2/m*sum(theta.^2)
A prompt will pop up, saying that it is necessary to pay attention to θ(1) without operation, and then just change it. Note that the θ subscript is 1, not 0, and an error is reported again.
Then there is the reference for grad: Website: Regularized Linear Regression
Look back at the code written yesterday:
Take a look at the gourd:
grad = X'*(h-y)/m + lambda/m*theta
grad(1) = grad(1) - lambda/m*theta(1)
Note that the subscript is 1, and then grad(1) does not need to be regularized, just add it and subtract it.
I can't think of a more concise method, but at least J and grad don't need for, and use matrix multiplication flexibly.
Finished, the last screenshot to commemorate:
Ex3
1. Regularized Logistic Regression
h = sigmoid(X*theta);
J = 1/m*(-y'*log(h)-(1-y)'*log(1-h))+lambda/2/m*(sum(theta.^2)-theta(1)*theta(1)) ;
grad = X'*(h-y)/m + lambda/m*theta;
grad(1) = grad(1) - lambda/m*theta(1);
Paste ex2 and it's over... ...
2. One-vs-All Classifier Training
got stuck for 40 minutes. . The topic is relatively long, and fmincg won't use it (the one that should come is still coming):
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj','on','MaxIter',50);
for c = 1:num_labels
%initial_theta = zeros(n + 1, 1); 没有加上但是还是过了
k = fmincg(@(t)(lrCostFunction(t,X,(y==c),lambda)),initial_theta,options)
k';
all_theta(c,:) = k';
all_theta;
endfor
The iteration is how many times fmincg runs, and the cost is the loss function. At first, I thought it was an error. . .
The idea of the topic is to run num_labels times, the xth time fmincg gives an optimal theta, and put it in the xth row of all_theta, because the given theta is a column vector, so it needs to be transposed in a row of all_theta.
To sum up, the xth row of all_theta is used to store the xth best theta. This theta is obtained for the cases where y is 1, 2, 3, and 4 (because num_labels = 4), where x is 1 ~4.
For the convenience of transposition, I use k to store a column vector theta, then transpose k, and then assign it to a row of all_theta, that is, the above code, if a parameter k is omitted:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj','on','MaxIter',50);
for c = 1:num_labels
%initial_theta = zeros(n + 1, 1); 没有加上但是还是过了
all_theta(c,:) = fmincg(@(t)(lrCostFunction(t,X,(y==c),lambda)),initial_theta,options)'; %最后加上转置" ' "
endfor
Since the problem of transposition was not considered, what was written at the beginning was:
for c = 1:num_labels
all_theta(c) = fmincg(@(t)(lrCostFunction(t,X,(y==c),lambda)),all_theta(c),options)
endfor
It can be said that it is full of mistakes. First (c) is not the cth row, but the cth one after the matrix is arranged in columns, and then all_theta should be initial_theta. In addition, if you are careful, you can find that initial_theta is not cleared every time it is traversed, but it doesn't matter, because this fmincg algorithm is powerful enough, and theta is all set to 0, isn't it the best?
But wait, is initial_theta really needed? Isn't all_theta also 0 for each row at the beginning? So we can save another parameter initial_theta! All you have to do is take the row for all_theta and transpose it a bit to replace initial_theta:
options = optimset('GradObj','on','MaxIter',50);
for c = 1:num_labels
all_theta(c,:) = fmincg(@(t)(lrCostFunction(t,X,(y==c),lambda)),all_theta(c,:)',options)';
endfor
Great, now the code is much leaner than the correct code submitted in the first place, and it is also correct ヾ(≧∇≦*)ヾ
3. The old method of One-vs-All Classifier Prediction, use size() to output the length and width of all_theta and X, knowing that X is mx 3 dimensions, all_theta is num_labels x 3 dimensions (4 x 3), first the max function has 4 types:
max(X) Find the largest element in X
max( X , [] , 1 ) Find the largest element in each column of X
max( X , [] , 2 ) Find the largest element in each row of X
[ a , b ] = max (X), a is the largest element in each column of X, b is the subscript of the largest element in each column of X
max( X , Y ) is the largest element in each position of the X and Y matrices
What we are mainly looking for is after theta is multiplied by X, for each group (row), find out which of the 4 kinds of y has the highest probability, get the one, and store it in the row corresponding to p. Then use the max function to get the subscript of the largest element in each column, so we imagine that, because there are 4 types, we will make 4 rows, and then the columns will be the corresponding probabilities of these 4 types. When calling the max function, You can get the subscript of the one with the highest probability in each column. So this matrix is of size 4 xm, so all_theta * X'. After making it together, we get the row vector composed of subscripts with the highest probability, and transpose it to p:
t = all_theta*X';
[a,b] = max(t);
p = b';
By the way, the sigmoid activation is not added, because θx is inherently increasing, and so is sigmoid, the bigger the bigger, so there is no need to add
4. The Neural Network Prediction Function
was a bit messy when I wrote it. I don’t know what the final output matrix is. I glanced at other people’s code and found that the hidden layer was missing and 1 was added. The input layer remembers and the hidden layer forgets:
X = [ones(m, 1) X];
z2 = sigmoid(X*Theta1');
size(z2);
a2 = [ones(m,1) z2];
a3 = a2*Theta2';
t = a3';
[a,b] = max(t);
p = b';
input layer matrix is mx 2 -> mx 3
hidden layer input matrix is mx 3 * 3 x 4 = mx 4
hidden layer output matrix is mx 4 -> mx 5
output layer matrix is mx 5 * 5 x 4 = mx 4
final, mx 4 (or 4 xm) means that there are m sets of inputs, and then correspond to 4 sets of outputs (probability). In addition, after
reading what others wrote, it turns out that [a,b] = max(A,[ ], 2) Such coquettish operations have similar meanings, but one is for row vectors and the other is for column vectors:
X = [ones(m, 1) X]
z2 = sigmoid(X*Theta1');
size(z2);
a2 = [ones(m,1) z2];
a3 = sigmoid(a2*Theta2');
[a,b] = max(a3,[],2);
p = b;
Sure enough, I still have to look at other people's codes and learn from each other's strengths.
The sigmod can be added or not at the end, because it is all incremental, but it has the meaning of "probability", so it is still added.
Let’s stumble, it’s not too difficult to find out after finishing it (the truth of all things: it’s not difficult to understand), and the last routine screenshot commemorates:
Ex4
Feedforward and Cost Function & Regularized Cost Function
has not been typed for a long time because of the review, it is unfamiliar, and it has been stuck for a long time:
reference formula:
[hx,dummy] = predict(Theta1,Theta2,X);
Y = zeros(m,num_labels);
for i=1:m
Y(i,y(i))=1;
endfor
t1 = log(hx).*Y + log(1-hx).*(1-Y);
sum_xita = sum(sum(Theta1(:,2:end).^2)) + sum(sum(Theta2(:,2:end).^2));
J = -1/m*sum(sum(t1)) + lambda/2/m*sum_xita;
% 不能够直接矩阵乘法,因为矩阵乘法的话,log(hx_k)和跟下一组y_k+1等乘在一起,是不对的
% 正确的理解应该是点乘,自己乘回自己,这里没有理解,卡了很久
% 然后theta点乘的时候,第一位是要省略不乘的,可以用语法简易截取
predict is changed in the file:
Sigmoid Gradient
according to the formula:
g = sigmoid(z).*(1-sigmoid(z));
Take a look at the proof of the derivation of the Sigmoid function . . a bit difficult
Neural Network Gradient (Backpropagation) & Regularized Gradient
wrote collapsed, refer to others
and continue to add after nnCostFunction:
X = [ones(m,1) X];
for i=1:m
a1 = X(i,:); %1x3
z2 = a1*Theta1'; %1x4
a2 = [1 sigmoid(z2)];
z3 = a2*Theta2';
a3 = sigmoid(z3);
deta3 = a3-Y(i,:); % 1 x 4
deta3 = deta3'
deta2 = Theta2(:,2:end)'*deta3.*sigmoidGradient(z2')
size(z2)
size(deta2)
size(Theta2(:,2:end)')
size(deta3)
size(a2')
Theta2_grad = Theta2_grad + deta3*a2; % 4 x 5
Theta1_grad = Theta1_grad + deta2*a1; % 4 x 3
endfor
Theta1(:,1) = 0;
Theta2(:,1) = 0;
Theta1_grad = (Theta1_grad + lambda*Theta1)/m;
Theta2_grad = (Theta2_grad + lambda*Theta2)/m;
Knowledge reference:
[Description of the detailed process]
Let me tell you, what pitfalls I stepped on in the afternoon:
- I don't know what a and z are respectively. I have been working on it for a long time, and I didn't read the previous notes. I just found out that the pdf has
- Then the input layer of the first layer needs to add x0 = 1, hehe
- Lowercase delta (δ), that is θ T ∗ δ θ^T*δiT∗δ , I actually thought it didn’t matter, so I wroteδ T ∗ θ δ^T*θdT∗θ , wrong to Feitian
- In the final pdf, D should be both △ and λθ divided by m
From ten o'clock in the morning to five o'clock in the afternoon, I stepped on all kinds of pitfalls, let me commemorate it:
ex4 is stumbling, so I have time to look at the forum questions.
Ex5
Regularized Linear Regression Cost Function & Regularized Linear Regression Gradient
reference picture:
Brainstormed and wrote the comment part. Later, I read other people's and found that there is no need to multiply x to the power of i. After a while, I found that the superscript of x refers to the first Several x, not powers of x.
Then it is to calculate the gradient, grad
because the grad here needs to use regression and regularization.
Look back at the Ex2 question notes: (this watermark is a bit difficult to remove) Regularized Linear Regression
and the part framed in blue is grad, which is the gradient of J to θ.
Then there is the complete code:
h = zeros(m,1);
h = X*theta;
J = 1/2/m*(sum((h-y).^2))+lambda/2/m*sum(theta(2:end).^2)
grad = X'*(h-y)/m + lambda/m*theta
grad(1) = grad(1) - lambda/m*theta(1)
Remember that the first θ does not require regularization.
The requirement of Learning Curve
is to draw the error of cross validation and the error of training test as the number of samples i increases. And draw these two pictures, the usual situation is: (here N is i)
and then found a very transparent problem: I can't use Octave to draw pictures . If it is not a test, it will draw a picture, but if it is submitted, it will iterate many times, and then there are problems with the drawn picture, and the drawn picture is very messy.
It's a little annoying, the tutorials on the Internet are all matlab, the progress is a bit rushed, and I don't want to learn matlab, so let's take a look after the summer vacation after the exam.
Code: (I didn’t understand it at the beginning, but after seeing the pictures drawn by others, I knew what to do (as N increases, there are two error pictures))
for i = 1:m
theta = trainLinearReg(X(1:i,:),y(1:i,:),lambda);
error_val(i) = linearRegCostFunction(Xval,yval,theta,0);
%theta = trainLinearReg(X,y,lambda)
error_train(i) = linearRegCostFunction(X(1:i,:),y(1:i,:),theta,0);
endfor
Polynomial Feature Mapping
calculates X and becomes [X(i) X(i).^2 X(i).^3 ... X(i).^p], code:
for i=1:size(X,1)
for j=1:p
X_poly(i,j) = X(i).^j;
endfor
When reading other people's articles, I found that the properties of the matrix can be used more fully:
X
for i=1:p
X_poly(:,i) = X.^i;
endfor
X_poly
Suddenly, I found that the printout was very different from what I understood = =, well, I didn’t seriously examine the question, and I just set two for loops when I looked at the formula, but I still did it right.
Post a blog to see matlab pictures: https://blog.csdn.net/weixin_40807247/article/details/81359042
Final practice:
PS: I think this time I didn’t do it very well = =, I was a little impatient to read the questions, and failed , After thinking about it for a while, I went to look at other people's codes. It's not bad, but I have to read the questions well.
Ex6
Gaussian Kernel
calculates Gaussian kernel formula:
||u|| calculation method: (u is 2 dimensions)
replace u with x1 - x2, correct code:
t = x1-x2;
sim = exp(-sum(t.^2)/(2*sigma^2))
error code:
sim = exp(-(x1-x2).^2/(2*sigma^2))
Obviously did not understand the formula.
Parameters (C, sigma) for Dataset 3
was dumbfounded, and then I remembered that my friend said there was a document to download:
Life suck. There are scripts in it, and a pdf introducing the topic, (maybe there are other data or something) , Some of the stuff we were asked to download at the beginning will not be there.
However, Su You said that it was useless to introduce the background, so I have not read the "here" file for a long time, so it is safe and sound.
I really couldn't stand the homework this time. After reading the downloaded file, I found that there were actually a lot of hints, including the usage of functions that were not mentioned, which were not covered in the class.
The meaning of the question is to find the best C and sigma (variance / σ), and then use svmPredict. In fact, when you use svmPredict, you will find that you need to use svmTrain, and then the fourth parameter in it needs to use a @啥哥, and svmTrain’s You also need to look back at the code. Looking at the code, you can find that x1 and x2 do not need to be defined, because the direct assignment is (1,0), which is also in line with what was taught in the class. Then the pdf also prompts the parameters recommended by C and sigma, and the double-cycle brute force search is enough:
vec_C = [0.01;0.03;0.1;0.3;1;3;10;30];
vec_sigma = [0.01;0.03;0.1;0.3;1;3;10;30];
%x1 = [1 2 1 5 9 8]; x2 = [0 4 -1 7 6 5]; 见有人定义了x1 x2
%x1 x2这里怎么定义没所谓,因为svmTrain中kernelFunction的用法是(1,0)
vec_errors = 10000000;
for i=1:length(vec_C)
for j=1:length(vec_sigma)
model= svmTrain(X, y, vec_C(i), @(x1, x2) gaussianKernel(x1, x2, vec_sigma(j)));
pred = svmPredict(model,Xval);
error = mean(double(pred~=yval));
if (error < vec_errors)
a = i;
b = j;
vec_errors = error;
endif
endfor
endfor
C = vec_C(a); % 注意要记得赋值
sigma = vec_sigma(b);
vec_errors;
model = svmTrain(X,y,vec_C(i),@(x1, x2) gaussianKernel(x1, x2, vec_sigma(j)));
visualizeBoundaryLinear(X,y,model); % 看ex6.m知道的,可以用来看图
Email Preprocessing
violence comparison on the line:
for i=1:size(vocabList)
if (strcmp(str,vocabList(i,1)))
word_indices = [word_indices;i];
break;
endif
endfor
emailFeatures
is the same as violence:
for i = 1:size(word_indices)
x(word_indices(i)) = 1;
endfor
I feel that it is relatively simple this time, maybe you can read it from the pdf hh, and you don’t need to write it yourself:
Ex7
Find Closest Centroids (k-Means)
Note that a row of X minus a row of centroids (because centroids (μ) is a point in X (bar)), so the dimensions are the same.
Instead of X(i) - centroids(j)
m = size(X,1);
for i=1:m
min = 1000000000;
for j=1:K
t = X(i,:) - centroids(j,:);
s = sum(t.^2);
if (s<min)
min = s;
idx(i) = j;
endif
endfor
endfor
Compute Centroid Means (k-Means)
can be calculated by matrix. I know there is a find but I am not very familiar with it. Later, I used for. I read other people's and made the following modifications:
%自己写的:
##for i=1:K
## cnt = 0;
## for j=1:m
## if (idx(j)==i)
## centroids(i,:) += X(j,:);
## cnt++;
## endif
## endfor
## centroids(i,:) /= cnt;
##endfor
%别人用了find的,进行更改:
for i=1:K
t = find(idx==i);
centroids(i,:) = sum(X(t,:))/length(t);
endfor
K-means is used for image compression
So far, run the script ex7:
it feels good and interesting.
I saw something about image compression in the pdf, which is to set K colors (such as 16 colors), and then run K-means.
First, write kMeansInitCentroids, which contains:
% randidx存的是大小为X的行数的,然后把用X(randidx(1:K),:)分出打乱后的前K行
randidx = randperm(size(X,1)); % Randomly reorder the indices of examples
centroids = X(randidx(1:K),:); % Take the first K examples as centroids
The usage test of randperm is as follows.
Use a variable to save a vector, and then use another matrix to associate rows into this vector. This method/process is a bit convoluted when written in C, but it works perfectly here (I feel that Python can also be written this way). Although it is short, it compiles The device also understands your thoughts, which is very convenient.
Then run this script in the command window:
When K = 16:
change the script, K = 8:
K = 1:
K = 1024:
PCA
Start with a picture, the code is all about editing:
sigma = X'*X/m;
[U,S,X] = svd(sigma);
Project Data (PCA)
U = U(:,1:K);
Z = X*U; %懒得纸上算了,长宽瞎搞
Recover Data (PCA)
X_rec = Z*U(:,1:K)'; % 看回projectData
One more programming exercise is over! ! !
Ex8
Estimate Gaussian Parameters
mu = mean(X)';
t = mu'.*ones(m,n);
sigma2 = sum( (X-t).^2 )'/m;
t is used to expand the matrix, because I don’t want to use for
to read other people’s code:
X
mu = mean(X)
size(mu)
size(X)
sigma2=sum((X - mu) .^ 2)/m;
emmm is right to run out. . .
Discovery - same as multiplication, follow the horizontal multiplication vertical, I am still not familiar with the code = =
Just type Select Threshold , note that <epsilon is 1:
pval;
yval;
fp = sum((pval<epsilon)&(yval==0));
tp = sum((pval<epsilon)&(yval==1));
fn = sum((pval>=epsilon)&(yval==1));
prec = tp/(tp+fp);
rec = tp/(tp+fn);
F1 = 2*prec*rec/(prec+rec);
Collaborative Filtering Cost & Collaborative Filtering Gradient
has no regularized cost and gradient:
% without regularization
% version 1.0
J = sum(sum((X*Theta'-Y).^2.*R))/2;
for j=1:size(R,2)
for i=1:size(R,1)
if (R(i,j)==1)
[i,j];
X_grad(i,:) += (X(i,:)*Theta(j,:)'-Y(i,j))*Theta(j,:);
Theta_grad(j,:) += (X(i,:)*Theta(j,:)'-Y(i,j))*X(i,:);
endif
endfor
endfor
J's reference:
Because Y is a matrix of nm * nu size, X and Theta need to change positions, here is X ∗ T heta TX*Theta^TX∗ThetaT
then uses R to determine whether the position of a certain row i and column j is valid, 1 is valid, 0 is invalid, so it is a dot product.
Finally, because J is a number, two sums are used to sum a vector.
Gradient's reference:
I glanced at the prompt in the supporting pdf, said to use a for loop, and then couldn't write it out...
Finally, I still think carefully about the meaning of the matrix: (Of course, X and X_grad have the same meaning (because it is a gradient relationship), Theta The same is true)
X matrix: each row represents a movie (num_movies) (x1 x2 in the figure below), each column represents a feature (num_features) (degree of romance and action) Theta matrix: each row
represents a user (num_users) (Alice Bob Carol Dave), each column represents a feature (num_features) (for romance and action two types of love)
to see more clearly with the diagram: (Note: The Theta matrix of the diagram is a column for a user, each row Indicates a feature, the first line 0 is for multiplying with X0 that does not appear, which can be ignored)
and then look back at our definition:
According to the combing of the matrix just now and the above formula, X_grad is actually updated for each row, and then It is the multiplication of (the transposition of row j of Theta) and (row i of X), note: what comes out here is a number, not a matrix. Then multiply it with the j-th row of Theta, remember to add up the answers, because for the i-th row (i-th movie), there are different scores (columns/features), here are two loops, each time only for the i-th row The data in column j, so the results must be accumulated.
Then Theta_grad is the same, where R ( i , j ) means "if the i-th movie was rated by the j-th user ", so for Theta_grad, the update is row j, and X_grad is i OK.
Suddenly found that there are hints on the next page of the pdf. . . Talk about how to find j:r(i,j)=1:
follow the gourd and draw a new code:
% without regularization
% version 2.0
J = sum(sum((X*Theta'-Y).^2.*R))/2;
for i=1:size(R,1)
idx = find(R(i,:)==1); % 即找出符合的 j;
X_grad(i,:) += (X(i,:)*Theta(idx,:)'-Y(i,idx))*Theta(idx,:);
endfor
for j=1:size(R,2)
idx = find(R(:,j)==1); % 即找出符合的 i
Theta_grad(j,:) += (X(idx,:)*Theta(j,:)'-Y(idx,j))'*X(idx,:);
endfor
Note that in the final Theta_grad calculation formula, the result of - Y should be transposed, otherwise an error will be reported:
after writing, run the ex8_cofi, and get a picture:
I don’t know what it is...
Regularized Cost & Regularized Gradient
Add regularization here, Then just make a little modification on the basis of the previous one.
refer to:
% with regularization
J = sum(sum((X*Theta'-Y).^2.*R))/2 + lambda/2*sum(sum(Theta.^2))+ lambda/2*sum(sum(X.^2));
for i=1:size(R,1)
idx = find(R(i,:)==1); % 即找出符合的 j;
X_grad(i,:) += (X(i,:)*Theta(idx,:)'-Y(i,idx))*Theta(idx,:)+lambda*X(i,:);
endfor
for j=1:size(R,2)
idx = find(R(:,j)==1); % 即找出符合的 i
Theta_grad(j,:) += (X(idx,:)*Theta(j,:)'-Y(idx,j))'*X(idx,:)+lambda*Theta(j,:);
endfor
But the first time it will time out, I don't know why. . . Just submit it again:
run ex8_cofi at the end, it feels weird, the scores are all 5.0, but all the questions are passed again: the
last programming homework, it's over! ! !