垃圾邮件过滤(多项式事件模型贝叶斯分类器)

公式推导

直接参考：https://www.cnblogs.com/qpswwww/p/9308786.html

注意，这里为了数值稳定性，用了一个小trick，保证数值太小时不会下溢

\[p(y=1|x)=\frac {(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_y+(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}\]

\[=\frac {1}{1+\frac{(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}}}\]

\[=\frac {1}{1+\exp(\log\frac{(\prod_{i=1}^n\phi_{x_i|y=0})(1-\phi_y)}{(\prod_{i=1}^n\phi_{x_i|y=1})\phi_{y}})}\]

\[=\frac {1}{1+\exp(\sum_{i=1}^n\phi_{x_i|y=0}+\log(1-\phi_y)-\sum_{i=1}^n \log \phi_{x_i|y=1}-\log \phi_{y})}\]

代码

nb_train.m


[spmatrix, tokenlist, trainCategory] = readMatrix('MATRIX.TRAIN');

trainMatrix = full(spmatrix);
numTrainDocs = size(trainMatrix, 1);
numTokens = size(trainMatrix, 2);

% trainMatrix is now a (numTrainDocs x numTokens) matrix.
% Each row represents a unique document (email).
% The j-th column of the row $i$ represents the number of times the j-th
% token appeared in email $i$. 

% tokenlist is a long string containing the list of all tokens (words).
% These tokens are easily known by position in the file TOKENS_LIST

% trainCategory is a (1 x numTrainDocs) vector containing the true 
% classifications for the documents just read in. The i-th entry gives the 
% correct class for the i-th email (which corresponds to the i-th row in 
% the document word matrix).

% Spam documents are indicated as class 1, and non-spam as class 0.
% Note that for the SVM, you would want to convert these to +1 and -1.


% YOUR CODE HERE
n=size(trainMatrix,2);
m=length(trainCategory);
phi_y=sum(trainCategory)/m;
phi_y1=zeros(n,1);
phi_y0=zeros(n,1);
for i=1:m
    if(trainCategory(i)==1)
        for j=1:n
            phi_y1(j)=phi_y1(j)+trainMatrix(i,j);
        end
    else
        for j=1:n
            phi_y0(j)=phi_y0(j)+trainMatrix(i,j);
        end
    end
end

for i=1:n
    sum1=0;
    sum0=0;
    for j=1:m
        if(trainCategory(j)==1)
            sum1=sum1+trainMatrix(j,i);
        else
            sum0=sum0+trainMatrix(j,i);
        end
    end
    phi_y1(i)=(phi_y1(i)+1)/(sum1+n);
    phi_y0(i)=(phi_y0(i)+1)/(sum0+n);
end

nb_test.m



[spmatrix, tokenlist, category] = readMatrix('MATRIX.TEST');

testMatrix = full(spmatrix);
numTestDocs = size(testMatrix, 1);
numTokens = size(testMatrix, 2);

% Assume nb_train.m has just been executed, and all the parameters computed/needed
% by your classifier are in memory through that execution. You can also assume 
% that the columns in the test set are arranged in exactly the same way as for the
% training set (i.e., the j-th column represents the same token in the test data 
% matrix as in the original training data matrix).

% Write code below to classify each document in the test set (ie, each row
% in the current document word matrix) as 1 for SPAM and 0 for NON-SPAM.

% Construct the (numTestDocs x 1) vector 'output' such that the i-th entry 
% of this vector is the predicted class (1/0) for the i-th  email (i-th row 
% in testMatrix) in the test set.
output = zeros(numTestDocs, 1);

%---------------
% YOUR CODE HERE

n=size(testMatrix,2);
m=size(testMatrix,1);

for t=1:m
    log_a=0;
    log_b=0;
    for i=1:n
        if(testMatrix(t,i)==0)
            continue;
        end
        log_a=log_a+testMatrix(t,i)*log(phi_y1(i));
        log_b=log_b+testMatrix(t,i)*log(phi_y0(i));
        
    end
    p=1/(1+exp(log_b+log(1-phi_y)-log_a-log(phi_y)));
    if(p>=0.5)
        output(t)=1;
    else
        output(t)=0;
    end
end
%---------------


% Compute the error on the test set
y = full(category);
y = y(:);
error = sum(y ~= output) / numTestDocs;

%Print out the classification error on the test set
fprintf(1, 'Test error: %1.4f\n', error);

程序运行结果

Test error: 0.0525

CS229 Machine Learning作业代码:Problem Set 2

垃圾邮件过滤(多项式事件模型贝叶斯分类器)

公式推导

代码

nb_train.m

nb_test.m

程序运行结果

猜你喜欢