Natural language processing - big job

1. Problem description: Compare the performance of the three classifiers on the shadow rating task, the training machine is 1500, and the test set is 500; Naive Bayes has three commonly used models: Gaussian, polynomial, Bernoulli; KNN chooses k (preferably cross-validation); SVM selects the kernel function. Question requirements: Briefly describe the principle of each model, explain the meaning of each parameter adjustment, and briefly summarize the performance of the three classifiers on the movie scoring class.

2. Review how to use ROC curve and AUC to evaluate a binary classifier.

 


1.1 Principle:

  Naive Bayes Model: Choose the metric with the highest posterior probability as the metric to determine the class.

  KNN model: If most of the k nearest samples in the feature space of the sample to be tested belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. Realize the determination of the number of neighbors, which is generally an odd number; calculate the distance between the data point to be classified and all the classified sample points according to the pre-determined distance measurement formula (Euclidean distance), and calculate the nearest k sample points; count the brother samples Among the points, the number of each category and the sample type with the largest number are the type of data to be classified. When the number of samples is unbalanced, when the number of samples in one class is large and the number of other samples is small, it is very likely that when an unknown sample is input, the samples of the large number of classes in the K neighbors of the sample are the majority; It is improved by setting a weight for the distance. The weight with a large distance from the sample is small, and the weight with a small distance from the sample is large.

  SVM: Support Vector Machine is a regression and support vector machine algorithm. By adjusting the settings of the kernel function parameters, the data set can be mapped to a multi-dimensional plane and fine-grained, so that its features change from two-dimensional to multi-dimensional. , transform the linearly inseparable problem in two dimensions into a linearly separable problem in multiple dimensions, and finally find an optimal cutting plane (equivalent to finding an optimal solution on the basis of the decision tree), so the classification effect of svm outperforms most machine learning classification methods.

  Linear kernel function: It is mainly used for linearly separable data. We can see that the dimension from the feature space to the input space is the same, and its parameters are few and fast. For linearly separable data, its classification effect is ideal, so we usually first Try to use a linear kernel function for classification, see how it works, and if it doesn't work, change to something else.

  The polynomial kernel function can map the low-dimensional input space to the high-latitude feature space, but the polynomial kernel function has many parameters. When the polynomial order is relatively high, the element value of the kernel matrix will tend to be infinite or infinitely small. Calculate The complexity will be too great to calculate.

  The Gaussian radial basis function is a kernel function with strong locality, which can map a sample into a higher-dimensional space. The performance of , and its parameters are less than that of the polynomial kernel function, so in most cases, when you don't know what kernel function to use, the Gaussian kernel function is preferred.

  Using the sigmoid kernel function, the support vector machine implements a multi-layer neural network.

  When choosing the kernel function, Wu Enda mentioned several methods in the class:  

  • If the number of features is as large as the number of samples, choose LR or SVM with linear kernel;
  • If the number of features is small and the number of samples is normal, SVM+Gaussian kernel function is selected;
  • If the number of features is small and the number of samples is large, you need to manually add some features to become the first case. 

 

1.2 Comparing the performance of the three algorithms on the movie rating category

1.2.1 Comparison of the effects of four different kernel functions of SVM:

  

The results show that the linear effect is the best among the four kernel functions of SVM, and the ploy is the worst.

1.2.2 The optimal k value of KNN

  It is required to use cross-validation and choose the kfold algorithm. The K-folds algorithm divides the data into k parts, performs k cycles, and each time a different part serves as the test group data. Generally choose kfold=10.

For the training set [:1500,,:], choose k between [1,40].

The results show that the validation works best when k = 22

1.2.3 Naive Bayes

The results show that the MultionmialNB model has the best effect, reaching 0.83

  

2. Review how to use ROC curve and AUC to evaluate a binary classifier.

ROC (Receiver Operating Characteristic, receiver operating characteristic curve) curve and AUC are often used to evaluate the pros and cons of a binary classifier.

AUC is one of the main offline evaluation indicators used by classification models, especially binary classification models. Compared with the accuracy rate, recall rate, F1 and other indicators, AUC has a unique advantage, that is, it does not pay attention to the specific score, but only pays attention to the sorting result, which makes it especially suitable for the evaluation of the effect of sorting problems, such as the evaluation of recommendation sorting. There are two interpretation methods for the AUC indicator, one is the traditional "area under the curve" interpretation, and the other is the interpretation of the ranking ability. For example, an AUC of 0.7 can be roughly understood as: given a positive sample and a negative sample, in 70% of the cases, the model scores higher on the positive sample than on the negative sample. It can be seen that under this explanation, we only care about the score between positive and negative samples, and the specific score is irrelevant.

The abscissa of the ROC curve is false positive rate (FPR), and the ordinate is true positive rate (TPR) (that is, recall).

 

 

 

 

 

The ROC curve has four special points and a line:

The first point, (0,1), is FPR=0, TPR=1, which means FN (false negative)=0, and FP (false positive)=0. Wow, this is a perfect classifier, it classifies all the samples correctly.

The second point, (1,0), i.e. FPR=1, TPR=0, can be similarly analyzed to find that this is the worst classifier because it successfully avoids all correct answers.

The third point, (0,0), that is, FPR=TPR=0, that is, FP (false positive)=TP (true positive)=0, it can be found that the classifier predicts that all samples are negative samples (negative).

At the fourth point (1,1), the classifier actually predicts that all samples are positive. After the above analysis, we can assert that the closer the ROC curve is to the upper left corner, the better the performance of the classifier.

A point on y=x in the ROC curve. The point on this diagonal actually represents the result of a classifier that adopts a random guessing strategy (FP = TN, TP = FN, so FP+TP = TN + FN, that is, Y = N, which is random guessing ), such as (0.5, 0.5), means that the classifier randomly guesses that half of the samples are positive samples, and the other half of the samples are negative samples. After the above analysis, we can assert that the closer the ROC curve is to the upper left corner, the better the performance of the classifier.

AUC (Area Under Curve) is defined as the area under the ROC curve. Obviously, the value of this area will not be greater than 1. Since the ROC curve is generally above the line y=x, the value of AUC ranges between 0.5 and 1. The AUC value is used as the evaluation criterion because in many cases the ROC curve does not clearly indicate which classifier is better, and as a value, the classifier with a larger AUC is better.


 

3. Compare the three classifiers on the tf, tfidf bag-of-words models, respectively.

3.1 The effect results on the tf word bag model :

features = np.zeros([len(documents),len(word_features)],dtype=float)
for n in range(len(documents)):
        document_words = documents[n][0]
        pdf=collections.Counter( document_words)
        for  m in range(len(word_features)):
                if word_features[m] in document_words:
                        features[n,m] = pdf[word_features[m]]

SVM :

SVM

KNN

Naive Bayes:

 

3.2 On the TF-IDF word bag model :

vectorizer=CountVectorizer(min_df=100,stop_words= ' english ' ) #This class will convert the words in the text into a word frequency matrix, and the matrix element a[i][j] represents the word frequency of the j word under the i-type text   
transformer=TfidfTransformer () #This class will count the tf-idf weight of each word 
tfidf=transformer.fit_transform(vectorizer.fit_transform(documents_words)) # fit_transform calculates tf-idf, fit_transform converts text into word frequency matrix   
word=vectorizer.get_feature_names() #Get all words in the bag of words model   
features=tfidf.toarray() #Extract the tf-idf matrix, the element a[i][j] represents the tf-idf weight of the j word in the i-type text 
print (features. shape)

SVM :

from sklearn.svm import SVC
svmmodels =[]
svmmodels.append(("linear",SVC(kernel='linear')))
svmmodels.append(("poly",SVC(kernel='poly')))
svmmodels.append(("sigmoid",SVC(kernel='sigmoid')))
svmmodels.append(("rbf",SVC(kernel='rbf')))
for name,model in svmmodels:
        model.fit(train_set1,target_train)
        pred = model.predict(test_set1)
        print("{0}准确率为:{1}".format(name,sum([1 for n in range(len(target_test)) if pred[n]==target_test[n]])/len(target_test)))

from sklearn.svm import SVC
svclf = SVC(kernel ='linear',probability=True)
svclf.fit(train_set1, target_train)
pred_svc = svclf.predict(test_set1)
print('SVM=',sum(pred_svc==target_test)/len(target_test))

 

KNN

#KNN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score


result =[]
for k in range(1,40):
        kfold = KFold(n_splits=10)
        knnclf = KNeighborsClassifier(n_neighbors=k)  #通常默认为5
        cv_result = cross_val_score(knnclf,train_set1,target_train,cv=kfold)
        #print(sum([1 for n in range(len(target_test)) if pred[n]==target_test[n] ])/len(target_test))
        print("k = {0};cross_val_score:{1}\n".format(k,cv_result.mean()))
        result.append(cv_result.mean())

Naive Bayes:

 

#朴素贝叶斯
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

nbmodels = []
nbmodels.append(("GaussianNB",GaussianNB()))
nbmodels.append(("MultinomialNB",MultinomialNB()))
nbmodels.append(("BernoulliNB",BernoulliNB()))

for name,model in nbmodels:
        model.fit(train_set1,target_train)
        pred = model.predict(test_set1)
         print ( " The prediction accuracy of the {0} model: {1} " .format(name,sum([1 for n in range(len(test_set1))) if pred[n] = = target_test[n]]) / len(target_test)))

 

 3.3 Summary of results:

 After comparing the results, it is shown that although the accuracy of KNN on the TF-IDF model is significantly higher than that of the TF model, it is still weaker than the SVM and the Naive Bayes model; the effect of the Naive Bayes classifier on the two models is similar , and even has a weakening trend on TF-IDF; the results obtained by the SVM classifier are relatively stable.

 

4. In the movie score classifier, compare the classification performance of different distance functions in KNN (Hamming distance, cosine distance, Euclidean distance)

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325088153&siteId=291194637