Machine learning classification example - SVM (modified)/Decision Tree/Naive Bayes

Machine learning classification example - SVM (modified)/Decision Tree/Naive Bayes

20180427-28 Notes, 30 Summary

It's already May, the graduation project is over, and it's time to prepare the thesis. In the meeting the day before yesterday, the teacher said that I hope I will do the content of semantic analysis in the future. During the meeting, I also discussed the knowledge reasoning content of the knowledge graph of my sister, which is also very interesting, but it should be more complicated. If you have time, I would like to know more about this. Actually, I still don't know exactly how to show it. . .
Probably like the chart. I will make one first, let the teacher see it, and write a thesis in two weeks. 24/25 defense, you can go to Lang Hahahahaha in June.

1. Work

The work is mainly divided into three parts - modifying the code of the SVM part (simplified), drawing charts, modifying the code of DT and NBY

1. Modify the code of the SVM part

Since we need to predict the scores for three labels, print them out at a time. So you need to train three models separately and feed the predicted text into three models at a time for prediction.

Then it is:


    x = vec[:,:-3]
    y = vec[:,-3]
    y2=vec[:,-2]
    y3=vec[:,-1]
    
    f = open("examout.txt","r")
    newl =f.read()
    newl=list(map(str,newl.strip().split(',')))
    newv = np.array(newl)
    new_test_x = newv[:]
  
#    模型1
    train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2)
    clf1 = SVC(kernel='linear',C=0.4)
    clf1.fit(train_x,train_y)
    pred_y = clf1.predict(test_x)
    new_pred_y1 = clf1.predict(new_test_x.reshape(1,-1))
    npy1=int(new_pred_y1[0])

#   模型2
    train_x2,test_x2,train_y2,test_y2 = train_test_split(x,y2,test_size=0.2)
    clf2= SVC(kernel='linear',C=0.4)
    clf2.fit(train_x2,train_y2)
    pred_y2 = clf2.predict(test_x2)
    new_pred_y2 = clf2.predict(new_test_x.reshape(1,-1))
    npy2=int(new_pred_y2[0])
    
#   模型3
    train_x3,test_x3,train_y3,test_y3 = train_test_split(x,y3,test_size=0.2)
    clf3= SVC(kernel='linear',C=0.4)
    clf3.fit(train_x3,train_y3)
    pred_y3 = clf3.predict(test_x3)    
    new_pred_y3 = clf3.predict(new_test_x.reshape(1,-1))
    npy3=int(new_pred_y3[0])

This results in 3 models.

In the last code, we classification_report(test_y,pred_y)can get the precision, recall and F1 value of the model through the function.

             (precision    recall  f1-score)   support

          1       0.88      0.93      0.90       187
          2       0.00      0.00      0.00        11
          3       0.15      0.18      0.17        11
          4       0.00      0.00      0.00         7
          5       0.00      0.00      0.00         2

avg / total       0.76      0.81      0.78       218

Here, the three models correspond to different (precision, recall, f1-score) respectively, we want to get the values ​​in the last row of the above table, and finally draw them into a table, like this:

So the question is, how to get the value in classification_report(). There are two ways:

  • (1) Output content to text, then read from text
  • (2) Read the print output stream

I have tried both methods, the first method is more familiar, so I finished writing it:

    ff = open("crout.txt","w")
    ff.write(classification_report(test_y,pred_y))
    ff.close()
    ff2=open("crout.txt","r")
    cr=[]
    for line in ff2.readlines():
        cr.append(list(map(str,line.strip().split(','))))
    ss =str(cr[8]) #把最后一列整列打印出来['avg / total       0.89      0.92      0.90       218']
    
    #运用正则表达式找出小数
    fs1=[]
    for i in range(3):
        s = re.findall("\d+(\.\d+)?",ss)[i]
        s='0'+s
        s=float(s)
        fs1.append(s)
    print (fs1)

Obviously, the above 3 models require 3 texts, and it feels cumbersome and low to write them back and forth... So it's good to look at it this way. At this time, we use the print output stream to solve the problem of getting the value. I went through Baidu and saw the answer in Baidu Know - How to get the value printed to the console by the python print statement

Simplify the original code and represent it as a function, instead of writing it in the main body of the program

class TextArea(object):  
    def __init__(self):  
        self.buffer = []  
    def write(self, *args, **kwargs): 
        self.buffer.append(args)  
        
def mf(L=[]):
    for i in range(90):
        s = re.findall("\d+(\.\d+)?",l)[i]
        s='0'+s
        s=float(s)
        #通过观察,在正则表达式获取的小数中,所需要三个模型的P/R/F值分别位于26~28、56~58、86~88位置上,将其加入L中即可
        if i>=26 and i<=28 or i>=56 and i<=58 or i>=86 and i<=88:
            L.append(s) 
    return L

Write the function body again:

    stdout = sys.stdout  
    sys.stdout = TextArea()  
    print(classification_report(test_y,pred_y))
    print(classification_report(test_y2,pred_y2))
    print(classification_report(test_y3,pred_y3))
    
    text_area, sys.stdout = sys.stdout, stdout  
      
    l=str(text_area.buffer)
    L=[]
    L=mf(L)
    #结果L=[0.76, 0.81, 0.78, 0.91, 0.94, 0.92, 0.73, 0.75, 0.74]
    fs1=[]
    fs2=[]
    fs3=[]
    LL=range(9)
    for i in LL[:9:3]:
        fs1.append(L[i])
        fs2.append(L[i+1])
        fs3.append(L[i+2])

2. Draw the graph

Here we get the precision, recall, and f1-score values ​​of each model (stored in fs1, fs2, and fs3, respectively), as well as the predicted text prediction results npy1, npy2, and npy3 three scores.

(1) Forecast chart

x = np.arange(3)
data = [npy1, npy2, npy3]

labels = ['民主制度', '民主自由', '民主监督']
plt.ylim(ymax=5.5, ymin=0)
plt.ylabel("评分")
plt.title("预测")
plt.bar(x, data,alpha=0.9,tick_label=labels)
plt.show()

result:

This is much more intuitive. Last time, I only gave a score of "5", so I can get all three.

(2) Comparison of precision, recall, and f1-score value models

total_width, n = 0.6, 3
x = np.arange(n)
width = total_width / n
x = x - (total_width - width) / 2
plt.ylim(ymax=1.3, ymin=0)
plt.bar(x, fs1,alpha=0.8, width=width, label='Precision')
plt.bar(x + width, fs2,alpha=0.8, width=width, label='Recall',tick_label = labels)
plt.bar(x + 2 * width, fs3, alpha=0.8,width=width, label='F1-score')
plt.legend()
plt.show()

Got the following result:

3. Modify the code of DT and NBY

(1) Decision Tree

The code of the decision tree is exactly the same as the SVM, but the function name is different. Just look at the results directly here:

Hmm...accuracy is pretty good.

(2) Naive Bayes

Naive Bayes is the same as SVM in code, but there is an error in the prediction.

 new_pred_y1 = clf.predict(new_test_x.reshape(1,-1))

Error:

TypeError: ufunc 'subtract' did not contain a loop with signature matching types dtype('<U32') dtype('<U32') dtype('<U32')

What's going on, it turns out that new_test_x needs to be converted to float64. For
reference python TypeError: ufunc 'subtract' did not contain a loop with signature matching typesdtype('S32'),
just add a conversion to the original code:

    new_test_x = new_test_x.astype('float64')
    new_pred_y1 = clf.predict(new_test_x.reshape(1,-1))

So let's take a look at the results:


The Bayesian effect is worse. . . Democracies are the least accurate and least predictive. However, the given text still has the highest score for the democratic system, and the prediction effect is not bad.

2. Summary and reflection

3. The next task

  • Manually select some test sets to test what the accuracy is when the scoring error is ±1. I tested some posts before, which were selected in the training set. The accuracy rate reached 95%, although it is very high but not accurate.
  • Learn essay writing.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325052329&siteId=291194637