Machine Learning Classification Example - Data Testing

Machine Learning Classification Example - Data Testing

20180502 study record

This morning, I mainly modified the foreign language translation and adjusted the format of the formula. In the afternoon, I mainly test the data. My score is 1~5 points. The accuracy rate of SVM and decision tree before is more than 70%, and the Bayesian accuracy rate is very low. However, since the data is accurately compared, the accuracy rate will be relatively low, so this time we need to test the accuracy rate with an error of ±1.

1. Work (for SVM)

Open MLdata and append the code at the bottom.

1 Randomly select the test set

Lr=[]
for i in range(350):
    Lr.append(random.randint(0, 1090))
Lr=list(set(Lr))
Lr.sort()
Lrlen=len(Lr)
delta=abs(300-Lrlen)
for i in range(delta):
    j=-i
    Lr.pop(j)

The selected random numbers may be repeated, so I take 350 and try to select more than 300 random numbers. Then use set() to deduplicate, and then adjust to a list with only 300 elements.

file2=open('test2.txt','w')
for j in Lr:
    vec=L3[j]
    i = str(vec).strip('[').strip(']')
    file2.write(i+'\n')
file2.close()

As usual, write to file.

2 Generate a training set

Ll=list(range(0,1090))
di=[v for v in Ll if v not in Lr]

Generate 0~1089, and find the difference between the complete set and the test set, which is the training set --di.

file3=open('test3.txt','w')
for j in di:
    vec=L3[j]
    i = str(vec).strip('[').strip(']')
    file3.write(i+'\n')
file3.close()

The old way, write it to the file

3 Predictive test (the code is relatively simple to select 300 tests)


import numpy as np
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split


with open("test3.txt","r") as file:
    result=[]
    for line in file.readlines():
        result.append(list(map(str,line.strip().split(','))))
    vec = np.array(result)
    x = vec[:,:-3]
    y = vec[:,-3]
    y2=vec[:,-2]
    y3=vec[:,-1]
    
    f = open("test2.txt","r")
    newl=[]
    for line in f.readlines():
        newl.append(list(map(str,line.strip().split(','))))
    newv = np.array(newl)
    new_test_x = newv[:,:-3]
    new_test_y1=newv[:,-3]
    new_test_y2=newv[:,-2]
    new_test_y3=newv[:,-1]

###############################################################################
#模型训练
###############################################################################
  
#    模型1
    train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.3)
    clf1 = SVC(kernel='linear',C=0.4)
    clf1.fit(train_x,train_y)
    pred_y = clf1.predict(test_x)
    new_pred_y1 = clf1.predict(new_test_x)
###############################################################################
#   模型2
    train_x2,test_x2,train_y2,test_y2 = train_test_split(x,y2,test_size=0.2)
    clf2= SVC(kernel='linear',C=0.4)
    clf2.fit(train_x2,train_y2)
    pred_y2 = clf2.predict(test_x2)
    new_pred_y2 = clf2.predict(new_test_x)
###############################################################################    
#   模型3
    train_x3,test_x3,train_y3,test_y3 = train_test_split(x,y3,test_size=0.3)
    clf3= SVC(kernel='linear',C=0.4)
    clf3.fit(train_x3,train_y3)
    pred_y3 = clf3.predict(test_x3)    
    new_pred_y3 = clf3.predict(new_test_x)
###############################################################################    
#预测分析    
        
    testnum=300
    count1=0
    count2=0
    count3=0
    for i in range(testnum):
        py1=int(new_pred_y1[i])
        py2=int(new_pred_y2[i])
        py3=int(new_pred_y3[i])

        cy1=int(new_test_y1[i])
        cy2=int(new_test_y2[i])
        cy3=int(new_test_y3[i])

        if abs(py1-cy1)<2:
            count1=count1+1
        if abs(py2-cy2)<2:
            count2=count2+1
        if abs(py3-cy3)<2:
            count3=count3+1            
#    
    p1=count1/testnum
    p2=count2/testnum
    p3=count3/testnum
    print(p1,p2,p3)

Finally, three test results "0.8566666666666667 0.96 0.89" can be obtained
, representing the accuracy of the three tags (subtopics) with an error of 1.

Democratic system: 0.86,
democratic freedom: 0.96,
democratic supervision: 0.89

2. Summary and reflection

Just after May Day, after playing for 3 days, I was in a daze. It's time to write a good thesis, but there is one place that I'm still worried about. Because my initial corpus was directly given to me by my seniors, and it was already formatted data. In fact, I need to write my own crawler to crawl from the Internet, but I haven't done this work, and I actually have no time to do it. But it is still necessary to understand this knowledge. It is too late to write code. When the time comes for the graduation defense, probably will not ask this question. In short, learn reptiles first in these two days.

3. The next task

  • learn web crawling
  • Learn essay writing.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325163596&siteId=291194637