Machine Learning Classification Examples (sklearn)

Machine Learning Classification Example - SVM

20180423-20180426 study notes

25 I went to the first Digital China Exhibition and did not study. (I want to be lazy) Since it is the last day, I feel that there are fewer things on display, because it is closed on the 24th. . . But you can go to the experience area. I mainly experienced VR, but other things on display were either boring or incomprehensible. When Ma Yun and Ma Huateng came, I didn’t watch it because of the meeting. Lost the opportunity to ask them to sign, alas.

1. Work

It is mainly to learn some theories of classifiers. Through practice, the results are finally obtained. Although the results are not so perfect, how they are not perfect will be mentioned below.

1.1 Learning Theory:

Because the knowledge in this area is limited to some understanding in the "artificial intelligence" and "data mining" courses at the undergraduate level, it can be used for exams. So it is really troublesome to use it. Although it is not necessary to understand the meaning of each formula and each number, it is still necessary to understand the "input", "output", "what to do", "core idea" and so on of the entire model.
The high-voted answer in Zhihu is "Jianzhi", which vividly compares the various parts of the SVM model

After that, boring adults, call these balls "data", the sticks "classifier", the maximum clearance trick is called "optimization", the table is called "kernelling", and the paper is called "hyperplane".

Through his description, the operating mechanism of SVM can have a general understanding.

The second answer with high votes is "reliable and reliable", and the whole model is further introduced in detail. From the classification problem of "apple banana", the principle of SVM and the introduction of various parameters are also well written.

Know the link: What does support vector machine (SVM) mean?

1.2 Learning examples:

Looking at the theory, I am still confused. This project is based on scikit-learn v0.19.1. Go to the official query documentation to see how to set parameters, call functions, input and output formats, etc.

Official documents: Support Vector Machines , Chinese documents : Support Vector Machines , SVM (you can understand the meaning without reading Chinese, but some parameters are still incomprehensible)

I searched for a blog, which is very well written: The use of support vector machine SVM in Python (with examples)
The meaning of each parameter is written in detail in the article. Step by step, according to the steps of the blogger, you can correctly classify and draw the diagram.
It's pretty:

But I'm doing multi-dimensional features, so this "two-feature" plot is not suitable for me. (The teacher asked to use a chart to express the discrimination results, but I have not figured out how to express it.)

1.3 Start coding:

Read the file "test.txt", which stores a series of features and three label values. In my " non-democracy related post processing " I have the style of test.txt, which contains 200 features and 3 label values.

(1) Import the required packages

import numpy as np
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report

(2) Read the file and store it

with open("test.txt","r") as file:
    ty=-3 #代表取哪一列label值,-1代表取倒一列所有值
    result=[]
    for line in file.readlines():
        result.append(list(map(str,line.strip().split(','))))

    vec = np.array(result)
    x = vec[:,:-3]#取除掉最后三列以外的所有列,即所有特征列
    y = vec[:,ty]#标签列

(3) Divide the test set and the training set

    train_x,test_x,train_y,test_y = train_test_split(x,y,test_size=0.2)

(4) Model training and prediction

    clf = SVC(kernel='linear',C=0.4)
    clf.fit(train_x,train_y)
    
    pred_y = clf.predict(test_x)
    print(classification_report(test_y,pred_y))

Among them, Kernel is introduced in the reference document. The constant C is a numerical value obtained by experience, and there is no specific formula. There are some implications about its settings though. Knowing the answerer "Gu Lingfeng" introduced C like this:

In principle C can choose all numbers greater than 0 as needed.
The larger the C, the higher the attention to the total error in the entire optimization process, and the higher the requirement for reducing the error, even at the expense of reducing the interval.

  • When C tends to infinity, the problem is that samples with classification errors are not allowed to exist, then this is a hard-margin
  • SVM problem When C tends to 0, we no longer pay attention to whether the classification is correct, and only require that the larger the interval, the better, then we will not be able to get a meaningful solution and the algorithm will not converge.

Know the link: Regarding the understanding of constant C in SVM?

(5) Analysis of training results

For the introduction of the function "classification_report(test_y, pred_y)" and the specific meaning of precision/recall/f1-score, see two blogs for details

Simply put:

(6) Test text analysis

  • The task is to enter a piece of text, use jieba to find keywords, select a larger number of keywords, and try to cover the full text. (The use of keywords or participles here is doubtful)
  • Then compare it with the keyword library to get the feature vector of the text

The code is relatively simple, similar to the previous one:

from openpyxl import load_workbook
from openpyxl import Workbook
import jieba.analyse

wr=load_workbook('sta.xlsx')
osheet=wr.active
orow=osheet.max_row

print(orow)

testL=[]
num=200
tempc=0
for i in osheet["A"]:
    if tempc<num:
        testL.append(i.value)
    else:
        break
    tempc=tempc+1

with open("example.txt","r") as f:
    ftxt=f.read()
    print(ftxt)
    content=ftxt
    keywords=jieba.analyse.extract_tags(content,topK=1000)
    print(keywords)
    
    L3=[]
    L2=[]
    flag=False

    print(testL)
    L2=[]
      
    
    
    for g in testL: 
        flag=False
        for i in keywords:
            if g==i:
                flag=True
                break
        if flag:
            L2.append(1)
        else:
            L2.append(0)
    print(L2)              
    file=open('examout.txt','w')

    file.write(str(L2).strip('[').strip(']'))
    file.close()

From a piece of text in example.txt, it eventually becomes the feature value in examout.txt

(7) Text Feature Test

    f = open("examout.txt","r")
    newl =f.read()
    newl=list(map(str,newl.strip().split(',')))
    newv = np.array(newl)
    new_test_x = newv[:]
    print(new_test_x)
    new_pred_y = clf.predict(new_test_x.reshape(1,-1))
    print(new_pred_y)

Output result:

Note that we are testing on a single column. above

ty=-3
y = vec[:,ty]

It means that only the third-to-last column, which is the first subtopic "Democracy", is trained. The text is judged and its score under the "Democracy" label is obtained - 5 points, it seems very relevant

  • My rating scale is 1-5, with 1 being irrelevant and 5 being very relevant.

  • Notice the function "clf.predict(new_test_x.reshape(1,-1))", which must receive a two-dimensional array. The feature vector has only one dimension, so reshape(1,-1) is required for conversion, otherwise an error will be reported

A little note: only a small part of the text has been tested here, and it will take time to test a large number of texts.

2. Summary and reflection

  • I have implemented SVM and have a certain understanding of it, but I still don't understand the internal principles because I don't have time to read related papers.
  • Have a relatively simple understanding of the use of scilearn, and can read text and tables proficiently
  • The performance of SVM and DecisionTree is acceptable, but the classification effect of Naive Bayes is relatively poor. . . (All three categories have been completed, only SVM is written due to space limitations, and the remaining categories will be written in two days)

3. The next task

How to display the results? What to show? In what form?

Because we want to compare the accuracy of three classifications, let's learn Python table drawing first. Tomorrow I will show the senior and senior the work that has been done so far.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324976770&siteId=291194637