"Machine Learning Practical Notes--Part 1 Classification Algorithm: KNN Algorithm 1"

    In supervised learning, we only need to input a given set of samples, and the machine can deduce the possible outcomes of the specified target variable. Supervised learning generally uses two types of target variables: nominal and numerical. Nominal target variables take values ​​only from finite targets, while numeric target variables take values ​​from infinite values.

Classification algorithm: Chapter 2: k-nearest neighbor algorithm, using distance matrix for classification;

                Chapter 3: Decision tree;

                Chapter 4: discusses the use of probability theory to build classifiers;

                Chapter 5: Logistic Regression; use the most parameters to correctly classify the original data, a centralized optimization algorithm commonly used in the process of searching for the optimal parameters;

                Chapter 6: Support Vector Machines;

                Chapter 7: Meta-algorithms: AdaBoost


The basics of using the Numpy function library:

    >>>random.rand(4,4)

    will construct a 4*4 random array    

    Tip: There will be two different data types in the Numpy library (matrix and array), both of which can be used to process numeric elements represented by rows and columns. Although they look similar, they perform the same math on both data types. Operations may produce different results.

    Use the mat() function to convert an array into a matrix; the .I operator can achieve matrix inversion


    Below we will introduce the first classification algorithm in this book: the k-nearest neighbor algorithm (KNN)

    Simply put, the K-nearest neighbor algorithm uses the distance between different eigenvalues ​​for classification;

    Advantages: high precision, insensitivity to outliers, no data input assumptions;

    OK: high computational complexity and high space complexity; 

    Applies to numerical and nominal types.

    Working principle: Each data in the training sample has a label, that is, we know the corresponding relationship of the classification to which each data belongs. After inputting new data without labels, each feature of the new data is matched with the features corresponding to the data in the sample set. A comparison is made, and the algorithm then extracts the classification labels of the data (nearest neighbors) with the most similar features in the sample set. In general, we only extract the top K most similar data in the dataset, which is where K comes from (generally not greater than 20). Finally, we choose the category with the most occurrences in the K most similar data as the category for the new label.

    The general flow of the k-nearest neighbor algorithm:

(1) Collect data: collect data by any method;

(2) Prepare data: the value required for distance calculation, preferably in a structured data format;

(3) Analysis of data: any method can be used;

(4) Training algorithm: This step is not suitable for k-nearest neighbor algorithm;

(5) Test algorithm: calculate the error rate;

(6) Using the algorithm: First, you need to enter the sample data and structured output results, then run the k-nearest neighbor algorithm to determine which category the input data belongs to, and finally the application performs subsequent processing on the calculated classification results.


1. Preparation: import data using python

Create KNN.py file

from numpy import *
import operator
"""The
scientific computing package and operator module
k-nearest neighbor algorithm use the functions provided by this module to perform sorting operations.
"""
def createDataset():

    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])

//Note that there is also [] in () in the array, which is stored in each array

    labels = ['A','A','B','B']

    return group,labels

For convenience, use the createDataset() function, which is used to create datasets and labels. Save the KNN.py file and import KNN.py in the new module


Here are four sets of data, each of which has two known attributes or eigenvalues. Each row of the group matrix contains a different data, and the vector labels contains the number of elements equal to the number of rows of the group matrix. The plot below is the four data points with labels.



2. Parse data from text

Use the function in Listing 2-1 to run the KNN algorithm to classify each set of data. The function of this function is to use the KNN algorithm to divide each set of data into a certain class.

First give the algorithm and pseudo code of KNN and the actual python code

Do the following for each point in the dataset with unknown categorical attributes in turn:

(1) Calculate the distance between the point in the known category dataset and the current point;

(2) Sort by increasing distance;

(3) Select the K points with the smallest distance from the current point;

(4) Determine the frequency of occurrence of the category of the first K points;

(5) Return the category with the highest frequency of occurrence of the first K points as the predicted classification of the current point.

The function classify0() is shown in Listing 2-1:

import KNN
from numpy import *
import operator
group,labels = KNN.createDataset()

def classify0(inX, dataSet, labels, k):
    """
    Input vector for classification: inX
    input training set: dataSet
    Label vector: labels
    Select the number of neighbors: k
    """ #elements of the
    label vector and matrix dataSet The number of rows is the same, and dataSetSize is the number of rows.
    dataSetSize = dataSet.shape[0]
    
    #Repeat the inX vector dataSetSize times by row, and repeat the column once to become the same form as dataSet, and calculate the difference diffMat
    diffMat = tile(inX, ( dataSetSize,1)) - dataSet #Squaring
    
    the difference
    sqDiffMat = diffMat**2
    #axis=1 Add the values ​​of each row vector together and sum
    up sqDistances = sqDiffMat.sum(axis=1)
    #square
    root distances = sqDistances **0.5
    
    #argsort(): Arrange the elements in x from small to large, extract their corresponding index (index), and then output the index
    sortedDistIndicies = distances.argsort()
    classCount={}
    for i in range(k):
        #sortedDistIndicies[i] returns the index of the i-th value in distances
        #labels[i]: that is, the returned label of the vector at the i-th position is stored in voteIlabel
        voteIlabel = labels[sortedDistIndicies[i]]
        #Update The value of the corresponding label in the dictionary
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    
    """
    sorted(iterable[, cmp[, key[, reverse]]])
    Parameter explanation:


(1) iterable specifies to be sorted The list or iterable, needless to say;


(2) cmp is a function, specifying the function to compare when sorting, you can specify a function or lambda function, such as:
      students is a list of class objects, each member has three fields, use sorted When comparing, you can define the cmp function yourself. For example, to sort by comparing the third data member, the code can be written as follows:
      students = [('john', 'A', 15), ('jane', 'B' , 12), ('dave', 'B', 10)]
       sorted(students, key=lambda student : student[2])
(3) The key is a function, which specifies which of the elements to be sorted is sorted, the function Using the above example to illustrate, the code is as follows:
       sorted(students, key=lambda student : student[2]) 
       The function of the lambda function specified by the key is to go to the third field of the element student (ie: student[2]), so when sorted, it will be sorted by the first of all elements of students three fields to sort.
With the above operator.itemgetter function, it can also be implemented with this function. For example, to sort by the third field of students, you can write:
sorted(students, key=operator.itemgetter(2)) 
The sorted function can also be performed Multi-level sorting, for example, to sort according to the second field and the third field, you can write:
sorted(students, key=operator.itemgetter(1,2)) 
    
    """

    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

    #The way python3 is written here is classCount.items() is different from the original book

    return sortedClassCount[0][0]



Using the previously created training set, the classification results obtained by calling the classification function are shown above.

Program listing 2-1 uses the Euclidean distance formula to calculate the distance between two vector points xA, xB



After calculating all the distances, sort the data from small to large to determine the main category where the first k elements with the smallest distances are located.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324506000&siteId=291194637