In supervised learning, we only need to input a given set of samples, and the machine can deduce the possible outcomes of the specified target variable. Supervised learning generally uses two types of target variables: nominal and numerical. Nominal target variables take values only from finite targets, while numeric target variables take values from infinite values.
Classification algorithm: Chapter 2: k-nearest neighbor algorithm, using distance matrix for classification;
Chapter 3: Decision tree;
Chapter 4: discusses the use of probability theory to build classifiers;
Chapter 5: Logistic Regression; use the most parameters to correctly classify the original data, a centralized optimization algorithm commonly used in the process of searching for the optimal parameters;
Chapter 6: Support Vector Machines;
Chapter 7: Meta-algorithms: AdaBoost
The basics of using the Numpy function library:
>>>random.rand(4,4)
will construct a 4*4 random array
Tip: There will be two different data types in the Numpy library (matrix and array), both of which can be used to process numeric elements represented by rows and columns. Although they look similar, they perform the same math on both data types. Operations may produce different results.
Use the mat() function to convert an array into a matrix; the .I operator can achieve matrix inversion
Below we will introduce the first classification algorithm in this book: the k-nearest neighbor algorithm (KNN)
Simply put, the K-nearest neighbor algorithm uses the distance between different eigenvalues for classification;
Advantages: high precision, insensitivity to outliers, no data input assumptions;
OK: high computational complexity and high space complexity;
Applies to numerical and nominal types.
Working principle: Each data in the training sample has a label, that is, we know the corresponding relationship of the classification to which each data belongs. After inputting new data without labels, each feature of the new data is matched with the features corresponding to the data in the sample set. A comparison is made, and the algorithm then extracts the classification labels of the data (nearest neighbors) with the most similar features in the sample set. In general, we only extract the top K most similar data in the dataset, which is where K comes from (generally not greater than 20). Finally, we choose the category with the most occurrences in the K most similar data as the category for the new label.
The general flow of the k-nearest neighbor algorithm:
(1) Collect data: collect data by any method;
(2) Prepare data: the value required for distance calculation, preferably in a structured data format;
(3) Analysis of data: any method can be used;
(4) Training algorithm: This step is not suitable for k-nearest neighbor algorithm;
(5) Test algorithm: calculate the error rate;
(6) Using the algorithm: First, you need to enter the sample data and structured output results, then run the k-nearest neighbor algorithm to determine which category the input data belongs to, and finally the application performs subsequent processing on the calculated classification results.
1. Preparation: import data using python
Create KNN.py file
from numpy import *
import operator
"""The
scientific computing package and operator module
k-nearest neighbor algorithm use the functions provided by this module to perform sorting operations.
"""
def createDataset():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
//Note that there is also [] in () in the array, which is stored in each array
labels = ['A','A','B','B']return group,labels
For convenience, use the createDataset() function, which is used to create datasets and labels. Save the KNN.py file and import KNN.py in the new module
Here are four sets of data, each of which has two known attributes or eigenvalues. Each row of the group matrix contains a different data, and the vector labels contains the number of elements equal to the number of rows of the group matrix. The plot below is the four data points with labels.
2. Parse data from text
Use the function in Listing 2-1 to run the KNN algorithm to classify each set of data. The function of this function is to use the KNN algorithm to divide each set of data into a certain class.
First give the algorithm and pseudo code of KNN and the actual python code
Do the following for each point in the dataset with unknown categorical attributes in turn:
(1) Calculate the distance between the point in the known category dataset and the current point;
(2) Sort by increasing distance;
(3) Select the K points with the smallest distance from the current point;
(4) Determine the frequency of occurrence of the category of the first K points;
(5) Return the category with the highest frequency of occurrence of the first K points as the predicted classification of the current point.
The function classify0() is shown in Listing 2-1:
import KNN
from numpy import *
import operator
group,labels = KNN.createDataset()
def classify0(inX, dataSet, labels, k):
"""
Input vector for classification: inX
input training set: dataSet
Label vector: labels
Select the number of neighbors: k
""" #elements of the
label vector and matrix dataSet The number of rows is the same, and dataSetSize is the number of rows.
dataSetSize = dataSet.shape[0]
#Repeat the inX vector dataSetSize times by row, and repeat the column once to become the same form as dataSet, and calculate the difference diffMat
diffMat = tile(inX, ( dataSetSize,1)) - dataSet #Squaring
the difference
sqDiffMat = diffMat**2
#axis=1 Add the values of each row vector together and sum
up sqDistances = sqDiffMat.sum(axis=1)
#square
root distances = sqDistances **0.5
#argsort(): Arrange the elements in x from small to large, extract their corresponding index (index), and then output the index
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
#sortedDistIndicies[i] returns the index of the i-th value in distances
#labels[i]: that is, the returned label of the vector at the i-th position is stored in voteIlabel
voteIlabel = labels[sortedDistIndicies[i]]
#Update The value of the corresponding label in the dictionary
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
"""
sorted(iterable[, cmp[, key[, reverse]]])
Parameter explanation:
(1) iterable specifies to be sorted The list or iterable, needless to say;
(2) cmp is a function, specifying the function to compare when sorting, you can specify a function or lambda function, such as:
students is a list of class objects, each member has three fields, use sorted When comparing, you can define the cmp function yourself. For example, to sort by comparing the third data member, the code can be written as follows:
students = [('john', 'A', 15), ('jane', 'B' , 12), ('dave', 'B', 10)]
sorted(students, key=lambda student : student[2])
(3) The key is a function, which specifies which of the elements to be sorted is sorted, the function Using the above example to illustrate, the code is as follows:
sorted(students, key=lambda student : student[2])
The function of the lambda function specified by the key is to go to the third field of the element student (ie: student[2]), so when sorted, it will be sorted by the first of all elements of students three fields to sort.
With the above operator.itemgetter function, it can also be implemented with this function. For example, to sort by the third field of students, you can write:
sorted(students, key=operator.itemgetter(2))
The sorted function can also be performed Multi-level sorting, for example, to sort according to the second field and the third field, you can write:
sorted(students, key=operator.itemgetter(1,2))
"""
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
#The way python3 is written here is classCount.items() is different from the original book
return sortedClassCount[0][0]
Using the previously created training set, the classification results obtained by calling the classification function are shown above.
Program listing 2-1 uses the Euclidean distance formula to calculate the distance between two vector points xA, xB
After calculating all the distances, sort the data from small to large to determine the main category where the first k elements with the smallest distances are located.