KNN and K-Means of Machine Learning Algorithms

1. Machine Learning Algorithms

Classification

<1> Supervised learning
Supervised learning - learning a model based on input data (training data), which can make predictions for subsequent inputs. The input variables and output variables can be continuous or discrete.
Regression means: both input variables and output variables are continuous variables
Classification means: output variables are finite discrete variables
Annotation means: both input variables and output variables are variable sequences

Algorithms: classification (category), regression (numeric)

<2> Unsupervised learning
Supervised learning: Training data has labeled categories. Unsupervised learning: The training data has no labeled categories.

Algorithms: Clustering, Dimensionality Reduction
Clustering is based on five categories: division, hierarchy, density, graph, and model

<3> Semi-supervised learning
The input data contains both labeled data and unlabeled data.

<4> The essence of reinforcement learning
is to solve decision-making problems and learn to make decisions automatically. As shown in the figure below:
insert image description here
Algorithm: Markov decision, dynamic programming
Traditional machine learning is mainly used for data mining and analysis.

2. KNN algorithm

2.1 Core idea

Select the k nearest neighbors from the input data point in the training set, and use the category that appears most frequently among the k neighbors (maximum voting rule) as the category of the data point. KNN classifies by measuring the distance between different feature values, which belongs to a classification method in supervised learning.

2.2 Ideas

If most of the k most similar samples in the feature space (that is, the nearest neighbors in the feature space) of a sample belong to a certain category, the sample also belongs to this category.
Note : k is usually an integer not greater than 20; the selected neighbors are objects that have been correctly classified; this method only determines the category of the sample to be divided according to the category of the nearest one or several samples in the classification decision .

2.3 Implementation

import numpy as np
import matplotlib.pyplot as plt
import operator


# 输入数据和类别
def makeData():
    data = np.array([
        [1.0, 1.0],
        [1.0, 1.2],
        [0, 0],
        [0, 0.3]
    ])
    labels = ['A', 'A', 'B', 'B']
    return data, labels


# 分类
def classify(inX, dataset, labels, k):
    dataSize = dataset.shape[0]
    diffMat = np.tile(inX, (dataSize, 1)) - dataset  # 广播
    # 求欧氏距离
    sqDiffMat = diffMat**2
    sqDis = sqDiffMat.sum(axis=1)
    distance = sqDis**0.5
    
    sortedDis = distance.argsort()
    labelCount = {
    
    }  # 存放最终分类结果以及相应的投票数
    for i in range(k):
        votellabel = labels[sortedDis[i]]  # 前k个最近样本的所属类别
        labelCount[votellabel] = labelCount.get(votellabel, 0) + 1  # 统计 前k个最近样本的所属类别 的样本个数
    sortedlabelCount = sorted(labelCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedlabelCount[0][0]


# 显示数据点和颜色
def showData(group, labels):
    labels = np.array(labels)
    index_a = np.where(labels == 'A')
    index_b = np.where(labels == 'B')
    for i in labels:
        if i == 'A':
            plt.scatter(group[index_a][:, :1], group[index_a][:, 1:2], c="red")
        elif i == "B":
            plt.scatter(group[index_b][:, :1], group[index_b][:, 1:2], c="blue")
    plt.show()


if __name__ == '__main__':
    dataset, labels = makeData()
    inX = [0.2, 0.3]
    className = classify(inX, dataset, labels, 3)
    print("该数据属于: %s" % className)
    dataset = np.vstack((dataset, inX))  # 添加新的点
    labels.append(className)  # 添加新的点对应类别
    showData(dataset, labels)

3. K-Means algorithm

3.1 Principle

The k-means algorithm uses the distance between data as the standard for measuring the similarity of data objects. The calculation method of the distance between data has a significant impact on the final clustering effect. Commonly used distance calculation methods: cosine distance, Euclidean distance, Manhattan distance, etc.
Euclidean distance formula:
dist ⁡ ( xi , xj ) = ∑ d − 1 D ( xi , d − xj , d ) \operatorname{dist}\left(\mathrm{x}_{\mathrm{i}}, \mathrm {x}_{\mathrm{j}}\right)=\sqrt{\sum_{\mathrm{d}-1}^{\mathrm{D}}\left(\mathrm{x}_{\mathrm{ i}, \mathrm{d}}-\mathrm{x}_{\mathrm{j}, \mathrm{d}}\right)}dist(xi,xj)=d1D(xi,dxj , d)

3.2 Process

The k-means algorithm belongs to a partition-based clustering algorithm of unsupervised learning.
Purpose: To cluster the data according to the characteristics implied by the data itself without knowing the category and the number of categories of the data.
insert image description here
Process : As shown in the figure above, the data sample is represented by a dot, and the center point of each cluster is represented by a cross:
(a) At the beginning, it is the original data, which is messy and has no label. They all look the same and are all green.
(b) Assuming that the data set can be divided into two categories, let K=2, randomly select two points on the coordinates as the center points of the two categories.
(c-f) Two iterations of clustering are demonstrated. Divide first, divide each data sample into the cluster of the nearest center point; after division, update the center of each cluster, that is, add up the coordinates of all data points in the cluster to get the average value. In this way, "division-update-division-update" is continuously performed until the center of each cluster is no longer moving. (The picture and text are from Andrew ng's machine learning open class).

3.3 Implementation

import numpy as np
import matplotlib.pyplot as plt


# 加载数据
def loadData(fileName):
    data = np.loadtxt(fileName, delimiter='\t')  # 文本数据以tab间隔,转化为文本
    return data


# 随机出中心点
def randCent(dataSet, k):
    m, n = dataSet.shape
    centids = np.zeros((k, n))
    for i in range(k):
        index = int(np.random.uniform(0, m))
        centids[i, :] = dataSet[index, :]
    return centids


# 欧氏距离
def distEclud(x1, x2):
    return np.sqrt(np.sum((x1-x2)**2))


# K均值聚类
def KMeans(dataSet, k):
    m = np.shape(dataSet)[0]
    clusterAssment = np.mat(np.zeros((m, 2)))  # 第一列存中心点的下标,第二列存数据到中心点的距离 mat可以存字符串
    clusterChange = True

    # 初始化中心点
    centrids = randCent(dataSet, k)
    while clusterChange:
        clusterChange = False
        # 遍历样本(行数)
        for i in range(m):
            minDist = 10000.0
            minIndex = -1
            # 拿到所有的中心点
            for j in range(k):
                # 计算欧式距离
                distance = distEclud(centrids[j, :], dataSet[i, :])
                if distance < minDist:
                    minDist = distance
                    minIndex = j
            # 更新每一行样本所属的中心点
            if clusterAssment[i, 0] != minIndex:
                clusterChange = True
                clusterAssment[i, :] = minIndex, minDist
        for j in range(k):
            # 获取该类别中所有的点
            points = dataSet[np.nonzero(clusterAssment[:, 0] == j)[0]]
            # 求每列的平均值
            centrids[j, :] = np.mean(points, axis=0)
    print("更新完成")
    return centrids, clusterAssment


def show(dataSet, k, centrids, clusterAssment):
    m, n = dataSet.shape
    mark = ['or', 'ob']
    if k > len(mark):
        print("k值太大")
        return 1
    # 绘制样本
    for i in range(m):
        markIndex = int(clusterAssment[i, 0])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])
    mark = ['Dr', 'Db']
    for i in range(k):
        plt.plot(centrids[i, 0], centrids[i, 1], mark[i])
    plt.show()


if __name__ == '__main__':
    dataSet = loadData("test.txt")
    k = 2
    centerios, clus = KMeans(dataSet, k)
    show(dataSet, k, centerios, clus)

3.4 Summary

1 Initial point selection: The distance between the initial cluster centers is as far as possible
(1) The average value of all data in this class
(2) Randomly select k data as the class center
(3) Select the farthest k points as the class Heart and so on

2 Iterative process:
(1) Calculate the distance from all samples to the center point
(2) Compare the distance from each sample to k center points, and classify the samples into the nearest category
(3) Re-sample points composed of k categories Calculate the center point

3 Iteration termination conditions: (1) the specified number of iterations is reached; (2) the centroid does not change, that is, convergence

4 k value selection: refer to https://www.cnblogs.com/xingnie/articles/10334412.html

5 k-means convergence: refer to https://www.cnblogs.com/zlslch/p/6965209.html

6 Complexity: O(n), ie linear

7 Initial point optimization method:
(1) Select the center point multiple times for multiple experiments, and use the loss function to evaluate the effect
(2) Select K sample points as far away as possible as the center point
(3) For high-dimensional sparse vectors (such as text), you can select K pairwise orthogonal eigenvectors as the initialization center point

Guess you like

Origin blog.csdn.net/weixin_50008473/article/details/123127985