Unsupervised learning-K-means clustering-knowledge point literacy

foreword

In practical work, we often encounter such a problem: input a large amount of feature data to the machine, and expect the machine to find some common features or structures in the data through learning, or a certain relationship between the data.
For example, video websites group users according to their viewing behavior to establish different recommendation strategies, or find the relationship between whether the video is played smoothly and whether users unsubscribe, etc. This type of problem is called an "unsupervised learning" problem, and it does not expect to predict a certain output like supervised learning.
Compared with supervised learning, the input data of unsupervised learning has no label information, and algorithmic models are needed to mine the inherent structure and pattern of the data.
Unsupervised learning mainly includes two types of learning methods: data clustering and feature variable association. Among them, the clustering algorithm often finds the optimal segmentation of data through multiple iterations, and the feature variable association uses various correlation analysis methods to find the relationship between variables.

K-means clustering

Scenario Description
Classical machine learning algorithms such as support vector machines, logistic regression, and decision trees are mainly used for classification problems, that is, according to some samples of a given category, a certain classifier is trained so that it can classify samples of unknown categories. Different from classification problems, clustering is to divide samples into several categories through the internal relationship between data without knowing any sample category labels in advance, so that the similarity between samples of the same category is high, and the similarity between samples of different categories is low.
insert image description here

(different colors represent different categories).

Classification problems fall under the umbrella of supervised learning, while clustering is unsupervised learning. K-means clustering (KMeans Clustering) is the most basic and commonly used clustering algorithm. Its basic idea is to iteratively find a division scheme of K clusters (Cluster), so that the cost function corresponding to the clustering result is minimized. In particular, the cost function can be defined as the sum of the squares of errors between each sample and the center point of the cluster
insert image description here
where xi represents the i-th sample, ci is the cluster to which xi belongs, μci represents the center point corresponding to the cluster, and M is the total number of samples.

Algorithm Description

The core goal of K-means clustering is to divide a given data set into K clusters and give the cluster center point corresponding to each data. The specific steps of the algorithm are described as follows:
insert image description here
insert image description here

process

Fig. 2 is a schematic diagram of an iterative process of the K-means algorithm. First, given some sample points in two-dimensional space (see Figure 2(a)), intuitively these points can be divided into two categories; next, initialize two center points (the brown and yellow forks in Figure 2(b) represent the center points), and calculate the cluster to which each sample belongs according to the bit of the center point (Figure 2(c) is represented by different colors); then calculate the new center point position according to the average value of all points in each cluster (see Figure 2(d)); Figure 2(e) and Figure 2(f) show the results of a new round of iteration; After two rounds of iterations, the algorithm basically
converges
.
insert image description here

Advantages and disadvantages

The main disadvantages of the K-means algorithm are as follows.
(1) The initial K value needs to be manually determined in advance, and this value may not coincide with the real data distribution.
(2) K-means can only converge to a local optimum, and the effect is greatly affected by the initial value.
(3) Vulnerable to noise.

Guess you like

Origin blog.csdn.net/ALiLiLiYa/article/details/131775925