SIGAI twenty-fifth of machine learning clustering algorithm 2

To teach the basic concepts of clustering algorithm, classification algorithm, hierarchical clustering, K-means algorithm, EM algorithm, DBSCAN algorithm, OPTICS algorithm, mean shift algorithm, spectral clustering algorithm, the practical application

Course Outline:

About clustering algorithm based on density of the
core idea of DBSCAN
define the basic concepts of
arithmetic processes
to achieve the details of
experiments
OPTICS algorithm is the core idea of
the basic concepts defined
flow algorithm
to generate clustering results sorted according to the results of
experiments
core idea of Mean Shift algorithm
kernel probability density estimation
process algorithm
core idea of spectral clustering of
basic concepts defined
flow algorithm
algorithm evaluation
application
clustering algorithm summary

This lesson stresses, density-based clustering algorithm: DBSCAN algorithm, OPTICS algorithm, Mean Shift algorithm, then say spectral clustering algorithm, and then talk about evaluation and application of the algorithm, and then make a conclusion for the two classes.

Introduction to density-based clustering algorithm:

It is the density of sample points count in all parts of space, more intensive distribution of sample points is considered to be a class, not the distribution of sample points other places densely that it is the noise point. So it depends on the clustering space density of the sample at each sample point, if the comparison is a dense cluster of words may otherwise not a cluster. This algorithm has the advantage that it can identify clusters of irregular shape, because it has no center, it does not define the center point, unlike kmeans, as it requires a Gaussian mixture model algorithm to calculate the center of the vector, therefore kmeans, Gaussian mixture model like the kind of clustering algorithms tend to like more round spherical ellipsoidal distribution, but these algorithms do, it can be found in any kind of shape, as long as China Unicom, for example S-shaped bend, zonal distribution, curly, it is able to handle. It is also an advantage, without specifying the number of clusters class without manually specify, unlike the first k-means algorithm as a set value of k. The core of this algorithm is to calculate the step density value at each sample point, and the cluster is constructed in accordance with the density value, the general practice is from a relatively dense core point we call the start point of repeated stretch began to expand, expansion to sample sparse place, the place surrounded, this is a cluster.

Depending on the density of sample points to a number of neighbors neighborhood defined data space, these algorithms can find space in the shape of an irregular cluster, do not specify the number of clusters.
The core algorithm is calculated density values, according to the density and to define a cluster.

The core idea of ​​DBSCAN:

The first density clustering algorithm based on the classic DBSCAN algorithm.

Martin Ester, Hanspeter Kriegel, Jorg Sander , Xu Xiaowei. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.
It is the simplest type of clustering algorithm based on density, its core ideological front has been said, is by first computing density at all spatial distribution of sample points, and then start looking for seeds from a sample, if this point is a core point is dense, then it spread out repeatedly until the boundary we become one cluster, a cluster after looking finished find another point to continue to the next cluster seed, knowing all the sample processing again later, this algorithm is ended .

This algorithm is very robust to noise, because it can find a variety of natural noise, all of the noise sample points it can weed out it, do not converge to go inside the cluster, which is a Gaussian mixture model, EM algorithm, k-means algorithm etc. one advantage of these algorithms can not be compared.

It is defined as a cluster area samples intensive algorithms from a seed sample begins to grow again dense area, until it reaches the border, the so-called border is very sparse sample and then go out, and go down to walk the distribution is very dense, arrive at this point we get a cluster up.

The basic concept definition:

This algorithm relies on some basic concepts Here are some of these concepts.

The core point:

The number of neighbors is greater than a specified threshold of sample points.

 

 

 

 As shown, if M is set to 5, then, near the middle point is a core point of the red, blue and red area boundary point not the core point.

Remember all the core set of dots is: Xc, c is the core of shorthand.

All non-core set of dots is a note: Xnc.

如果一个点不是核心点,但在它的邻域内有核心点,则称该点为边界点。核心点和边界点都是我们在聚类的时候的目标,就是每个簇里边包含核心点也包含边界点。

如果一个点既不是核心点,也不是边界点,则称为噪声点

如果x是核心点,y在它的邻域内,则称y是从x直接密度可达的。

对于一组样本x1,...,xn,如果xi+1是从xi直接密度可达的,则称xn是从x1密度可达,这里xn之前的所有的点要是核心点,而xn不一定是核心点。

对于x,y,z,如果y和z都从x密度可达,则称y和z密度相连

有了密度可达和密度相连以后,就可以定义一个簇了:

假设C是整个样本集X的一个子集,如果满足:对于样本集X中中任意两个样本x和y,如果x∈C,且y是从x密度可达的,则y∈C,如果x∈C,y∈C,则x和y是密度相连的,则称集合C为一个簇。
再来回顾一下密度可达和密度相连:

 

 密度可达它不是一个对称的概念,从p到q是密度可达的不能推出从q到p是密度可达的,如果最后一个点不是核心点的话,显然是对称可达是不成立的。密度相连是中间有一个跳板o,通过o到p是密度可达的,然后从o到q也是密度可达的,那我们称p和q之间是密度相连的,密度相连它是一个对称的概念,从p到q是密度相连的话,那么从q到p他也是密度相连的。

算法的流程:

Guess you like

Origin www.cnblogs.com/wisir/p/12113379.html