sklearn KMeans clustering algorithm (summary)

Fundamental

Kmeans is representative of unsupervised learning, there is no so-called Y. The main purpose of the classification, the distance between the classification is based on a sample. For example, to be divided into K classes. The steps are:

  1. K randomly selected points.
  2. Distances calculated for each point K to the centroid, into K clusters.
  3. K is calculated as the average of the new cluster samples centroid
  4. 2,3 cycle
  5. The same position, the distance is completed

distance

Kmeans basic principle is to calculate the distance. Generally, there are three distance options:

  • Euclidean distance

    \[ d(x,u)=\sqrt{\sum_{i=1}^n(x_i-\mu_i)^2} \]

  • Manhattan distance

    \[ d(x,u)=\sum_{i=1}^n(|x_i-\mu|) \]

  • Cosine distance

    \[ cos\theta=\frac{\sum_{i=1}^n(x_i*\mu)}{\sqrt{\sum_i^n(x_i)^2}*\sqrt{\sum_1^n(\mu)^2}} \]

inertia

To its centroid within each cluster distance sum, called inertia. Adding each cluster inertia and smaller, i.e. more similar cluster. (K but the smaller the greater the inertia, the greater the pursuit k applications without benefits)

Code

Simulation data:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=500, # 500个样本
                 n_features=2, # 每个样本2个特征
                 centers=4, # 4个中心
                 random_state=1 #控制随机性
                 )

Draw image:

color = ['red', 'pink','orange','gray']
fig, axi1=plt.subplots(1)
for i in range(4):
    axi1.scatter(X[y==i, 0], X[y==i,1],
               marker='o',
               s=8,
               c=color[i]
               )
plt.show()

Use KMeans modeling classes:

from sklearn.cluster import KMeans
n_clusters=3
cluster = KMeans(n_clusters=n_clusters,random_state=0).fit(X)

Also can be the first fit, and then predict, but the data may not be accurate. When a large amount of data.

At this point you can view its properties: the center of mass, inertia.

centroid=cluster.cluster_centers_
centroid # 查看质心

View inertia:

inertia=cluster.inertia_
inertia

Draw location.

color=['red','pink','orange','gray']
fig, axi1=plt.subplots(1)
for i in range(n_clusters):
    axi1.scatter(X[y_pred==i, 0], X[y_pred==i, 1],
               marker='o',
               s=8,
               c=color[i])
axi1.scatter(centroid[:,0],centroid[:,1],marker='x',s=100,c='black')

Guess you like

Origin www.cnblogs.com/heenhui2016/p/10988892.html