Sklearn - Clustering



from sklearn import datasets 
from sklearn.preprocessing import StandardScaler 
from sklearn.cluster import KMeans 

iris = datasets.load_iris() 
iris_features = iris.data 
iris_target = iris.target 


Using the K-Means clustering algorithm

# 标准化特征
scaler = StandardScaler() 
features_std = scaler.fit_transform(iris_features) 

# 创建 K-Means 对象 
cluster = KMeans(n_clusters=3, random_state=0) 

model = cluster.fit(features_std) 
# 查看预测分类
model.labels_ 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
# 真实分类
iris.target 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

new_ob = [[0.8, 0.8, 0.8, 0.8]]
model.predict(new_ob) 

array([2], dtype=int32)
# 查看分类的中心点
model.cluster_centers_  
array([[-0.05021989, -0.88337647,  0.34773781,  0.2815273 ],
       [-1.01457897,  0.85326268, -1.30498732, -1.25489349],
       [ 1.13597027,  0.08842168,  0.99615451,  1.01752612]])

Accelerated K-Means Clustering MiniBatchKMeans

batch_size controls the number of randomly selected observations in each batch; the more observations in a batch, the more computationally expensive training will be.

from sklearn.cluster import MiniBatchKMeans 
cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
model.fit(features_std) 
KMeans(n_clusters=3, random_state=0)
model = cluster.fit(features_std) 

Use the Meanshift clustering algorithm

Limitations of KMeans: You need to set the number K of clusters and assume the shape of the clusters; Meanshift has no such limitations.

Meanshift parameter

  • bandwidth

Discard orphaned values ​​cluster_all = False

from sklearn.cluster import MeanShift 
cluster = MeanShift(n_jobs=-1)
model = cluster.fit(features_std) 


Using the DBSCAN clustering algorithm

The main parameters

  • eps : the furthest distance from one observation to another; beyond this distance the two are no longer considered neighbors;
  • min_samples : the minimum number of neighbors;
  • metric : distance metric;
from sklearn.cluster import DBSCAN

cluster = DBSCAN(n_jobs=-1) 
model = cluster.fit(features_std) 
model.labels_ 
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,
        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])

Using hierarchical merging clustering algorithm AgglomerativeClustering

  • Agglomerative is a powerful and flexible hierarchical clustering algorithm;
  • In Agglomerative, all observations are initially an independent cluster;
    then, clusters that meet certain conditions are merged; this merging process is repeated continuously to allow the cluster to grow until a certain critical point is reached;
  • In sklearn, AgglomerativeClustering uses the linkage parameter to determine the merging strategy so that it can minimize the following values:
    • ward, the variance of the merged clusters
    • average, the average distance of observations between two clusters
    • complete, the maximum distance between observations between two clusters
  • Other parameters
    • Affinity, determines which distance metric linkage uses, such as minkowski or euclidean
    • n_clusters, sets the number of clusters that the clustering algorithm tries to find; until n_clusters clusters are reached, the merging of clusters is considered to be over.
from sklearn.cluster import AgglomerativeClustering 

cluster = AgglomerativeClustering(n_clusters=3)
model = cluster.fit(features_std) 
model.labels_ 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Guess you like

Origin blog.csdn.net/lovechris00/article/details/129904702