Article directory
- Sklearn - 2.3. Clustering
https://scikit-learn.org/stable/modules/clustering.html
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
iris = datasets.load_iris()
iris_features = iris.data
iris_target = iris.target
Using the K-Means clustering algorithm
# 标准化特征
scaler = StandardScaler()
features_std = scaler.fit_transform(iris_features)
# 创建 K-Means 对象
cluster = KMeans(n_clusters=3, random_state=0)
model = cluster.fit(features_std)
# 查看预测分类
model.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2,
0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0], dtype=int32)
# 真实分类
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
new_ob = [[0.8, 0.8, 0.8, 0.8]]
model.predict(new_ob)
array([2], dtype=int32)
# 查看分类的中心点
model.cluster_centers_
array([[-0.05021989, -0.88337647, 0.34773781, 0.2815273 ],
[-1.01457897, 0.85326268, -1.30498732, -1.25489349],
[ 1.13597027, 0.08842168, 0.99615451, 1.01752612]])
Accelerated K-Means Clustering MiniBatchKMeans
batch_size controls the number of randomly selected observations in each batch; the more observations in a batch, the more computationally expensive training will be.
from sklearn.cluster import MiniBatchKMeans
cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
model.fit(features_std)
KMeans(n_clusters=3, random_state=0)
model = cluster.fit(features_std)
Use the Meanshift clustering algorithm
Limitations of KMeans: You need to set the number K of clusters and assume the shape of the clusters; Meanshift has no such limitations.
Meanshift parameter
- bandwidth
Discard orphaned values cluster_all = False
from sklearn.cluster import MeanShift
cluster = MeanShift(n_jobs=-1)
model = cluster.fit(features_std)
Using the DBSCAN clustering algorithm
The main parameters
- eps : the furthest distance from one observation to another; beyond this distance the two are no longer considered neighbors;
- min_samples : the minimum number of neighbors;
- metric : distance metric;
from sklearn.cluster import DBSCAN
cluster = DBSCAN(n_jobs=-1)
model = cluster.fit(features_std)
model.labels_
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, -1, 1,
1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
-1, 1, 1, -1, 1, 1, -1, 1, 1, 1, -1, -1, -1, 1, 1, 1, -1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1])
Using hierarchical merging clustering algorithm AgglomerativeClustering
- Agglomerative is a powerful and flexible hierarchical clustering algorithm;
- In Agglomerative, all observations are initially an independent cluster;
then, clusters that meet certain conditions are merged; this merging process is repeated continuously to allow the cluster to grow until a certain critical point is reached; - In sklearn, AgglomerativeClustering uses the linkage parameter to determine the merging strategy so that it can minimize the following values:
- ward, the variance of the merged clusters
- average, the average distance of observations between two clusters
- complete, the maximum distance between observations between two clusters
- Other parameters
- Affinity, determines which distance metric linkage uses, such as minkowski or euclidean
- n_clusters, sets the number of clusters that the clustering algorithm tries to find; until n_clusters clusters are reached, the merging of clusters is considered to be over.
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3)
model = cluster.fit(features_std)
model.labels_
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])