[Study Notes] [Machine Learning] 8. Clustering Algorithms (Clustering Algorithms: K-means, K-means++; Feature Dimensionality Reduction: Pearson Correlation Coefficient, Spearman Correlation Coefficient, PCA Principal Component Analysis)

  1. video link
  2. Dataset download address: no download required

1. Introduction to Clustering Algorithms

learning target:

  • Master the clustering algorithm implementation process
  • Know the principle of K-means algorithm
  • Know the evaluation model in the clustering algorithm
  • Explain the advantages and disadvantages of K-means
  • Understand how algorithms are optimized in clustering
  • Know the implementation process of feature dimensionality reduction
  • Applying K-means to achieve clustering tasks

1.1 Understanding Clustering Algorithms

insert image description here

Using different clustering criteria, the resulting clustering results are different.

1.2 Application of clustering algorithm in reality

  • User portrait, advertisement recommendation, Data Segmentation, search engine traffic recommendation, malicious traffic identification
  • Business push, news clustering, screening and sorting based on location information
  • Image segmentation, dimensionality reduction, recognition; Outlier detection; Abnormal consumption of credit cards; Discovery of gene fragments with the same function

insert image description here

1.3 The concept of clustering algorithm

Clustering algorithm : It is a typical unsupervised learning algorithm, which is mainly used to automatically classify similar samples into one category.

In the clustering algorithm, according to the similarity between samples, the samples are divided into different categories, and different similarity calculation methods are used to obtain different clustering results . The commonly used similarity calculation method is the Euclidean distance method.

1.4 The biggest difference between clustering algorithm and classification algorithm

Clustering algorithms are unsupervised learning algorithms, while classification algorithms are supervised learning algorithms.


Summary :

  • Classification of clustering algorithms [understand]
    • rough clustering
    • fine clustering
  • The definition of clustering [understand]
    • A Typical Unsupervised Learning Algorithm
    • It is mainly used to automatically group similar samples into one category
    • Calculate the similarity between samples and samples, generally using Euclidean distance

2. Preliminary use of clustering algorithm API

learning target:

  • Know the use of clustering algorithm API

2.1 API introduction

sklearn.cluster.KMeans(n_clusters=8)

sklearn.cluster.KMeansis a class in the scikit-learn library that implements the K-Means clustering algorithm.

  • Main parameters :
    • n_clusters: int type, the default value is 8. The number of clusters to form and the number of centroids to generate.
    • init: {'k-means++', 'random'}, callable object or array of shape (n_clusters, n_features), default value is 'k-means++'. initialization method.
    • n_init: 'auto' or int type, the default value is 10. Number of times to run the k-means algorithm with different centroid seeds. The end result is the most inertial output of successive runs of n_init.
    • max_iter: int type, the default value is 300. The maximum number of iterations for a single run of the k-means algorithm.
    • tol: float type, the default value is 1e-4. Relative tolerance on the Frobenius norm of the difference in cluster centers between two successive iterations, used to declare convergence.
    • verbose: int type, the default value is 0. Verbose mode.
    • random_state: int, RandomState instance or None, the default value is None. Determines the random number generation for centroid initialization.
    • copy_x: bool type, the default value is True. When precomputing distances, it is more numerically accurate to center the data first.
    • algorithm: {"lloyd", "elkan", "auto", "full"}, default value is "lloyd". The K-means algorithm used.
  • return value :
    • The function returns a KMeansobject whose methods (eg fit, , predictetc.) can be used to cluster the data.
  • method :
    • fit(X[, y, sample_weight]): Calculate K-Means clustering.
    • fit_predict(X[, y, sample_weight]): Calculate the cluster center and predict the cluster index to which each sample belongs.
    • fit_transform(X[, y, sample_weight]): Calculate cluster centers and convert X to cluster distance.
    • get_params([deep]): Get the parameters of this estimator.
    • predict(X): Predict the cluster to which each sample belongs to the nearest cluster center.
    • score(X[, y, sample_weight]): Scores the KMeans model given the data X.
    • set_params(**params): Sets the parameters of this estimator.
    • transform(X): Convert X to cluster distance space.

2.2 Case

Randomly create different two-dimensional data sets as training sets, and combine them with the k-means algorithm to cluster them. You can try to cluster different numbers of clusters and observe the clustering effect:

insert image description here

The clustering n_clusterparameter values ​​are different, and the clustering results are different:

insert image description here

2.2.1 Process Analysis

  1. import tool library
  2. Create a dataset and display it
  3. Apply K-means
  4. Show results

2.2.2 Code implementation

import matplotlib.pyplot as plt
from sklearn.datasets._samples_generator import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score


# 1. 创建数据集
# X为样本特征y为样本簇类别,共1000个样本,每个样本有4个特征,共4个簇
# 簇中心在[-1, -1], [0, 0], [1, 1], [2, 2],簇方差分别为[0.4, 0.2, 0.2, 0.2]
X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1, -1], [0, 0], [1, 1], [2, 2]],
                  cluster_std=[0.4, 0.2, 0.2, 0.2], random_state=9)

# 数据集可视化
plt.figure(dpi=300)
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()

insert image description here

# 2. 使用K-means进行聚类,并使用CH方法评估
y_pred = KMeans(n_clusters=2, random_state=9).fit_predict(X)

plt.figure(dpi=300)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

# 用CH方法评估聚类得分
print(calinski_harabasz_score(X=X, labels=y_pred))

insert image description here


Try it separately n_clusters=2/3/4, and then check the clustering effect:

fig, axes = plt.subplots(1, 3, figsize=(20, 5),dpi=300)
n_clusters_ls = [2, 3, 4]
for idx, val in enumerate(n_clusters_ls):
    # 2. 使用K-means进行聚类,并使用CH方法评估
    y_pred = KMeans(n_clusters=val, random_state=9).fit_predict(X)
    
    axes[idx].scatter(X[:, 0], X[:, 1], c=y_pred)
    axes[idx].set_title(f"n_cluster={
      
      val}")

    # 用CH方法评估聚类得分
    print(f"n_clusters为{
      
      val}时的CH评分为:", calinski_harabasz_score(X=X, labels=y_pred))
plt.savefig("./不同n_clusters的聚类结果.png")
plt.show()
n_clusters为2时的CH评分为: 3116.1706763322227
n_clusters为3时的CH评分为: 2931.625030199556
n_clusters为4时的CH评分为: 5924.050613480169

insert image description here

The higher the CH score, the better


Summary :

  • API: sklearn.cluster.KMeans(n_clusters=8)【know】
    • parameter:
      • n_clusters: the number of cluster centers to start with
    • method:
      • estimator.fit_predict(x): Calculate the cluster center and predict which class each sample belongs to.
        • It is equivalent to calling first fit(x), then callingpredict(x)

3. Clustering algorithm implementation process

learning target:

  • Master the implementation steps of K-means clustering

  • k-means actually contains two layers of content:
    • K: Number of initial center points (number of planned clusters)
    • means: Calculate the average distance from the center point to other data points

3.1 Steps of K-means clustering

  1. Randomly set K points in the feature space as the initial cluster centers
  2. For each other point, the distance to K centers is calculated, and the unknown point selects the nearest cluster center point as the label category
  3. Then after facing the marked cluster center, recalculate the new center point (mean value) of each cluster
  4. If the calculated new center point is the same as the original center point (the center of mass no longer moves), then end; otherwise, repeat the second step process

The implementation process is explained by the following figure:

insert image description here

Dynamic graph demo:

insert image description here

The dots are the samples, and the × are the centers of the clusters

3.2 Case exercises

insert image description here

Step 1 : Randomly set KKThe points in the K feature space are used as the initial cluster centers (in this case, P1 and P2 are set)

insert image description here

Step 2 : For every other point calculate to KKThe distance between K centers, the unknown point selects the nearest cluster center point as the label category.

insert image description here

insert image description here

Step 3 : After facing the marked cluster center, recalculate the new center point (mean value) of each cluster

insert image description here

Step 4 : If the calculated new center point is the same as the original center point (the center of mass no longer moves), then end; otherwise, repeat the second step process [after judgment, the above steps need to be repeated to start a new round of iteration]

insert image description here

Note : When the result of each iteration remains unchanged, it is considered that the algorithm converges and the clustering is completed. K-Means will definitely stop, and it is impossible to fall into the process of always selecting the centroid.


Summary :

  • K-means clustering implementation process [master]
  • Determine the constant KK in advanceK , constantKKK means the final number of cluster categories
  • Randomly select the initial point as the centroid, and classify the sample points into the most similar class by calculating the similarity between each sample and the centroid (here, the Euclidean distance)
  • Then, recalculate the centroid of each class (that is, the class center), repeat this process until the centroid does not change, and finally determine the category to which each sample belongs and the centroid of each class.
  • Notice:
    • Since the similarity between all samples and each centroid is calculated every time, the convergence speed of the K-Means algorithm is relatively slow on large-scale data sets.

4. Model Evaluation

learning target:

  • Know the implementation principles of SSE, "elbow" method, SC coefficient and CH coefficient in model evaluation

4.1 The sum of squares due to error (SSE)

Example: (the data -0.2, 0.4, -0.8, 1.3, -O.7 in the figure below are the difference between the actual value and the predicted value)

insert image description here

Application in k-means:

S S E = ∑ i = 1 k ∑ p ∈ C i ∣ p − m i ∣ 2 \mathrm{SSE} = \sum^k_{i=1}\sum_{p \in C_i}|p-m_i|^2 SSE=i=1kpCipmi2

insert image description here

The contents of each part of the formula:

insert image description here

In the figure above: k = 2 k=2k=2

  • The final result of the SSE graph is a measure of the looseness of the graph (eg: SSE (left) < SSE (right))
  • As the clustering iterates, the value of SSE will become smaller and smaller until it stabilizes at the end:

Note : The selection of the centroid in the SSE algorithm is random. Therefore, when the centroid is selected, when the initial distance between the two centroids is relatively close, it may produce bad results. Therefore, the final result of SSE is the local optimal solution instead of global optimal solution. As shown below:

insert image description here

If the initial value of the centroid is not well chosen, SSE will only reach a not so good local optimal solution

insert image description here

4.2 "Elbow method" (Elbow method) - K value determination

insert image description here

  1. for nData set of n points, iteratively calculatekkk from i i i tonnn , calculate the sum of the squares of the distances from each point to the center of the cluster to which it belongs after each clustering;
  2. The sum of squares will gradually become smaller until k = = nk==nk==The sum of squares is 0 at n because each point is itself the center of the cluster it is in.
  3. In the process of this sum of squares change, there will be an inflection point, that is, the "elbow" point. When the rate of decline suddenly slows down, it is considered to be the best kkk value.

The elbow criterion is also valid when deciding when to stop training, the data is usually more noisy, and we stop adding classes when adding classes does not bring more rewards .
3. The Silhouette Coefficient method
combines the degree of cohesion (Cohesion) and degree of separation (Separation) of clustering to evaluate the effect of clustering:

4.3 Silhouette Coefficient (SC)

Combines the degree of cohesion (Cohesion) and degree of separation (Separation) of clustering to evaluate the effect of clustering:

insert image description here

Purpose : internal distance aaa is minimized, the outer distancebbb is maximized.

S ∈ [ − 1 , 1 ] S \in [-1, 1] S[1,1 ] , whereSSThe larger the S , the better (close to 1), the smaller the worse (close to -1)

s ( i ) = b ( i ) − a ( i ) max ⁡ { a ( i ) , b ( i ) } = { 1 − a ( i ) b ( i ) , a ( i ) < b ( i ) 0 , a ( i ) = b ( i ) b ( i ) a ( i ) − 1 , a ( i ) > b ( i ) \begin{aligned} s(i) & = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\\ & = \begin{cases} 1 - \frac{a(i)}{b(i)}, & a(i) < b(i)\\ 0, & a(i) = b(i)\\ \frac{b(i)}{a(i)} - 1, & a(i) > b(i) \end{cases} \end{aligned} s(i)=max{ a(i),b(i)}b(i)a(i)= 1b(i)a(i),0,a(i)b(i)1,a(i)<b(i)a(i)=b(i)a(i)>b(i)

Calculation sample iiThe average distance a ( i ) a(i) from i to other samples in the same clustera(i) a ( i ) a(i) a ( i ) the smaller the sampleiiThe smaller the dissimilarity in the cluster of i , it means that the sample iii should be clustered into this cluster.

Calculation sample iii to the nearest clusterC j C_jCjThe average distance of all samples of b ( j ) b(j)b ( j ) , called sampleiii and the nearest clusterC j C_jCjThe degree of dissimilarity is defined as sample iidissimilarity between i : b ( i ) = min ⁡ { b ( i ) 1 , b ( i ) 2 , . . . , b ( i ) k } b(i) =\min\{b(i )_1, b(i)_2,..., b(i)_k\}b(i)=min{ b(i)1,b(i)2,...,b(i)k} b ( i ) b(i) The bigger b ( i ) is, the sampleiii does not belong to other clusters.

The average silhouette coefficient is obtained by calculating the silhouette coefficients of all samples and then calculating the average value . The value range of the average silhouette coefficient is [ − 1 , 1 ] [-1, 1][1,1 ] , the larger the coefficient, the better the clustering effect.

The closer the distance between samples within a cluster, the farther the distance between samples between clusters.

case :

The following figure shows the data distribution of 500 samples containing 2 features, and we measure the effect of the SC coefficient on it:

insert image description here

n_clusters = 2 The average silhouette_score is : 0.7049787496083262
n_clusters = 3 The average silhouette_score is : 0.5882004012129721
n_clusters = 4 The average silhouette_score is : 0.6505186632729437
n_clusters = 5 The average silhouette_score is : 0.56376469026194
n_clusters = 6 The average silhouette_score is : 0.4504666294372765

When n_clusters are 2, 3, 4, 5, and 6 respectively, the SC coefficient is as follows, which is a metric between [-1,1]:

After each clustering, each sample will get a silhouette coefficient. When it is 1, it means that the point is far away from the surrounding clusters, and the result is very good; when it is 0, it means that the point may be between two clusters. On the boundary; when the value is negative, it implies that the point may be misclassified.

From the average SC coefficient results, KKIt is not good for K to take 3, 5, and 6, so what about 2 and 4?

K = 2 K=2 K=2 cases:

insert image description here

K = 4 K=4 K=4 cases:

insert image description here

  • When n_clusters = 2, the width of the 0th cluster is much wider than that of the 1st cluster;
  • When n_clusters = 4, the cluster widths are not much different, so choose K = 4 K=4K=4 as the final number of clusters.

4.4 CH Coefficient (Calinski-Harabasz Index)

Calinski-Harabasz:

The smaller the covariance of the data within the category is, the better the covariance between the categories is. The Calinski-Harabasz score of sss will be high, scoressThe higher the s , the better the clustering effect.

s ( k ) = t r ( B k ) t r ( W k ) m − k k − 1 s(k) = \frac{tr(B_k)}{tr(W_k)} \frac{m-k}{k-1} s(k)=t r ( Wk)tr(Bk)k1mk

in:

  • t r tr t r is the trace of the matrix,B k B_kBkis the covariance matrix between categories, W k W_kWkis the covariance matrix of the within-category data
  • m m m is the number of samples in the training set
  • k k k is the number of categories.

Trace (trace), defined as the sum of the main diagonal numbers. For a matrix: ( a 11 a 12 . . . a 1 na 21 a 22 . . . a 2 n . . . . . . . . . . . . an 1 an 2 . . . ann ) \begin{pmatrix} a_{11} & a_{12} & ... & a_{1n}\\ a_{21} & a_{22} & ... & a_{2n}\\ ... & ... & .. . & ...\\ a_{n1} & a_{n2} & ... & a_{nn}\\ \end{pmatrix} a11a21...an 1a12a22...an 2............a1na2 n...ann Then its trace tr trdefinition of tr = a 11 + a 22 + . . . + ann tr = a_{11} + a_{22} + ... + a_{nn}tr=a11+a22+...+ann

The understanding of using the trace of the matrix to solve: the diagonal of the matrix can represent the similarity of an object .

In machine learning, the main purpose is to obtain the eigenvalues ​​of the data. That is to say, after any matrix is ​​calculated, it can be simplified. As long as the trace of the matrix is ​​obtained, the most important feature of this piece of data can be represented. In this way A lot of irrelevant data can be deleted to simplify the data and improve the processing speed.

The goal that CH needs to achieve: use as few categories as possible to cluster as many samples as possible, and at the same time obtain a better clustering effect .


Summary :

  • sse [know]
    • The smaller the value of the sum of squared errors, the better
  • Elbow method【know】
    • When the rate of decline suddenly slows down, it is considered to be the best kkk value
  • SC coefficient 【know】
    • The value is [-1,1], the larger the value, the better
  • CH coefficient [know]
    • The higher the score s, the better the clustering effect
    • The goal that CH needs to achieve: use as few categories as possible to cluster as many samples as possible, and at the same time obtain a better clustering effect.

5. Algorithm optimization

learning target:

  • Know the advantages and disadvantages of the k-means algorithm
  • Know the optimization principles of canopy, K-means++, dichotomous K-means, and K-medoids
  • Understand the optimization principles of kernel K-means, ISODATA, Mini-batch K-means

Summary of k-means algorithm :

  • Advantages :
    1. The principle is simple (near the center point), easy to implement
    2. The clustering effect is above average (depending on KKK 's choice)
    3. Space complexity O ( N ) O(N)O ( N ) , time complexityO ( I ∗ K ∗ N ) O(I*K*N)O ( IKN)

N N N is the number of sample points,KKK is the number of central quotients,III is the number of iterations

  • Disadvantages :
    1. Sensitive to outliers and noise (central point is easy to shift)
    2. It is difficult to find clusters with large differences in size and perform incremental calculations
    3. The result is not necessarily global optimal, only local optimal (with KKThe number of K is related to the selection of the initial value)

5.1 Canopy algorithm with initial clustering

The Canopy algorithm is a "coarse" clustering algorithm that is faster but less accurate . Different from traditional clustering algorithms (such as K-Means), the biggest feature of Canopy clustering is that it does not need to specify the k value (that is, the number of clusters) in advance, so it has great practical application value .

5.1.1 Canopy algorithm with initial clustering implementation process

The steps of the Canopy algorithm are as follows:

  1. Given sample list L = x 1 , x 2 , … , xm L=x_1,x_2,\dots,x_mL=x1,x2,,xmand an initial distance threshold of T 1 T1T1 T 2 T2 T 2 , and (T 1 > T 2 T1>T2T 1>T2)( T 1 T1 T1 T 2 T2 T 2 can be defined by yourself);
  2. from list LLTake anyPP from LP , calculatePPThe distance from P to all cluster centers (if there is no cluster center, then the pointPPP as a new cluster), and select the closest distance to the cluster centerD ( P , aj ) D(P,a_j)D(P,aj)
  3. If the distance DDD is less thanT 1 T1T 1 , indicating that the node belongs to the cluster, is added to the cluster list.
  4. If the distance DDD is less thanT 2 T2T 2 means that the node not only belongs to the cluster, but also is very close to the center point of the current cluster, so thePPP from listLLDeleted from L.

Canopy algorithm can be used for "coarse" clustering, get the k value and roughly K initial centroids, and then use K-means for further "fine" clustering. This form of Canopy+K-means clustering algorithm clustering effect is good.

insert image description here

insert image description here

5.1.2 Advantages and disadvantages of Canopy algorithm

Advantages :

  1. K-means is weak against noise and interference. Compared with Canopy, directly removing the smaller NumPoint Cluster is beneficial for anti-interference.
  2. The centerPoint of each Canopy selected by Canopy will be more accurate as K.
  3. Just do K-means clustering for each Canopy to reduce the number of similar calculations

In the Canopy algorithm, NumPoint refers to the number of points contained in Canopy. And centerPoint refers to the Canopy center point, that is, the center point of each Canopy selected by Canopy. These center points can be used as the initial centroids of the K-means algorithm, thereby improving the clustering effect of the K-means algorithm.

Disadvantages :

  1. Algorithm T 1 T1T1 T 2 T2 The determination problem of T 2 may still fall into a local optimal solution
  2. The value of the final result obtained by the Canopy algorithm may overlap between clusters, but there will be no case where an object does not belong to any cluster.

5.2 K-means++ algorithm

The K-means++ algorithm is an algorithm for selecting initial values ​​(or "seeds") for the K-means clustering algorithm. It is an approximation algorithm for the NP-hard K-means problem, and it is a way of avoiding the weaker clustering sometimes found with the standard K-means algorithm.

The K-means++ algorithm is only improved in the way of initializing the cluster center, and other places are the same as the K-means clustering algorithm . K-means++’s method of initializing cluster centers can be summed up in one sentence: select k cluster centers one by one, and the farther the sample point from other cluster centers is, the more likely it is to be selected as the next cluster center. In this way, the situation that the initial cluster centers are in the same cluster can be avoided to the greatest extent, thereby improving the clustering effect.

P = D ( x ) 2 ∑ x ∈ X D ( x ) 2 P = \frac{D(x)^2}{\sum_{x\in X}D(x)^2} P=xXD(x)2D(x)2

in:

  • P P P represents the probability that each sample point is selected as the next cluster center
  • D ( x ) D(x) D ( x ) represents the sample pointxxThe shortest distance between x and the current existing cluster center (the farther the sample point is from the existing cluster center, the more likely it will be selected as the next cluster center)
  • In the K-means++ algorithm, D ( x ) D(x)D ( x ) represents the sample pointxxThe shortest distance between x and the current existing cluster centers. Specifically, assume that there is currentlymmm cluster centersc 1 , c 2 , … , cm c_1, c_2, \dots, c_mc1,c2,,cm, then D ( x ) D(x)D ( x ) can be calculated by the following formula:

D ( x ) = min ⁡ i = 1 m dist ( x , c i ) D(x) = \min_{i=1}^m \text{dist}(x, c_i) D(x)=i=1minmdist(x,ci)

其中 dist ( x , c i ) \text{dist}(x, c_i) dist(x,ci) represents the sample pointxxx and cluster centerci c_icithe distance between. The distance calculation method can choose different distance measurement methods according to specific problems, such as Euclidean distance, Manhattan distance, etc.


In order to facilitate subsequent calculations, we will ∑ x ∈ XD ( x ) 2 \sum_{x\in X}D(x)^2xXD(x)2 is marked asAAA

insert image description here

First we choose point 2 as the centroid, so that we can find the PP from different points to the centroidP. _ Then we according to the probability PPof different pointsP selects the new point as the centroid. Becausethe purpose of K-means++ is to make the selected centroids as scattered as possible.

As shown in the figure below, if the first centroid is selected at the center of the circle, then the best possible next point to be selected is at P ( A ) P(A)P ( A ) This area (divided according to color).

insert image description here


The K-means++ algorithm flow is as follows :

  1. From the dataset X \mathcal{X}Randomly (uniformly distributed) select a sample point in X as the first initial cluster centerci c_ici
  2. Then calculate the shortest distance between each sample and the current existing cluster center, use D ( x ) D(x)D ( x ) means; then calculate the probability that each sample point is selected as the next cluster centerP ( x ) P(x)P ( x ) , and finally select the sample point corresponding to the maximum probability value as the next cluster center.
  3. Repeat step ② until k cluster centers are selected.
  4. Clustering was performed using the standard K-means algorithm.

The flow of the standard K-means algorithm is as follows :

  1. Assign each sample point to the cluster where its nearest cluster center is located.
  2. Calculate the mean of each cluster and use it as the new cluster center.
  3. Repeat steps ① and ② until the cluster center does not change or reaches the maximum number of iterations.

The K-means++ algorithm can effectively avoid the weak clusters sometimes found by the standard K-means algorithm by improving the selection method of the initial cluster center, thereby improving the clustering effect.

5.3 Dichotomous K-means Algorithm

The binary K-means algorithm is an improved algorithm based on the K-means algorithm. Its main idea is to start with a total cluster, and continue to split into k clusters until it is split into k clusters .

The process of the binary K-means algorithm is as follows:

  1. Treat all points as a cluster;
  2. For each cluster, do the following:
    • Calculate the total error (SSE);
    • Perform K-Means clustering on a given cluster (k=2);
    • Compute the total error (SSE) after splitting the cluster in two;
  3. Select the cluster that makes the error SSE the smallest to perform the division operation (that is, divide the cluster with a larger error);
  4. Repeat steps ② and ③ until the number of clusters specified by the user is reached.

The binary K-means algorithm can effectively solve the problem that the K-means algorithm converges to a local minimum and improve the clustering effect.

insert image description here

insert image description here

insert image description here

A principle implicit in dichotomous K-means :

Because the sum of squared errors of clustering can measure clustering performance, the smaller the value, the closer the data points are to their centroids, and the better the clustering effect. Therefore, it is necessary to divide the cluster with the largest sum of squared errors again, because the larger the sum of squared errors, the worse the clustering effect of the cluster is, and the more likely it is that multiple clusters are regarded as one cluster, so we first need to divide this Clusters are divided.

The binary K-means algorithm can speed up the execution speed of the K-means algorithm because it has fewer similarity calculations and is not affected by initialization problems. Because there is no random point selection in the binary K-means algorithm, and each step ensures the minimum error.

5.4 K-medoids (K-center clustering algorithm)

There is a difference between K-medoids and K-means, the difference lies in the selection of the center point.

  • In K-means, the center point is taken as the average of all data points in the current cluster, so it is very sensitive to abnormal points.
  • In K-medoids, the point with the smallest sum of distances from the current cluster to all other points (in the current cluster) is taken as the center point.

medoids: /ˈmɛdɔɪdz/center points


K-medoids algorithm (also known as K-center clustering algorithm) is a classic partitioning clustering technique, which divides a data set of n objects into k clusters, where the value of k (that is, the number of clusters) in the algorithm Known before execution ( meaning the programmer must specify the value of k before executing the K-medoids algorithm ).

Unlike the K-means algorithm, the K-medoids algorithm chooses actual data points as centers (medoids or exemplars), thus allowing better interpretation of cluster centers than K-means. Furthermore, K-medoids can be used with arbitrary dissimilarity measures, whereas K-means usually requires Euclidean distance for efficient solutions. Because K-medoids minimizes the sum of pairwise dissimilarities rather than the sum of squared Euclidean distances, it is more resistant to noise and outliers than K-means.

The medoid of a cluster is defined as the object whose average dissimilarity with all objects in the cluster is the smallest, that is, it is the most central point in the cluster.


insert image description here

Algorithm flow :

  1. Randomly select k objects as initial medoids.
  2. Assign each remaining object to the cluster represented by its nearest medoid.
  3. For each cluster, calculate the criterion function value corresponding to each member object, and select the object corresponding to the minimum criterion function value as the new medoid.
  4. Repeat steps ② and ③ until all medoid objects no longer change or reach the maximum number of iterations.

Among them, the criterion function is defined as the sum of distances between a certain member object and other member objects in a class.


Compared with the K-means algorithm, the K-medoids algorithm is more robust to noise .

Example: When there are only a few sample points in a cluster, such as (1, 1) (1, 2) (2, 1) (1000, 1000). where (1000, 1000) is the noise.

According to K-means, the centroid will be roughly in the middle of (1, 1) (1000, 1000). This is obviously not what we want.

At this time, k-medoids can avoid this situation. He will select a sample point in (1,1) (1,2) (2,1) (1000,1000) to minimize the absolute error of the cluster. The calculation shows that Be sure to pick one of the first three points.

k-medoids only work with small samples: if the samples are large, then it is too slow. And when there are many samples, the impact of a few noises on the centroid of k-means is not as heavy as expected, so the application of k-means is obviously more than k-medoids .


simply put:

For small data sets, k-medoids can be used, and the effect is generally better than K-means. But for large data sets, the K-means algorithm is still used.

5.5 Kernel K-means algorithm (understand)

Kernel K-means algorithm is a kind of K-means algorithm based on kernel method, which can deal with non-linearly separable data. It maps the data to a high-dimensional space, making the data that was originally linearly inseparable in the low-dimensional space become linearly separable in the high-dimensional space, thereby improving the clustering effect .

The process of the Kernel K-means algorithm is similar to the standard K-means algorithm, but the kernel function is used to calculate the similarity between sample points when calculating the distance. This can effectively solve the problem that the standard K-means algorithm cannot handle nonlinearly separable data.

Kernel k-means is actually a process of projecting each sample into a high-dimensional space, and then clustering the processed data using the common k-means algorithm idea.

insert image description here

5.6 ISODATA algorithm (understand)

ISODATA algorithm (Iterative Self-Organizing Data Analysis Techniques Algorithm, iterative self-organizing data analysis technique algorithm) is an improved k-means algorithm. It introduces criteria for category evaluation in the clustering process, automatically merges or splits certain categories according to the criteria, and breaks through the limitation of the number of categories to a certain extent.

The algorithm can dynamically adjust the number of cluster centers according to the actual situation of the samples contained in each class during the clustering process. If the degree of dispersion of samples in a class is large (measured by variance) and the number of samples is large, it will be split; if two categories are relatively close (measured by the distance between cluster centers), then They perform merge operations.

The ISODATA algorithm is a repetitive self-organizing data analysis technique. It calculates the uniformly distributed class mean value in the data space, and then uses the minimum distance technique to iteratively aggregate the remaining pixels. The mean value is recalculated for each iteration, and according to the obtained new mean value, Classify the pixels again.


Features :

  • The number of categories changes with the clustering process
  • The number of categories will be merged and split:
    • Merge : When the number of samples in a certain class of clustering results is too small, or the distance between two classes is too close
    • Splitting : When the intra-class variance of a certain class in the clustering result is too large, the class is split

5.7 Mini Batch K-Means algorithm (understand)

The Mini Batch K-Means algorithm is an optimization scheme of the K-Means algorithm, which is suitable for clustering algorithms of big data . The algorithm employs small batches of subsets of the data to reduce computation time while still attempting to optimize the objective function. The so-called small batch here refers to the randomly selected data subsets each time the algorithm is trained. Using these randomly generated subsets to train the algorithm greatly reduces the calculation time. Compared with other algorithms, it reduces the time required for K-Means. Convergence time, results produced by small-batch K-Means are generally only slightly worse than standard algorithms.

The Mini Batch K-Means algorithm can reduce the convergence time of the K-Means algorithm, and the result is only slightly worse than the standard K-Means algorithm.

Usually when the sample size is greater than 10,000 for clustering, you need to consider using the Mini Batch K-Means algorithm.

The iterative step of the algorithm has two steps:

  1. Randomly select some data from the data set to form a mini-batch, and assign them to the nearest centroid
  2. update centroid

Compared with K-means, the data is updated on each small sample set. For each mini-batch, an updated centroid is obtained by computing the mean, and the data in the mini-batch is assigned to the centroid. As the number of iterations increases, the changes in these centroids are gradually reduced until the centroid is stable or reaches the specified number of iterations, and the calculation is stopped.


Summary :

  • Summary of advantages and disadvantages of k-means algorithm 【Know】
    • Advantages :
      1. The principle is simple (near the center point), easy to implement
      2. The clustering effect is medium to high (depending on the choice of K)
      3. Space complexity O ( N ) O(N)O ( N ) , time complexityO ( IKN ) O(IKN)O ( I K N )
    • Disadvantages :
      1. Sensitive to outliers and noise (central point is easy to shift)
      2. It is difficult to find clusters with large differences in size and perform incremental calculations
      3. The result is not necessarily the global optimum, but only a local optimum (related to the number of K and the selection of initial values)
  • Optimization method [know]
Optimization train of thought
Canopy + k-means Canopy rough clustering with K-means
K-means++ The farther the distance, the easier it is to become a new centroid
Dichotomous K-means Remove the largest cluster in SSE
k-medoids It is different from the way K-means selects the center point
kernel K-means map to high-dimensional space
ISODATA Dynamic clustering, you can change the size of the K value
Mini-batch K-means Batch clustering of large datasets

6. Feature Engineering - Feature Dimensionality Reduction

learning target:

  • Know the definition of dimensionality reduction
  • Know the dimensionality reduction process through low-variance filtering
  • The process of knowing the correlation coefficient to achieve dimensionality reduction
  • Know the implementation process of principal component analysis

6.1 Dimensionality reduction

6.1.1 Definition

In the field of machine learning and statistics, dimensionality reduction refers to the process of reducing the number of random variables to obtain a set of "uncorrelated" main variables under certain limited conditions. Dimensionality reduction can be further subdivided into two major methods, variable selection and feature extraction .

Simply put, dimensionality reduction is to use a certain mapping method to map data points in the original high-dimensional space to a low-dimensional space. The purpose of doing this is to reduce the error caused by redundant information and noise information, and improve the accuracy in the application. Common data dimensionality reduction methods include: PCA, LDA, MDS, ISOMAP, SNE, T-SNE, AutoEncoder, etc.


In short, dimensionality reduction refers to the process of reducing the number of random variables (features) under certain limited conditions to obtain a set of "uncorrelated" main variables.

The key is: reduce the number of relevant variables

insert image description here


Q : For example, for a data set, there are features such as ABCDEFG. If the correlation of the three features of BCD is high, is it possible to select a feature from BCD (choose C here) to replace the other two similar features (B and D), that is, the characteristics of the data set have changed from the original ABCDEFG to ACEFG. Is this understanding correct?
A : This understanding is correct. In this example, if the three BCD features are highly correlated, they probably contain a lot of duplicated information. In this case, we can use a variable selection method to select one feature (such as C) from BCD and delete the other two features (B and D) to reduce redundant information in the dataset. In this way, the features in the data set are changed from the original ABCDEFG to ACEFG.

Of course, this is only one method of dimensionality reduction. Another common dimensionality reduction method is feature extraction, which generates new, fewer features by applying some transformation to the original features. For example, in the example you provided, we can use methods such as PCA to convert the three features of BCD into a new feature H, and then replace the original three features of BCD with H. In this way, the features in the data set are changed from the original ABCDEFG to AEFGH.

In summary, dimensionality reduction aims to reduce redundant information and noise information in the dataset to improve the accuracy of the model in application.


Related features:

  • Correlation between relative humidity and rainfall
  • etc.

It is precisely because when training, we all use features for learning. If there is a problem with the feature itself or the correlation between the features is strong, it will have a greater impact on the algorithm learning and prediction.

6.1.2 Two ways of dimensionality reduction

  1. feature selection
  2. Principal component analysis (can understand a way of feature extraction)

6.2 Feature Selection

6.2.1 Definition

The data contains redundant or irrelevant variables (or features, attributes, indicators, etc.) , aiming to find out the main features from the original features .

insert image description here

6.2.2 Method

  • Filter (filter type) : Mainly explore the characteristics of the feature itself, the relationship between the feature and the feature and the target value.
    • Variance Selection Method: Low Variance Feature Filtering
    • correlation coefficient
  • Embedded : The algorithm automatically selects features (associations between features and target values)
    • Decision tree: information entropy, information gain
    • Regularization: L1, L2
    • deep learning:
      • convolution

6.2.3 Low Variance Feature Filtering

Delete some features of low variance , the meaning of variance was mentioned earlier. Combined with the size of the variance to consider the angle of this method.

  • The variance of the feature is small: the value of most samples of a feature is relatively similar
  • Large feature variance: the values ​​of many samples of a feature are different

Variance is a measure used to measure the degree of dispersion of a set of data. It represents the difference between each variable (observation) and the population mean. The larger the variance, the greater the fluctuation of the data; the smaller the variance, the smaller the fluctuation of the data .

6.2.3.1 API

sklearn.feature_selection.VarianceThreshold(threshold=0.0)
  • Role : sklearn.feature_selection.VarianceThresholdIt is a feature selector that removes all low variance features. This feature selection algorithm only focuses on the features (X) and not on the desired output (y), so it can be used for unsupervised learning.
  • Parameters :
    • threshold: float, the default value is 0. Features with training set variance below this threshold will be removed. By default, all features with non-zero variance are kept, that is, features with the same value in all samples are removed (if they are slightly different, they are kept).
  • properties :
    • variances_: Array, shape (n_features,). The variance of each feature.
    • n_features_in_:int. The number of features seen during fit.
    • feature_names_in_: ndarray with shape (n_features_in_,). Feature names seen during fit. Defined only if X has feature names that are all strings.
  • method :
    • fit(X, y=None): learn empirical variance from X.
    • fit_transform(X, y=None, **fit_params): fit the data, then transform it.
    • get_feature_names_out(input_features=None): Masks the feature name based on the selected feature.
    • get_params(deep=True): Get the parameters of this estimator.
    • get_support(indices=False): Get the mask or integer index of the selected features.
    • inverse_transform(X): Inverts the conversion operation.
    • set_params(**params): Sets the parameters of this estimator.
    • transform(X): Reduces X to the selected feature.

6.2.3.2 Data calculation

We perform a filter between the index characteristics of some stocks, and remove 'index', 'date', 'return'columns are not considered (these types do not match, nor are the required indicators).

There are a total of the following characteristics:

'pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev', 'earnings_per_share', 'revenue', 'total_expense'

insert image description here

  • analyze:
    1. Initialize VarianceThreshold, specify the threshold variance
    2. call fit_transformmethod
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold

data = pd.read_csv("./data/factor_returns.csv")

# 1. 实例化一个转换器类
transfer = VarianceThreshold(threshold=1)

# 2. 调用fit_transform方法
data = transfer.fit_transform(data.iloc[:, 1:10])
print("形状为:", data.shape)
print("删除低方差特征的结果为:\r\n", data)
形状为: (2318, 8)

删除低方差特征的结果为:
 [[ 5.95720000e+00  1.18180000e+00  8.52525509e+10 ...  1.21144486e+12
   2.07014010e+10  1.08825400e+10]
 [ 7.02890000e+00  1.58800000e+00  8.41133582e+10 ...  3.00252062e+11
   2.93083692e+10  2.37834769e+10]
 [-2.62746100e+02  7.00030000e+00  5.17045520e+08 ...  7.70517753e+08
   1.16798290e+07  1.20300800e+07]
 ...
 [ 3.95523000e+01  4.00520000e+00  1.70243430e+10 ...  2.42081699e+10
   1.78908166e+10  1.74929478e+10]
 [ 5.25408000e+01  2.46460000e+00  3.28790988e+10 ...  3.88380258e+10
   6.46539204e+09  6.00900728e+09]
 [ 1.42203000e+01  1.41030000e+00  5.91108572e+10 ...  2.02066110e+11
   4.50987171e+10  4.13284212e+10]]

6.2.4 Correlation coefficient

Main implementation methods:

  1. Pearson correlation coefficient
  2. Spearman correlation coefficient

6.2.4.1 Pearson Correlation Coefficient

1. Function

The Pearson correlation coefficient is a statistical indicator used to reflect the closeness of the correlation between two variables. Its value range is [ − 1 , 1 ] [-1,1][1,1]

  • When the coefficient is close to 1, it means that the two variables are positively correlated, that is, when one variable increases, the other also increases;
  • When the coefficient is close to -1, it means that the two variables are negatively correlated, that is, when one variable increases, the other will decrease;
  • When the coefficient is close to 0, it means that there is no linear relationship between the two variables.

2. Formula calculation case (understand, no need to memorize)

r = n ∑ x y − ∑ x ∑ y n ∑ x 2 − ( ∑ x ) 2 n ∑ y 2 − ( ∑ y ) 2 r = \frac{n\sum{xy} - \sum{x}\sum{y}}{\sqrt{n\sum{x^2} -(\sum{x})^2}\sqrt{n\sum{y^2} - (\sum{y})^2}} r=nx2(x)2 ny2(y)2 nxyxy

in:

  • r r r : Pearson's correlation coefficient, used to measure the linear relationship between two variables.
  • n n n : number of samples.
  • x x xyyy : the values ​​of the two variables.
  • Σ xy ΣxyΣ x y : xxin all samplesxyyThe sum of the products of y .
  • Σ x ΣxΣ xΣ y ΣyΣ y : xxin all samplesxyysum of y .
  • Σ x 2 Σx^2x _2Σ y 2 Σy^2y _2 : x 2 x^2in all samplesx2y 2 and^2ysum of 2 .

Example : For example, we calculate the annual advertising investment and monthly average sales.

insert image description here

So how to calculate the Pearson correlation coefficient between them?

insert image description here

Final calculation:

10 × 16679.09 − 346.2 × 422.5 10 × 14304.52 − 346. 2 2 10 × 19687.81 − 422. 5 2 = 0.9942 \frac{10\times 16679.09 - 346.2 \times 422.5}{\sqrt{10 \times 14304.52 - 346.2^2} \sqrt{10 \times 19687.81 - 422.5^2}} = 0.9942 10×14304.52346.22 10×19687.81422.52 10×16679.09346.2×422.5=0.9942

So we finally came to the conclusion that there is a high positive correlation between advertising investment and monthly average sales.

The value range of the Pearson correlation coefficient is [ − 1 , 1 ] [-1,1][1,1]

3. Features

The value of the correlation coefficient is between [ − 1 , 1 ] [-1, 1][1,1 ] , that is− 1 ≤ r ≤ 1 -1 \le r \le 11r1 . Its properties are as follows:

  • r > 0 r > 0 r>When 0 , it means that the two variables are positively correlated;r < 0 r < 0r<When 0 , the two variables are negatively correlated
  • When ∣ r ∣ = 1 |r| = 1r=When 1 , it means that the two variables are completely correlated, whenr = 0 r = 0r=When 0 , it means that there is no correlation between the two variables
  • When $0<|r|<$1, it means that there is a certain degree of correlation between the two variables. and ∣ r ∣ |r|r The closer to 1, the closer the linear relationship between the two variables;∣ r ∣ |r|The closer ∣ r is to 0, the weaker the linear correlation between the two variables. Generally, it can be divided into three levels:
    • ∣ r ∣ < 0.4 |r| < 0.4 r<0.4 is low correlation
    • 0.4 ≤ ∣ r ∣ < 0.7 0.4 \le |r| < 0.7 0.4r<0.7 is a significant correlation
    • 0.7 ≤ ∣ r ∣ < 1 0.7 \le |r| < 1 0.7r<1 is highly linear correlation

4. APIs

from scipy.stats import pearsonr
  • Function : scipy.stats.pearsonrIt is a function used to calculate the Pearson correlation coefficient, which can measure the linear relationship between two variables. It also provides a p-value for testing non-correlation.
  • Parameters :
    • x: (N,) array_like, input array.
    • y: (N,) array_like, input array.
  • return value :
    • r: float, Pearson correlation coefficient, the value range is [-1, 1].
    • p-value: float, two-tailed p-value.

5. Case

from scipy.stats import pearsonr


x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]

r, p_value = pearsonr(x1, x2)
print("r:", r)
print("p-value:", p_value)
r: 0.9941983762371884
p-value: 4.922089955456964e-09

According to the results, the Pearson correlation coefficient x1between x2and is 0.9941983762371884, which shows that there is a very strong positive correlation between the two variables. That is, when x1increases , x2it also increases.

The two-tailed p-value 4.922089955456964e-09is very close to 0. Typically, if the p-value is less than the significance level (for example, 0.05), we can reject the null hypothesis (there is no correlation between the two variables) and consider that there is a significant correlation between the two variables. In this case, the p-value is very small, so we can assume x1that x2there is a significant correlation between and .

6.2.4.2 Spearman's rank correlation coefficient (Rank IC)

1. Function :

Spearman's rank correlation coefficient (Spearman's rank correlation coefficient, referred to as rank correlation coefficient or rank correlation coefficient) is a non-parametric index to measure the correlation between two variables. It uses a monotonic function to evaluate the correlation of two statistical variables. When there are no repeated values ​​in the data and when the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or −1. A positive Spearman correlation coefficient reflects a monotonically increasing trend between two variables X and Y. A negative Spearman correlation coefficient reflects a monotonically decreasing trend between two variables X and Y.

2. Formula (understand, no need to memorize) :

R a n k I C = 1 − 6 ∑ d i 2 n ( n 2 = 1 ) \mathrm{RankIC} = 1 - \frac{6\sum{d_i^2}}{n(n^2 = 1)} RankIC=1n(n2=1)6di2

in:

  • RankIC: Spearman rank correlation coefficient, used to measure the monotonic relationship between two variables.
  • d_i: The rank difference of the ith observation, that is, the rank difference between the two variables.
  • n:Number of samples.

Q : What is the rank difference?
A : Rank Difference refers to the difference between the ranks of each observation in the calculation of Spearman's rank correlation coefficient. For example, if you have two variables xand y, and you order the values ​​of each variable, the rank difference of each observation is xthe ydifference between its rank in and its rank in . The rank difference was used to calculate the Spearman rank correlation coefficient, which measures the monotonic relationship between two variables.

For example, suppose there are two sets of data:

x = [1, 2, 3, 4]
y = [2, 3, 1, 4]

First, we need to sort each set of data and assign a rank to each:

x_sorted = [1, 2, 3, 4]
x_ranks = [1, 2, 3, 4]

y_sorted = [1, 2, 3, 4]
y_ranks = [3, 1, 2, 4]

We can then compute the rank difference for each observation:

rank_differences = [2, -1, -1, 0]

Finally, we can use these rank differences to calculate the Spearman rank correlation coefficient.


Example :

insert image description here

3. Features :

  • The Spearman correlation coefficient indicates the direction of correlation between X (independent variable) and Y (dependent variable). If Y tends to increase as X increases, the Spearman correlation is positive
  • Same as the previous Pearson correlation coefficient, the value is still [ − 1 , 1 ] [-1, 1]between [ 1 , 1 ]

The Spearman correlation coefficient is more widely used than the Pearson correlation coefficient

4. API :

from scipy.stats import spearmanr
  • Role : scipy.stats.spearmanrIt is a function used to calculate the Spearman rank correlation coefficient, which can measure the monotonic relationship between two variables. It also provides a p-value for testing non-correlation.
  • Parameters :
    • a: (N,) array_like, input array.
    • b: (N,) array_like, input array, optional.
    • axis: int or None, optional. If axis=0 (the default), each column represents a variable and rows contain observations. If axis=1, the relationship is transposed: each row represents a variable, while columns contain observations. If axis=None, both arrays will be expanded.
    • nan_policy: {'propagate', 'raise', 'omit'}, optional. Defines what to do when the input contains nan. Available options are (default 'propagate'): 'propagate': return nan; 'raise': throw an error; 'omit': ignore nan values ​​for calculation.
    • alternative: {'two-sided', 'less', 'greater'}, optional. Define the alternative hypothesis. Defaults to 'two-sided'. Available options are: 'two-sided': the correlation is not zero; 'less': the correlation is negative (less than zero); 'greater': the correlation is positive (greater than zero).
  • return value :
    • correlation: float or ndarray (2-D square). Spearman correlation matrix or correlation coefficient (if only 2 variables are given as arguments). The correlation matrix is ​​square and has a length equal to the total number of variables (columns or rows) in which a and b are combined.
    • pvalue: float or ndarray (2-D square). Two-tailed p-value.

5. Case :

from scipy.stats import spearmanr


x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]

r, p_value = spearmanr(x1, x2)
print("r:", r)
print("p-value:", p_value)
r: 0.9999999999999999
p-value: 6.646897422032013e-64

According to the results, the Spearman rank correlation coefficient between x1and x2is 0.9999999999999999, which shows that there is a very strong positive correlation between the two variables. That is, when x1increases , x2it also increases.

The two-tailed p-value 6.646897422032013e-64is very close to 0. Typically, if the p-value is less than the significance level (for example, 0.05), we can reject the null hypothesis (there is no correlation between the two variables) and consider that there is a significant correlation between the two variables. In this case, the p-value is very small, so we can assume x1that x2there is a significant correlation between and .

6.3 Principal Component Analysis (PCA)

6.3.1 What is Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a commonly used data dimensionality reduction technique. It transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components through a linear transformation. Principal component analysis can reduce the dimensionality of a data set by retaining low-dimensional principal components and ignoring high-dimensional principal components, while retaining the features that contribute the most to variance in the data set .

Definition : The process of transforming high-dimensional data into low-dimensional data, during which the original data may be discarded and new variables created.

Function : data dimensionality compression, reduce the dimensionality (complexity) of the original data as much as possible, and lose a small amount of information.

Application : regression analysis or cluster analysis.

For the word information, it is introduced in the decision tree

So how to better understand this process? Let's look at a picture.

insert image description here

If we want to see the whole picture of a teapot, we can clearly see that the fourth picture is the easiest full picture of a formal teapot. Although we cannot see all the details, it does not affect our acquisition of the main factors.

6.3.2 API

sklearn.decomposition.PCA(n_components=None)
  • Role : sklearn.decomposition.PCAIt is a class that implements principal component analysis (PCA) in the scikit-learn library. It can convert a set of possibly correlated variables into a set of linearly uncorrelated variables through a linear transformation, and these uncorrelated variables are called principal components. PCA can reduce the dimensionality of a dataset by retaining low-dimensional principal components and ignoring high-dimensional principal components, while retaining the features that contribute the most to the variance in the dataset.

  • Parameters :

    • n_components: int, float, None or str, the default value is None. The number of components to keep.
      • Decimal: Indicates how many percent of information is retained
      • integer: how many features to reduce to
      • If n_components is not set, keep all components: n_components == min(n_samples, n_features) .
      • If n_components == 'mle' and svd_solver == 'full', use Minka's MLE to guess the dimensionality. Using n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.
      • If 0 < n_components < 1 and svd_solver == 'full', the number of components is chosen such that the amount of variance to be explained is greater than the percentage specified by n_components.
      • If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples. Therefore, the result in the None case is: n_components == min(n_samples, n_features) - 1.
    • copy: bool, default is True. If False, the data passed to fit will be overwritten, and running fit(X).transform(X) will not give the expected result, and fit_transform(X) should be used instead.
    • whiten: bool, default is False. When True (default False), the components_ vector is multiplied by the square root of n_samples and divided by the singular values ​​to ensure that the outputs are uncorrelated and have unit component-wise variance. Whitening will remove some information from the transformed signal (the relative variance proportions of the components), but can sometimes improve the prediction accuracy of downstream estimators by making their data obey hardwired assumptions.
    • svd_solver: {'auto', 'full', 'arpack', 'randomized'}, default value is 'auto'.
      • If auto: choose the solver's default strategy based on X.shape and n_components:
        • If the input data is larger than 500x500 and the number of components to extract is less than 80% of the smallest dimension of the data, enable the more efficient "randomization" method.
        • Otherwise the exact full SVD is computed and optionally truncated afterwards.
      • If full: Runs exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and selecting components via postprocessing.
    • tol: float, default value is 0.0. Convergence parameter for svd_solver == 'arpack'.
    • iterated_power: int or 'auto', default is 'auto'. svd_solver == 'randomized' power of iterations.
    • random_state: int, RandomState instance or None, the default value is None. Controls the seed of the random number generator; passed to arpack or random_state when choosing a randomized svd solver.
  • method :

    • fit(X[, y]): Fitted model.
    • fit_transform(X[, y]): Fit the model and perform the transformation.
    • get_covariance(): Computes data covariance.
    • get_params([deep]): Get the parameters of this estimator.
    • get_precision(): Computational precision matrix.
    • inverse_transform(X): Convert the data back to the original space.
    • score(X[, y]): Returns the mean log-likelihood.
    • score_samples(X): Returns an array of sample log-likelihoods.
    • set_params(**params): Sets the parameters of this estimator.
    • transform(X): Reduce the dimensionality of the data.

6.3.3 Data Calculation

First take a simple data calculation:

from sklearn.decomposition import PCA


data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]

# 1. 实例化PCA,小数:保留多少信息
transfer = PCA(n_components=0.9)

# 2. 调用fit_transform方法
data_PCA = transfer.fit_transform(data)

print(f"保留90%的信息后,降维的结果为:\r\n{
      
      data_PCA}")
保留90%的信息后,降维的结果为:
[[ 1.28620952e-15  3.82970843e+00]
 [ 5.74456265e+00 -1.91485422e+00]
 [-5.74456265e+00 -1.91485422e+00]]
from sklearn.decomposition import PCA


data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]

# 1. 实例化PCA,小数:保留多少信息
transfer = PCA(n_components=2)

# 2. 调用fit_transform方法
data_PCA = transfer.fit_transform(data)

print(f"降维到2维后的结果为:\r\n{
      
      data_PCA}")
降维到2维后的结果为:
[[ 1.28620952e-15  3.82970843e+00]
 [ 5.74456265e+00 -1.91485422e+00]
 [-5.74456265e+00 -1.91485422e+00]]

  • The definition of dimensionality reduction [understand]
    • It is to change the characteristic value, choose which column to keep and which column to delete.
    • The goal is to get a set of "uncorrelated" primary variables
  • Two ways of dimensionality reduction【Understand】
    • feature selection
    • Principal component analysis (can understand a way of feature extraction)
  • feature selection [know]
    • Definition: Eliminate redundant variables in the data
    • method:
      • Filter (filter): Mainly explore the characteristics of the feature itself, the relationship between features and features and target values
        • Variance selection method: low variance feature filtering (low variance means low discrimination)
        • correlation coefficient
      • Embedded: Algorithms automatically select features (associations between features and target values)
        • Decision tree: information entropy, information gain
        • Regularization: L1, L2
  • Low variance feature filtering 【know】
    • Eliminate a column with a relatively small variance
    • API:sklearn.feature_selection.VarianceThreshold(threshold=0.0)
      • Remove all low variance features
      • Note: The parameter thresholdmust specify the value
  • Correlation coefficient【master】
    • Main implementation methods:
      • Pearson correlation coefficient
      • Spearman correlation coefficient
    • Pearson correlation coefficient
      • Calculate by the size of the specific value
      • relatively complex
      • API:from scipy.stats import pearsonr
        • The closer the return value is to 1, the stronger the correlation
        • The closer the return value is to 0, the weaker the correlation
    • Spearman correlation coefficient
      • Calculation by grade difference
      • Simpler than the previous one (Pearson correlation coefficient)
      • API:from scipy.tats import spearmanr
        • The closer the return value is to 1, the stronger the correlation
        • The closer the return value is to 0, the weaker the correlation
  • PCA【know】
    • Definition: High-dimensional data is converted to low-dimensional data, and then new variables are generated
    • API:sklearn.decomposition.PCA(n_components=None)
      • n_components
        • Decimals: how many percent of information to keep
        • Integer: Indicates how many dimensions to reduce

7. Case: Exploring the segmentation of users' preferences for item categories

learning target:

  • Applying PCA and K-means to realize the user's preference segmentation for item categories

insert image description here

7.1 Requirements

Whether you shop from a carefully planned list or let your mood guide your shopping, our unique food habits define who we are. Instacart is a grocery ordering and delivery app designed to make it easy to fill your fridge and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, a personal shopper reviews your order and makes in-store purchases and delivery for you.

Instacart's data science team plays an important role in delivering a delightful shopping experience. Currently, they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to a cart next during a session. Recently, Instacart open-sourced this data - see their blog post on 3 million Instacart orders.

In this competition, Instacart challenges the Kaggle community to use this anonymized customer order data to predict which previously purchased products a user will include in their next order. Not only are they looking for the best model, Instacart is also looking for machine learning engineers to grow their team.

The winner of this competition will receive a cash prize and the opportunity to fast-track the recruitment process. For more information on exciting opportunities at Instacart, check out their careers page or contact their recruitment team directly at [email protected].

Dataset link : Instacart Market Basket Analysis


Data are as follows:

  • order_products_prior.csv: order and product information
    • Fields: order_id, product_id, add_to_cart_order,reordered
  • products.csv: product information
    • Fields: product_id, product_name, aisle_id,department_id
  • orders.csv: user's order information
    • Fields: order_id, user_id, eval_set, order_number, …
  • aisles.csv: The specific item category to which the product belongs
    • field: aisle_id,aisle

7.2 Analysis

  1. retrieve data
  2. Basic Data Processing
    1. merge table
    2. Crosstab Merge
    3. data interception
  3. Feature Engineering: PCA
  4. machine learning (k-means)
  5. model evaluation
    1. sklearn.metrics.silhouette_score(X, labels)
      1. Compute the average silhouette coefficient for all samples
      2. X:Eigenvalues
      3. labels: target value marked by clustering

7.3 Code implementation

7.3.0 Import library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

7.3.1 Get data

# 1. 读取数据
order_product = pd.read_csv("./data/instacart-market-basket-analysis/order_products__prior.csv")
products = pd.read_csv("./data/instacart-market-basket-analysis/products.csv")
orders = pd.read_csv("./data/instacart-market-basket-analysis/orders.csv")
aisles = pd.read_csv("./data/instacart-market-basket-analysis/aisles.csv")

7.3.2 Basic data processing

7.3.2.1 Merge tables

# 2. 数据基本处理
## 2.1 合并表格
# on:标签或列表。要连接的列或索引级别名称。这些必须在两个 DataFrame 中都能找到。如果 on 为 None 并且未在索引上合并,则默认为两个 DataFrame 中列的交集。
table_1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table_2 = pd.merge(table_1, orders, on=["order_id", "order_id"])
table = pd.merge(table_2, aisles, on=["aisle_id", "aisle_id"])

7.3.2.2 Crosstab Merge

Cross Tabulations is a commonly used classification and summary table for frequency distribution statistics. Its main value lies in describing the profound meaning of the relationship between variables . It can compute simple crosstabulations of two (or more) factors. By default, it computes a frequency table for factors, unless an array of values ​​and an aggregate function are passed.

For example, we can use pd.crosstabthe function to compute a crosstabulation between two categorical variables. The results show how many times each value in each variable occurs in combination with each value in the other variable.

Here is a simple example of how to use pd.crosstabthe function to calculate a crosstab:

import pandas as pd

# 创建示例数据
data = {
    
    '性别': ['男', '女', '男', '女', '男', '女', '男', '男'],
        '喜欢的颜色': ['红', '红', '蓝', '绿', '蓝', '蓝', '红', '绿'],
        '数量': [1, 2, 3, 4, 5, 6, 7, 8]}

df = pd.DataFrame(data)

# 计算交叉表
ct = pd.crosstab(df['性别'], df['喜欢的颜色'])

print(ct)
df.head()

insert image description here

ct.head()

insert image description here

This code will output the following result:

喜欢的颜色  绿  红  蓝
性别                
女         1   1   1
男         1   2   2

In this example, pd.crosstabwe computed the crosstabulation of 性别the and 喜欢的颜色columns using the function. The result 性别shows 喜欢的颜色how many times each value in the column occurs in combination with each value in the column.


Recommended Video : Pandas_Pivot and Crosstabs

Pivot Table is a tool for summarizing and analyzing data. It can aggregate data based on one or more keys, producing a new DataFrame. The levels in the pivot table will be stored in the resulting DataFrame's index and in the columns' MultiIndex objects (hierarchical indexes). The pivot table can aggregate the data in the DataFrame by one or more key groupings, and the aggregation type is determined by the aggfunc parameter, which is an advanced function of groupby.

Cross Tabulations is a commonly used classification and summary table for frequency distribution statistics. Its main value lies in describing the profound meaning of the relationship between variables. By default, it computes a frequency table for factors, unless an array of values ​​and an aggregate function are passed. A crosstab is used to count the number of groups of data in one column for another (a special pivot table for statistical grouping frequencies).

In short, a pivot table is a function for grouping statistics, and a crosstab is a special pivot table, which is more convenient when only counting group frequencies .


## 2.2 交叉表合并
table = pd.crosstab(table["user_id"], table["aisle_id"])
table.head()

insert image description here

Here we use the crosstab merge because we want to see what is the relationship between "user_id" and the product category "aisle_id".

7.3.2.3 Data interception

## 2.3 数据截取
table_clip = table[:1000]
table_clip.head()

insert image description here

7.3.3 Feature Engineering: PCA Principal Component Analysis

# 3. 特征工程:PCA主成分分析
transfer = PCA(n_components=0.9)  # 保留90%的信息
data = transfer.fit_transform(table_clip)
data
array([[-2.27452872e+01, -7.32942365e-01, -2.48945893e+00, ...,
        -4.78491473e+00, -3.10742945e+00, -2.45192316e+00],
       [ 5.28638801e+00, -3.00176267e+01, -1.11226906e+00, ...,
         9.24145693e+00, -3.11309382e+00,  2.20144174e+00],
       [-6.52593099e+00, -3.87333123e+00, -9.23859508e+00, ...,
        -1.33929081e+00,  1.25062993e+00,  6.12717485e-01],
       ...,
       [ 1.31226615e+01, -2.77296885e+01, -4.62403246e+00, ...,
         7.40793534e+00,  1.03829352e+00, -1.39058393e+01],
       [ 1.64905900e+02, -8.54916188e+01,  1.90577481e-02, ...,
        -5.62014943e+00, -1.38488891e+01, -7.11424774e+00],
       [-1.60244724e+00,  1.82037661e+00,  8.55756408e+00, ...,
         3.69860152e+00,  2.82248188e+00, -3.79491023e+00]])
print("降维前特征数量为:", table_clip.shape)
print("降维后特征数量为:", data.shape)
降维前特征数量为: (1000, 134)
降维后特征数量为: (1000, 22)

7.3.4 Machine Learning: K-means Clustering Algorithm

import os
os.environ["OMP_NUM_THREADS"] = "4"

# 4. 机器学习:K-means聚类
estimator = KMeans(n_clusters=8, random_state=22)  # 分为8类
pred = estimator.fit_predict(data)
pred
array([0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 7, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 7, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 7, 1, 0,
       1, 6, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 7,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 3, 7, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 7, 0,
       0, 0, 0, 1, 0, 7, 0, 1, 0, 0, 6, 4, 0, 0, 0, 7, 0, 1, 0, 0, 1, 1,
       1, 1, 3, 0, 0, 1, 7, 0, 1, 0, 0, 7, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       7, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 3,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 7, 0, 0, 1, 0, 0,
       0, 0, 4, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       7, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 0, 7, 1, 3, 0, 0, 0, 3, 0, 0, 0,
       0, 1, 0, 7, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 0,
       0, 0, 0, 0, 1, 0, 7, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 7, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 4, 0, 0, 0, 1, 0, 1, 1, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2,
       0, 7, 1, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 7, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 7,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 7, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 7, 0, 0, 0, 1, 7, 0, 0, 3, 1, 1, 1, 1, 0, 3, 0, 1, 3, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 7, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 7, 0, 0, 0, 0, 1, 7, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 7, 0, 1, 1, 0, 0, 1, 0, 2, 1, 0, 0, 0, 7, 0, 7, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 7, 0, 4, 0, 0, 0, 1, 0, 0, 0, 0, 7, 7,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 2, 0, 7, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 3, 0, 0, 0, 7, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 1, 1, 0, 3, 0, 0, 0, 3,
       1, 0, 0, 1, 7, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 2, 0])

7.3.5 Model Evaluation

Silhouette Coefficient (Silhouette Coefficient) is an index used to evaluate the effect of clustering. It measures the clustering effect by calculating the silhouette coefficient of each sample. Silhouette coefficients were calculated by computing the average intra-cluster distance (a) and average nearest-cluster distance (b) for each sample. The silhouette coefficient of each sample is ( b − a ) / max ⁡ ( a , b ) (b - a) / \max(a, b)(ba)/max(a,b ) . where b is the distance between a sample and the nearest cluster that does not belong to that sample. Note that the silhouette coefficient is only defined when the number of labels is 2 <= n_labels <= n_samples - 1. This function returns the average silhouette coefficient over all samples.

The value range of the silhouette coefficient is [-1, 1]:

  • When the coefficient is 1, it means that the clustering effect is very good;
  • When the coefficient is -1, it means that the clustering effect is very poor;
  • When the coefficient is close to 0, there is overlap between the clusters.

Negative values ​​usually indicate that the sample was assigned to the wrong cluster because different clusters are more similar.

In the scikit-learn library, you can use sklearn.metrics.silhouette_scorethe function to calculate the silhouette coefficient. The function requires a data matrix Xand a label array labels, and returns the average silhouette coefficient for all samples.

# 5. 模型评估
score = silhouette_score(data, pred)
score
0.46400567259894415

Let's take a look at the effect of intercepting different amounts of data:

clip_num = [10, 50, 100, 500, 1000, 1500, 5000, 10000]

for clip_n in clip_num:
    ## 2.3 数据截取
    table_clip = table[:clip_n]

    # 3. 特征工程:PCA主成分分析
    transfer = PCA(n_components=0.9)  # 保留90%的信息
    data = transfer.fit_transform(table_clip)

    # 4. 机器学习:K-means聚类
    estimator = KMeans(n_clusters=8, random_state=22)  # 分为8类
    pred = estimator.fit_predict(data)

    # 5. 模型评估
    score = silhouette_score(data, pred)
    print(f"[数据量: {
      
      clip_n}] 分数为:{
      
      score*100:.4f}%")
[数据量: 10] 分数为:13.2716%
[数据量: 50] 分数为:33.6469%
[数据量: 100] 分数为:31.4929%
[数据量: 500] 分数为:44.9804%
[数据量: 1000] 分数为:46.4006%
[数据量: 1500] 分数为:38.5747%
[数据量: 5000] 分数为:38.0150%
[数据量: 10000] 分数为:37.5044%

It can be seen that the effect is not very good, it should be caused by the small number of features we use.

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/131146649
Recommended