- video link
- Dataset download address: no download required
1. Introduction to Clustering Algorithms
learning target:
- Master the clustering algorithm implementation process
- Know the principle of K-means algorithm
- Know the evaluation model in the clustering algorithm
- Explain the advantages and disadvantages of K-means
- Understand how algorithms are optimized in clustering
- Know the implementation process of feature dimensionality reduction
- Applying K-means to achieve clustering tasks
1.1 Understanding Clustering Algorithms
Using different clustering criteria, the resulting clustering results are different.
1.2 Application of clustering algorithm in reality
- User portrait, advertisement recommendation, Data Segmentation, search engine traffic recommendation, malicious traffic identification
- Business push, news clustering, screening and sorting based on location information
- Image segmentation, dimensionality reduction, recognition; Outlier detection; Abnormal consumption of credit cards; Discovery of gene fragments with the same function
1.3 The concept of clustering algorithm
Clustering algorithm : It is a typical unsupervised learning algorithm, which is mainly used to automatically classify similar samples into one category.
In the clustering algorithm, according to the similarity between samples, the samples are divided into different categories, and different similarity calculation methods are used to obtain different clustering results . The commonly used similarity calculation method is the Euclidean distance method.
1.4 The biggest difference between clustering algorithm and classification algorithm
Clustering algorithms are unsupervised learning algorithms, while classification algorithms are supervised learning algorithms.
Summary :
- Classification of clustering algorithms [understand]
- rough clustering
- fine clustering
- The definition of clustering [understand]
- A Typical Unsupervised Learning Algorithm
- It is mainly used to automatically group similar samples into one category
- Calculate the similarity between samples and samples, generally using Euclidean distance
2. Preliminary use of clustering algorithm API
learning target:
- Know the use of clustering algorithm API
2.1 API introduction
sklearn.cluster.KMeans(n_clusters=8)
sklearn.cluster.KMeans
is a class in the scikit-learn library that implements the K-Means clustering algorithm.
- Main parameters :
n_clusters
: int type, the default value is 8. The number of clusters to form and the number of centroids to generate.init
: {'k-means++', 'random'}, callable object or array of shape (n_clusters, n_features), default value is 'k-means++'. initialization method.n_init
: 'auto' or int type, the default value is 10. Number of times to run the k-means algorithm with different centroid seeds. The end result is the most inertial output of successive runs of n_init.max_iter
: int type, the default value is 300. The maximum number of iterations for a single run of the k-means algorithm.tol
: float type, the default value is 1e-4. Relative tolerance on the Frobenius norm of the difference in cluster centers between two successive iterations, used to declare convergence.verbose
: int type, the default value is 0. Verbose mode.random_state
: int, RandomState instance or None, the default value is None. Determines the random number generation for centroid initialization.copy_x
: bool type, the default value is True. When precomputing distances, it is more numerically accurate to center the data first.algorithm
: {"lloyd", "elkan", "auto", "full"}, default value is "lloyd". The K-means algorithm used.
- return value :
- The function returns a
KMeans
object whose methods (egfit
, ,predict
etc.) can be used to cluster the data.
- The function returns a
- method :
fit(X[, y, sample_weight])
: Calculate K-Means clustering.fit_predict(X[, y, sample_weight])
: Calculate the cluster center and predict the cluster index to which each sample belongs.fit_transform(X[, y, sample_weight])
: Calculate cluster centers and convert X to cluster distance.get_params([deep])
: Get the parameters of this estimator.predict(X)
: Predict the cluster to which each sample belongs to the nearest cluster center.score(X[, y, sample_weight])
: Scores the KMeans model given the data X.set_params(**params)
: Sets the parameters of this estimator.transform(X)
: Convert X to cluster distance space.
2.2 Case
Randomly create different two-dimensional data sets as training sets, and combine them with the k-means algorithm to cluster them. You can try to cluster different numbers of clusters and observe the clustering effect:
The clustering n_cluster
parameter values are different, and the clustering results are different:
2.2.1 Process Analysis
- import tool library
- Create a dataset and display it
- Apply K-means
- Show results
2.2.2 Code implementation
import matplotlib.pyplot as plt
from sklearn.datasets._samples_generator import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
# 1. 创建数据集
# X为样本特征y为样本簇类别,共1000个样本,每个样本有4个特征,共4个簇
# 簇中心在[-1, -1], [0, 0], [1, 1], [2, 2],簇方差分别为[0.4, 0.2, 0.2, 0.2]
X, y = make_blobs(n_samples=1000, n_features=2, centers=[[-1, -1], [0, 0], [1, 1], [2, 2]],
cluster_std=[0.4, 0.2, 0.2, 0.2], random_state=9)
# 数据集可视化
plt.figure(dpi=300)
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()
# 2. 使用K-means进行聚类,并使用CH方法评估
y_pred = KMeans(n_clusters=2, random_state=9).fit_predict(X)
plt.figure(dpi=300)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
# 用CH方法评估聚类得分
print(calinski_harabasz_score(X=X, labels=y_pred))
Try it separately n_clusters=2/3/4
, and then check the clustering effect:
fig, axes = plt.subplots(1, 3, figsize=(20, 5),dpi=300)
n_clusters_ls = [2, 3, 4]
for idx, val in enumerate(n_clusters_ls):
# 2. 使用K-means进行聚类,并使用CH方法评估
y_pred = KMeans(n_clusters=val, random_state=9).fit_predict(X)
axes[idx].scatter(X[:, 0], X[:, 1], c=y_pred)
axes[idx].set_title(f"n_cluster={
val}")
# 用CH方法评估聚类得分
print(f"n_clusters为{
val}时的CH评分为:", calinski_harabasz_score(X=X, labels=y_pred))
plt.savefig("./不同n_clusters的聚类结果.png")
plt.show()
n_clusters为2时的CH评分为: 3116.1706763322227
n_clusters为3时的CH评分为: 2931.625030199556
n_clusters为4时的CH评分为: 5924.050613480169
The higher the CH score, the better
Summary :
- API:
sklearn.cluster.KMeans(n_clusters=8)
【know】- parameter:
n_clusters
: the number of cluster centers to start with
- method:
estimator.fit_predict(x)
: Calculate the cluster center and predict which class each sample belongs to.- It is equivalent to calling first
fit(x)
, then callingpredict(x)
- It is equivalent to calling first
- parameter:
3. Clustering algorithm implementation process
learning target:
- Master the implementation steps of K-means clustering
- k-means actually contains two layers of content:
K
: Number of initial center points (number of planned clusters)means
: Calculate the average distance from the center point to other data points
3.1 Steps of K-means clustering
- Randomly set K points in the feature space as the initial cluster centers
- For each other point, the distance to K centers is calculated, and the unknown point selects the nearest cluster center point as the label category
- Then after facing the marked cluster center, recalculate the new center point (mean value) of each cluster
- If the calculated new center point is the same as the original center point (the center of mass no longer moves), then end; otherwise, repeat the second step process
The implementation process is explained by the following figure:
Dynamic graph demo:
The dots are the samples, and the × are the centers of the clusters
3.2 Case exercises
Step 1 : Randomly set KKThe points in the K feature space are used as the initial cluster centers (in this case, P1 and P2 are set)
Step 2 : For every other point calculate to KKThe distance between K centers, the unknown point selects the nearest cluster center point as the label category.
Step 3 : After facing the marked cluster center, recalculate the new center point (mean value) of each cluster
Step 4 : If the calculated new center point is the same as the original center point (the center of mass no longer moves), then end; otherwise, repeat the second step process [after judgment, the above steps need to be repeated to start a new round of iteration]
Note : When the result of each iteration remains unchanged, it is considered that the algorithm converges and the clustering is completed. K-Means will definitely stop, and it is impossible to fall into the process of always selecting the centroid.
Summary :
- K-means clustering implementation process [master]
- Determine the constant KK in advanceK , constantKKK means the final number of cluster categories
- Randomly select the initial point as the centroid, and classify the sample points into the most similar class by calculating the similarity between each sample and the centroid (here, the Euclidean distance)
- Then, recalculate the centroid of each class (that is, the class center), repeat this process until the centroid does not change, and finally determine the category to which each sample belongs and the centroid of each class.
- Notice:
- Since the similarity between all samples and each centroid is calculated every time, the convergence speed of the K-Means algorithm is relatively slow on large-scale data sets.
4. Model Evaluation
learning target:
- Know the implementation principles of SSE, "elbow" method, SC coefficient and CH coefficient in model evaluation
4.1 The sum of squares due to error (SSE)
Example: (the data -0.2, 0.4, -0.8, 1.3, -O.7 in the figure below are the difference between the actual value and the predicted value)
Application in k-means:
S S E = ∑ i = 1 k ∑ p ∈ C i ∣ p − m i ∣ 2 \mathrm{SSE} = \sum^k_{i=1}\sum_{p \in C_i}|p-m_i|^2 SSE=i=1∑kp∈Ci∑∣p−mi∣2
The contents of each part of the formula:
In the figure above: k = 2 k=2k=2
- The final result of the SSE graph is a measure of the looseness of the graph (eg: SSE (left) < SSE (right))
- As the clustering iterates, the value of SSE will become smaller and smaller until it stabilizes at the end:
Note : The selection of the centroid in the SSE algorithm is random. Therefore, when the centroid is selected, when the initial distance between the two centroids is relatively close, it may produce bad results. Therefore, the final result of SSE is the local optimal solution instead of global optimal solution. As shown below:
If the initial value of the centroid is not well chosen, SSE will only reach a not so good local optimal solution
4.2 "Elbow method" (Elbow method) - K value determination
- for nData set of n points, iteratively calculatekkk from i i i tonnn , calculate the sum of the squares of the distances from each point to the center of the cluster to which it belongs after each clustering;
- The sum of squares will gradually become smaller until k = = nk==nk==The sum of squares is 0 at n because each point is itself the center of the cluster it is in.
- In the process of this sum of squares change, there will be an inflection point, that is, the "elbow" point. When the rate of decline suddenly slows down, it is considered to be the best kkk value.
The elbow criterion is also valid when deciding when to stop training, the data is usually more noisy, and we stop adding classes when adding classes does not bring more rewards .
3. The Silhouette Coefficient method
combines the degree of cohesion (Cohesion) and degree of separation (Separation) of clustering to evaluate the effect of clustering:
4.3 Silhouette Coefficient (SC)
Combines the degree of cohesion (Cohesion) and degree of separation (Separation) of clustering to evaluate the effect of clustering:
Purpose : internal distance aaa is minimized, the outer distancebbb is maximized.
S ∈ [ − 1 , 1 ] S \in [-1, 1] S∈[−1,1 ] , whereSSThe larger the S , the better (close to 1), the smaller the worse (close to -1)
s ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } = { 1 − a ( i ) b ( i ) , a ( i ) < b ( i ) 0 , a ( i ) = b ( i ) b ( i ) a ( i ) − 1 , a ( i ) > b ( i ) \begin{aligned} s(i) & = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\\ & = \begin{cases} 1 - \frac{a(i)}{b(i)}, & a(i) < b(i)\\ 0, & a(i) = b(i)\\ \frac{b(i)}{a(i)} - 1, & a(i) > b(i) \end{cases} \end{aligned} s(i)=max{ a(i),b(i)}b(i)−a(i)=⎩ ⎨ ⎧1−b(i)a(i),0,a(i)b(i)−1,a(i)<b(i)a(i)=b(i)a(i)>b(i)
Calculation sample iiThe average distance a ( i ) a(i) from i to other samples in the same clustera(i), a ( i ) a(i) a ( i ) the smaller the sampleiiThe smaller the dissimilarity in the cluster of i , it means that the sample iii should be clustered into this cluster.
Calculation sample iii to the nearest clusterC j C_jCjThe average distance of all samples of b ( j ) b(j)b ( j ) , called sampleiii and the nearest clusterC j C_jCjThe degree of dissimilarity is defined as sample iidissimilarity between i : b ( i ) = min { b ( i ) 1 , b ( i ) 2 , . . . , b ( i ) k } b(i) =\min\{b(i )_1, b(i)_2,..., b(i)_k\}b(i)=min{ b(i)1,b(i)2,...,b(i)k}, b ( i ) b(i) The bigger b ( i ) is, the sampleiii does not belong to other clusters.
The average silhouette coefficient is obtained by calculating the silhouette coefficients of all samples and then calculating the average value . The value range of the average silhouette coefficient is [ − 1 , 1 ] [-1, 1][−1,1 ] , the larger the coefficient, the better the clustering effect.
The closer the distance between samples within a cluster, the farther the distance between samples between clusters.
case :
The following figure shows the data distribution of 500 samples containing 2 features, and we measure the effect of the SC coefficient on it:
n_clusters = 2 The average silhouette_score is : 0.7049787496083262
n_clusters = 3 The average silhouette_score is : 0.5882004012129721
n_clusters = 4 The average silhouette_score is : 0.6505186632729437
n_clusters = 5 The average silhouette_score is : 0.56376469026194
n_clusters = 6 The average silhouette_score is : 0.4504666294372765
When n_clusters are 2, 3, 4, 5, and 6 respectively, the SC coefficient is as follows, which is a metric between [-1,1]:
After each clustering, each sample will get a silhouette coefficient. When it is 1, it means that the point is far away from the surrounding clusters, and the result is very good; when it is 0, it means that the point may be between two clusters. On the boundary; when the value is negative, it implies that the point may be misclassified.
From the average SC coefficient results, KKIt is not good for K to take 3, 5, and 6, so what about 2 and 4?
K = 2 K=2 K=2 cases:
K = 4 K=4 K=4 cases:
- When n_clusters = 2, the width of the 0th cluster is much wider than that of the 1st cluster;
- When n_clusters = 4, the cluster widths are not much different, so choose K = 4 K=4K=4 as the final number of clusters.
4.4 CH Coefficient (Calinski-Harabasz Index)
Calinski-Harabasz:
The smaller the covariance of the data within the category is, the better the covariance between the categories is. The Calinski-Harabasz score of sss will be high, scoressThe higher the s , the better the clustering effect.
s ( k ) = t r ( B k ) t r ( W k ) m − k k − 1 s(k) = \frac{tr(B_k)}{tr(W_k)} \frac{m-k}{k-1} s(k)=t r ( Wk)tr(Bk)k−1m−k
in:
- t r tr t r is the trace of the matrix,B k B_kBkis the covariance matrix between categories, W k W_kWkis the covariance matrix of the within-category data
- m m m is the number of samples in the training set
- k k k is the number of categories.
Trace (trace), defined as the sum of the main diagonal numbers. For a matrix: ( a 11 a 12 . . . a 1 na 21 a 22 . . . a 2 n . . . . . . . . . . . . an 1 an 2 . . . ann ) \begin{pmatrix} a_{11} & a_{12} & ... & a_{1n}\\ a_{21} & a_{22} & ... & a_{2n}\\ ... & ... & .. . & ...\\ a_{n1} & a_{n2} & ... & a_{nn}\\ \end{pmatrix} a11a21...an 1a12a22...an 2............a1na2 n...ann Then its trace tr trdefinition of tr = a 11 + a 22 + . . . + ann tr = a_{11} + a_{22} + ... + a_{nn}tr=a11+a22+...+ann
The understanding of using the trace of the matrix to solve: the diagonal of the matrix can represent the similarity of an object .
In machine learning, the main purpose is to obtain the eigenvalues of the data. That is to say, after any matrix is calculated, it can be simplified. As long as the trace of the matrix is obtained, the most important feature of this piece of data can be represented. In this way A lot of irrelevant data can be deleted to simplify the data and improve the processing speed.
The goal that CH needs to achieve: use as few categories as possible to cluster as many samples as possible, and at the same time obtain a better clustering effect .
Summary :
- sse [know]
- The smaller the value of the sum of squared errors, the better
- Elbow method【know】
- When the rate of decline suddenly slows down, it is considered to be the best kkk value
- SC coefficient 【know】
- The value is [-1,1], the larger the value, the better
- CH coefficient [know]
- The higher the score s, the better the clustering effect
- The goal that CH needs to achieve: use as few categories as possible to cluster as many samples as possible, and at the same time obtain a better clustering effect.
5. Algorithm optimization
learning target:
- Know the advantages and disadvantages of the k-means algorithm
- Know the optimization principles of canopy, K-means++, dichotomous K-means, and K-medoids
- Understand the optimization principles of kernel K-means, ISODATA, Mini-batch K-means
Summary of k-means algorithm :
- Advantages :
- The principle is simple (near the center point), easy to implement
- The clustering effect is above average (depending on KKK 's choice)
- Space complexity O ( N ) O(N)O ( N ) , time complexityO ( I ∗ K ∗ N ) O(I*K*N)O ( I∗K∗N)
N N N is the number of sample points,KKK is the number of central quotients,III is the number of iterations
- Disadvantages :
- Sensitive to outliers and noise (central point is easy to shift)
- It is difficult to find clusters with large differences in size and perform incremental calculations
- The result is not necessarily global optimal, only local optimal (with KKThe number of K is related to the selection of the initial value)
5.1 Canopy algorithm with initial clustering
The Canopy algorithm is a "coarse" clustering algorithm that is faster but less accurate . Different from traditional clustering algorithms (such as K-Means), the biggest feature of Canopy clustering is that it does not need to specify the k value (that is, the number of clusters) in advance, so it has great practical application value .
5.1.1 Canopy algorithm with initial clustering implementation process
The steps of the Canopy algorithm are as follows:
- Given sample list L = x 1 , x 2 , … , xm L=x_1,x_2,\dots,x_mL=x1,x2,…,xmand an initial distance threshold of T 1 T1T1、 T 2 T2 T 2 , and (T 1 > T 2 T1>T2T 1>T2)( T 1 T1 T1、 T 2 T2 T 2 can be defined by yourself);
- from list LLTake anyPP from LP , calculatePPThe distance from P to all cluster centers (if there is no cluster center, then the pointPPP as a new cluster), and select the closest distance to the cluster centerD ( P , aj ) D(P,a_j)D(P,aj);
- If the distance DDD is less thanT 1 T1T 1 , indicating that the node belongs to the cluster, is added to the cluster list.
- If the distance DDD is less thanT 2 T2T 2 means that the node not only belongs to the cluster, but also is very close to the center point of the current cluster, so thePPP from listLLDeleted from L.
Canopy algorithm can be used for "coarse" clustering, get the k value and roughly K initial centroids, and then use K-means for further "fine" clustering. This form of Canopy+K-means clustering algorithm clustering effect is good.
5.1.2 Advantages and disadvantages of Canopy algorithm
Advantages :
- K-means is weak against noise and interference. Compared with Canopy, directly removing the smaller NumPoint Cluster is beneficial for anti-interference.
- The centerPoint of each Canopy selected by Canopy will be more accurate as K.
- Just do K-means clustering for each Canopy to reduce the number of similar calculations
In the Canopy algorithm, NumPoint refers to the number of points contained in Canopy. And centerPoint refers to the Canopy center point, that is, the center point of each Canopy selected by Canopy. These center points can be used as the initial centroids of the K-means algorithm, thereby improving the clustering effect of the K-means algorithm.
Disadvantages :
- Algorithm T 1 T1T1、 T 2 T2 The determination problem of T 2 may still fall into a local optimal solution
- The value of the final result obtained by the Canopy algorithm may overlap between clusters, but there will be no case where an object does not belong to any cluster.
5.2 K-means++ algorithm
The K-means++ algorithm is an algorithm for selecting initial values (or "seeds") for the K-means clustering algorithm. It is an approximation algorithm for the NP-hard K-means problem, and it is a way of avoiding the weaker clustering sometimes found with the standard K-means algorithm.
The K-means++ algorithm is only improved in the way of initializing the cluster center, and other places are the same as the K-means clustering algorithm . K-means++’s method of initializing cluster centers can be summed up in one sentence: select k cluster centers one by one, and the farther the sample point from other cluster centers is, the more likely it is to be selected as the next cluster center. In this way, the situation that the initial cluster centers are in the same cluster can be avoided to the greatest extent, thereby improving the clustering effect.
P = D ( x ) 2 ∑ x ∈ X D ( x ) 2 P = \frac{D(x)^2}{\sum_{x\in X}D(x)^2} P=∑x∈XD(x)2D(x)2
in:
- P P P represents the probability that each sample point is selected as the next cluster center
- D ( x ) D(x) D ( x ) represents the sample pointxxThe shortest distance between x and the current existing cluster center (the farther the sample point is from the existing cluster center, the more likely it will be selected as the next cluster center)
- In the K-means++ algorithm, D ( x ) D(x)D ( x ) represents the sample pointxxThe shortest distance between x and the current existing cluster centers. Specifically, assume that there is currentlymmm cluster centersc 1 , c 2 , … , cm c_1, c_2, \dots, c_mc1,c2,…,cm, then D ( x ) D(x)D ( x ) can be calculated by the following formula:
D ( x ) = min i = 1 m dist ( x , c i ) D(x) = \min_{i=1}^m \text{dist}(x, c_i) D(x)=i=1minmdist(x,ci)
其中 dist ( x , c i ) \text{dist}(x, c_i) dist(x,ci) represents the sample pointxxx and cluster centerci c_icithe distance between. The distance calculation method can choose different distance measurement methods according to specific problems, such as Euclidean distance, Manhattan distance, etc.
In order to facilitate subsequent calculations, we will ∑ x ∈ XD ( x ) 2 \sum_{x\in X}D(x)^2∑x∈XD(x)2 is marked asAAA。
First we choose point 2 as the centroid, so that we can find the PP from different points to the centroidP. _ Then we according to the probability PPof different pointsP selects the new point as the centroid. Becausethe purpose of K-means++ is to make the selected centroids as scattered as possible.
As shown in the figure below, if the first centroid is selected at the center of the circle, then the best possible next point to be selected is at P ( A ) P(A)P ( A ) This area (divided according to color).
The K-means++ algorithm flow is as follows :
- From the dataset X \mathcal{X}Randomly (uniformly distributed) select a sample point in X as the first initial cluster centerci c_ici。
- Then calculate the shortest distance between each sample and the current existing cluster center, use D ( x ) D(x)D ( x ) means; then calculate the probability that each sample point is selected as the next cluster centerP ( x ) P(x)P ( x ) , and finally select the sample point corresponding to the maximum probability value as the next cluster center.
- Repeat step ② until k cluster centers are selected.
- Clustering was performed using the standard K-means algorithm.
The flow of the standard K-means algorithm is as follows :
- Assign each sample point to the cluster where its nearest cluster center is located.
- Calculate the mean of each cluster and use it as the new cluster center.
- Repeat steps ① and ② until the cluster center does not change or reaches the maximum number of iterations.
The K-means++ algorithm can effectively avoid the weak clusters sometimes found by the standard K-means algorithm by improving the selection method of the initial cluster center, thereby improving the clustering effect.
5.3 Dichotomous K-means Algorithm
The binary K-means algorithm is an improved algorithm based on the K-means algorithm. Its main idea is to start with a total cluster, and continue to split into k clusters until it is split into k clusters .
The process of the binary K-means algorithm is as follows:
- Treat all points as a cluster;
- For each cluster, do the following:
- Calculate the total error (SSE);
- Perform K-Means clustering on a given cluster (k=2);
- Compute the total error (SSE) after splitting the cluster in two;
- Select the cluster that makes the error SSE the smallest to perform the division operation (that is, divide the cluster with a larger error);
- Repeat steps ② and ③ until the number of clusters specified by the user is reached.
The binary K-means algorithm can effectively solve the problem that the K-means algorithm converges to a local minimum and improve the clustering effect.
A principle implicit in dichotomous K-means :
Because the sum of squared errors of clustering can measure clustering performance, the smaller the value, the closer the data points are to their centroids, and the better the clustering effect. Therefore, it is necessary to divide the cluster with the largest sum of squared errors again, because the larger the sum of squared errors, the worse the clustering effect of the cluster is, and the more likely it is that multiple clusters are regarded as one cluster, so we first need to divide this Clusters are divided.
The binary K-means algorithm can speed up the execution speed of the K-means algorithm because it has fewer similarity calculations and is not affected by initialization problems. Because there is no random point selection in the binary K-means algorithm, and each step ensures the minimum error.
5.4 K-medoids (K-center clustering algorithm)
There is a difference between K-medoids and K-means, the difference lies in the selection of the center point.
- In K-means, the center point is taken as the average of all data points in the current cluster, so it is very sensitive to abnormal points.
- In K-medoids, the point with the smallest sum of distances from the current cluster to all other points (in the current cluster) is taken as the center point.
medoids:
/ˈmɛdɔɪdz/
center points
K-medoids algorithm (also known as K-center clustering algorithm) is a classic partitioning clustering technique, which divides a data set of n objects into k clusters, where the value of k (that is, the number of clusters) in the algorithm Known before execution ( meaning the programmer must specify the value of k before executing the K-medoids algorithm ).
Unlike the K-means algorithm, the K-medoids algorithm chooses actual data points as centers (medoids or exemplars), thus allowing better interpretation of cluster centers than K-means. Furthermore, K-medoids can be used with arbitrary dissimilarity measures, whereas K-means usually requires Euclidean distance for efficient solutions. Because K-medoids minimizes the sum of pairwise dissimilarities rather than the sum of squared Euclidean distances, it is more resistant to noise and outliers than K-means.
The medoid of a cluster is defined as the object whose average dissimilarity with all objects in the cluster is the smallest, that is, it is the most central point in the cluster.
Algorithm flow :
- Randomly select k objects as initial medoids.
- Assign each remaining object to the cluster represented by its nearest medoid.
- For each cluster, calculate the criterion function value corresponding to each member object, and select the object corresponding to the minimum criterion function value as the new medoid.
- Repeat steps ② and ③ until all medoid objects no longer change or reach the maximum number of iterations.
Among them, the criterion function is defined as the sum of distances between a certain member object and other member objects in a class.
Compared with the K-means algorithm, the K-medoids algorithm is more robust to noise .
Example: When there are only a few sample points in a cluster, such as (1, 1) (1, 2) (2, 1) (1000, 1000). where (1000, 1000) is the noise.
According to K-means, the centroid will be roughly in the middle of (1, 1) (1000, 1000). This is obviously not what we want.
At this time, k-medoids can avoid this situation. He will select a sample point in (1,1) (1,2) (2,1) (1000,1000) to minimize the absolute error of the cluster. The calculation shows that Be sure to pick one of the first three points.
k-medoids only work with small samples: if the samples are large, then it is too slow. And when there are many samples, the impact of a few noises on the centroid of k-means is not as heavy as expected, so the application of k-means is obviously more than k-medoids .
simply put:
For small data sets, k-medoids can be used, and the effect is generally better than K-means. But for large data sets, the K-means algorithm is still used.
5.5 Kernel K-means algorithm (understand)
Kernel K-means algorithm is a kind of K-means algorithm based on kernel method, which can deal with non-linearly separable data. It maps the data to a high-dimensional space, making the data that was originally linearly inseparable in the low-dimensional space become linearly separable in the high-dimensional space, thereby improving the clustering effect .
The process of the Kernel K-means algorithm is similar to the standard K-means algorithm, but the kernel function is used to calculate the similarity between sample points when calculating the distance. This can effectively solve the problem that the standard K-means algorithm cannot handle nonlinearly separable data.
Kernel k-means is actually a process of projecting each sample into a high-dimensional space, and then clustering the processed data using the common k-means algorithm idea.
5.6 ISODATA algorithm (understand)
ISODATA algorithm (Iterative Self-Organizing Data Analysis Techniques Algorithm, iterative self-organizing data analysis technique algorithm) is an improved k-means algorithm. It introduces criteria for category evaluation in the clustering process, automatically merges or splits certain categories according to the criteria, and breaks through the limitation of the number of categories to a certain extent.
The algorithm can dynamically adjust the number of cluster centers according to the actual situation of the samples contained in each class during the clustering process. If the degree of dispersion of samples in a class is large (measured by variance) and the number of samples is large, it will be split; if two categories are relatively close (measured by the distance between cluster centers), then They perform merge operations.
The ISODATA algorithm is a repetitive self-organizing data analysis technique. It calculates the uniformly distributed class mean value in the data space, and then uses the minimum distance technique to iteratively aggregate the remaining pixels. The mean value is recalculated for each iteration, and according to the obtained new mean value, Classify the pixels again.
Features :
- The number of categories changes with the clustering process
- The number of categories will be merged and split:
- Merge : When the number of samples in a certain class of clustering results is too small, or the distance between two classes is too close
- Splitting : When the intra-class variance of a certain class in the clustering result is too large, the class is split
5.7 Mini Batch K-Means algorithm (understand)
The Mini Batch K-Means algorithm is an optimization scheme of the K-Means algorithm, which is suitable for clustering algorithms of big data . The algorithm employs small batches of subsets of the data to reduce computation time while still attempting to optimize the objective function. The so-called small batch here refers to the randomly selected data subsets each time the algorithm is trained. Using these randomly generated subsets to train the algorithm greatly reduces the calculation time. Compared with other algorithms, it reduces the time required for K-Means. Convergence time, results produced by small-batch K-Means are generally only slightly worse than standard algorithms.
The Mini Batch K-Means algorithm can reduce the convergence time of the K-Means algorithm, and the result is only slightly worse than the standard K-Means algorithm.
Usually when the sample size is greater than 10,000 for clustering, you need to consider using the Mini Batch K-Means algorithm.
The iterative step of the algorithm has two steps:
- Randomly select some data from the data set to form a mini-batch, and assign them to the nearest centroid
- update centroid
Compared with K-means, the data is updated on each small sample set. For each mini-batch, an updated centroid is obtained by computing the mean, and the data in the mini-batch is assigned to the centroid. As the number of iterations increases, the changes in these centroids are gradually reduced until the centroid is stable or reaches the specified number of iterations, and the calculation is stopped.
Summary :
- Summary of advantages and disadvantages of k-means algorithm 【Know】
- Advantages :
- The principle is simple (near the center point), easy to implement
- The clustering effect is medium to high (depending on the choice of K)
- Space complexity O ( N ) O(N)O ( N ) , time complexityO ( IKN ) O(IKN)O ( I K N )
- Disadvantages :
- Sensitive to outliers and noise (central point is easy to shift)
- It is difficult to find clusters with large differences in size and perform incremental calculations
- The result is not necessarily the global optimum, but only a local optimum (related to the number of K and the selection of initial values)
- Advantages :
- Optimization method [know]
Optimization | train of thought |
---|---|
Canopy + k-means | Canopy rough clustering with K-means |
K-means++ | The farther the distance, the easier it is to become a new centroid |
Dichotomous K-means | Remove the largest cluster in SSE |
k-medoids | It is different from the way K-means selects the center point |
kernel K-means | map to high-dimensional space |
ISODATA | Dynamic clustering, you can change the size of the K value |
Mini-batch K-means | Batch clustering of large datasets |
6. Feature Engineering - Feature Dimensionality Reduction
learning target:
- Know the definition of dimensionality reduction
- Know the dimensionality reduction process through low-variance filtering
- The process of knowing the correlation coefficient to achieve dimensionality reduction
- Know the implementation process of principal component analysis
6.1 Dimensionality reduction
6.1.1 Definition
In the field of machine learning and statistics, dimensionality reduction refers to the process of reducing the number of random variables to obtain a set of "uncorrelated" main variables under certain limited conditions. Dimensionality reduction can be further subdivided into two major methods, variable selection and feature extraction .
Simply put, dimensionality reduction is to use a certain mapping method to map data points in the original high-dimensional space to a low-dimensional space. The purpose of doing this is to reduce the error caused by redundant information and noise information, and improve the accuracy in the application. Common data dimensionality reduction methods include: PCA, LDA, MDS, ISOMAP, SNE, T-SNE, AutoEncoder, etc.
In short, dimensionality reduction refers to the process of reducing the number of random variables (features) under certain limited conditions to obtain a set of "uncorrelated" main variables.
The key is: reduce the number of relevant variables
Q : For example, for a data set, there are features such as ABCDEFG. If the correlation of the three features of BCD is high, is it possible to select a feature from BCD (choose C here) to replace the other two similar features (B and D), that is, the characteristics of the data set have changed from the original ABCDEFG to ACEFG. Is this understanding correct?
A : This understanding is correct. In this example, if the three BCD features are highly correlated, they probably contain a lot of duplicated information. In this case, we can use a variable selection method to select one feature (such as C) from BCD and delete the other two features (B and D) to reduce redundant information in the dataset. In this way, the features in the data set are changed from the original ABCDEFG to ACEFG.
Of course, this is only one method of dimensionality reduction. Another common dimensionality reduction method is feature extraction, which generates new, fewer features by applying some transformation to the original features. For example, in the example you provided, we can use methods such as PCA to convert the three features of BCD into a new feature H, and then replace the original three features of BCD with H. In this way, the features in the data set are changed from the original ABCDEFG to AEFGH.
In summary, dimensionality reduction aims to reduce redundant information and noise information in the dataset to improve the accuracy of the model in application.
Related features:
- Correlation between relative humidity and rainfall
- etc.
It is precisely because when training, we all use features for learning. If there is a problem with the feature itself or the correlation between the features is strong, it will have a greater impact on the algorithm learning and prediction.
6.1.2 Two ways of dimensionality reduction
- feature selection
- Principal component analysis (can understand a way of feature extraction)
6.2 Feature Selection
6.2.1 Definition
The data contains redundant or irrelevant variables (or features, attributes, indicators, etc.) , aiming to find out the main features from the original features .
6.2.2 Method
- Filter (filter type) : Mainly explore the characteristics of the feature itself, the relationship between the feature and the feature and the target value.
- Variance Selection Method: Low Variance Feature Filtering
- correlation coefficient
- Embedded : The algorithm automatically selects features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- deep learning:
- convolution
- …
6.2.3 Low Variance Feature Filtering
Delete some features of low variance , the meaning of variance was mentioned earlier. Combined with the size of the variance to consider the angle of this method.
- The variance of the feature is small: the value of most samples of a feature is relatively similar
- Large feature variance: the values of many samples of a feature are different
Variance is a measure used to measure the degree of dispersion of a set of data. It represents the difference between each variable (observation) and the population mean. The larger the variance, the greater the fluctuation of the data; the smaller the variance, the smaller the fluctuation of the data .
6.2.3.1 API
sklearn.feature_selection.VarianceThreshold(threshold=0.0)
- Role :
sklearn.feature_selection.VarianceThreshold
It is a feature selector that removes all low variance features. This feature selection algorithm only focuses on the features (X) and not on the desired output (y), so it can be used for unsupervised learning. - Parameters :
threshold
: float, the default value is 0. Features with training set variance below this threshold will be removed. By default, all features with non-zero variance are kept, that is, features with the same value in all samples are removed (if they are slightly different, they are kept).
- properties :
variances_
: Array, shape (n_features,). The variance of each feature.n_features_in_
:int. The number of features seen during fit.feature_names_in_
: ndarray with shape (n_features_in_,). Feature names seen during fit. Defined only if X has feature names that are all strings.
- method :
fit(X, y=None)
: learn empirical variance from X.fit_transform(X, y=None, **fit_params)
: fit the data, then transform it.get_feature_names_out(input_features=None)
: Masks the feature name based on the selected feature.get_params(deep=True)
: Get the parameters of this estimator.get_support(indices=False)
: Get the mask or integer index of the selected features.inverse_transform(X)
: Inverts the conversion operation.set_params(**params)
: Sets the parameters of this estimator.transform(X)
: Reduces X to the selected feature.
6.2.3.2 Data calculation
We perform a filter between the index characteristics of some stocks, and remove 'index'
, 'date'
, 'return'
columns are not considered (these types do not match, nor are the required indicators).
There are a total of the following characteristics:
'pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev', 'earnings_per_share', 'revenue', 'total_expense'
- analyze:
- Initialize
VarianceThreshold
, specify the threshold variance - call
fit_transform
method
- Initialize
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
data = pd.read_csv("./data/factor_returns.csv")
# 1. 实例化一个转换器类
transfer = VarianceThreshold(threshold=1)
# 2. 调用fit_transform方法
data = transfer.fit_transform(data.iloc[:, 1:10])
print("形状为:", data.shape)
print("删除低方差特征的结果为:\r\n", data)
形状为: (2318, 8)
删除低方差特征的结果为:
[[ 5.95720000e+00 1.18180000e+00 8.52525509e+10 ... 1.21144486e+12
2.07014010e+10 1.08825400e+10]
[ 7.02890000e+00 1.58800000e+00 8.41133582e+10 ... 3.00252062e+11
2.93083692e+10 2.37834769e+10]
[-2.62746100e+02 7.00030000e+00 5.17045520e+08 ... 7.70517753e+08
1.16798290e+07 1.20300800e+07]
...
[ 3.95523000e+01 4.00520000e+00 1.70243430e+10 ... 2.42081699e+10
1.78908166e+10 1.74929478e+10]
[ 5.25408000e+01 2.46460000e+00 3.28790988e+10 ... 3.88380258e+10
6.46539204e+09 6.00900728e+09]
[ 1.42203000e+01 1.41030000e+00 5.91108572e+10 ... 2.02066110e+11
4.50987171e+10 4.13284212e+10]]
6.2.4 Correlation coefficient
Main implementation methods:
- Pearson correlation coefficient
- Spearman correlation coefficient
6.2.4.1 Pearson Correlation Coefficient
1. Function
The Pearson correlation coefficient is a statistical indicator used to reflect the closeness of the correlation between two variables. Its value range is [ − 1 , 1 ] [-1,1][−1,1]:
- When the coefficient is close to 1, it means that the two variables are positively correlated, that is, when one variable increases, the other also increases;
- When the coefficient is close to -1, it means that the two variables are negatively correlated, that is, when one variable increases, the other will decrease;
- When the coefficient is close to 0, it means that there is no linear relationship between the two variables.
2. Formula calculation case (understand, no need to memorize)
r = n ∑ x y − ∑ x ∑ y n ∑ x 2 − ( ∑ x ) 2 n ∑ y 2 − ( ∑ y ) 2 r = \frac{n\sum{xy} - \sum{x}\sum{y}}{\sqrt{n\sum{x^2} -(\sum{x})^2}\sqrt{n\sum{y^2} - (\sum{y})^2}} r=n∑x2−(∑x)2n∑y2−(∑y)2n∑xy−∑x∑y
in:
- r r r : Pearson's correlation coefficient, used to measure the linear relationship between two variables.
- n n n : number of samples.
- x x x和yyy : the values of the two variables.
- Σ xy ΣxyΣ x y : xxin all samplesx和yyThe sum of the products of y .
- Σ x ΣxΣ x和Σ y ΣyΣ y : xxin all samplesx和yysum of y .
- Σ x 2 Σx^2x _2和Σ y 2 Σy^2y _2 : x 2 x^2in all samplesx2和y 2 and^2ysum of 2 .
Example : For example, we calculate the annual advertising investment and monthly average sales.
So how to calculate the Pearson correlation coefficient between them?
Final calculation:
10 × 16679.09 − 346.2 × 422.5 10 × 14304.52 − 346. 2 2 10 × 19687.81 − 422. 5 2 = 0.9942 \frac{10\times 16679.09 - 346.2 \times 422.5}{\sqrt{10 \times 14304.52 - 346.2^2} \sqrt{10 \times 19687.81 - 422.5^2}} = 0.9942 10×14304.52−346.2210×19687.81−422.5210×16679.09−346.2×422.5=0.9942
So we finally came to the conclusion that there is a high positive correlation between advertising investment and monthly average sales.
The value range of the Pearson correlation coefficient is [ − 1 , 1 ] [-1,1][−1,1]
3. Features
The value of the correlation coefficient is between [ − 1 , 1 ] [-1, 1][−1,1 ] , that is− 1 ≤ r ≤ 1 -1 \le r \le 1−1≤r≤1 . Its properties are as follows:
- 当 r > 0 r > 0 r>When 0 , it means that the two variables are positively correlated;r < 0 r < 0r<When 0 , the two variables are negatively correlated
- When ∣ r ∣ = 1 |r| = 1∣r∣=When 1 , it means that the two variables are completely correlated, whenr = 0 r = 0r=When 0 , it means that there is no correlation between the two variables
- When $0<|r|<$1, it means that there is a certain degree of correlation between the two variables. and ∣ r ∣ |r|∣ r ∣ The closer to 1, the closer the linear relationship between the two variables;∣ r ∣ |r|The closer ∣ r ∣ is to 0, the weaker the linear correlation between the two variables. Generally, it can be divided into three levels:
- ∣ r ∣ < 0.4 |r| < 0.4 ∣r∣<0.4 is low correlation
- 0.4 ≤ ∣ r ∣ < 0.7 0.4 \le |r| < 0.7 0.4≤∣r∣<0.7 is a significant correlation
- 0.7 ≤ ∣ r ∣ < 1 0.7 \le |r| < 1 0.7≤∣r∣<1 is highly linear correlation
4. APIs
from scipy.stats import pearsonr
- Function :
scipy.stats.pearsonr
It is a function used to calculate the Pearson correlation coefficient, which can measure the linear relationship between two variables. It also provides a p-value for testing non-correlation. - Parameters :
x
: (N,) array_like, input array.y
: (N,) array_like, input array.
- return value :
r
: float, Pearson correlation coefficient, the value range is [-1, 1].p-value
: float, two-tailed p-value.
5. Case
from scipy.stats import pearsonr
x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]
r, p_value = pearsonr(x1, x2)
print("r:", r)
print("p-value:", p_value)
r: 0.9941983762371884
p-value: 4.922089955456964e-09
According to the results, the Pearson correlation coefficient x1
between x2
and is 0.9941983762371884
, which shows that there is a very strong positive correlation between the two variables. That is, when x1
increases , x2
it also increases.
The two-tailed p-value 4.922089955456964e-09
is very close to 0. Typically, if the p-value is less than the significance level (for example, 0.05), we can reject the null hypothesis (there is no correlation between the two variables) and consider that there is a significant correlation between the two variables. In this case, the p-value is very small, so we can assume x1
that x2
there is a significant correlation between and .
6.2.4.2 Spearman's rank correlation coefficient (Rank IC)
1. Function :
Spearman's rank correlation coefficient (Spearman's rank correlation coefficient, referred to as rank correlation coefficient or rank correlation coefficient) is a non-parametric index to measure the correlation between two variables. It uses a monotonic function to evaluate the correlation of two statistical variables. When there are no repeated values in the data and when the two variables are completely monotonically correlated, the Spearman correlation coefficient is +1 or −1. A positive Spearman correlation coefficient reflects a monotonically increasing trend between two variables X and Y. A negative Spearman correlation coefficient reflects a monotonically decreasing trend between two variables X and Y.
2. Formula (understand, no need to memorize) :
R a n k I C = 1 − 6 ∑ d i 2 n ( n 2 = 1 ) \mathrm{RankIC} = 1 - \frac{6\sum{d_i^2}}{n(n^2 = 1)} RankIC=1−n(n2=1)6∑di2
in:
RankIC
: Spearman rank correlation coefficient, used to measure the monotonic relationship between two variables.d_i
: The rank difference of thei
th observation, that is, the rank difference between the two variables.n
:Number of samples.
Q : What is the rank difference?
A : Rank Difference refers to the difference between the ranks of each observation in the calculation of Spearman's rank correlation coefficient. For example, if you have two variables x
and y
, and you order the values of each variable, the rank difference of each observation is x
the y
difference between its rank in and its rank in . The rank difference was used to calculate the Spearman rank correlation coefficient, which measures the monotonic relationship between two variables.
For example, suppose there are two sets of data:
x = [1, 2, 3, 4]
y = [2, 3, 1, 4]
First, we need to sort each set of data and assign a rank to each:
x_sorted = [1, 2, 3, 4]
x_ranks = [1, 2, 3, 4]
y_sorted = [1, 2, 3, 4]
y_ranks = [3, 1, 2, 4]
We can then compute the rank difference for each observation:
rank_differences = [2, -1, -1, 0]
Finally, we can use these rank differences to calculate the Spearman rank correlation coefficient.
Example :
3. Features :
- The Spearman correlation coefficient indicates the direction of correlation between X (independent variable) and Y (dependent variable). If Y tends to increase as X increases, the Spearman correlation is positive
- Same as the previous Pearson correlation coefficient, the value is still [ − 1 , 1 ] [-1, 1]between [ − 1 , 1 ]
The Spearman correlation coefficient is more widely used than the Pearson correlation coefficient
4. API :
from scipy.stats import spearmanr
- Role :
scipy.stats.spearmanr
It is a function used to calculate the Spearman rank correlation coefficient, which can measure the monotonic relationship between two variables. It also provides a p-value for testing non-correlation. - Parameters :
a
: (N,) array_like, input array.b
: (N,) array_like, input array, optional.axis
: int or None, optional. If axis=0 (the default), each column represents a variable and rows contain observations. If axis=1, the relationship is transposed: each row represents a variable, while columns contain observations. If axis=None, both arrays will be expanded.nan_policy
: {'propagate', 'raise', 'omit'}, optional. Defines what to do when the input contains nan. Available options are (default 'propagate'): 'propagate': return nan; 'raise': throw an error; 'omit': ignore nan values for calculation.alternative
: {'two-sided', 'less', 'greater'}, optional. Define the alternative hypothesis. Defaults to 'two-sided'. Available options are: 'two-sided': the correlation is not zero; 'less': the correlation is negative (less than zero); 'greater': the correlation is positive (greater than zero).
- return value :
correlation
: float or ndarray (2-D square). Spearman correlation matrix or correlation coefficient (if only 2 variables are given as arguments). The correlation matrix is square and has a length equal to the total number of variables (columns or rows) in which a and b are combined.pvalue
: float or ndarray (2-D square). Two-tailed p-value.
5. Case :
from scipy.stats import spearmanr
x1 = [12.5, 15.3, 23.2, 26.4, 33.5, 34.4, 39.4, 45.2, 55.4, 60.9]
x2 = [21.2, 23.9, 32.9, 34.1, 42.5, 43.2, 49.0, 52.8, 59.4, 63.5]
r, p_value = spearmanr(x1, x2)
print("r:", r)
print("p-value:", p_value)
r: 0.9999999999999999
p-value: 6.646897422032013e-64
According to the results, the Spearman rank correlation coefficient between x1
and x2
is 0.9999999999999999
, which shows that there is a very strong positive correlation between the two variables. That is, when x1
increases , x2
it also increases.
The two-tailed p-value 6.646897422032013e-64
is very close to 0. Typically, if the p-value is less than the significance level (for example, 0.05), we can reject the null hypothesis (there is no correlation between the two variables) and consider that there is a significant correlation between the two variables. In this case, the p-value is very small, so we can assume x1
that x2
there is a significant correlation between and .
6.3 Principal Component Analysis (PCA)
6.3.1 What is Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a commonly used data dimensionality reduction technique. It transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components through a linear transformation. Principal component analysis can reduce the dimensionality of a data set by retaining low-dimensional principal components and ignoring high-dimensional principal components, while retaining the features that contribute the most to variance in the data set .
Definition : The process of transforming high-dimensional data into low-dimensional data, during which the original data may be discarded and new variables created.
Function : data dimensionality compression, reduce the dimensionality (complexity) of the original data as much as possible, and lose a small amount of information.
Application : regression analysis or cluster analysis.
For the word information, it is introduced in the decision tree
So how to better understand this process? Let's look at a picture.
If we want to see the whole picture of a teapot, we can clearly see that the fourth picture is the easiest full picture of a formal teapot. Although we cannot see all the details, it does not affect our acquisition of the main factors.
6.3.2 API
sklearn.decomposition.PCA(n_components=None)
-
Role :
sklearn.decomposition.PCA
It is a class that implements principal component analysis (PCA) in the scikit-learn library. It can convert a set of possibly correlated variables into a set of linearly uncorrelated variables through a linear transformation, and these uncorrelated variables are called principal components. PCA can reduce the dimensionality of a dataset by retaining low-dimensional principal components and ignoring high-dimensional principal components, while retaining the features that contribute the most to the variance in the dataset. -
Parameters :
n_components
: int, float, None or str, the default value is None. The number of components to keep.- Decimal: Indicates how many percent of information is retained
- integer: how many features to reduce to
- If n_components is not set, keep all components: n_components == min(n_samples, n_features) .
- If n_components == 'mle' and svd_solver == 'full', use Minka's MLE to guess the dimensionality. Using n_components == 'mle' will interpret svd_solver == 'auto' as svd_solver == 'full'.
- If 0 < n_components < 1 and svd_solver == 'full', the number of components is chosen such that the amount of variance to be explained is greater than the percentage specified by n_components.
- If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples. Therefore, the result in the None case is: n_components == min(n_samples, n_features) - 1.
copy
: bool, default is True. If False, the data passed to fit will be overwritten, and running fit(X).transform(X) will not give the expected result, and fit_transform(X) should be used instead.whiten
: bool, default is False. When True (default False), the components_ vector is multiplied by the square root of n_samples and divided by the singular values to ensure that the outputs are uncorrelated and have unit component-wise variance. Whitening will remove some information from the transformed signal (the relative variance proportions of the components), but can sometimes improve the prediction accuracy of downstream estimators by making their data obey hardwired assumptions.svd_solver
: {'auto', 'full', 'arpack', 'randomized'}, default value is 'auto'.- If auto: choose the solver's default strategy based on X.shape and n_components:
- If the input data is larger than 500x500 and the number of components to extract is less than 80% of the smallest dimension of the data, enable the more efficient "randomization" method.
- Otherwise the exact full SVD is computed and optionally truncated afterwards.
- If full: Runs exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and selecting components via postprocessing.
- If auto: choose the solver's default strategy based on X.shape and n_components:
tol
: float, default value is 0.0. Convergence parameter for svd_solver == 'arpack'.iterated_power
: int or 'auto', default is 'auto'. svd_solver == 'randomized' power of iterations.random_state
: int, RandomState instance or None, the default value is None. Controls the seed of the random number generator; passed to arpack or random_state when choosing a randomized svd solver.
-
method :
fit(X[, y])
: Fitted model.fit_transform(X[, y])
: Fit the model and perform the transformation.get_covariance()
: Computes data covariance.get_params([deep])
: Get the parameters of this estimator.get_precision()
: Computational precision matrix.inverse_transform(X)
: Convert the data back to the original space.score(X[, y])
: Returns the mean log-likelihood.score_samples(X)
: Returns an array of sample log-likelihoods.set_params(**params)
: Sets the parameters of this estimator.transform(X)
: Reduce the dimensionality of the data.
6.3.3 Data Calculation
First take a simple data calculation:
from sklearn.decomposition import PCA
data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]
# 1. 实例化PCA,小数:保留多少信息
transfer = PCA(n_components=0.9)
# 2. 调用fit_transform方法
data_PCA = transfer.fit_transform(data)
print(f"保留90%的信息后,降维的结果为:\r\n{
data_PCA}")
保留90%的信息后,降维的结果为:
[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]
from sklearn.decomposition import PCA
data = [[2, 8, 4, 5], [6, 3, 0, 8], [5, 4, 9, 1]]
# 1. 实例化PCA,小数:保留多少信息
transfer = PCA(n_components=2)
# 2. 调用fit_transform方法
data_PCA = transfer.fit_transform(data)
print(f"降维到2维后的结果为:\r\n{
data_PCA}")
降维到2维后的结果为:
[[ 1.28620952e-15 3.82970843e+00]
[ 5.74456265e+00 -1.91485422e+00]
[-5.74456265e+00 -1.91485422e+00]]
- The definition of dimensionality reduction [understand]
- It is to change the characteristic value, choose which column to keep and which column to delete.
- The goal is to get a set of "uncorrelated" primary variables
- Two ways of dimensionality reduction【Understand】
- feature selection
- Principal component analysis (can understand a way of feature extraction)
- feature selection [know]
- Definition: Eliminate redundant variables in the data
- method:
- Filter (filter): Mainly explore the characteristics of the feature itself, the relationship between features and features and target values
- Variance selection method: low variance feature filtering (low variance means low discrimination)
- correlation coefficient
- Embedded: Algorithms automatically select features (associations between features and target values)
- Decision tree: information entropy, information gain
- Regularization: L1, L2
- Filter (filter): Mainly explore the characteristics of the feature itself, the relationship between features and features and target values
- Low variance feature filtering 【know】
- Eliminate a column with a relatively small variance
- API:
sklearn.feature_selection.VarianceThreshold(threshold=0.0)
- Remove all low variance features
- Note: The parameter
threshold
must specify the value
- Correlation coefficient【master】
- Main implementation methods:
- Pearson correlation coefficient
- Spearman correlation coefficient
- Pearson correlation coefficient
- Calculate by the size of the specific value
- relatively complex
- API:
from scipy.stats import pearsonr
- The closer the return value is to 1, the stronger the correlation
- The closer the return value is to 0, the weaker the correlation
- Spearman correlation coefficient
- Calculation by grade difference
- Simpler than the previous one (Pearson correlation coefficient)
- API:
from scipy.tats import spearmanr
- The closer the return value is to 1, the stronger the correlation
- The closer the return value is to 0, the weaker the correlation
- Main implementation methods:
- PCA【know】
- Definition: High-dimensional data is converted to low-dimensional data, and then new variables are generated
- API:
sklearn.decomposition.PCA(n_components=None)
n_components
:- Decimals: how many percent of information to keep
- Integer: Indicates how many dimensions to reduce
7. Case: Exploring the segmentation of users' preferences for item categories
learning target:
- Applying PCA and K-means to realize the user's preference segmentation for item categories
7.1 Requirements
Whether you shop from a carefully planned list or let your mood guide your shopping, our unique food habits define who we are. Instacart is a grocery ordering and delivery app designed to make it easy to fill your fridge and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, a personal shopper reviews your order and makes in-store purchases and delivery for you.
Instacart's data science team plays an important role in delivering a delightful shopping experience. Currently, they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to a cart next during a session. Recently, Instacart open-sourced this data - see their blog post on 3 million Instacart orders.
In this competition, Instacart challenges the Kaggle community to use this anonymized customer order data to predict which previously purchased products a user will include in their next order. Not only are they looking for the best model, Instacart is also looking for machine learning engineers to grow their team.
The winner of this competition will receive a cash prize and the opportunity to fast-track the recruitment process. For more information on exciting opportunities at Instacart, check out their careers page or contact their recruitment team directly at [email protected].
Dataset link : Instacart Market Basket Analysis
Data are as follows:
order_products_prior.csv
: order and product information- Fields:
order_id
,product_id
,add_to_cart_order
,reordered
- Fields:
products.csv
: product information- Fields:
product_id
,product_name
,aisle_id
,department_id
- Fields:
orders.csv
: user's order information- Fields:
order_id
,user_id
,eval_set
,order_number
, …
- Fields:
aisles.csv
: The specific item category to which the product belongs- field:
aisle_id
,aisle
- field:
7.2 Analysis
- retrieve data
- Basic Data Processing
- merge table
- Crosstab Merge
- data interception
- Feature Engineering: PCA
- machine learning (k-means)
- model evaluation
sklearn.metrics.silhouette_score(X, labels)
- Compute the average silhouette coefficient for all samples
X
:Eigenvalueslabels
: target value marked by clustering
7.3 Code implementation
7.3.0 Import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
7.3.1 Get data
# 1. 读取数据
order_product = pd.read_csv("./data/instacart-market-basket-analysis/order_products__prior.csv")
products = pd.read_csv("./data/instacart-market-basket-analysis/products.csv")
orders = pd.read_csv("./data/instacart-market-basket-analysis/orders.csv")
aisles = pd.read_csv("./data/instacart-market-basket-analysis/aisles.csv")
7.3.2 Basic data processing
7.3.2.1 Merge tables
# 2. 数据基本处理
## 2.1 合并表格
# on:标签或列表。要连接的列或索引级别名称。这些必须在两个 DataFrame 中都能找到。如果 on 为 None 并且未在索引上合并,则默认为两个 DataFrame 中列的交集。
table_1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table_2 = pd.merge(table_1, orders, on=["order_id", "order_id"])
table = pd.merge(table_2, aisles, on=["aisle_id", "aisle_id"])
7.3.2.2 Crosstab Merge
Cross Tabulations is a commonly used classification and summary table for frequency distribution statistics. Its main value lies in describing the profound meaning of the relationship between variables . It can compute simple crosstabulations of two (or more) factors. By default, it computes a frequency table for factors, unless an array of values and an aggregate function are passed.
For example, we can use pd.crosstab
the function to compute a crosstabulation between two categorical variables. The results show how many times each value in each variable occurs in combination with each value in the other variable.
Here is a simple example of how to use pd.crosstab
the function to calculate a crosstab:
import pandas as pd
# 创建示例数据
data = {
'性别': ['男', '女', '男', '女', '男', '女', '男', '男'],
'喜欢的颜色': ['红', '红', '蓝', '绿', '蓝', '蓝', '红', '绿'],
'数量': [1, 2, 3, 4, 5, 6, 7, 8]}
df = pd.DataFrame(data)
# 计算交叉表
ct = pd.crosstab(df['性别'], df['喜欢的颜色'])
print(ct)
df.head()
ct.head()
This code will output the following result:
喜欢的颜色 绿 红 蓝
性别
女 1 1 1
男 1 2 2
In this example, pd.crosstab
we computed the crosstabulation of 性别
the and 喜欢的颜色
columns using the function. The result 性别
shows 喜欢的颜色
how many times each value in the column occurs in combination with each value in the column.
Recommended Video : Pandas_Pivot and Crosstabs
Pivot Table is a tool for summarizing and analyzing data. It can aggregate data based on one or more keys, producing a new DataFrame. The levels in the pivot table will be stored in the resulting DataFrame's index and in the columns' MultiIndex objects (hierarchical indexes). The pivot table can aggregate the data in the DataFrame by one or more key groupings, and the aggregation type is determined by the aggfunc parameter, which is an advanced function of groupby.
Cross Tabulations is a commonly used classification and summary table for frequency distribution statistics. Its main value lies in describing the profound meaning of the relationship between variables. By default, it computes a frequency table for factors, unless an array of values and an aggregate function are passed. A crosstab is used to count the number of groups of data in one column for another (a special pivot table for statistical grouping frequencies).
In short, a pivot table is a function for grouping statistics, and a crosstab is a special pivot table, which is more convenient when only counting group frequencies .
## 2.2 交叉表合并
table = pd.crosstab(table["user_id"], table["aisle_id"])
table.head()
Here we use the crosstab merge because we want to see what is the relationship between "user_id" and the product category "aisle_id".
7.3.2.3 Data interception
## 2.3 数据截取
table_clip = table[:1000]
table_clip.head()
7.3.3 Feature Engineering: PCA Principal Component Analysis
# 3. 特征工程:PCA主成分分析
transfer = PCA(n_components=0.9) # 保留90%的信息
data = transfer.fit_transform(table_clip)
data
array([[-2.27452872e+01, -7.32942365e-01, -2.48945893e+00, ...,
-4.78491473e+00, -3.10742945e+00, -2.45192316e+00],
[ 5.28638801e+00, -3.00176267e+01, -1.11226906e+00, ...,
9.24145693e+00, -3.11309382e+00, 2.20144174e+00],
[-6.52593099e+00, -3.87333123e+00, -9.23859508e+00, ...,
-1.33929081e+00, 1.25062993e+00, 6.12717485e-01],
...,
[ 1.31226615e+01, -2.77296885e+01, -4.62403246e+00, ...,
7.40793534e+00, 1.03829352e+00, -1.39058393e+01],
[ 1.64905900e+02, -8.54916188e+01, 1.90577481e-02, ...,
-5.62014943e+00, -1.38488891e+01, -7.11424774e+00],
[-1.60244724e+00, 1.82037661e+00, 8.55756408e+00, ...,
3.69860152e+00, 2.82248188e+00, -3.79491023e+00]])
print("降维前特征数量为:", table_clip.shape)
print("降维后特征数量为:", data.shape)
降维前特征数量为: (1000, 134)
降维后特征数量为: (1000, 22)
7.3.4 Machine Learning: K-means Clustering Algorithm
import os
os.environ["OMP_NUM_THREADS"] = "4"
# 4. 机器学习:K-means聚类
estimator = KMeans(n_clusters=8, random_state=22) # 分为8类
pred = estimator.fit_predict(data)
pred
array([0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 7, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 7, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 7, 1, 0,
1, 6, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 7,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 3, 7, 1,
1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 7, 0,
0, 0, 0, 1, 0, 7, 0, 1, 0, 0, 6, 4, 0, 0, 0, 7, 0, 1, 0, 0, 1, 1,
1, 1, 3, 0, 0, 1, 7, 0, 1, 0, 0, 7, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
7, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 3,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 7, 0, 0, 1, 0, 0,
0, 0, 4, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
7, 0, 0, 0, 4, 0, 0, 1, 0, 0, 0, 0, 7, 1, 3, 0, 0, 0, 3, 0, 0, 0,
0, 1, 0, 7, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 0,
0, 0, 0, 0, 1, 0, 7, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
0, 7, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 4, 0, 0, 0, 1, 0, 1, 1, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2,
0, 7, 1, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 7, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 7,
1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 7, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 7, 0, 0, 0, 1, 7, 0, 0, 3, 1, 1, 1, 1, 0, 3, 0, 1, 3, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 0, 3, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 7, 1, 1, 0, 1, 0, 0, 0,
1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 7, 0, 0, 0, 0, 1, 7, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 7, 0, 1, 1, 0, 0, 1, 0, 2, 1, 0, 0, 0, 7, 0, 7, 1, 0, 1, 0,
0, 1, 0, 0, 1, 1, 0, 1, 0, 7, 0, 4, 0, 0, 0, 1, 0, 0, 0, 0, 7, 7,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 2, 0, 7, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 3, 0, 0, 0, 7, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 1, 1, 0, 3, 0, 0, 0, 3,
1, 0, 0, 1, 7, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 2, 0])
7.3.5 Model Evaluation
Silhouette Coefficient (Silhouette Coefficient) is an index used to evaluate the effect of clustering. It measures the clustering effect by calculating the silhouette coefficient of each sample. Silhouette coefficients were calculated by computing the average intra-cluster distance (a) and average nearest-cluster distance (b) for each sample. The silhouette coefficient of each sample is ( b − a ) / max ( a , b ) (b - a) / \max(a, b)(b−a)/max(a,b ) . where b is the distance between a sample and the nearest cluster that does not belong to that sample. Note that the silhouette coefficient is only defined when the number of labels is 2 <= n_labels <= n_samples - 1. This function returns the average silhouette coefficient over all samples.
The value range of the silhouette coefficient is [-1, 1]:
- When the coefficient is 1, it means that the clustering effect is very good;
- When the coefficient is -1, it means that the clustering effect is very poor;
- When the coefficient is close to 0, there is overlap between the clusters.
Negative values usually indicate that the sample was assigned to the wrong cluster because different clusters are more similar.
In the scikit-learn library, you can use sklearn.metrics.silhouette_score
the function to calculate the silhouette coefficient. The function requires a data matrix X
and a label array labels
, and returns the average silhouette coefficient for all samples.
# 5. 模型评估
score = silhouette_score(data, pred)
score
0.46400567259894415
Let's take a look at the effect of intercepting different amounts of data:
clip_num = [10, 50, 100, 500, 1000, 1500, 5000, 10000]
for clip_n in clip_num:
## 2.3 数据截取
table_clip = table[:clip_n]
# 3. 特征工程:PCA主成分分析
transfer = PCA(n_components=0.9) # 保留90%的信息
data = transfer.fit_transform(table_clip)
# 4. 机器学习:K-means聚类
estimator = KMeans(n_clusters=8, random_state=22) # 分为8类
pred = estimator.fit_predict(data)
# 5. 模型评估
score = silhouette_score(data, pred)
print(f"[数据量: {
clip_n}] 分数为:{
score*100:.4f}%")
[数据量: 10] 分数为:13.2716%
[数据量: 50] 分数为:33.6469%
[数据量: 100] 分数为:31.4929%
[数据量: 500] 分数为:44.9804%
[数据量: 1000] 分数为:46.4006%
[数据量: 1500] 分数为:38.5747%
[数据量: 5000] 分数为:38.0150%
[数据量: 10000] 分数为:37.5044%
It can be seen that the effect is not very good, it should be caused by the small number of features we use.