DBSCAN clustering method combined with PCA dimensionality reduction (with Python code)

 

Table of contents

Preface introduction:

1. PCA dimensionality reduction:

(1) Concept explanation:

(2) Implementation steps:

(3) Pros and cons: 

2. DBSCAN clustering:

(1) Concept explanation:

(2) Algorithm principle:

(3) Pros and cons:

Code:

0. Data preparation:

1. PCA dimensionality reduction:

2. DBSCAN clustering:

3. Code summary:

Realize the effect:

1. Dimensionality reduction effect:

2. Clustering effect:

Write at the end:


 

Preface introduction:

1. PCA dimensionality reduction:

(1) Concept explanation:

3f361d7fea82414ca1d0245aa2849b0b.pngPCA , the full name is Principal Component Analysis , which is principal component analysis . It is a dimensionality reduction method, which is implemented by extracting the main components of features, so as to compress high-dimensional data into low-dimensional space while retaining the main features.

6f1bdd5bab6a430ab875e12d0707ea00.pngThe low-dimensional data obtained after PCA processing is actually the projection of the original high-dimensional feature data on a low-dimensional plane ( as long as the dimension is low, it can be regarded as a plane , for example, three-dimensional relative to four-dimensional space can also be regarded as a plane). Although the reduced-dimensional data can reflect most of the information of the original high-dimensional data, it cannot reflect all the information of the original high-dimensional space, so it should be identified and used according to the actual situation .

        (2) Implementation steps:

        6f1bdd5bab6a430ab875e12d0707ea00.pngPCA is mainly implemented through 6 steps:

        1. Standardization (normalize the original data, usually to remove the mean, if the features are on different levels, divide the matrix by the standard deviation)

        specific:

43c90c21dd574a628e433ed7803e1b5c.bmp

        Among them, E is the original matrix, Emean is the mean matrix, and Enorm is the standardized matrix.

 

        2. Covariance (calculate the covariance matrix of the standardized data set)

        specific:

ae0a75cb6c6546a19b8c0eb5db4188b0.bmp

        Among them, Cov is the covariance matrix, m is the number of samples, and Enorm is the mean matrix.

 

        3. Eigenvalues ​​(calculate the eigenvalues ​​and eigenvectors of the covariance matrix)

        specific:

        Assuming a real number λ, n rows (the number of columns of the original matrix E is n) and a matrix X with one column (that is, an n-dimensional vector) satisfy the following formula:

139b88e78691423d8894128f030be4f7.bmp

        Then λ is the eigenvalue of Cov, where Cov is the covariance matrix.

 

        4. K features (keep the first K features with the largest eigenvalues ​​(K is the dimension we expect to achieve after dimension reduction))

        specific:

        If there are multiple eigenvalues, the first K largest eigenvalues ​​are retained to meet subsequent calculation requirements.

 

        5. K vector (find the eigenvectors corresponding to these K eigenvalues)

        specific:

        Obtain the eigenvector corresponding to each eigenvalue through the formula in step 3.

 

        6. Dimensionality reduction (multiply the standardized data set by the K feature vectors to obtain the result after dimensionality reduction)

        specific:

        d845b46e452e48bfbc58e05b706b1c6b.bmp

        Among them, Epca is the PCA dimension reduction matrix obtained at last, Enorm is the standardized matrix, and X1, X2, X3, ..., Xk are the eigenvectors corresponding to the K eigenvalues.

        (3) Pros and cons: 

        6f1bdd5bab6a430ab875e12d0707ea00.pngAdvantages :   

        1. The principal components after PCA dimensionality reduction are orthogonal to each other, which can eliminate the mutual influence factors between the original data .

        2. The calculation process of PCA dimensionality reduction is not complicated, so it is relatively simple and easy to implement .

        3. On the premise of retaining most of the main information , it has the effect of dimensionality reduction and simplification of calculation .

        6f1bdd5bab6a430ab875e12d0707ea00.pngDisadvantages :

        1. The definition of feature principal components is vague and poor in interpretability .

        2. PCA dimensionality reduction selects the standard of the principal component that makes the original data have the largest variance on the new coordinate axis, so that some features with small variance are more likely to be lost, and there is a possibility of losing important information .

2. DBSCAN clustering:

        (1) Concept explanation:

        6f1bdd5bab6a430ab875e12d0707ea00.pngDensity clustering is also known as "Density-Based Clustering". This type of algorithm assumes that the cluster structure can be determined by the tightness of the sample distribution . Usually, the density clustering algorithm examines the continuity between samples from the perspective of sample density, and continuously expands clusters based on connectable samples to obtain the final clustering results.

        DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is such a clustering algorithm, which is based on a set of "neighborhood" parameters (ε, MinPts) to describe the tightness of the sample distribution .

        (2) Algorithm principle:

       6f1bdd5bab6a430ab875e12d0707ea00.pngGiven a data set D={x1,x2,...,xm}, define the following concepts:

d64d8f508a2848dd8343b3df86a2aa74.png

9fdb9ff77a7642359488cd9932471d10.png

ddc7417999654a0d9338d045b0b42a54.png

b71656162e584be0870332ef8de3dd75.png

9404d3989d51482e903d5d0a21ebc7eb.png

 ff93787632564d2b9b9e8885f84f5005.png

         6f1bdd5bab6a430ab875e12d0707ea00.pngAfter understanding the related concepts, the pseudo code of the algorithm implementation is given below :

 211c9a80f576444b8ef611391279a6b7.png

        (3) Pros and cons:

              6f1bdd5bab6a430ab875e12d0707ea00.pngadvantage:

              1. Able to identify samples of any shape .

              2. The algorithm divides regions with sufficient density into clusters and finds clusters of arbitrary shape in a noisy spatial database.

              3. There is no need to specify the number of clusters, but the algorithm discovers them independently.

              6f1bdd5bab6a430ab875e12d0707ea00.pngshortcoming:

              1. It is necessary to specify the minimum number of points (MinPts) and radius (ε). (But in fact, compared with other clustering algorithms, it already has greater freedom.)

              2. The minimum number of points and the radius have a great influence on the algorithm , and generally need to be debugged several times.

Code:

0. Data preparation:

              6f1bdd5bab6a430ab875e12d0707ea00.pngHere, we use the iris iris dataset (sklearn.datasets.load_iris) of the sklearn library as a test data sample. The iris dataset contains 150 samples, and each sample contains four attribute features (sepal length, sepal width, petal length, petal width) and a class label (0, 1, 2 for Iris, Iris and Vigie Nia iris).

              6f1bdd5bab6a430ab875e12d0707ea00.pngFirst, we need to install the sklearn library . Install this library, or through the pip install command, but not pip install sklearn, but pip install scikit-learn . Just as we call opencv import cv2, but the installation is the same as pip install opencv. 

pip install scikit-learn

              6f1bdd5bab6a430ab875e12d0707ea00.pngThen, get the data set, where x is the feature data set of iris (the data type is the array numpy.adarray ), and y is the label data set of the iris flower (the data type is the array numpy.adarray ).

from sklearn.datasets import load_iris
x = load_iris().data
y = load_iris().target

1. PCA dimensionality reduction:

import numpy as np

def PCA_DimRed(dataMat,topNfeat): #PCA_DimRed--PCA dimension reduction,PCA降维
    meanVals = np.mean(dataMat, axis=0)
    meanRemoved = dataMat - meanVals  # 标准化(去均值)
    covMat = np.cov(meanRemoved, rowvar=False)
    eigVals, eigVets = np.linalg.eig(np.mat(covMat))  # 计算矩阵的特征值和特征向量
    eigValInd = np.argsort(eigVals)  # 将特征值从小到大排序,返回的是特征值对应的数组里的下标
    eigValInd = eigValInd[:-(topNfeat + 1):-1]  # 保留最大的前K个特征值
    redEigVects = eigVets[:, eigValInd]  # 对应的特征向量
    lowDDatMat = meanRemoved * redEigVects  # 将数据转换到低维新空间
    # reconMat = (lowDDatMat * redEigVects.T) + meanVals  # 还原原始数据
    return lowDDatMat

2. DBSCAN clustering:

import numpy as np
import random
import copy

def DBSCAN_cluster(mat,eps,min_Pts): #进行DBSCAN聚类,优点在于不用指定簇数量,而且适用于多种形状类型的簇
    k = -1
    neighbor_list = []  # 用来保存每个数据的邻域
    omega_list = []  # 核心对象集合
    gama = set([x for x in range(len(mat))])  # 初始时将所有点标记为未访问
    cluster = [-1 for _ in range(len(mat))]  # 聚类
    for i in range(len(mat)):
        neighbor_list.append(find_neighbor(mat, i, eps))
        if len(neighbor_list[-1]) >= min_Pts:
            omega_list.append(i)  # 将样本加入核心对象集合
    omega_list = set(omega_list)  # 转化为集合便于操作
    while len(omega_list) > 0:
        gama_old = copy.deepcopy(gama)
        j = random.choice(list(omega_list))  # 随机选取一个核心对象
        k = k + 1
        Q = list()
        Q.append(j)
        gama.remove(j)
        while len(Q) > 0:
            q = Q[0]
            Q.remove(q)
            if len(neighbor_list[q]) >= min_Pts:
                delta = neighbor_list[q] & gama
                deltalist = list(delta)
                for i in range(len(delta)):
                    Q.append(deltalist[i])
                    gama = gama - delta
        Ck = gama_old - gama
        Cklist = list(Ck)
        for i in range(len(Ck)):
            cluster[Cklist[i]] = k
        omega_list = omega_list - Ck
    return cluster

3. Code summary:

from sklearn.datasets import load_iris
import numpy as np
import random
import copy
import matplotlib.pyplot as plt

def PCA_DimRed(dataMat,topNfeat): #PCA_DimRed--PCA dimension reduction,PCA降维
    meanVals = np.mean(dataMat, axis=0)
    meanRemoved = dataMat - meanVals  # 标准化(去均值)
    covMat = np.cov(meanRemoved, rowvar=False)
    eigVals, eigVets = np.linalg.eig(np.mat(covMat))  # 计算矩阵的特征值和特征向量
    eigValInd = np.argsort(eigVals)  # 将特征值从小到大排序,返回的是特征值对应的数组里的下标
    eigValInd = eigValInd[:-(topNfeat + 1):-1]  # 保留最大的前K个特征值
    redEigVects = eigVets[:, eigValInd]  # 对应的特征向量
    lowDDatMat = meanRemoved * redEigVects  # 将数据转换到低维新空间
    # reconMat = (lowDDatMat * redEigVects.T) + meanVals  # 还原原始数据
    return lowDDatMat

def find_neighbor(data,pos,eps): #寻找相邻点函数
    N = list()
    temp = np.sum((data-data[pos])**2, axis=1)**0.5
    N = np.argwhere(temp <= eps).flatten().tolist()
    return set(N)

def DBSCAN_cluster(data,eps,min_Pts): #进行DBSCAN聚类,优点在于不用指定簇数量,而且适用于多种形状类型的簇,如果使用K均值聚类的话,对于这次实验的数据(条状簇)无法得到较好的分类结果
    k = -1
    neighbor_list = []  # 用来保存每个数据的邻域
    omega_list = []  # 核心对象集合
    gama = set([x for x in range(len(data))])  # 初始时将所有点标记为未访问
    cluster = [-1 for _ in range(len(data))]  # 聚类
    for i in range(len(data)):
        neighbor_list.append(find_neighbor(data, i, eps))
        if len(neighbor_list[-1]) >= min_Pts:
            omega_list.append(i)  # 将样本加入核心对象集合
    omega_list = set(omega_list)  # 转化为集合便于操作
    while len(omega_list) > 0:
        gama_old = copy.deepcopy(gama)
        j = random.choice(list(omega_list))  # 随机选取一个核心对象
        k = k + 1
        Q = list()
        Q.append(j)
        gama.remove(j)
        while len(Q) > 0:
            q = Q[0]
            Q.remove(q)
            if len(neighbor_list[q]) >= min_Pts:
                delta = neighbor_list[q] & gama
                deltalist = list(delta)
                for i in range(len(delta)):
                    Q.append(deltalist[i])
                    gama = gama - delta
        Ck = gama_old - gama
        Cklist = list(Ck)
        for i in range(len(Ck)):
            cluster[Cklist[i]] = k
        omega_list = omega_list - Ck
    return cluster

if __name__ == "__main__":
    #1、准备数据
    x = load_iris().data
    y = load_iris().target

    #2、PCA降维
    pro_data = PCA_DimRed(x,2)

    #3、DBSCAN聚类(此步中要保证数据集类型为数组,以配合find_neighbor函数)
    pro_array = np.array(pro_data)
    thecluster = DBSCAN_cluster(pro_array,eps=0.8,min_Pts=30)

    #4、展示降维效果:
    print("下面是降维之前的鸢尾花数据集特征集:")
    print(x)
    print("下面是降维之后的鸢尾花数据集特征集:")
    print(pro_data)

    #5、展示聚类效果:
    plt.figure()
    plt.scatter(pro_array[:, 0], pro_array[:, 1], c=thecluster)
    plt.show()

Realize the effect:

1. Dimensionality reduction effect:

6f1bdd5bab6a430ab875e12d0707ea00.pngThe feature set of the iris data set before dimensionality reduction:

f323b8929b2f44028dfb4f95ce4c8090.png

6f1bdd5bab6a430ab875e12d0707ea00.pngThe feature set of the iris data set after dimensionality reduction:

0b682f9e9f654db1b7c446ea66511992.png

2. Clustering effect:

40bf123e9d744fa69a17bbe6dce7f9e1.png

6f1bdd5bab6a430ab875e12d0707ea00.pngIt can be seen that the DBSCAN clustering method cannot accurately cluster the iris flower samples according to the iris flower feature set after PCA dimensionality reduction, because the sample characteristics of the iris versicolor and the iris virginia are closer, and the two are closer . It is similar to belonging to the same density space , which leads to the inaccuracy of the experiment.

6f1bdd5bab6a430ab875e12d0707ea00.pngHowever, in fact, it can also be seen that Iris spp. can be better distinguished from the other two types of Iris  , indicating that this method is still applicable to clustering situations where there is a large gap between samples of different categories .

Write at the end:

6f1bdd5bab6a430ab875e12d0707ea00.pngThis article mainly introduces the basic principles of PCA dimensionality reduction and DBSCAN clustering, two machine learning operations , and the method of combining the two for actual data processing .

6f1bdd5bab6a430ab875e12d0707ea00.pngMaybe the DBSCAN clustering method based on PCA dimensionality reduction is not very suitable for the iris data set in the sklearn library, but this method has the ability to handle high-dimensional data and clusters of various shapes , indicating that it is a set of A relatively complete clustering method still has a relatively broad application scenario .

6f1bdd5bab6a430ab875e12d0707ea00.pngI hope that everyone can actively apply this method, so that it has more application possibilities. Thank you everyone!

6f1bdd5bab6a430ab875e12d0707ea00.pngReference books:

Zhou Zhihua. Machine Learning [M]. Beijing: Tsinghua University Press, 2016.01

6f1bdd5bab6a430ab875e12d0707ea00.pngReference article:

Six common clustering algorithms: http://t.csdn.cn/Urhn9

Two implementations of Python PCA (Principal Component Analysis) dimensionality reduction: http://t.csdn.cn/NlAeU

Python implementation of DBSCAN clustering algorithm: http://t.csdn.cn/lkFhF

PCA dimensionality reduction principle operation steps and advantages and disadvantages: http://t.csdn.cn/QiEJM

 

 46d908c6026d42fd8094221c13854163.pngWell, the above is all the content. I hope you will pay more attention, like, and bookmark . This will be of great help to me. Thank you all!

31deec53974c4aea8cc517b8385f9cf1.gif

1ea540028ba44c97a3c22e18317cf3e9.pngAlright, this is Mr. Kamen Black . I wish the country a healthy family, and see you next time! ! ! yo-yo~~ 

04eff7ea64b0424599b1579769341f81.jpeg

 

 

Guess you like

Origin blog.csdn.net/m0_55320151/article/details/130099343