Table of contents
1. PCA dimensionality reduction:
1. PCA dimensionality reduction:
1. Dimensionality reduction effect:
Preface introduction:
1. PCA dimensionality reduction:
(1) Concept explanation:
PCA , the full name is Principal Component Analysis , which is principal component analysis . It is a dimensionality reduction method, which is implemented by extracting the main components of features, so as to compress high-dimensional data into low-dimensional space while retaining the main features.
The low-dimensional data obtained after PCA processing is actually the projection of the original high-dimensional feature data on a low-dimensional plane ( as long as the dimension is low, it can be regarded as a plane , for example, three-dimensional relative to four-dimensional space can also be regarded as a plane). Although the reduced-dimensional data can reflect most of the information of the original high-dimensional data, it cannot reflect all the information of the original high-dimensional space, so it should be identified and used according to the actual situation .
(2) Implementation steps:
PCA is mainly implemented through 6 steps:
1. Standardization (normalize the original data, usually to remove the mean, if the features are on different levels, divide the matrix by the standard deviation)
specific:
Among them, E is the original matrix, Emean is the mean matrix, and Enorm is the standardized matrix.
2. Covariance (calculate the covariance matrix of the standardized data set)
specific:
Among them, Cov is the covariance matrix, m is the number of samples, and Enorm is the mean matrix.
3. Eigenvalues (calculate the eigenvalues and eigenvectors of the covariance matrix)
specific:
Assuming a real number λ, n rows (the number of columns of the original matrix E is n) and a matrix X with one column (that is, an n-dimensional vector) satisfy the following formula:
Then λ is the eigenvalue of Cov, where Cov is the covariance matrix.
4. K features (keep the first K features with the largest eigenvalues (K is the dimension we expect to achieve after dimension reduction))
specific:
If there are multiple eigenvalues, the first K largest eigenvalues are retained to meet subsequent calculation requirements.
5. K vector (find the eigenvectors corresponding to these K eigenvalues)
specific:
Obtain the eigenvector corresponding to each eigenvalue through the formula in step 3.
6. Dimensionality reduction (multiply the standardized data set by the K feature vectors to obtain the result after dimensionality reduction)
specific:
Among them, Epca is the PCA dimension reduction matrix obtained at last, Enorm is the standardized matrix, and X1, X2, X3, ..., Xk are the eigenvectors corresponding to the K eigenvalues.
(3) Pros and cons:
Advantages :
1. The principal components after PCA dimensionality reduction are orthogonal to each other, which can eliminate the mutual influence factors between the original data .
2. The calculation process of PCA dimensionality reduction is not complicated, so it is relatively simple and easy to implement .
3. On the premise of retaining most of the main information , it has the effect of dimensionality reduction and simplification of calculation .
Disadvantages :
1. The definition of feature principal components is vague and poor in interpretability .
2. PCA dimensionality reduction selects the standard of the principal component that makes the original data have the largest variance on the new coordinate axis, so that some features with small variance are more likely to be lost, and there is a possibility of losing important information .
2. DBSCAN clustering:
(1) Concept explanation:
Density clustering is also known as "Density-Based Clustering". This type of algorithm assumes that the cluster structure can be determined by the tightness of the sample distribution . Usually, the density clustering algorithm examines the continuity between samples from the perspective of sample density, and continuously expands clusters based on connectable samples to obtain the final clustering results.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is such a clustering algorithm, which is based on a set of "neighborhood" parameters (ε, MinPts) to describe the tightness of the sample distribution .
(2) Algorithm principle:
Given a data set D={x1,x2,...,xm}, define the following concepts:
After understanding the related concepts, the pseudo code of the algorithm implementation is given below :
(3) Pros and cons:
advantage:
1. Able to identify samples of any shape .
2. The algorithm divides regions with sufficient density into clusters and finds clusters of arbitrary shape in a noisy spatial database.
3. There is no need to specify the number of clusters, but the algorithm discovers them independently.
shortcoming:
1. It is necessary to specify the minimum number of points (MinPts) and radius (ε). (But in fact, compared with other clustering algorithms, it already has greater freedom.)
2. The minimum number of points and the radius have a great influence on the algorithm , and generally need to be debugged several times.
Code:
0. Data preparation:
Here, we use the iris iris dataset (sklearn.datasets.load_iris) of the sklearn library as a test data sample. The iris dataset contains 150 samples, and each sample contains four attribute features (sepal length, sepal width, petal length, petal width) and a class label (0, 1, 2 for Iris, Iris and Vigie Nia iris).
First, we need to install the sklearn library . Install this library, or through the pip install command, but not pip install sklearn, but pip install scikit-learn . Just as we call opencv import cv2, but the installation is the same as pip install opencv.
pip install scikit-learn
Then, get the data set, where x is the feature data set of iris (the data type is the array numpy.adarray ), and y is the label data set of the iris flower (the data type is the array numpy.adarray ).
from sklearn.datasets import load_iris
x = load_iris().data
y = load_iris().target
1. PCA dimensionality reduction:
import numpy as np
def PCA_DimRed(dataMat,topNfeat): #PCA_DimRed--PCA dimension reduction,PCA降维
meanVals = np.mean(dataMat, axis=0)
meanRemoved = dataMat - meanVals # 标准化(去均值)
covMat = np.cov(meanRemoved, rowvar=False)
eigVals, eigVets = np.linalg.eig(np.mat(covMat)) # 计算矩阵的特征值和特征向量
eigValInd = np.argsort(eigVals) # 将特征值从小到大排序,返回的是特征值对应的数组里的下标
eigValInd = eigValInd[:-(topNfeat + 1):-1] # 保留最大的前K个特征值
redEigVects = eigVets[:, eigValInd] # 对应的特征向量
lowDDatMat = meanRemoved * redEigVects # 将数据转换到低维新空间
# reconMat = (lowDDatMat * redEigVects.T) + meanVals # 还原原始数据
return lowDDatMat
2. DBSCAN clustering:
import numpy as np
import random
import copy
def DBSCAN_cluster(mat,eps,min_Pts): #进行DBSCAN聚类,优点在于不用指定簇数量,而且适用于多种形状类型的簇
k = -1
neighbor_list = [] # 用来保存每个数据的邻域
omega_list = [] # 核心对象集合
gama = set([x for x in range(len(mat))]) # 初始时将所有点标记为未访问
cluster = [-1 for _ in range(len(mat))] # 聚类
for i in range(len(mat)):
neighbor_list.append(find_neighbor(mat, i, eps))
if len(neighbor_list[-1]) >= min_Pts:
omega_list.append(i) # 将样本加入核心对象集合
omega_list = set(omega_list) # 转化为集合便于操作
while len(omega_list) > 0:
gama_old = copy.deepcopy(gama)
j = random.choice(list(omega_list)) # 随机选取一个核心对象
k = k + 1
Q = list()
Q.append(j)
gama.remove(j)
while len(Q) > 0:
q = Q[0]
Q.remove(q)
if len(neighbor_list[q]) >= min_Pts:
delta = neighbor_list[q] & gama
deltalist = list(delta)
for i in range(len(delta)):
Q.append(deltalist[i])
gama = gama - delta
Ck = gama_old - gama
Cklist = list(Ck)
for i in range(len(Ck)):
cluster[Cklist[i]] = k
omega_list = omega_list - Ck
return cluster
3. Code summary:
from sklearn.datasets import load_iris
import numpy as np
import random
import copy
import matplotlib.pyplot as plt
def PCA_DimRed(dataMat,topNfeat): #PCA_DimRed--PCA dimension reduction,PCA降维
meanVals = np.mean(dataMat, axis=0)
meanRemoved = dataMat - meanVals # 标准化(去均值)
covMat = np.cov(meanRemoved, rowvar=False)
eigVals, eigVets = np.linalg.eig(np.mat(covMat)) # 计算矩阵的特征值和特征向量
eigValInd = np.argsort(eigVals) # 将特征值从小到大排序,返回的是特征值对应的数组里的下标
eigValInd = eigValInd[:-(topNfeat + 1):-1] # 保留最大的前K个特征值
redEigVects = eigVets[:, eigValInd] # 对应的特征向量
lowDDatMat = meanRemoved * redEigVects # 将数据转换到低维新空间
# reconMat = (lowDDatMat * redEigVects.T) + meanVals # 还原原始数据
return lowDDatMat
def find_neighbor(data,pos,eps): #寻找相邻点函数
N = list()
temp = np.sum((data-data[pos])**2, axis=1)**0.5
N = np.argwhere(temp <= eps).flatten().tolist()
return set(N)
def DBSCAN_cluster(data,eps,min_Pts): #进行DBSCAN聚类,优点在于不用指定簇数量,而且适用于多种形状类型的簇,如果使用K均值聚类的话,对于这次实验的数据(条状簇)无法得到较好的分类结果
k = -1
neighbor_list = [] # 用来保存每个数据的邻域
omega_list = [] # 核心对象集合
gama = set([x for x in range(len(data))]) # 初始时将所有点标记为未访问
cluster = [-1 for _ in range(len(data))] # 聚类
for i in range(len(data)):
neighbor_list.append(find_neighbor(data, i, eps))
if len(neighbor_list[-1]) >= min_Pts:
omega_list.append(i) # 将样本加入核心对象集合
omega_list = set(omega_list) # 转化为集合便于操作
while len(omega_list) > 0:
gama_old = copy.deepcopy(gama)
j = random.choice(list(omega_list)) # 随机选取一个核心对象
k = k + 1
Q = list()
Q.append(j)
gama.remove(j)
while len(Q) > 0:
q = Q[0]
Q.remove(q)
if len(neighbor_list[q]) >= min_Pts:
delta = neighbor_list[q] & gama
deltalist = list(delta)
for i in range(len(delta)):
Q.append(deltalist[i])
gama = gama - delta
Ck = gama_old - gama
Cklist = list(Ck)
for i in range(len(Ck)):
cluster[Cklist[i]] = k
omega_list = omega_list - Ck
return cluster
if __name__ == "__main__":
#1、准备数据
x = load_iris().data
y = load_iris().target
#2、PCA降维
pro_data = PCA_DimRed(x,2)
#3、DBSCAN聚类(此步中要保证数据集类型为数组,以配合find_neighbor函数)
pro_array = np.array(pro_data)
thecluster = DBSCAN_cluster(pro_array,eps=0.8,min_Pts=30)
#4、展示降维效果:
print("下面是降维之前的鸢尾花数据集特征集:")
print(x)
print("下面是降维之后的鸢尾花数据集特征集:")
print(pro_data)
#5、展示聚类效果:
plt.figure()
plt.scatter(pro_array[:, 0], pro_array[:, 1], c=thecluster)
plt.show()
Realize the effect:
1. Dimensionality reduction effect:
The feature set of the iris data set before dimensionality reduction:
The feature set of the iris data set after dimensionality reduction:
2. Clustering effect:
It can be seen that the DBSCAN clustering method cannot accurately cluster the iris flower samples according to the iris flower feature set after PCA dimensionality reduction, because the sample characteristics of the iris versicolor and the iris virginia are closer, and the two are closer . It is similar to belonging to the same density space , which leads to the inaccuracy of the experiment.
However, in fact, it can also be seen that Iris spp. can be better distinguished from the other two types of Iris , indicating that this method is still applicable to clustering situations where there is a large gap between samples of different categories .
Write at the end:
This article mainly introduces the basic principles of PCA dimensionality reduction and DBSCAN clustering, two machine learning operations , and the method of combining the two for actual data processing .
Maybe the DBSCAN clustering method based on PCA dimensionality reduction is not very suitable for the iris data set in the sklearn library, but this method has the ability to handle high-dimensional data and clusters of various shapes , indicating that it is a set of A relatively complete clustering method still has a relatively broad application scenario .
I hope that everyone can actively apply this method, so that it has more application possibilities. Thank you everyone!
Reference books:
Zhou Zhihua. Machine Learning [M]. Beijing: Tsinghua University Press, 2016.01
Reference article:
Six common clustering algorithms: http://t.csdn.cn/Urhn9
Two implementations of Python PCA (Principal Component Analysis) dimensionality reduction: http://t.csdn.cn/NlAeU
Python implementation of DBSCAN clustering algorithm: http://t.csdn.cn/lkFhF
PCA dimensionality reduction principle operation steps and advantages and disadvantages: http://t.csdn.cn/QiEJM
Well, the above is all the content. I hope you will pay more attention, like, and bookmark . This will be of great help to me. Thank you all!
Alright, this is Mr. Kamen Black . I wish the country a healthy family, and see you next time! ! !
yo-yo~~