table of Contents
Unsupervised learning classification
Unsupervised learning
Compared with the various supervised learning algorithms learned before, they are learning with both features and target values; while unsupervised learning means that there is no target value, only feature values. You must train yourself according to these feature values, and then perform classification predictions. .
Unsupervised learning classification
There are two main categories:
Clustering: K-means (K-means), mean shift clustering, density-based clustering method (DBSCAN), maximum expectation (EM) clustering with Gaussian mixture model (GMM), agglomerative hierarchical clustering, graph group detection (Graph Community Detection)
Dimensionality reduction: PCA
A few examples in clustering
1. An advertising platform needs to follow up similar demographic characteristics and buying habits to divide the American population into different groups so that advertisers can reach their target customers through related advertisements.
2. A homestay needs to divide its house into different communities so that users can more easily check these lists
How to summarize and classify the above examples? This is unsupervised learning, learning from unlabeled data, so as to perform induction and grouping.
Today, I will mainly summarize the principle of the K-means algorithm.
K-means algorithm principle
K is a hyperparameter, and K is the number of categories we want to classify. If the requirements are determined to be divided into several categories, then the value of K is what is the value; then if the value of K is uncertain, then the value of K can be studied by machine learning introductory research (7)-Model selection and tuning Tuning parameters.
The following mainly describes the principle of K-means through pictures and texts:
Suppose there are several points, and these points should be divided into two categories, so a few simple examples are given to explain the process:
At this time, the value of K is 2
1. Randomly take 2 points and mark different colors as pink and purple:
2. Then calculate the distance between other points and these two random points, and then compare the distance with these two random points. If it is close to that point, it will be marked with the same color, and the above figure will become the following icon:
3. According to two different colors, the area can be divided into two parts, and the center points of the two areas can be found respectively. The so-called center point is the mean value of all points in the area, which becomes the following figure:
4. From the right figure in the figure above, you can see that the calculated center point (indicated by the arrow) and the random point at the beginning (slightly larger) are not the same point, so you need to use the calculated center point as random Point, repeat 2. and 3. until the calculated center point and the random point are the same point or a point very close.
API in sklearn
An API is provided in sklearn to implement the K-means algorithm.
sklearn.cluster.KMeans(self, n_clusters=8, init='k-means++', n_init=10,
max_iter=300, tol=1e-4, precompute_distances='auto',
verbose=0, random_state=None, copy_x=True,
n_jobs=None, algorithm='auto'):
The parameters are as follows:
Field | meaning |
n_clusters | Hyperparameter k, the number of clusters, that is, the number of categories. The default is 8. |
init | Initial cluster center acquisition method. The default is'k-means++': it can speed up the convergence in the iterative process There can also be: Random: Random selection narray vector value: (n_clusters, n_features), and give the initial centroid |
n_init | The number of different initial cluster center initialization algorithms. The default is 10 |
max_iter | The maximum number of iterations |
tolls | Iterative convergence condition |
precompute_distances | Pre-calculated distance. Calculation speed is faster but will take up more memory The default value is auto. If the number of samples x the number of clusters> 12000000, the distance is not pre-calculated. About 100MB True: always pre-calculate the distance False: never precalculate the distance |
verbose | Verbose mode |
random_state | Random state |
copy_x | Whether to copy the original value, the default is True: the original data will not change; |
n_jobs | Number of processes used for calculation |
algorithm | Kmeans implementation algorithm auto: full: use EM to achieve elkan: |
Returned parameters:
parameter | meaning |
cluster_centers_ | Return vector, [n_clusters, n_features] (coordinates of cluster centers) |
Labels_ | Classification of each point |
intertia_ | The sum of the distances from each point to the centroid of its cluster |
Instance
You can view the machine learning introductory research (17)-an example provided by Instacart Market user classification .
Dimensionality reduction
降维指的是在某些限定条件下,降低随机变量(特征)个数,得到一组不相关主要量的过程。所以这里的降维指的就是降低特征个数,对于在提供的样本数据是一个二维数组,也就是一个行代表样本个数,列代表特征个数的二维数组,那么也就是减少这个二维数组的列个数。
通常在降低特征的个数的时候,通过就是减少相关特征的个数。那么什么是相关特征呢?
相关特征就是特征之间存在相似性,例如,预测一个地区的相对湿度和降雨量作为特征来预测的时候,那么这个相对湿度和降雨量就是相对特征,这两个特征带来的信息是相似的。
如果相关特征比较多时,说明冗余信息就很多,影响到最后的预测结果。
降维的方法主要有下面两种:
1)特征选择
原有特征中找出主要特征
2)主要成分分析PCA
PCA
PCA可以把可能具有相关性的高维变量合成线性无关的低维变量,称为主成分( principal components)。新的低维数据集会尽可能的保留原始数据的变量。
PCA就是将高维数据转化成低维数据的过程,在这过程中可能会舍弃原有的数据,创建新的变量。也就是通过损失少量信息的前提下进行数据维数的压缩,尽可能降低原数据的维数,降低复杂度。
PCA的整个计算过程就是通过一个矩阵运算得到主要成分分析的结果
在sklearn中的API
在sklearn中的API如下:
sklearn.decomposition.PCA(self, n_components=None, copy=True, whiten=False,
svd_solver='auto', tol=0.0, iterated_power='auto',
random_state=None)
其中参数如下:
参数 | 含义 |
n_components | 小数类型:保留百分之多少 整数:将特征减少到多少 |
copy | 是否复制原始值,默认为True:原始数据不会改变; |
whiten | 是否将降维后的数据进行归一化。默认为False,一般不需要进行归一化 |
svd_solver | 指定奇异值分解SVD的方法。有四个‘auto’, ‘full’, ‘arpack’, ‘randomized’取值。 ‘auto’,:在下面的三种方法中权衡 ‘full’:传统的SVD。使用scripy库对应实现 ‘arpack’:直接使用scripy的sparse SVD实现,arpack和randomized的适用场景类似,区别是randomized使用的是scikit-learn自己的SVD实现 ‘randomized’:数据量大,数据维度多同时主成分数目比例又比较低 |
返回的参数
参数 | 含义 |
explained_variance_ | 降维后的各主成分的方差值。方差值越大,则说明越是重要的主成分 |
explained_variance_ratio_ | 降维后的各主成分的方差值占总方差值的比例,这个比例越大,则越是重要的主成分 |
简单的实例
原数据为一个3x4的二维数组,降维之后变成了一个3x3的二维数组。
def pca():
#1实例化PCA实例
data = [[2,8,5,1],[9,7,4,2],[5,8,2,7]]
print("原数据:")
print(data)
pca = PCA(n_components=3)
#2)调用fit_transfa
data_new = pca.fit_transform(data)
print("原降维之后的数据:")
print(data_new)
return None
运行之后:
原数据:
[[2, 8, 5, 1], [9, 7, 4, 2], [5, 8, 2, 7]]
原降维之后的数据:
[[ 4.26857026e+00 -4.73024706e-01 2.57350359e-16]
[-1.68373209e+00 3.59761365e+00 2.57350359e-16]
[-2.58483817e+00 -3.12458895e+00 2.57350359e-16]]