Introductory Research on Machine Learning (16)—K-means

table of Contents

 

Unsupervised learning

Unsupervised learning classification

A few examples in clustering

K-means algorithm principle

API in sklearn

Instance

Dimensionality reduction

PCA

API in sklearn

Simple example


Unsupervised learning

Compared with the various supervised learning algorithms learned before, they are learning with both features and target values; while unsupervised learning means that there is no target value, only feature values. You must train yourself according to these feature values, and then perform classification predictions. .

Unsupervised learning classification

There are two main categories:

Clustering: K-means (K-means), mean shift clustering, density-based clustering method (DBSCAN), maximum expectation (EM) clustering with Gaussian mixture model (GMM), agglomerative hierarchical clustering, graph group detection (Graph Community Detection)

Dimensionality reduction: PCA

A few examples in clustering

1. An advertising platform needs to follow up similar demographic characteristics and buying habits to divide the American population into different groups so that advertisers can reach their target customers through related advertisements.

2. A homestay needs to divide its house into different communities so that users can more easily check these lists

How to summarize and classify the above examples? This is unsupervised learning, learning from unlabeled data, so as to perform induction and grouping.

Today, I will mainly summarize the principle of the K-means algorithm.

K-means algorithm principle

K is a hyperparameter, and K is the number of categories we want to classify. If the requirements are determined to be divided into several categories, then the value of K is what is the value; then if the value of K is uncertain, then the value of K can be studied by machine learning introductory research (7)-Model selection and tuning Tuning parameters.

The following mainly describes the principle of K-means through pictures and texts:

Suppose there are several points, and these points should be divided into two categories, so a few simple examples are given to explain the process:

At this time, the value of K is 2

1. Randomly take 2 points and mark different colors as pink and purple:

2. Then calculate the distance between other points and these two random points, and then compare the distance with these two random points. If it is close to that point, it will be marked with the same color, and the above figure will become the following icon:

3. According to two different colors, the area can be divided into two parts, and the center points of the two areas can be found respectively. The so-called center point is the mean value of all points in the area, which becomes the following figure:

4. From the right figure in the figure above, you can see that the calculated center point (indicated by the arrow) and the random point at the beginning (slightly larger) are not the same point, so you need to use the calculated center point as random Point, repeat 2. and 3. until the calculated center point and the random point are the same point or a point very close.

API in sklearn

An API is provided in sklearn to implement the K-means algorithm.

sklearn.cluster.KMeans(self, n_clusters=8, init='k-means++', n_init=10,
                 max_iter=300, tol=1e-4, precompute_distances='auto',
                 verbose=0, random_state=None, copy_x=True,
                 n_jobs=None, algorithm='auto'):

The parameters are as follows:

Field meaning
 n_clusters Hyperparameter k, the number of clusters, that is, the number of categories. The default is 8.
init

Initial cluster center acquisition method. The default is'k-means++': it can speed up the convergence in the iterative process

There can also be:

Random: Random selection

narray vector value: (n_clusters, n_features), and give the initial centroid

n_init The number of different initial cluster center initialization algorithms. The default is 10
max_iter The maximum number of iterations
tolls Iterative convergence condition
precompute_distances

Pre-calculated distance. Calculation speed is faster but will take up more memory

The default value is auto. If the number of samples x the number of clusters> 12000000, the distance is not pre-calculated. About 100MB

True: always pre-calculate the distance

False: never precalculate the distance

 verbose Verbose mode
random_state Random state
copy_x Whether to copy the original value, the default is True: the original data will not change;
 n_jobs Number of processes used for calculation
algorithm

Kmeans implementation algorithm

auto:

full: use EM to achieve

elkan:

Returned parameters:

parameter meaning
cluster_centers_ Return vector, [n_clusters, n_features] (coordinates of cluster centers)
Labels_ Classification of each point
intertia_ The sum of the distances from each point to the centroid of its cluster

Instance

You can view the machine learning introductory research (17)-an example provided by Instacart Market user classification

Dimensionality reduction

降维指的是在某些限定条件下,降低随机变量(特征)个数,得到一组不相关主要量的过程。所以这里的降维指的就是降低特征个数,对于在提供的样本数据是一个二维数组,也就是一个行代表样本个数,列代表特征个数的二维数组,那么也就是减少这个二维数组的列个数。

通常在降低特征的个数的时候,通过就是减少相关特征的个数。那么什么是相关特征呢?

相关特征就是特征之间存在相似性,例如,预测一个地区的相对湿度和降雨量作为特征来预测的时候,那么这个相对湿度和降雨量就是相对特征,这两个特征带来的信息是相似的。

如果相关特征比较多时,说明冗余信息就很多,影响到最后的预测结果。

降维的方法主要有下面两种:

1)特征选择

原有特征中找出主要特征

2)主要成分分析PCA

PCA

PCA可以把可能具有相关性的高维变量合成线性无关的低维变量,称为主成分( principal components)。新的低维数据集会尽可能的保留原始数据的变量。

PCA就是将高维数据转化成低维数据的过程,在这过程中可能会舍弃原有的数据,创建新的变量。也就是通过损失少量信息的前提下进行数据维数的压缩,尽可能降低原数据的维数,降低复杂度。

PCA的整个计算过程就是通过一个矩阵运算得到主要成分分析的结果

在sklearn中的API

在sklearn中的API如下:

sklearn.decomposition.PCA(self, n_components=None, copy=True, whiten=False,
                 svd_solver='auto', tol=0.0, iterated_power='auto',
                 random_state=None)

其中参数如下:

参数 含义
n_components

小数类型:保留百分之多少

整数:将特征减少到多少

copy 是否复制原始值,默认为True:原始数据不会改变;
whiten 是否将降维后的数据进行归一化。默认为False,一般不需要进行归一化
 svd_solver

指定奇异值分解SVD的方法。有四个‘auto’, ‘full’, ‘arpack’, ‘randomized’取值。

‘auto’,:在下面的三种方法中权衡

‘full’:传统的SVD。使用scripy库对应实现

‘arpack’:直接使用scripy的sparse SVD实现,arpack和randomized的适用场景类似,区别是randomized使用的是scikit-learn自己的SVD实现

‘randomized’:数据量大,数据维度多同时主成分数目比例又比较低

 返回的参数

参数 含义
explained_variance_ 降维后的各主成分的方差值。方差值越大,则说明越是重要的主成分
explained_variance_ratio_ 降维后的各主成分的方差值占总方差值的比例,这个比例越大,则越是重要的主成分

简单的实例

原数据为一个3x4的二维数组,降维之后变成了一个3x3的二维数组。

def pca():

    #1实例化PCA实例
    data = [[2,8,5,1],[9,7,4,2],[5,8,2,7]]
    print("原数据:")
    print(data)
    pca = PCA(n_components=3)
    #2)调用fit_transfa
    data_new = pca.fit_transform(data)
    print("原降维之后的数据:")
    print(data_new)
    return None

运行之后:

原数据:
[[2, 8, 5, 1], [9, 7, 4, 2], [5, 8, 2, 7]]
原降维之后的数据:
[[ 4.26857026e+00 -4.73024706e-01  2.57350359e-16]
 [-1.68373209e+00  3.59761365e+00  2.57350359e-16]
 [-2.58483817e+00 -3.12458895e+00  2.57350359e-16]]

 

Guess you like

Origin blog.csdn.net/nihaomabmt/article/details/104358737