Machine learning classical algorithm of K-Means

I. Introduction

K-Means is an unsupervised learning, clustering problem is solved. K is represented by K class, Means represents the center, you can understand the nature of this algorithm is to determine the center point of the K class, when you find the center point, the clustering is finished.

/ * Please respect the author fruits of labor, reproduced, please indicate the original link: * /

/* https://www.cnblogs.com/jpcflyer/p/11117012.html * /

First you and I think about a scene, I suppose there are 20 Asian football, they want to be divided into three levels according to the results to be how to divide?

 

Two, K-Means works

The level of Asian football, you may also have their own judgments. For example, first-class Asian teams Who? You might say, Iran or South Korea. Asian second-rate team do? You might say that is China. Asian third-team it? You might say Vietnam.

In fact, these are relying on our experience to divide, then Iran, China, Vietnam, can be said to be typical of three levels, that is the center point of each of our classes.

So come back to, how to determine the center point K class? We can start a random assignment, when you confirm the central point, you can follow from the other football teams divided into different categories.

This is K-Means central idea is so simple and direct. You might ask: If you start selecting a first-class team of China, Iran is second-rate team, third team in South Korea, the central point the wrong choice how to do? In fact, do not worry, K-Means have self-correcting mechanisms, in constant iterative process, will correct the center point. In the central point of the whole iterative process, not the only, but you need an initial value, the general algorithm will randomly set the initial center point.

Well, then I come to K-Means works to you summarize:

Select the K classes center point as the initial point, the points are generally randomly selected from the data set;

Each point is assigned to the nearest class center point, thus forming a class of K, and then recalculate the center point of each class;

The second step is repeated until the class does not change, or you can set the maximum number of iterations, this change occurred even though the center point of the class, but as long as the maximum number of iterations to achieve the ends.

 

Third, how do Asian team to clustering

For machine data is required in order to determine the center point of the class, so I compiled a 2015-2019 Asian teams ranked in the following table.

Let me explain data profiles.

Where the 2019 FIFA world ranking in 2015 Asian Cup rankings are the actual rankings. 2018 World Cup, many teams did not enter the finals, only to enter the finals of the team will have actual rankings. If it is 12 strong team of Asian qualifiers, the ranking is set to 40. If you do not enter the Asian qualifiers 12 strong team rankings will be set to 50.

For the above ranking, we first need to do is data normalization. I first values ​​are normalized to [0,1] space, obtained the following table of values:

If we randomly selected China, Japan, South Korea is the center point of three classes, we need to look at these teams from the center point.

There are many ways to calculate distance calculate distance and I talked about over the KNN algorithm. Euclidean distance is the most common way to calculate distance, here I choose Euclidean distance as a standard distance, the distance is calculated for each team separately to China, Japan, South Korea, and then to divide the distance according to the distance. We see most of the team and the Chinese team will be clustered together. Here I compiled a distance calculation process, such as China and China Euclidean distance Euclidean distance 0, China and Japan is 0.732003. If in accordance with China, Japan, South Korea 3 classification center point of the Euclidean distance calculation results in the following table:

Then we re-calculate the center point of the three classes, how to calculate it? The easiest way is to take the average, then the team free redistribution distance according to the distance of the new center point, and then update the position of the center point based on the team classification. Expand this calculation is not the last iteration has been (repeating the above calculation: calculating center points and for classification) to no longer change the classification can be obtained the following results:

So we can see that the first echelon Japan, South Korea, Iran, Saudi Arabia, Australia; the second tier are China, Iraq, United Arab Emirates, Uzbekistan; third echelon Qatar, Thailand, Vietnam, Oman, Bahrain, Korea, Indonesia, Syria, Jordan, Kuwait and Palestine.

 

Fourth, how to use the K-Means algorithm in sklearn

sklearn Python is a machine learning tool library, if the function is divided up, sklearn can achieve classification, clustering, regression, dimension reduction, model selection and pre-processing functions. We use here is sklearn clustering library, and therefore need to reference kit, the specific code as follows:

1 from sklearn.cluster import KMeans

Of course, K-Means is only one poly sklearn.cluster library, actually comprise inner K-Means, provided a total of nine sklearn.cluster clustering method, such as Mean-shift, DBSCAN, Spectral clustering (spectral clustering) Wait. These different clustering methods principles and K-Means, not introduced here.

We look at how to create a K-Means:

1 KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm='auto')

We can see that in the process of K-Means class created, there are some major parameters:

n_clusters: namely K value, generally need to try some of the K value to ensure better clustering effect. You can set the random number K value, and then select the best cluster as a final value of K;

max_iter: maximum number of iterations, if clustering is difficult to converge, then set the maximum number of iterations allows us to receive timely feedback results, otherwise the program run time can be very long;

n_init :初始化中心点的运算次数,默认是 10。程序是否能快速收敛和中心点的选择关系非常大,所以在中心点选择上多花一些时间,来争取整体时间上的快速收敛还是非常值得的。由于每一次中心点都是随机生成的,这样得到的结果就有好有坏,非常不确定,所以要运行 n_init 次, 取其中最好的作为初始的中心点。如果 K 值比较大的时候,你可以适当增大 n_init 这个值;

algorithm :k-means 的实现算法,有“auto” “full”“elkan”三种。一般来说建议直接用默认的"auto"。简单说下这三个取值的区别,如果你选择"full"采用的是传统的 K-Means 算法,“auto”会根据数据的特点自动选择是选择“full”还是“elkan”。我们一般选择默认的取值,即“auto” 。

 

在创建好 K-Means 类之后,就可以使用它的方法,最常用的是 fit 和 predict 这个两个函数。你可以单独使用 fit 函数和 predict 函数,也可以合并使用 fit_predict 函数。其中 fit(data) 可以对 data 数据进行 k-Means 聚类。 predict(data) 可以针对 data 中的每个样本,计算最近的类。

现在我们要完整地跑一遍 20 支亚洲球队的聚类问题。

 1 # coding: utf-8
 2 
 3 from sklearn.cluster import KMeans
 4 
 5 from sklearn import preprocessing
 6 
 7 import pandas as pd
 8 
 9 import numpy as np
10 
11 # 输入数据
12 
13 data = pd.read_csv('data.csv', encoding='gbk')
14 
15 train_x = data[["2019 年国际排名 ","2018 世界杯 ","2015 亚洲杯 "]]
16 
17 df = pd.DataFrame(train_x)
18 
19 kmeans = KMeans(n_clusters=3)
20 
21 # 规范化到 [0,1] 空间
22 
23 min_max_scaler=preprocessing.MinMaxScaler()
24 
25 train_x=min_max_scaler.fit_transform(train_x)
26 
27 # kmeans 算法
28 
29 kmeans.fit(train_x)
30 
31 predict_y = kmeans.predict(train_x)
32 
33 # 合并聚类结果,插入到原数据中
34 
35 result = pd.concat((data,pd.DataFrame(predict_y)),axis=1)
36 
37 result.rename({0:u'聚类'},axis=1,inplace=True)
38 
39 print(result)

运行结果:

 1 国家  2019 年国际排名  2018 世界杯  2015 亚洲杯  聚类
 2 
 3 中国         73       40        7   2
 4 
 5 日本         60       15        5   0
 6 
 7 韩国         61       19        2   0
 8 
 9 伊朗         34       18        6   0
10 
11 沙特         67       26       10   0
12 
13 伊拉克         91       40        4   2
14 
15 卡塔尔        101       40       13   1
16 
17 阿联酋         81       40        6   2
18 
19 乌兹别克斯坦         88       40        8   2
20 
21 泰国        122       40       17   1
22 
23 越南        102       50       17   1
24 
25  阿曼         87       50       12   1
26 
27 巴林        116       50       11   1
28 
29 朝鲜        110       50       14   1
30 
31 印尼        164       50       17   1
32 
33 澳洲         40       30        1   0
34 
35 叙利亚         76       40       17   1
36 
37 约旦        118       50        9   1
38 
39  科威特        160       50       15   1
40 
41 巴勒斯坦         96       50       16   1

搜索关注微信公众号“程序员姜小白”,获取更新精彩内容哦。

 

Guess you like

Origin www.cnblogs.com/jpcflyer/p/11117012.html