(1) KNN and Kmeans analysis

KNN and Kmeans are commonly used machine learning algorithms, but they are often confused. This article will introduce the meaning, implementation process, purpose, difference and other aspects of these two algorithms in detail.

1 、 KNN

It belongs to supervised learning and belongs to classification learning (requires training). The simplest and rude is to calculate the distance between the predicted point and all points, then save and sort, select the first K values ​​to see which categories are more, and then which category the predicted points belong to.

(1) Process

Do the following in turn for each point in the dataset with unknown class attributes:

a. Calculate the distance between the point in the known category dataset and the current point (Manhattan distance calculation, Euclidean distance calculation)

b. Sort in ascending order of distance

c. Select the k points with the smallest distance from the current point (the basis for the value of k: through cross-validation (divide the sample data into training data and verification data according to a certain ratio, such as 6:4 split into parts) training data and validation data), starting from selecting a smaller value of K, increasing the value of K, then calculating the variance of the validation set, and finally finding a more appropriate value of K)

d. Determine the probability of occurrence of the category of the first K points

e. Return the category with the highest frequency of occurrence of the first k points as the predicted classification of the current point

2、KMeans

It belongs to clustering learning, and the number of categories is artificially given. If you let the machine find the number of categories by itself, repeat the process of selecting the centroid - calculating the distance and classifying - selecting the new centroid again until we are grouped After that, all data will not change, and the final aggregation result will be obtained.

(1) Process

a. Randomly select K centroids (the K value depends on how many categories you want to cluster into)

b. Calculate the distance from the sample to the centroid, and classify the distance from the centroid into one category and divide it into k categories

c. Find the new centroid of each class after classification

d. Calculate the distance from the sample to the new centroid again, and the ones with the closest distance from the centroid are classified into one category

e. Determine whether the old and new clusters are the same. If they are the same, it means that the clustering has been successful. If not, cycle 2 to 4 steps until they are the same.

3、KMeans++

Two defects of KMeans: The number K of KMeans cluster centers needs to be given in advance, but in practice the selection of this K value is very difficult to estimate; KMeans needs to artificially determine the initial cluster center, different initial clusters Centers can lead to completely different clustering results. (KMeans++ addresses two of KMeans' flaws.)

(1) Process

The basic idea of ​​the k-means++ algorithm for selecting initial seeds is that the distance between the initial cluster centers should be as far as possible.

a. A set D of given initial points

b. Randomly select a point from the point set D as the initial center point

c. Calculate the distance Si from each point to the nearest center point

d. Sum Si to get Sum

e. Take the random value Random(0<Random<Sum)

f. Loop point set D, do Random-=Si operation until Random<0, then point i is the next center point

g. Loop c~f until all K center points are taken out

h. Perform K-means algorithm

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324142051&siteId=291194637