[Clustering algorithm] MeanShift algorithm

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. Preface

The MeanShift algorithm is also a density-based clustering algorithm. The intuitive understanding of both algorithms is relatively easy to understand.
DBSCAN: Gradually develop the offline mode from people around you
MeanShfit: Gradually select big brothers based on "appeal"

Most people should be able to think of these two simple ideas, but they need complete mathematical support behind them to truly become "inventors".

Intuition is important sometimes.

Please add a picture description
insert image description here

1. Text

1.1 Concept

The Mean Shift algorithm was first proposed by Fukunaga and Hostetler in 1975. In 2002, Comaniciu and Meer improved the algorithm and gave a more complete theoretical analysis and proof. It is used for object tracking and salient region detection. It has also been widely used in data mining, image processing, and social network analysis.

1.2 Basic process

  1. Randomly select a sample from the ( unlabeled ) data as the center ( big brother )
  2. Find out other sample points ( little brother ) within a certain range (specified radius, bandwidth ) centered on the current sample, record them as M, and classify them into cluster c (take a domineering name, Ax Gang), and at the same time, increase the Probability of class samples (number of times + 1 , the big brother finds the younger brother, determines the name of the gang, and records the sincerity of each younger brother )
  3. Calculate the average vector shift from the sample center to the M set , and move the sample center, center = center + shift ( re-determine Big Brother according to "appeal" )
  4. Repeat the process of 2 and 3 until the shift size is less than a specified value, that is, convergence. Record the center at this time. At this point, we have found the center of the sample and marked the surrounding samples. ( The final big brother, gang name, and degree of sincerity of the younger brother are determined )
  5. Repeat 1, 2, 3, 4 process. If the distance between the two sample centers is less than the threshold, then merge ( gang merge )
  6. Classify sample points that are not sample centers. (The younger brother may be interested in multiple gangs, and he will be divided into different gangs according to his sincerity for each gang)

Please add a picture description
Average offset vector:
subtract each sample point from the center point, then sum, and finally divide by the number of samples
insert image description here
to update the center point:
insert image description here

1.3 Kernel function

When calculating the offset vector above, the points around the sample center play the same role. In fact, the closer to the sample center, the greater the impact.

In the implementation, we usually use the Gaussian kernel function to calculate the density value of the points in the neighborhood, and then determine the moving direction and distance of each point according to the density value. Specifically, for a data point xi x_ixi, the density of points in its neighborhood can be calculated as:
insert image description here

Among them, KKK is the Gaussian kernel function,hhh is the bandwidth parameter,nnn is the size of the dataset,ddd is the dimensionality of the data points.

Using the Gaussian kernel function can ensure that the density values ​​​​are smaller at points that are farther away and larger at points that are closer.

After calculating the density value of each point, we can use the following formula to calculate the moving direction and distance of each point:
insert image description here

1.4 demo

The bandwidth of the algorithm can be estimated with a function, as follows:

from sklearn.cluster import MeanShift, estimate_bandwidth
import numpy as np

# 加载数据
data = np.loadtxt('data.txt')

# 估计 bandwidth 值
bandwidth = estimate_bandwidth(data)

# 基于估计的 bandwidth 进行聚类
ms = MeanShift(bandwidth=bandwidth)
ms.fit(data)

reference

[1] https://blog.csdn.net/Cristiano2000/article/details/119673252
[2] https://blog.csdn.net/google19890102/article/details/51030884
[3] https://blog.csdn.net/pantingd/article/details/107134729
[4] https://zhuanlan.zhihu.com/p/81629406
[5] https://zhuanlan.zhihu.com/p/543744941
[6] https://zhuanlan.zhihu.com/p/354913697
[7] https://blog.csdn.net/SkullSky/article/details/113142978
[8] https://zhuanlan.zhihu.com/p/618919552
[9] https://zhuanlan.zhihu.com/p/350031668
[10] https://blog.csdn.net/SkullSky/article/details/113142978#:~:text=Meanshift,%E8%81%9A%E7%B1%BB%E7%9A%84%E4%B8%BB%E8%A6%81%E6%80%9D%E8%B7%AF%E6%98%AF%EF %BC%9A%E8%AE%A1%E7%AE%97%E6%9F%90%E4%B8%80%E7%82%B9A%E4%B8%8E%E5%85%B6%E5%8D %8A%E5%BE%84R%E5%86%85%E7%9A%84%E7%82%B9%E4%B9%8B%E9%97%B4%E5%90%91%E9%87%8F %E8%B7%9D%E7%A6%BB%E7%9A%84%E5%B9%B3%E5%9D%87%E5%80%BCM%EF%BC%8C%E5%BE%97%E5 %88%B0%E8%AF%A5%E7%82%B9%E4%B8%8B%E4%B8%80%E6%AD%A5%E7%9A%84%E6%BC%82%E7%A7 %BB%EF%BC%88%E7%A7%BB%E5%8A%A8%EF%BC%89%E6%96%B9%E5%90%91%EF%BC%88A%3DM%2BA%EF %BC%89%E5%92%8C%E8%B7%9D%E7%A6%BB%EF%BC%88%7C%7CM%7C%7C%EF%BC%89%E3%80%82%20 %E5%BD%93%E8%AF%A5%E7%82%B9%E4%B8%8D%E5%86%8D%E7%A7%BB%E5%8A%A8%E6%97%B6%EF %BC%8C%E8%AE%A1%E7%AE%97%E8%BF%99%E4%B8%AA%E7%82%B9%E4%B8%8E%E5%8E%86%E5%8F %B2%E7%B0%87%E4%B8%AD%E5%BF%83%E7%9A%84%E8%B7%9D%E7%A6%BB%EF%BC%8C%E6%BB%A1 %E8%B6%B3%E5%B0%8F%E4%BA%8E%E9%98%88%E5%80%BCD%E5%8D%B3%E5%90%88%E5%B9%B6%E4 %B8%BA%E5%90%8C%E4%B8%80%E4%B8%AA%E7%B1%BB%E7%B0%87%EF%BC%8C%E4%B8%8D%E6%BB%A1%E8%B6%B3%E5%88%99%E8%87 %AA%E8%BA%AB%E5%BD%A2%E6%88%90%E4%B8%80%E4%B8%AA%E7%B1%BB%E7%B0%87%E3%80%82
[11] https://zhuanlan.zhihu.com/p/611488610

Guess you like

Origin blog.csdn.net/weixin_39190382/article/details/131432425