[Wu Enda Machine Learning Course Notes] week three unsupervised learning

Unsupervised Learning Definition

In clustering problems, we are given a training set {x1,...,xm} and expect to divide the data into some cohesive "clusters". The xi here is usually a real number; but the data set does not have a label y, so this is an unsupervised learning problem.

K-means algorithm

The most commonly used clustering algorithm based on Euclidean distance, which believes that the closer the distance between two targets, the greater the similarity.

The best way to choose the number of K: manual selection.

Priest - Villager Model

K-means has a well-known explanation: the priest-village model:

Four pastors went to preach in the suburbs. At the beginning, the pastors randomly selected a few mission sites, and announced the situation of these mission sites to all the villagers in the suburbs, so each villager went to the mission site closest to his home. attend lectures.
After listening to the class, everyone felt that the distance was too far, so each pastor counted the addresses of all the villagers in his class, moved to the center of all addresses, and updated the location of his mission site on the poster.
It is impossible for the pastor to be closer to everyone every time he moves. Some people find that after pastor A moves, it is better to go to pastor B to attend lectures, so each villager goes to the mission site closest to him... That's it
, The pastor updates his location every week, and the villagers choose their evangelism sites according to their own conditions, and finally stabilized.

We can see that the purpose of the priest is to minimize the sum of the distances from each villager to its nearest center point.

Algorithm principle

The inner loop of the algorithm repeats two steps:
(i) "assigns" each training example x(i) to the nearest cluster centroid µj, (ii)
moves each cluster centroid µj to its assigned point average.

insert image description here
Training examples are represented by dots and cluster centroids are represented by crosses. (a) The original dataset. (b) Random initial cluster centroids (in this case, not chosen equal to two training examples). (c-f) Illustration of running two iterations of k-means. At each iteration, we assign each training example to the nearest cluster centroid (shown by "painting" the training example the same color as the cluster centroid it was assigned to); we then move each cluster centroid to the assigned The average of the points given to it. (best color color)

shortcoming

The distortion function J is non-convex, so coordinate descent on J is not guaranteed to converge to a global minimum. In other words, k-means can be susceptible to local optima . Usually k-means works fine, it still comes up with really nice clusters though. However, if you are worried about getting stuck in bad local minima, a common thing to do is to run k-means multiple times (with different random initial values ​​µj). Then, from all the different clusters found, the one with the lowest distortion J(c, µ) is selected.

Algorithm process
insert image description here

https://zhuanlan.zhihu.com/p/78798251
Zhou Zhihua "Machine Learning"
https://wuxian.blog.csdn.net/article/details/80107795

Mixed Gaussian algorithm (*not understood)

GMM model

Gaussian Mixed Model (Gaussian Mixed Model) refers to the linear combination of multiple Gaussian distribution functions. In theory, GMM can fit any type of distribution. It is usually used to solve the situation that the data under the same set contains multiple different distributions ( Or the same type of distribution but with different parameters, or different types of distributions, such as normal distribution and Bernoulli distribution).
Teacher Li Hang "Statistical Learning Methods"

EM算法(Expectation-Maximization algorithm)

The EM algorithm is an iterative algorithm with two main steps. Applied to our problem, at step e, it tries to "guess" the value of z(i). In step m, it updates the parameters of our model based on our guess. Since in the m step we pretend that the first part of the guess is correct, maximization becomes easy.
insert image description here
insert image description here
In step e, we compute the posterior probability of our parameter z(i), given x(i), and using the current setting of our parameter.

Teacher Li Hang "Statistical Learning Methods"
https://blog.csdn.net/xmu_jupiter/article/details/50889023

Similarities and differences between the two algorithms

GMM:

  • First calculate the response of all data to each sub-model
  • Calculate the parameters of each sub-model based on the responsivity
  • iteration

K-means:

  • First calculate the distance of all data to K points, and take the point with the closest distance as the class to which it belongs
  • The position of the update point is divided according to the category of the previous step (the position of the point can be regarded as a model parameter)
  • iteration

It can be seen that GMM and K-means still have a lot in common. The responsivity of the data to the Gaussian component in GMM is equivalent to the distance calculation in K-means, and the calculation of the Gaussian component parameter based on the responsivity in GMM is equivalent to the position of the classification point in K-means. Then they all reach the optimum through continuous iteration. The difference is: the GMM model gives the probability of which Gaussian component is generated by each observation point, while K-means directly gives which category an observation point belongs to .

https://blog.csdn.net/xmu_jupiter/article/details/50889023

cluster

Clustering is a typical unsupervised learning method, which reveals the inherent nature and laws of the data through the learning of unlabeled training samples, and provides a basis for further data analysis . Other common unsupervised learning tasks include density estimation, anomaly detection, etc.

Clustering attempts to divide the samples in the data set into several usually disjoint subsets , and each subset is called a "cluster". Through such division, each cluster may correspond to some potential concepts, which are unknown to the clustering algorithm in advance, and the clustering process can only automatically form a cluster structure, and the concept semantics corresponding to the clusters need to be grasped by the user and naming.

https://zhuanlan.zhihu.com/p/70756804

Gaussian function

The normal distribution is a Gaussian probability distribution. A Gaussian probability distribution is a function that reflects the principle of the central limit theorem, which states that when a random sample is large enough, the population sample will tend towards the expected value and values ​​farther away from the expected value will occur less frequently.

The Gaussian function is widely used in the field of statistics to express the normal distribution. In the field of signal processing, it is used to define the Gaussian filter. In the field of image processing, the two-dimensional Gaussian kernel function is often used in Gaussian Blur. In the field of mathematics, it is mainly is used to solve the thermodynamic equation and the diffusion equation, and to define the Weiertrass Transform.
insert image description here
μ refers to the expectation, which determines the central symmetry axis of the normal distribution.
σ refers to the variance that determines the fat and thin of the normal distribution. The larger the variance, the relatively fat and short the normal distribution. The
variance: (x refers to the average )
standard deviation: the square root of the variance.
The probability density of any normal distribution is 1 from negative infinity to positive infinity.

1D Gaussian function

insert image description here
For any real number a, b, c, it is named after the famous mathematician Carl Friedrich Gauss. The one-dimensional graph of Gaussian is a characteristic symmetrical "bell curve" shape, a is the height of the curve peak, b is the coordinate of the center of the peak, and c is called the standard deviation, which characterizes the width of the bell bell.

insert image description here
The 2D Gaussian function
insert image description here
A is the magnitude, x. y. is the coordinates of the center point, σx σy is the variance, as shown below, A = 1, xo = 0, yo = 0, σx = σy = 1
insert image description here

https://blog.csdn.net/qinglongzhan/article/details/82348153

PCA algorithm (Principal components analysis)

Principal Component Analysis (PCA), which also tries to identify the subspace in which the data approximates. PCA will do this more directly and requires only one eigenvector computation (easy to do with the eig function in Matlab) without resorting to EM.

PCA (Principal Component Analysis), the principal component analysis method, is the most widely used data dimensionality reduction algorithm (unsupervised machine learning method).

Its main purpose is "dimensionality reduction", by extracting the largest individual differences displayed by the principal components, and discovering features that are easier for humans to understand. It can also be used to reduce the number of variables in regression analysis and cluster analysis.

Why Do Principal Component Analysis

In many scenarios, it is necessary to observe multivariate data, which increases the workload of data collection to a certain extent. What's more: There may be correlations between multiple variables, which increases the complexity of problem analysis.

If each indicator is analyzed separately, the analysis results are often isolated, and the information in the data cannot be fully utilized. Therefore, blindly reducing indicators will lose a lot of useful information, resulting in wrong conclusions.

Therefore, it is necessary to find a reasonable method to minimize the loss of information contained in the original indicators while reducing the indicators to be analyzed, so as to achieve the purpose of comprehensive analysis of the collected data . Since there is a certain correlation between the variables, it can be considered to change the closely related variables into as few new variables as possible, so that these new variables are not correlated in pairs, and then can be represented by fewer comprehensive indicators. Various types of information that exist in each variable. Principal component analysis and factor analysis belong to this type of dimensionality reduction algorithm.

step

Steps (1-2) ignore the mean of the data
Data for which the mean is known to be zero (for example, time series corresponding to speech or other acoustic signals) can be ignored.

(3-4) Rescale each coordinate
to have unit variance to ensure that different attributes are all treated on the same "scale". For example, if x1 is the car's maximum speed in miles per mile (take a high ten or a low hundred), and x2 is the number of seats (take it around 2-4), then this renormalization rescales the different attributes to make them more Comparable.
If we know in advance that different attributes are all on the same scale, then steps (3-4) can be omitted.

insert image description here
Two-dimensional reduction and one-dimensional
insert image description here
attention : the difference from the linear regression loss function

https://blog.csdn.net/weixin_43312354/article/details/10565330

Multidimensional Vector Dimensionality Reduction
insert image description here

Principle: Linear algebra – multiply the matrix of m n-dimensional vectors to the left by its transposed matrix to obtain a matrix of order m*m

Guess you like

Origin blog.csdn.net/mossfan/article/details/125216628