Optimization: Modeling, Algorithms and Theory (Optimization Modeling - 2)

3.10 K-means clustering

Cluster analysis is a fundamental problem in statistics and has important applications in machine learning, data mining, pattern recognition and image analysis. Clustering is different from classification. In clustering problems, we only know the data points themselves, but not the specific labels of each data point. The task of cluster analysis is to classify some unlabeled data points according to a certain degree of similarity, and then learn their inherent category features from the data points themselves.

Given ppnnin p- dimensional spacen data pointsa 1 , a 2 , ⋯ , an a_1,a_2,\cdots,a_na1,a2,,an, assuming that the similarity between two data points can be measured by their Euclidean distance, our goal is to classify similar points into a class while distinguishing dissimilar points. For simplicity we assume that classes The number of is known, so we might as well record it as kkk , and the same data point only belongs to one class, so the clustering problem is to findkkk disjoint non-empty setsS 1 , S 2 , ⋯ , S k S_1,S_2,\cdots,S_kS1,S2,,Sk,使得
{ a 1 , a 2 , ⋯   , a n } = S 1 ∪ S 2 ∪ ⋯ ∪ S k \{a_1,a_2,\cdots,a_n\}=S_1\cup{S_2}\cup{\cdots}\cup{S_k} { a1,a2,,an}=S1S2Sk
And the distance between similar points must be close enough. In order to mathematically describe "the distance between similar points is close enough", we define the sum of squared distances within the group as
W ( S 1 , S 2 , ⋯ , S k ) = ∑ i = 1 k ∑ a ∈ S i ∣ ∣ a − ci ∣ ∣ 2 (3.10.1) W(S_1,S_2,\cdots,S_k)=\sum_{i=1}^k\sum_{a{\in }S_i}||a-c_i||^2 \tag{3.10.1}W(S1,S2,,Sk)=i=1kaSi∣∣aci2( 3.10.1 )
hereci c_iciPart IIThe center of the i -category data point. Note that the question assumes that each category is non-empty.
After defining the clustering criteria, you can recommend an optimized model. We want to find a clustering method that minimizes the sum of squared distances within the group, that is,
min ⁡ S 1 , S 2 , ⋯ , S k ∑ i = 1 k ∑ a ∈ S i ∣ ∣ a − ci ∣ ∣ 2 s . t . { a 1 , a 2 , ⋯ , an } = S 1 ∪ S 2 ∪ ⋯ ∪ S k S j ∩ S j ≠ ∅ , ∀ i ≠ j (3.10.2) \min_{S_1,S_2, \cdots,S_k}\sum_{i=1}^k\sum_{a{\in}S_i}||a-c_i||^2 \\ st {\quad} \{a_1,a_2,\cdots,a_n \}=S_1\cup{S_2}\cup{\cdots}\cup{S_k} \\ S_j \cap S_j {\not=}\varnothing,\forall{i {\not=}j} \tag{3.10. 2}S1,S2,,Skmini=1kaSi∣∣aci2s.t.{ a1,a2,,an}=S1S2SkSjSj=,i=j( 3.10.2 )
The independent variable of problem (3.10.2) is the division method of the data point set, which seems difficult to handle. Therefore, it is necessary to write the problem in a familiar form. Next, two matrix expressions of the problem are given. , they are equivalent.

1. Equivalent expression of K-means clustering 1

In the original clustering problem, the sum of squared distances within the group is defined as (3.10.1), that is, S i S_i needs to be calculatedSipoints in to their center points ci c_iciThe sum of squares, in fact, select the center point ci c_iciIt is not necessary to use it as a reference point. We can choose other points hi h_ihias a reference to calculate the intra-group distance (in fact, this hi h_ihiBy optimizing the final center point of the representation), the sum of squared distances within the group can be generalized to
W ( S 1 , S 2 , ⋯ , S k , H ) = ∑ i = 1 k ∑ a ∈ S i ∣ ∣ a − hi ∣ ∣ 2 W(S_1,S_2,\cdots,S_k,H)=\sum_{i=1}^k\sum_{a{\in}S_i}||a-h_i||^2W(S1,S2,,Sk,H)=i=1kaSi∣∣ahi2where H ∈ R k × p H{\in}R^{k \times p
}HRk × p (a point of k classes (dimension p)) andiiThe vector of row i ishi T h_i^ThiT, in order to express the clustering method S 1 , S 2 , ⋯ , S k S_1,S_2,\cdots,S_kS1,S2,,Sk, a natural idea is to use a vector ϕ i ∈ R k {\phi_i}{\in}R^kϕiRk to representai a_iai所处的类别
( ϕ i ) j = { 1 , a i ∈ S j 0 , a i ∉ S j {(\phi_i)}_j=\left\{ \begin{matrix} 1,a_i{\in}S_j \\ 0,a_i{\notin}S_j \end{matrix} \right. ( ϕi)j={ 1aiSj0ai/Sj
The clustering problem is equivalently described as
min ⁡ ϕ , H ∣ ∣ A − Φ H ∣ ∣ F 2 s . t . Φ ∈ R n × k , only one element in each row is 1, and the rest are 0 H ∈ R k × p (3.10.3) \min_{\phi,H}||A-{\Phi}H||_F^2 \\ st {\quad}{\Phi}{\in}R^{n \times k} , only one element in each row is 1, and the rest are 0 \\ H{\in}R^{k \times p}\tag{3.10.3}ϕ , Hmin∣∣AΦHF2s.t.ΦRn × k , only one element in each row is1, and the rest are0HRk×p( 3.10.3 ) Φ {\Phi}
hereΦ 'ssecondThe vector of row i isϕ i T {\phi}_i^TϕiT

Next, it will be explained that 3.10.3 is equivalent to the original problem 3.10.2. For this purpose, only the reference point set HH needs to be explained.The method of selecting H is actually the midpoint of each category. WhenP hi PhiWhen P hi ,iiThe sum of squared distances within the group of points of type i is
a ∈ S i ∣ ∣ a − hi ∣ ∣ 2 \sum_{a{\in}S_i}||a-h_i||^2aSi∣∣ahi2According
to the properties of quadratic functions, whenhi = 1 n ∑ a ∈ S ia h_i=\frac{1}{n}{\sum_{a{\in}S_i}}ahi=n1aSiWhen a , the sum of squared distances within the group is the smallest,
sohi h_ihiIt will definitely be optimized into iithe center point of class i

There are two reasons why we introduce problem (3.10.3):
(1) the form is simple, and the "division method" of the independent variable that is not easy to handle is converted into a matrix
(2) it can be regarded as a matrix decomposition problem, which facilitates us to design algorithms

2. Equivalent expression 2 of K-means clustering

The second equivalent expression of K-means clustering takes advantage of the properties of column orthogonal matrices. This expression is more concise than problem (3.10.3). First define IS t , 1 ≤ t ≤ k I_{ S_t}, 1{\le}t{\le}kISt, 1 t k isnnA vector in n- dimensional space where each component takes the value 0 or 1, and
IS j ( i ) = { 1 , ai ∈ S t 0 , ai ∉ S t I_{S_j}(i)=\left\{ \begin{ matrix} 1, a_i{\in}S_t \\ 0, a_i{\notin}S_t \end{matrix} \right.ISj(i)={ 1aiSt0ai/St
It can be proved that ttt classS t S_tStThe sum of the squared distances from each point to its center point can be written as 1 2 nt T r ( DIS t IS t T ) \frac{1}{2n_t}Tr(DI_{S_t}I_{S_t}^T)2n _t1T r ( D IStIStT),其中 D ∈ R n × n D{\in}R^{n \times n} DRThe elements of n × n are D ij = ∣ ∣ ai − aj ∣ ∣ 2 D_{ij}=||a_i-a_j||^2Dij=∣∣aiaj2 . This shows thatS t S_tStThe sum of squared distances from each point to the center point is equal to S t S_tStThe sum of the squares of the distances between all pairs of points in is related. Therefore, we transform problem (3.10.2) into
min ⁡ S 1 , S 2 , ⋯ , S k 1 2 T r ( DX ) s . t . X = ∑ t = 1 k 1 nt IS t IS t TS 1 ∪ S 2 ∪ ⋯ ∪ S k = { a 1 , a 2 , ⋯ , an } S i ∩ S j = ∅ , ∀ i ≠ j (3.10.4) \min_{S_1,S_2,\cdots,S_k}\frac{1}{2}Tr(DX) \\ st {\quad}X={\sum}_{t=1}^k\frac{1} {n_t}I_{S_t}I_{S_t}^T \\ S_1{\cup}S_2{\cup}\cdots{\cup}S_k=\{a_1,a_2,\cdots,a_n\} \\ S_i{\ cap}S_j={\varnothing},\forall{i{\not=}j}\tag{3.10.4}S1,S2,,Skmin21Tr(DX)s.t.X=t=1knt1IStIStTS1S2Sk={ a1,a2,,an}SiSj=,i=j( 3.10.4 ) Prove XX
for semi-positive definiteX进行分解 X = Y Y T , Y ∈ R n × k X=YY^T,Y{\in}R^{n \times k} X=YYT,YRn × k , we can further obtain the following matrix optimization problem (hereIII isnnn维向量且分量全为1)
min ⁡ Y ∈ R n × k T r ( Y T D Y ) s . t . Y Y T I = I , Y Y T = I k , Y ≥ 0 (3.10.5) \min_{Y{\in}R^{n \times k}}Tr(Y^TDY) \\ s.t.{\quad}YY^TI=I, \\ YY^T=I_k,Y{\ge}0\tag{3.10.5} YRn×kminT r ( YT DY)s.t.YYT I=I,YYT=IkY0( 3.10.5 )
Obtain the solutionYYT YY^TYYT corresponds to the solution of (3.10.4) (to be honest, I don’t understand this part. It is quite simple if Kmeans does it directly)

3.11 Total variation model in image processing

This section may require students with a basic knowledge of image processing to understand it. Anyway, I’m currently confused.

Let’s briefly introduce the image processing model based on total variation (TV), which is defined in the region Ω ⊂ R 2 {\Omega}\subset{R^2}OhRFunctionu ( x , y ) u(x,y) of 2u(x,y ) , its total variation∣
∣ u ∣ ∣ TV = ∫ Ω ∣ ∣ D u ∣ ∣ dx (3.11.1) ||u||_{TV}=\int_{\Omega}||Du||dx \tag{3.11.1}∣∣uTV=Oh∣∣Du∣∣dx( 3.11.1 )
where gradient operatorDDD满足:
D u = ( ∂ u ∂ x , ∂ u ∂ y ) Du=(\frac{\partial{u}}{\partial{x}},\frac{\partial{u}}{\partial{y}}) From _=(xu,yu)
here,∣ ∣ D u ∣ ∣ ||Du||∣∣ D u ∣∣ can bel 1 l_1l1范数,即
∣ ∣ D u ∣ ∣ 1 = ∣ ∂ u ∂ x ∣ + ∣ ∂ u ∂ y ∣ ||Du||_1=|\frac{\partial{u}}{\partial{x}}|+|\frac{\partial{u}}{\partial{y}}| ∣∣Du1=xu+yu∣It
is said that the corresponding total variation is anisotropic.
Ifl 2 l_2l2范数,即
∣ ∣ D u ∣ ∣ 2 = ( ∂ u ∂ x ) 2 + ( ∂ u ∂ y ) 2 ||Du||_2=\sqrt{(\frac{\partial{u}}{\partial{x}})^2+({\frac{\partial{u}}{\partial{y}})^2}} ∣∣Du2=(xu)2+(yu)2
The corresponding total variation is said to be isotropic.

b ( x , y ) b(x,y) b(x,y ) is the observed noisy image,AAA is a linear operator, in the classicRudin − O sher − F atemi (ROF) Rudin-Osher-Fatemi(ROF)RudinOsherUnder the F a t e mi ( ROF ) model, the image denoising and deblurring problems can be written as
min ⁡ u ∣ ∣ A u − b ∣ ∣ L 2 2 + λ ∣ ∣ u ∣ ∣ TV (3.11.2) \min_u| |Au-b||_{L2}^2+{\lambda}||u||_{TV}\tag{3.11.2}umin∣∣AubL2 _2+λ∣∣uTV( 3.11.2 )
Here, the domain isΩ \OmegaΩ functionfff L2 L_2L2The norm is defined as
∣ ∣ f ∣ ∣ L 2 = ( ∫ Ω f 2 dx ) 1 2 ||f||_{L2}=(\int_{\Omega}f^2dx)^{\frac{1}{ 2}}∣∣fL2 _=(Ohf2dx)21
If AAA is the unit operator or blur operator, then the above models correspond to image denoising and deblurring respectively. The first item of the objective function is the data fidelity item, that is, the reconstructed photos must be compatible with the existing collected information. The second term is the regular term, which is used to ensure that the steps of the reconstructed image are sparse, or that the reconstructed image is similar to a piecewise constant function.

The discrete format of the continuous model (3.11.2) is given below. For simplicity, it is assumed that the area Ω = [0, 1] × [0, 1] \Omega=[0,1] \times [0,1]Oh=[0,1]×[0,1 ] and discretize it asn × nn \times nn×n grid, then the grid point( in , jn ) (\frac{i}{n},\frac{j}{n})(ni,nj) corresponds to the indicator(i, j) (i,j)(i,j ) , we set the imageuuu represents the matrixU ∈ R n × n U{\in}R^{n \times n}URn × n , its elementui, j u_{i,j}ui,jCorresponding indicator (i, j) (i,j)(i,j ) Use forward difference discrete gradient operatorDDFor ( DU ) i , j = ( ( D 1 U ) i , j , ( D 2 U ) i , j ) (DU)_{i,j}=((D_1U)_{i,j},( D_2U)_{i,j})( D U )i,j=((D1U)i,j,(D2U)i,j),且有
( D 1 U ) i , j = { u i + 1 , j − u i , j , i < n 0 , i = n (D_1U)_{i,j}=\left\{ \begin{matrix} u_{i+1,j}-u_{i,j},i<n \\ 0,i=n \end{matrix} \right. (D1U)i,j={ ui+1,jui,j,i<n0i=n
( D 2 U ) i , j = { u i , j + 1 − u i , j , j < n 0 , j = n (D_2U)_{i,j}=\left\{ \begin{matrix} u_{i,j+1}-u_{i,j},j<n \\ 0,j=n \end{matrix} \right. (D2U)i,j={ ui,j+1ui,j,j<n0j=n

Here for i, j = ni,j=ni,j=The point n uses the Neumann boundary condition∂ u ∂ n = 0 \frac{\partial{u}}{\partial{n}}=0nu=0且有 D U ∈ R n × n × 2 DU{\in}R^{n \times n \times 2} DURn × n × 2 , then the discrete total variation can be defined as
∣ ∣ U ∣ ∣ TV = ∑ i ≤ i , j ≤ n ∣ ∣ ( DU ) i , j ∣ ∣ (3.11.3) ||U||_{ TV}=\sum_{i{\le}i,j{\le}n}||(DU)_{i,j}||\tag{3.11.3}∣∣UTV=ii,jn∣∣ ( D U )i,j∣∣( 3.11.3 )
where∣∣ ⋅ ∣ ∣ ||·||∣∣∣∣ can bel 1 l_1l1or l 2 l_2l2Norm
For any U , V ∈ R n × n × 2 U,V{\in}R^{n \times n \times 2}U,VRn × n × 2 ,let infinitesimal
< U , V > = ∑ 1 ≤ i , j ≤ n , 1 ≤ k ≤ 2 ui , j , kvi , j , k <U,V>={\sum_{1 {\le}i,j{\le}n,1{\le}k{\le}2}}u_{i,j,k}v_{i,j,k}<U,V>=1i,jn,1k2ui,j,kvi,j,k
Then according to the definition, the discrete divergence operator GGG需满足
< U , G V > = − < D U , V > , ∀ U ∈ R n × n , V ∈ R n × n × 2 <U,GV>=-<DU,V>,\forall{U}{\in}R^{n \times n},V{\in}R^{n \times n \times 2} <U,GV>=<D U ,V>,URn×n,VRn × n × 2
(I’m not sure what the divergence operator does here)
rememberwij = (wi, j, 1, wi, j, 2) T, W = (wij) i, j = 1 n ∈ R n × n × 2 w_{ij}=(w_{i,j,1},w_{i,j,2})^T,W=(w_{ij})_{i,j=1}^ n{\in}R^{n \times n \times 2}wij=(wi,j,1,wi,j,2)T,W=(wij)i,j=1nRn × n × 2I
have
( GW ) ij = Δ i , j , 1 + Δ i , j , 2 (3.11.4) (GW)_{ij}={\Delta_{i,j,1}}+ {\Delta}_{i,j,2}\tag{3.11.4}(GW)ij=Di,j,1+Di,j,2(3.11.4)
其中
Δ i , j , 1 = { w i , j , 1 − w i − 1 , j , 1 , 1 < i < n w i , j , 1 , i = 1 − w i , j , 1 , i = n {\Delta_{i,j,1}}=\left\{ \begin{matrix} w_{i,j,1}-w_{i-1,j,1},1<i<n \\ w_{i,j,1},i=1 \\ -w_{i,j,1},i=n \end{matrix} \right. Di,j,1= wi,j,1wi1,j,1,1<i<nwi,j,1i=1wi,j,1,i=n
Δ i , j , 2 = { w i , j , 1 − w i , j − 1 , 1 , 1 < j < n w i , j , 1 , j = 1 − w i , j , 1 , j = n {\Delta_{i,j,2}}=\left\{ \begin{matrix} w_{i,j,1}-w_{i,j-1,1},1<j<n \\ w_{i,j,1},j=1 \\ -w_{i,j,1},j=n \end{matrix} \right. Di,j,2= wi,j,1wi,j1,1,1<j<nwi,j,1j=1wi,j,1,j=n
After applying appropriate discrete format processing, we can obtain the discrete linear operator AAA and imageBBB (here the notation for the continuous case is used, butAAThe meaning of A is completely different), so the discrete problem
min ⁡ U ∈ R n × n ∣ ∣ AU − B ∣ ∣ F 2 + λ ∣ ∣ U ∣ ∣ TV (3.11.5) \min_{U{\ in}R^{n \times n}}||AU-B||_F^2+\lambda||U||_{TV}\tag{3.11.5}URn×nmin∣∣AUBF2+λ∣∣UTV(3.11.5)
insert image description here

In practice, in addition to considering ROF ROFIn addition to the ROF model, we also consider a deformation of it,TV − L 1 TV-L^1TVL1 model, separate dispersive form
min ⁡ U ∈ R n × n ∣ ∣ AU − B ∣ ∣ 1 + λ ∣ ∣ U ∣ ∣ TV (3.11.6) \min_{U{\in}R^{n \times n }}||AU-B||_1+\lambda||U||_{TV}\tag{3.11.6}URn×nmin∣∣AUB1+λ∣∣UTV( 3.11.6 )
One benefit of the above model is that it can better handle non-Gaussian noise situations, such as salt and pepper noise, etc.

3.12 Wavelet model

The wavelet model written in the book is abstract, and many variables are directly mentioned without telling their meaning, so they are skipped here. . I'll take a closer look if I use it in the future.

Guess you like

Origin blog.csdn.net/abc1234564546/article/details/132739730