[Operational Research Optimization] Meta-heuristic Algorithm Detailed Explanation: Simulated Annealing Algorithm (Simulated Annealing, SA) + Case Explanation & Code Practice


1. Introduction

Simulated Annealing (SA) is the simplest and most famous meta-heuristic algorithm, which is often used to solve complex black-box global optimization problems, and it is widely used in real life.

The main advantage of SA is simplicity. SA is based on an analogy to physical annealing of materials, and SA avoids the disadvantages of Monte-Carlo methods (possible trapping in local minima) due to the application of an efficient Metropolis acceptance criterion.

When the evaluation of the objective function comes from a complex simulation process that manipulates a large-dimensional state space involving a large amount of memory, population-based algorithms are not applicable, and simulated annealing is an effective algorithm for solving these problems.


2. Basic knowledge

In the early 1980s, three IBM researchers, Kirkpatrick et al., introduced the concept of annealing in combinatorial optimization.

These concepts are based on strong analogies to physical annealing of materials. This process involves bringing the solid into a low-energy state after raising its temperature. It can be summarized in the following two steps ( see Figure 1.1 ):

  • bring solids to very high temperatures until the structure "melts"
  • Solids are cooled according to very specific cooling schemes to achieve a solid state with minimal energy

In the liquid state, the particles are randomly distributed. The results show that the minimum energy state can be reached as long as the initial temperature is high enough and the cooling time is long enough. If this were not the case, the solid would be in a metastable state with a non-minimum energy; this is called hardening, the sudden cooling of the solid.

insert image description here

Before describing the simulated annealing algorithm for optimization, we need to understand the principle of the local search optimization algorithm, of which simulated annealing is an extension.

2.1 Local search (or Monte Carlo) algorithm

These algorithms improve the objective function by exploring the neighborhood of the current point in the solution space.

In the next definition we consider ( S , f ) (S,f)(S,f ) Instantiation of combinatorial optimization problems (SSS : set of feasible solutions,fff : the objective function to be minimized).

Definition 1 : Let N \mathscr{N}N is an application that for each solutioni ∈ S i \in SiS defines an AND solutioniiThe subsetS i ⊂ S S_i \sub S of solutions that are "close" to iSiS. _ child collectionS i S_iSisolution iii 's neighborhood.

In the following definitions, we consider N \mathscr{N}N is with( S , f ) (S,f)(S,f ) Neighborhood structure of associations.

Definition 2 : The generative mechanism is used in a given solution iiAny neighborhoodS i S_i of iSichoose the solution jjmeans of j

The local search algorithm is an iterative algorithm that starts the search from a randomly drawn feasible point in the state space.

The generative mechanism is then continuously applied by exploring the neighborhood of the current solution in order to find a better solution.

If a better solution is found, it becomes the current solution. When an improved solution cannot be found, the algorithm ends and the current solution is considered an approximate solution to the optimization problem.

For minimization problems, the algorithm can be summarized by the following pseudocode:

insert image description here

Translated into Chinese is:

step Concrete operation
1 generate initial solution iii
2 From the current solution iiRandomly select a solution jjin the neighborhood of ij (generate a new solution)
3 If the new solution jjThe objective function value of j is smaller than the old solution iiThe objective function value of i , then let the new solutionjjj as the current solution
4 If the current solution iiThere is no solutionjj in the neighborhood of iThe objective function value of j is smaller than the old solution iiThe objective function value of i , then the program ends
5 return to step 2

Definition 3 : If for all j ∈ S i ∗ j\in S^*_ijSiAll satisfy f ( i ∗ ) ≤ f ( j ) f(i^*) \le f(j)f(i)f(j),则解 i ∗ ∈ S i^* \in S iS is called relative toN \mathscr{N}Local optimal solution of N

Definition 4 : If for about N \mathscr{N}Each local optimal solution of N i ∗ ∈ S i^*\in SiS i ∗ i^* i is also( S , f ) (S,f)(S,The global optimal value of f ) is called the neighborhood structureN \mathscr{N}N is exact.

Therefore, by definition, local search algorithms are guaranteed to converge to local optima only when there is an exact neighborhood structure.

This notion of an exact neighborhood is theoretical because in practice it usually leads to full enumeration by means of the search space.

Thus, if the current solution "falls" in a subdomain where the objective function is convex, the algorithm remains stuck in that subdomain unless the neighborhood structure associated with the new solution generation mechanism can reach points outside the subdomain.

To avoid getting stuck in local minima, it is necessary to define a procedure that may accept a temporary reduction in the quality of the current solution; this is the main principle of simulated annealing.

Before describing this algorithm, it is necessary to introduce the Metropolis algorithm, which is a basic component of SA.

2.2 Metropolis Algorithm

In 1953, three American researchers developed an algorithm for simulating the physical annealing process. Their aim was to reproduce the evolution of the physical structure of the material during annealing.

The algorithm is based on a Monte Carlo technique that involves generating a sequence of states of a solid as follows.

Set the initial state iiThe energy of i is E i E_iEi, by modifying the position of a particle to generate E_j with energy E jEjA new state jj ofj

If the energy difference E i − E j > 0 E_i-E_j>0EiEj>0 (the energy of the new state is low), then the statejjj becomes the new current state.

If the energy difference E i − E j ≤ 0 E_i-E_j\le0EiEj0 , then statejjj becomes the current state with a probability given by:

Pr ⁡ {  Current state  = j } = e ( E i − E j k b ⋅ T ) \operatorname{Pr}\{\text { Current state }=j\}=e^{\left(\frac{E_i-E_j}{k_b \cdot T}\right)} Pr{  Current state =j}=e(kbTEiEj)

where TTT represents the temperature of the solid,kb k_bkbis the Boltzmann constant ( kb = 1.38 × 1 0 − 23 J / K k_b=1.38×10^{-23}J/Kkb=1.38×1023J/K)。

This criterion for accepting poorer solutions with a certain probability is called the Metropolis criterion .

If the cooling proceeds slowly enough, the solid at each given temperature TTT reaches the equilibrium state.

In the Metropolis algorithm, this balance is achieved by generating a large number of transitions at each temperature.

The heat balance is characterized by a Boltzmann statistical distribution. This distribution gives the solid at temperature TTAt energyE i E_i at TEistate IIProbability of i :

P r { X = i } = 1 Z ( T ) e − ( E i k b ⋅ T ) Pr\{X=i\}=\frac{1}{Z(T)}e^{-(\frac{E_i}{k_b \cdot T})} P r { X=i}=Z(T)1e(kbTEi)

where XXX is a random variable related to the current state of the solid. Z ( T ) Z(T)Z ( T ) is a normalization coefficient, which is defined as follows:

Z ( T ) = ∑ j ∈ S e − ( E j k b ⋅ T ) Z(T)=\sum_{j \in S}{e^{-(\frac{E_j}{k_b \cdot T})}} Z(T)=jSe(kbTEj)

2.3 Simulated annealing algorithm

In the SA algorithm, the Metropolis algorithm is used in the state space SSGenerate a series of solutions in S.

To do this, use the following equivalence condition to draw an analogy between a multi-particle system and our optimization problem:

  • State space points (solutions) represent possible states of the solid
  • The function to be minimized represents the energy of the solid

Then import ccc is used as a control parameter for temperature. This parameter is expressed in the same units as the optimization objective.

Assume also that we have given each point in the state space a neighborhood and a mechanism to generate a solution in that neighborhood. Then, we define the acceptance criterion as:

Definition 5 : Let ( S , f ) (S,f)(S,f ) is an instance of the combinatorial minimization problem,iii j j j are two points in the state space. From the current solutioniiaccepted solutionjj in iThe acceptance criteria for j are given by the following probabilities:

Pr ⁡ {  accept  j } = { 1  if  f ( j ) < f ( i ) e ( f ( i ) − f ( j ) c )  else.  \operatorname{Pr}\{\text { accept } j\}= \begin{cases}1 & \text { if } f(j)<f(i) \\ e^{\left(\frac{f(i)-f(j)}{c}\right)} & \text { else. }\end{cases} Pr {  accept  j }={ 1e(cf(i)f(j)) if f(j)<f(i) else. 

By analogy, the generation principle of neighbors corresponds to the perturbation mechanism of the Metropolis algorithm, while the acceptance principle represents the Metropolis criterion.

Definition 6 : A transition means replacing the current solution with an adjacent solution. This operation is performed in two phases: generate and accept

Next, suppose ck c_kckis the value of the temperature parameter, L k L_kLkis at an iteration kkThe number of transitions to generate at k . The principle of SA can be summarized as follows:

insert image description here

One of the main features of the simulated annealing algorithm is its ability to accept poor transitions of the objective function.

At the beginning of the process, the temperature ck c_kckHigh values ​​of , which make transitions with high objective degradation acceptable, allowing for thorough exploration of the state space.

With ck c_kckDecrease, only transitions that improve the target or have a lower degree of target deterioration are accepted.

Finally, when ck c_kckWhen it tends to zero, the deterioration of the target is not accepted, and the SA algorithm behaves like a Monte Carlo algorithm at this time.


3. Principle

This section discusses two fundamental theoretical properties of SA: statistical equilibrium and asymptotic convergence .

3.1 Statistical Equilibrium Statistical balance

Based on the ergodicity assumption, a particle system can be considered as a collection with observable statistical properties, and many useful quantities can be derived from equilibrium statistical systems: average energy, energy distribution, entropy.

Furthermore, if this set of particles is stationary, which is the case when a statistical equilibrium is reached, the probability density associated with the state in the equilibrium phase depends on the energy of the system.

In fact, during the equilibrium phase, the system is in a given state iii , capacity isE i E_iEiThe probability of is given by Boltzmann's law:

Theorem 1 : With a fixed control parameter ccc and after a sufficient number of transformations using the following acceptance probabilities

P c {  accept  j ∣ S i } = { 1  if  f ( j ) < f ( i ) e ( f ( i ) − f ( j ) c )  else  P_c\left\{\text { accept } j \mid S_i\right\}=\left\{\begin{array}{cc} 1 & \text { if } f(j)<f(i) \\ e^{\left(\frac{f(i)-f(j)}{c}\right)} & \text { else } \end{array}\right. Pc{  accept jSi}={ 1e(cf(i)f(j)) if f(j)<f(i) else 

The simulated annealing algorithm will find a given solution i ∈ S i ∈ S with probabilityiS

P c { X = i } = q i ( c ) = 1 N 0 ( c ) e ( − f ( i ) c ) P_c\{X=i\}=q_i(c)=\frac{1}{N_0(c)} e^{\left(-\frac{f(i)}{c}\right)} Pc{ X=i}=qi(c)=N0(c)1e(cf(i))

where XXX is a random variable representing the current state of the annealing algorithm,N 0 ( c ) N_0(c)N0( c ) is the normalization coefficient:

N 0 ( c ) = ∑ j ∈ S e ( − f ( j ) c ) N_0(c)=\sum_{j \in S} e^{\left(-\frac{f(j)}{c}\right)} N0(c)=jSe(cf(j))

Definition 7 : Let AAA andBBB are two sets such thatB ⊂ AB ⊂ ABA. _ We defineBBCharacteristic function of B κ ( B ) κ(B)κ ( B ) is:

κ ( B ) ( a ) = { 1  if  a ∈ B 0  else \kappa_{(B)}(a)=\left\{\begin{array}{l} 1 \quad \text { if } a \in B \\ 0 \quad \text { else} \end{array}\right. K(B)(a)={ 1 if aB0 else

Corollary 1 : For any given solution iii , both have

lim ⁡ c → 0 + P c { X = i } = lim ⁡ c → 0 + q i ( c ) = q i ∗ = 1 ∣ S o p t ∣ κ ( S o p t ) ( i ) \lim _{c \rightarrow 0^{+}} P_c\{X=i\}=\lim _{c \rightarrow 0^{+}} q_i(c)=q_i^*=\frac{1}{\left|S_{o p t}\right|} \kappa_{\left(S_{o p t}\right)}(i) c0+limPc{ X=i}=c0+limqi(c)=qi=Sopt1K(Sopt)(i)

where S opt S_{opt}Soptrepresents the globally optimal set.

This result guarantees the asymptotic convergence of the simulated annealing algorithm to the elements of the globally optimal set, assuming that at ccA stable distribution qi ( c ) is reached at each value of c , i ∈ S q_i(c),i ∈ Sqi(c),iS

For a discrete state space, such a distribution is discrete and can be computed to reach the target value yi y_iyiA particular point xi x_i in the state space ofxiThe probability:

q i ( c ) = e ( − y i c c ) ∑ j ∈ S e ( − y j c c ) q_i(c)=\frac{e^{\left(-\frac{y_i^c}{c}\right)}}{\sum_{j \in S} e^{\left(-\frac{y_j^c}{c}\right)}} qi(c)=jSe(cyjc)e(cyic)

for ccFor any positive value of c , the functionffThe expected value of f optimized at equilibrium is denoted as<f>c<f>_c<f>c, the variance is expressed as < f 2 > c <f^2>_c<f2>c

at very high temperature ccUnder c , the SA algorithm moves randomly in the state space.

By mapping yic = f ( xi ) y^c_i = f(x_i )yic=f(xi) , each point xi x_iproduced by the processxiwith the target value yi y_iyiAssociated.

If we think about this process for a long time, it is possible to establish the objective function value yic , ( i = 1 , 2 , . . . N ) y^c_i,(i = 1,2,...N)yic,(i=1,2,... N ) distributions are generated by the SA procedure.

This distribution depends on the temperature ccc , denoted asq ( c ) q(c)q(c)

for big ccc value, this distribution is equal to the target distribution. Figure 1.2gives an example of such a distribution.

insert image description here
The figure shows a 1D objective function for which the circles represent at some high temperature c 1 c_1c1Below is an example of the SA algorithm.

The horizontal dashed line shows the mean of the distribution ( < f ( c 1 ) > < f(c_1)><f(c1)> ), and on the left, the associated distribution is given by the dotted plot (q ( c 1 ) q(c1)q ( ​​c 1 ) ) means.

For lower temperature c 2 c_2c2, some transitions in the SA process are not accepted, which means that the associated distribution q ( c 2 ) q(c2)q ( c 2 ) is shifted to lower levels with lower expected values ​​(squares in the objective function on the right and solid lines on the left).

Definition 8 : The entropy at equilibrium is

H c = ∑ i ∈ S q i ( c ) ln ⁡ ( q i ( c ) ) H_c=\sum_{i \in S} q_i(c) \ln \left(q_i(c)\right) Hc=iSqi(c)ln(qi(c))

Corollary 2 : One has

∂ ⟨ f ⟩ c ∂ c = σ c 2 c 2 ∂ H c ∂ c = σ c 2 c 3 . \begin{aligned} & \frac{\partial\langle f\rangle_c}{\partial c}=\frac{\sigma_c^2}{c^2} \\ & \frac{\partial H_c}{\partial c}=\frac{\sigma_c^2}{c^3} . \end{aligned} cfc=c2pc2cHc=c3pc2.

These last two expressions play an important role in statistical mechanics. We also derive the following expressions:

Inference 3 :

lim ⁡ c → ∞ ⟨ f ⟩ c = ⟨ f ⟩ ∞ = 1 ∣ S ∣ ∑ i ∈ S f ( i ) lim ⁡ c → 0 ⟨ f ⟩ c = ⟨ f ⟩ 0 = f O p t lim ⁡ c → ∞ σ c 2 = σ ∞ 2 = 1 ∣ S ∣ ∑ i ∈ S ( f ( i ) − ⟨ f ⟩ ∞ ) 2 lim ⁡ c → 0 σ c 2 = σ 0 2 = 0 lim ⁡ c → ∞ H c = H ∞ = ln ⁡ ( ∣ S ∣ ) lim ⁡ c → 0 H c = H 0 = ln ⁡ ( ∣ S O p t ∣ ) \begin{array}{cc} \lim _{c \rightarrow \infty}\langle f\rangle_c=\langle f\rangle_{\infty}=\frac{1}{|S|} \sum_{i \in S} f(i) & \lim _{c \rightarrow 0}\langle f\rangle_c=\langle f\rangle_0=f_{O p t} \\ \lim _{c \rightarrow \infty} \sigma_c^2=\sigma_{\infty}^2=\frac{1}{|S|} \sum_{i \in S}\left(f(i)-\langle f\rangle_{\infty}\right)^2 & \lim _{c \rightarrow 0} \sigma_c^2=\sigma_0^2=0 \\ \lim _{c \rightarrow \infty} H_c=H_{\infty}=\ln (|S|) & \lim _{c \rightarrow 0} H_c=H_0=\ln \left(\left|S_{O p t}\right|\right) \end{array} limcfc=f=S1iSf(i)limcpc2=p2=S1iS(f(i)f)2limcHc=H=ln(S)limc0fc=f0=fOptlimc0pc2=p02=0limc0Hc=H0=ln(SOpt)

where f O pt f_{Opt}fOptmeans ffoptimal value of f , this last formula expresses the third law of thermodynamics (assuming there is only one minimum energy state, then we get:S 0 = ln ( 1 ) = 0 S_0 = ln(1) = 0S0=ln(1)=0)。

In physics, entropy measures the degree of disorder associated with a system: high entropy values ​​indicate disordered structure, while low values ​​reflect organization.

In the context of optimization, entropy is related to a measure of the degree of optimality achieved. During successive SA iterations, the mathematical expectation of the objective function value and entropy only contributes to f O pt f_{Opt} respectivelyfOpt l n ( ∣ S O p t ∣ ) ln(|S_{Opt}|) ln(SOpt) decreases and converges.

Distribution qi ( c ) q_i(c)qi( c ) with temperatureccThe derivative of c is given by the expression:

∂ q i ( c ) ∂ c = q i ( c ) c 2 [ ⟨ f ⟩ c − f ( i ) ] \frac{\partial q_i(c)}{\partial c}=\frac{q_i(c)}{c^2}\left[\langle f\rangle_c-f(i)\right] cqi(c)=c2qi(c)[fcf(i)]

由于 ⟨ f ⟩ c   ≤   ⟨ f ⟩ ∞ \langle f\rangle_c \ \le \ \langle f\rangle_{\infty} fc  f, can exhibit three states during simulated annealing. More precisely, the following can be displayed:

Corollary 4 : Let ( S , f ) (S,f)(S,f ) hasSO pt ≠ S S_{Opt} \neq SSOpt=An instance of the combinatorial optimization problem of S , let qi ( c ) q_i(c)qi( c ) is the stationary distribution associated with the annealing process. Then we have:

insert image description here
This corollary shows that when ccAs c decreases, the probability of finding an optimal solution increases monotonically. Furthermore, for any non-optimal solution, there exists a positive valuec ˉ i \bar{c}_icˉi, such that for c < c ˉ i c<\bar{c}_ic<cˉi, the probability of finding the solution increases with ccThe decrease of c decreases.

Definition 9 : The acceptance rate associated with the simulated annealing algorithm is defined as follows

χ ( c ) =  Number of accepted transitions   Number of proposed transitions  \chi(c)=\frac{\text { Number of accepted transitions }}{\text { Number of proposed transitions }} x ( c )= Number of proposed transitions  Number of accepted transitions 
As a general rule, when ccWith high values ​​of c , all transitions are accepted, andχ ( c ) χ(c)χ ( c ) close to1 11

Then, when ccWhen c decreases,χ ( c ) χ(c)χ ( c ) decreases slowly until it reaches0 00 , meaning no transitions are accepted.

By observing ⟨ f ⟩ c \langle f\rangle_cfcσ c 2 σ^2_cpc2as ccEvolution of the function of c , we notice that there exists a transition threshold (denoted asct c_tct), which delimits two distinct regions of the equilibrium distribution. The threshold is the value ct c_tct, such that:

⟨ f ⟩ c t ≈ 1 2 ( < f ∞ > + f O p t ) \langle f\rangle_{c_t} \approx \frac{1}{2}\left(<f_{\infty}>+f_{O p t}\right) fct21(<f>+fOpt)

and

σ c 2 ≈ σ ∞ 2  if  c ≥ c t < σ ∞ 2  if  c < c t \begin{aligned} & \sigma_c^2 \approx \sigma_{\infty}^2 \text { if } c \geq c_t \\ & <\sigma_{\infty}^2 \text { if } c<c_t \\ & \end{aligned} pc2p2 if cct<p2 if c<ct

Therefore, for any given ccc value, search spaceSSS can be divided into two regions:

  1. R 1 R_1 R1Area: when ccWhen c decreases,σ c 2 σ^2_cpc2remains roughly constant (close to σ ∞ 2 σ^2_∞p2 )
  2. R 2 R_2 R2Region: when ccWhen c decreases,σ c 2 σ^2_cpc2reduce

when ccc is close toct c_tctvalue, the acceptance rate is about 0.5 0.50.5 (即χ ( ct ) ≈ 0.5 χ(c_t) ≈ 0.5x ( ct)0.5 ). Furthermore, we can prove that:

  • In R 1 R_1R1, for larger ccc 值, ⟨ f ⟩ c \langle f\rangle_c fcat c − 1 c^{-1}c1 is linear, andσ c 2 σ^2_cpc2Roughly constant
  • In R 2 R_2R2medium, for smaller ccc 值, ⟨ f ⟩ c \langle f\rangle_c fcwith ccc is proportional to,σ c 2 σ^2_cpc2with c 2 c_2c2Proportional

One can come up with the following ⟨ f ⟩ c \langle f\rangle_cfcσ c 2 σ^2_cpc2Approximate model for:

insert image description here
where, roughly speaking, γ γγ⟨ f ⟩ c \langle f\rangle_cfcfirst-order approximation of . Finally, let's introduce the specific heat, denoted as H ( c ) H(c)H ( c ) , which is given by:

H ( c ) = d ⟨ f ⟩ c d c = ⟨ f ⟩ c 2 − ⟨ f ⟩ c 2 k b c 2 H(c)=\frac{d\langle f\rangle_c}{d c}=\frac{\langle f\rangle_c^2-\langle f\rangle_c^2}{k_b c^2} H(c)=dcdfc=kbc2fc2fc2

Larger H ( c ) H(c)The H ( c ) value indicates that the material is starting to become solid: in this case, the rate of temperature drop must be reduced.

3.2 Asymptotic Convergence asymptotic convergence

The simulated annealing algorithm has the property of stochastically converging towards the global optimum as long as it provides an infinitely long temperature decay map with an infinitely small decay step size.

This decay scheme is purely theoretical, and one will try to approach this ideal in practice while remaining within a reasonable execution time.

Definition 10 : A Markov chain is a sequence of states where the probability of reaching a given state depends only on the previous state. Let X ( k ) X(k)X ( k ) is thekkthThe state reached at k iterations. Then, for each state pair( i , j ) (i,j)(i,j ) at thekthThe transition probability at k iterations is given by P ij ( k ) = P r { X ( k ) = j ∣ X ( k − 1 ) = i } P_{ij}(k)= Pr \{ X(k)= j | X(k-1)= i \}Pij(k)=P r { X ( k )=jX(k1)=i } gives. Incidence matrix[ P ij ( k ) ] [P_{ij}(k)][Pij( k )] is called the transfer matrix. In the context of simulated annealing, Markov chain transitions correspond to moves in state space (generation + acceptance).

Definition 11 : The transition probability of the SA algorithm is given by:

∀ i , j ∈ S : P i j ( k ) = P i j ( c k ) = { G i j ( c k ) A i j ( c k )  if  i ≠ j 1 − ∑ l ≠ i P i l ( c k )  if  i = j \forall i, j \in S: P_{i j}(k)=P_{i j}\left(c_k\right)= \begin{cases}G_{i j}\left(c_k\right) A_{i j}\left(c_k\right) & \text { if } i \neq j \\ 1-\sum_{l \neq i} P_{i l}\left(c_k\right) & \text { if } i=j\end{cases} i,jSPij(k)=Pij(ck)={ Gij(ck)Aij(ck)1l=iPil(ck) if i=j if i=j

where G ij ( ck ) G_{ij}(c_k)Gij(ck) represents the probability of generating state j from state I;

A i j ( c k ) A_{ij}(c_k) Aij(ck) is accepted from stateiiThe state jjgenerated by iThe probability of j .

For all i , j ∈ S i,j ∈ Si,jS A i j ( c k ) A_{ij}(c_k) Aij(ck) is given by:

A i j ( c k ) = e ( − ( f ( j ) − f ( i ) ) + c k )  with  a + = { a  if  a > 0 0  else  \begin{aligned} & A_{i j}\left(c_k\right)=e^{\left(-\frac{(f(j)-f(i))^{+}}{c_k}\right)} \\ & \text { with } a^{+}=\left\{\begin{array}{l} a \text { if } a>0 \\ 0 \text { else } \end{array}\right. \end{aligned} Aij(ck)=e(ck(f(j)f(i))+) with a+={ a if a>00 else 

Theorem 2 : Assume that the following conditions are met:

∀ i , j ∈ S ∃ p ≥ 1 ∃ l 0 , l 1 , … , l p ∈ S \forall i, j \in S \quad \exists p \geq 1 \quad \exists l_0, l_1, \ldots, l_p \in S i,jSp1l0,l1,,lpS

w i t h   l 0 = i , l p = j ,  and  G l k , l k + 1 > 0 , k = 0 , 1 , … , p − 1 .  with \ l_0=i, l_p=j, \text { and } G_{l_k, l_{k+1}}>0, k=0,1, \ldots, p-1 \text {. } with l0=i,lp=j, and Glk,lk+1>0,k=0,1,,p1

Then, the Markov chain has the expression q ( c ) q(c)The stationary distribution of q ( c ) , which is the SA algorithm at temperatureccThe distribution of solutions visited under c , whose components are given by:

q i ( c ) = 1 N 0 ( c ) e ( − f ( i ) c ) , ∀ i ∈ S q_i(c)=\frac{1}{N_0(c)} e^{\left(-\frac{f(i)}{c}\right)}, \forall i \in S qi(c)=N0(c)1e(cf(i)),iS

where N 0 ( c ) N_0(c)N0( c ) is the normalization coefficient

insert image description here

where X kc X^c_kXkcExpressed at temperature ccThe kkthobtained under ck iterations. This result shows that the simulated annealing algorithm converges to one of the optimal solutions.

Theorem 3 : Assume that the probability of generation and acceptance satisfies the following assumptions

insert image description here

Then, at any iteration kkk , there exists a stationary distributionq ( ck ) q(c_k)q(ck) , whose components are given by:

q i ( c k ) = A i O p t i ( c k ) ∑ j ∈ S A i O p t j ( c k ) ∀ i ∈ S  and  i O p t ∈ S O p t q_i\left(c_k\right)=\frac{A_{i_{O p t} i}\left(c_k\right)}{\sum_{j \in S} A_{i_{O p t} j}\left(c_k\right)} \forall i \in S \text { and } i_{Opt} \in S_{O p t} qi(ck)=jSAiOptj(ck)AiOpti(ck)iS and iOptSOpt

Furthermore, for any i O pt ∈ SO pt i_{Opt} ∈ S_{Opt}iOptSOpt,We have:

lim ⁡ c k → 0 + q i ( c k ) = 1 ∣ S O p t ∣ κ ( S O p t ) ( i ) \lim _{c_k \rightarrow 0^{+}} q_i\left(c_k\right)=\frac{1}{\left|S_{O p t}\right|} \kappa_{\left(S_{O p t}\right)}(i) ck0+limqi(ck)=SOpt1K(SOpt)(i)

In fact, except for the exponential distribution, it is difficult to find A 1 , A 2 , A 3 A_1, A_2, A_3A1A2A3acceptance distribution.

The theoretical results given above cannot be directly applied to practical SA algorithms because they assume that for ck c_kckEach value of has an infinite number of iterations, and ck c_kckThe value keeps going to 0 00 decreases.

Where the number of iterations per temperature step is finite, SA can be simulated using a Markovian non-uniform model, for which similar results can be established.

The simulated annealing algorithm converges to an optimal solution to the optimization problem, but it only reaches this optimal solution with an infinite number of transitions. Approximation of the asymptotic behavior requires many iterations of the order of magnitude equal to the cardinality of the state space, which is not realistic in the case of NP-hard problems.

Therefore, it is necessary to consider annealing as a mechanism to approach the global solution of combinatorial optimization problems, which will require adding a local search method that allows to reach the optimum exactly.

In other words, simulated annealing makes it possible to move in the correct domain of attraction, and local methods optimize the process by determining a local optimum within this domain of attraction that corresponds to the global optimum of the problem.


4. Practical issues

This section surveys the following practical problems of interest to friends who wish to implement SA algorithms for specific problems: finite-time approximations, polynomial-time cooling, Markov chain lengths, stopping criteria, and simulation-based evaluation .

4.1 Finite-Time Approximation Finite-Time Approximation

In practice, the convergence condition will be determined by choosing the parameter ck c_k at each iteration kckA relatively small decay step size and a sufficiently large number of transitions L k L_k at this temperatureLkto approximate.

Intuitively, the larger the decrement, the longer the length of the stabilization step to quasi-equilibrium (defined below). Therefore, it takes L_k between "large decrement" and "length" L kLkbalance between.

A finite-time implementation of the simulated annealing algorithm can be achieved by giving the control parameter ccA finite decreasing sequence of values ​​of c generates a homogeneous Markov chain of finite length to achieve:

Definition 12 : The cooling process is defined as follows

  1. Control parameter ccA finite sequence of values ​​for c , that is:
    • initial value c 0 c_0c0
    • parameter ccdecay function of c
    • c c final value of c
  2. A finite number of transitions for each value of the control parameter, i.e. a finite length of the associated Markov chain

Definition 13 : Let ε εε is a small enough positive value,kkk is the given number of iterations,L k L_kLkis the kkthThe length of k Markov chains,ck c_kckis the value of the control parameter. If in the Markov chain L k L_kLkThe probability distribution of the solution after iterations (by a ( L k , ck ) a(L_k,c_k)a(Lk,ck) The distribution represented by ) is close enough to the stationary distributionq ( ck ) q(c_k)q(ck) , we say that we have a quasi-equilibrium:

q i ( c k ) = 1 N 0 ( c k ) e − f ( i ) c k ∀ i ∈ S , N 0 ( c k ) = ∑ j ∈ S e − f ( j ) c k \begin{gathered} q_i\left(c_k\right)=\frac{1}{N_0\left(c_k\right)} e^{-\frac{f(i)}{c_k}} \forall i \in S, \\ N_0\left(c_k\right)=\sum_{j \in S} e^{-\frac{f(j)}{c_k}} \end{gathered} qi(ck)=N0(ck)1eckf(i)iS,N0(ck)=jSeckf(j)

That is:

∥ a ( L k , c k ) − q ( c k ) ∥ < ε \left\|a\left(L_k, c_k\right)-q\left(c_k\right)\right\|<\varepsilon a(Lk,ck)q(ck)<e

The cooling process using the quasi-equilibrium principle is based on the following observations. When the parameter ck c_kckTend to ∞ ∞ , the stable distribution consists of the possible solution setSSThe uniform law on S gives:

lim ⁡ c k → ∞ q ( c k ) = 1 ∣ S ∣ 1 \lim _{c_k \rightarrow \infty} q\left(c_k\right)=\frac{1}{|S|} \mathbf{1} cklimq(ck)=S11

where 1 \mathbf{1}1 is dimension∣ S ∣ |S|S vector, its components are all1 11

Therefore, for a sufficiently large ck c_kck, each point of the search space is visited with equal probability, and regardless of L k L_kLkWhatever the value of , it will directly reach the quasi-equilibrium state.

The cooling process then consists of determining the values ​​( L k , ck ) that will result in a quasi-equilibrium at the end of each Markov chain (L_k,c_k)(Lk,ck)

There are many possible cooling processes, but the two most common are the geometric process proposed by Kirkpatrick and the polynomial-time cooling proposed by Aarts and Van Laarhoven.

4.2 Geometric Cooling

  • initial temperature c 0 c_0c0: perform preheating so that we can find a sufficiently large c 0 c_0c0value such that almost all transitions are accepted in the first iteration. To find such a value, we start with a small value c 0 c_0c0start. Then, the value is gradually multiplied by a value greater than 1 11 until the acceptance rateχ ( c 0 ) χ(c_0)x ( c0) close to1 11
  • Control parameter ccDecay of c :ck + 1 = α ck c_{k+1}=\alpha c_kck+1=αck, where α \alphaα generally takes 0.8~0.99
  • Stopping criterion : The decision algorithm terminates when the current solution does not change from one iteration to the next during a sufficiently large number of iterations.
  • The length of the chain : In theory, each chain needs to reach a quasi-balanced state. For this, a sufficient number of acceptable transformations must be performed, which usually depends on the problem. Since the number of accepted transitions is relative to the number of proposed transitions L k L_kLkdecreases with time, so the latter must be the lower bound.

4.3 Cooling in Polynomial Time Polynomial Time Cooling

Let's explain how to set the initial value of the temperature parameter, and how to iteratively reduce it.

4.3.1 Initial temperature

Let m 1 m_1m1is the total number of proposed transformations that strictly improve the value of the objective function, m 2 m_2m2is the number of additional (increased) suggested conversions. Furthermore, let Δ ˉ f ( + ) \bar{\Delta}_f^{(+)}Dˉf(+)is the average of the cost differences of all incremental transitions. Then, the acceptance rate can be approximated as:

χ ( c ) ≃ m 1 + m 2 e − ( Δ ˉ f ( + ) c ) m 1 + m 2 \chi(c) \simeq \frac{m_1+m_2 e^{-\left(\frac{\bar{\Delta}_f^{(+)}}{c}\right)}}{m_1+m_2} x ( c )m1+m2m1+m2e(cDˉf(+))

This produces

insert image description here

initial temperature c 0 c_0c0The suggested initial value for is defined as follows:

Initially, c 0 c_0c0is set to 0 00 . Thereafter, a sequence ofm 0 m_0m0Convert, calculate m 1 m_1m1and m 2 m_2m2value.

Then, c 0 c_0c0The initial value of is calculated by equation (1.2), where acceptance rate χ ( c ) χ(c)The value of χ ( c ) is user-defined. Thenc 0 c_0c0The final value of is used as the initial value during the cooling process.

4.3.2 Attenuation of control parameters

The quasi-equilibrium condition is replaced by:

∀ k ≥ 0 : ∥ q ( k ) − q ( k + 1 ) ∥ < ε \forall k\geq 0:\|q(k)-q(k+1)\|<\varepsilonk0q(k)q(k+1)<e

Therefore, for two consecutive values ​​of the control parameter ck c_kckand ck + 1 c_{k+1}ck+1, expecting a stationary distribution close to . This can be quantified by the following formula:

insert image description here

where δ δδ is a small positive number given a priori. The following theorem provides the necessary conditions to satisfy Equation (1.3).

Theorem 4 : Let q ( ck ) q(c_k)q(ck) is with iterationkkThe stationary distribution of the Markov chain associated with the simulated annealing process at k , and let ck c_kckand ck + 1 c_{k+1}ck+1are two consecutive values ​​of the control parameter, where ck + 1 < ck c_{k+1} < c_kck+1<ck, then satisfy (1.3) formula, if:

insert image description here
The necessary condition (1.4) can be rewritten as:
insert image description here
It can be shown that the latter condition (1.5) can be approximated as:

insert image description here

where, σ ck σ_{c_k}pckis the temperature ck c_kckNext q ( ck ) q(c_k)q(ck) standard deviation.

Temperature parameter ccc is decremented by the user-defined parameterδd电影。d dLarge values ​​of δ lead to ccThe substantial reduction of c , whileδ δSmall values ​​of δ lead to ccA small decrease in c .

4.3.3 The length of the Markov chain

During SA cooling, the length of the Markov chain must allow access to a given solution i ∈ S i ∈ SiNeighborhoodS i S_i of SSia considerable percentage of. The following theorem is used to quantify this percentage:

Theorem 5 : Let SSS is a base∣ S ∣ |S|S set. Then, inNNSSvisited during random walk for N iterationsThe average number of S elements is given by:

insert image description here

Therefore, if no conversion is accepted, and N = ∣ S i ∣ N = |S_i|N=Si , then in solutioniiNeighborhoodS i S_i of iSiThe percentage of solutions visited in is: 1 − e − 1 ≃ 2 / 3 1-e^{-1} \simeq 2/31e12/3

L k = ∣ S i ∣ L_k = |S_i| Lk=Si∣Gives the inner loop at iteration k (the temperature isck c_kck) is a good choice for the number of iterations, where obviously ∣ S i ∣ |S_i|Si∣Related to the problem and must be designed by the user.

4.3.4 Stopping Criteria

Δ ⟨ f ⟩ c k = ⟨ f ⟩ c k − f O p t \Delta\langle f\rangle_{c_k}=\langle f\rangle_{c_k}-f_{O p t} Δ f ck=fckfOpt

Background, Δ ⟨ f ⟩ ck \Delta\langle f\rangle_{c_k}Δ f ckRelative to ⟨ f ⟩ c 0 \langle f\rangle_{c_0}fc0"enough" hours, the execution of the algorithm should terminate.

For sufficiently high c 0 c_0c0value, we have < fc 0 > ≃ ⟨ f ⟩ ∞ <f_{c_0}>\simeq\langle f\rangle_{\infty}<fc0>≃f

Also, for ck << 1 c_k << 1ck<<1

Δ ⟨ f ⟩ c k ≃ c k ∂ ⟨ f ⟩ c k ∂ c k \Delta\langle f\rangle_{c_k} \simeq c_k \frac{\partial\langle f\rangle_{c_k}}{\partial c_k} Δ f ckckckfck

Then, the endpoint of the algorithm is determined by:

c k ⟨ f ⟩ ∞ ∂ ⟨ f ⟩ c k ∂ c k < ε S  for  c k < < 1 \frac{c_k}{\langle f\rangle_{\infty}} \frac{\partial\langle f\rangle_{c_k}}{\partial c_k}<\varepsilon_S \text { for } c_k<<1 fckckfck<eS for ck<<1

The user can set some small tolerance ε S ε_SeS

4.3.5 Summary

Therefore, the cooling process in polynomial time is parameterized by:

  • Initial acceptance rate: χ ( c 0 ) χ(c_0)x ( c0)
  • By the parameter δ δDistance between consecutive stationary distributions controlled by δ
  • By the parameter ε S ε_SeScontrolled stopping criteria

The number of iterations of this cooling process is finite and can be characterized by the following theorem:

Theorem 6 : Let the decrement function be given by:

insert image description here

Let KKK is the first integer that satisfies the stopping criterion. Then, we haveK = O ( ln ( ∣ S ∣ ) ) K = O(ln(|S|))K=O ( l n ( S )) .

Therefore, if ln ( ∣ S ∣ ) ln(|S|)l n ( S ) is polynomial in the problem size (as is the case for many combinatorial optimization problems), then this type of cooling leads to polynomial execution of the algorithm.

Each problem has an optimal annealing scheme and it is up to the user to decide which one is best for his application.

When there is no prior information about the optimal annealing scheme (which is usually the case), one should rely on the standard geometry scheme.

For the standard geometry scheme, the parameter ck c_kckThe evolution is as follows: ck + 1 = α kck c_{k+1} = α_kc_kck+1=akck, and empirically tune the parameter α k on some representative instances of the problem class of interestakand L k L_kLk

This geometric approach is not optimal for all problems, but it has the advantage of being robust and ensures convergence to an approximate solution, even if it takes more time to converge than the optimal annealing scheme.

4.4 Simulation-Based Evaluation Simulation-Based Evaluation

In many optimization applications, an objective function is evaluated as a result of a computer simulation process requiring a simulated environment.

In this case, the optimization algorithm controls the decision variable XXVector of X variables used by the simulation process to calculate the performance (quality) of these decisionsyyy , asshown in Figure 1.3.

insert image description here

In this case, population-based algorithms may not be suitable for solving such problems, mainly when the simulated environment requires a large amount of storage space, which is often the case in today's real-life complex systems.

In fact, in the case of population-based approaches, the simulated environment must be replicated for each individual in the solution population, which can require a large amount of memory.

To avoid this shortcoming, consider using only one simulation environment, which is used each time a point in the population must be evaluated.

The first person for whom the simulation environment is started is considered first, and the simulation associated with that first person is run. The associated performance is then transferred to an optimization algorithm. Afterwards, the second person is evaluated, but the simulated environment must first be cleared from the first simulated event. The simulation is then run on the second individual, and so on, until the last individual in the population is evaluated.

In this case, storage space is no longer an issue, but the evaluation time may be too long and the whole process is too slow because the simulation environment needs to be reset every time the evaluation is performed.

In the standard simulated annealing algorithm, for each proposed transition, a copy of the state-space point is required. In fact, point X j X_jXjis from the current point X i X_i via a copy in computer memoryXiGenerated. In the case of state spaces of large dimensions, simple procedures to achieve such replication can be inefficient and can significantly degrade the performance of simulated annealing.

In this case, it would be more efficient to consider the return operator, since it removes the effect of generation. Let GGG is to move a point fromX i to X_iXiTransform to X j X_jXjThe generating operator for is:

G X i → X j \begin{gathered} G \\ \mathbf{X}_i \rightarrow \mathbf{X}_j \end{gathered} GXiXj

The regression operator is the inverse of the generating operator G − 1 G^{-1}G1

Typically, such builds modify only one component of the current solution.

In this case, the vector X i X_iXiCan be modified but not copied. Depending on the value obtained when evaluating this new point, two options can be considered:

  1. New solutions are accepted, and in this case only the current objective function value is updated
  2. Otherwise, return the operator G − 1 G^{-1}G−1 is applied to the new location in order to return to the previous solution, again without any copying in memory .

This process is summarized in Figure 1.4 .
insert image description here

The return operator must be used with care, as it can easily introduce undesired distortions in the algorithm's search of the state space. For example, if some secondary evaluation variables are used and modified to compute the overall evaluation, these variables must also be restored to their initial values, so the return operator must ensure the consistency of the state space.


5. Flowchart

insert image description here


6. Summary

  • This blog provides a detailed introduction to simulated annealing (SA), a meta-heuristic algorithm for global optimization.
  • The main advantage of SA is simplicity.
  • SA is based on an analogy to physical annealing of materials, avoiding the disadvantages of the Monte-Carlo method (possible trapping in local minima) due to the effective Metropolis acceptance criterion.
  • Population-based algorithms are not suitable when the objective function evaluation requires a lot of memory, such as when it results from a complex simulation process that manipulates a large-dimensional state space that involves a lot of memory, and simulated annealing is a good choice for these problems.

7. Case explanation & code practice

[Operational Research Optimization] SA simulated annealing algorithm to solve TSP problem + Java code implementation

Guess you like

Origin blog.csdn.net/weixin_51545953/article/details/130659729