Reinforcement Learning-Based Joint Cooperation Clustering and Content Caching paper reading notes

I. Introduction

Cell-free massive multiple -input multiple-output (CF-mMIMO ) [1] has recently been proposed as a promising fifth-generation (B5G) wireless communication technology. In the CF-mMIMO network, there are a large number of access points (APs). Each access point has an antenna, distributed over a large area, and uses the same time-frequency resources to jointly serve users. This scenario is similar to operating ubiquitous small cell base stations (BSs), but without cell boundary constraints (i.e., cell freedom). By taking advantage of distributed MIMO and massive MIMO, CF-mMIMO has good propagation properties and higher performance than traditional co-located MIMO (co-located mMIMO) [1] - [ 3 ] . Energy efficiency ( enhanced energy efficiency, EE ) performance. It only requires simple signal processing, such as local conjugate beamforming based on partial channel state information (CSI) , to achieve interference mitigation [3].


In CF-mMIMO without cell boundaries, users can freely connect to one or more APs. The task of determining user-AP associations is called cooperation cluster formation . Dynamic collaborative clustering of CF-mMIMO systems has been studied [3]-[6]. Ngo et al. [3] proposed a user -centric AP selection scheme to reduce backhaul power consumption and improve the energy efficiency of the network. Riera-Palou et al. [4] used K-means clustering algorithm to develop the optimal connectivity pattern between AP and users to minimize pilot contamination . Le et al. [5] developed a learning scheme based on K-means clustering, whose goal is to maximize the spectral efficiency . Björnson and Sanguinetti [6] proposed a cooperation clustering framework that can scale as the number of users increases.


Content caching is a technology that stores popular content files at the edge of the network to improve network throughputs and reduce service delays . Caching policies and user association have been studied in various cache-enabled networks [7]–[13]. Wei et al. [7] considered a joint user scheduling and caching strategy to minimize transmission delays in cellular networks using deep reinforcement learning (DRL) algorithms. However, only the centralized caching capability of macro BS is considered, and each user can only be associated with one BS. Li et al. [8] believe that all small BSs and macro BSs can cache files, aiming to find the optimal caching strategy through deep Q-learning to maximize the overall energy efficiency. However, only random user-bs associations are considered. Sadeghi et al. [9] use Q-learning to optimize caching strategies for individual units instead of the entire network. Yang et al. [10] studied content caching in heterogeneous networks composed of BSs, relays, and device-to-device (D2D) communication devices. However, the requesting user is only served by the nearest node with the maximum received power. Jing et al. [11] considered caching and user association in cache-enabled small cell networks and solved the problem by decomposing it into two sub-problems to optimize the effective capacity. However, under cell restrictions, each user can only be associated with one small cell BS. Chan and Chien [12] introduced the concept of joint caching and user association design in hierarchical heterogeneous networks, where each user is only allowed to associate with a single-antenna BS. Lin et al. [13] studied caching strategies in coordinated multipoint (CoMP)-enabled cellular networks , where edge users are allowed to connect to nearby BSs, but still Limited by the honeycomb structure.


In this work, we consider the joint cooperative clustering and content caching problem in CF-mMIMO networks with previously unexamined cell structure constraints . The goal is to determine content caching and user-AP association on the AP such that network EE, defined by the achievable network sum rate divided by the network power consumption , is maximized.

Note that this question involves trade-offs. User-AP association based on channel quality can increase the overall network rate, but may produce lower content hit rates ( content hit rates ), thus requiring additional power consumption when retrieving content from the backbone/backhaul. In contrast, user-AP association based on AP cache status can reduce power consumption related to content retrieval ( content retrieval ), but may not provide good network rates. This motivates joint design, as considered in this article. The main contributions of the paper are summarized as follows:

  • 我们提出了基于深度确定性策略梯度(deep deterministic policy gradient, DDPG)的方法来解决 CF-mMIMO 网络中复杂的联合合作聚类(joint cooperation clustering)和内容缓存问题,其中在大型网络中难以找到全局最优解。DDPG算法基于参与者-评论家网络体系结构(actor-critic network architecture),其中参与者由奖励(网络EE)训练,以输出良好的策略值(good policy values),即合作聚类和内容缓存策略(cooperation clustering and content caching strategies)。

  • 所提出的方法展示了良好的性能,在小型网络中接近最优的穷举搜索方法(optimal exhaustive search approach),并由于明智的联合设计(judicious joint design),在大型网络中优于基准(benchmarks),如基于信噪比(SNR)的聚类和基于流行度的缓存(popularity-based caching)。

论文的其余部分组织如下。Sec. II 描述了CF-mMIMO的系统模型和问题的表述。Sec. III 介绍了提出的 DRL-based 的方法。Sec. IV 给出了仿真结果和讨论。最后,Sec. V 对全文进行总结。




II. SYSTEM MODEL

A. 信号模型

Insert image description here


我们考虑一个具有 M M M 个AP和 K K K 个用户(或用户设备,UE)的下行单元的大规模MIMO网络。 M \mathcal{M} M K \mathcal{K} K 分别表示所有 AP 和 UE 的集合。图1描绘了一个拓扑的例子(example topology)。AP和UE各配置一根天线。每个AP通过物理光纤链路连接到地理上最近的中央处理单元(CPU)。在时分双工(TDD)操作下,选定的AP子集使用相同的时频资源共同服务于选定的UE子集[3]。设第 m m m 个AP与第 k k k 个UE之间的信道为 g m k = ( d m k / d 0 ) − α h m k g_{mk}= \left(d_{m k} / d_{0}\right)^{-\alpha} h_{m k} gmk=(dmk/d0)ahmk, where dmk d_{mk}dmkis mmm AP andkkthThe distance between k UEs, d 0 = min ⁡ m , kdmk d_{0}=\min _{m, k} d_{mk}d0=minm,kdmkis the reference distance, α αα is the path loss index (α ≥ 2 α ≥ 2a2), h m k ∼ C N ( 0 , 1 ) h_{m k} \sim \mathcal{C N}(0,1) hmkCN(0,1 ) represents small-scale fading. LetS k \mathcal{S}_kSkFor the kthThe serving AP set of k UEs,C m \mathcal{C}_mCmMM _The set of UEsserved by M APs . We believe that all UEs are guaranteed to be no more thanL (L < M) L (L < M)L ( L<M ) APs (i.e.∣ S k ∣ ≤ L , ∀ k |S_k|≤L, ∀kSkL , k ), but not necessarily all APs are serving UE. Therefore, the set of all serving (active) APs and all UEs isS = ⋃ k = 1 KS k \mathcal{S}=\bigcup_{k=1}^{K} \mathcal{S}_{k}S=k=1KSk C = ⋃ m = 1 M C m \mathcal{C}=\bigcup_{m=1}^{M} \mathcal{C}_{m} C=m=1MCm. Let qk q_kqkThe APs serving the UE are sent to the kthSymbols of k UEs, whereE [ ∣ qk ∣ 2 ] = 1 \mathbb{E}\left[\left|q_{k}\right|^{2}\right]=1E[qk2]=1 E [ q k ] = 0 , ∀ k \mathbb{E}\left[q_{k}\right]=0, \forall k E[qk]=0,k E [ q k q l ∗ ] = 0 , ∀ k ≠ l \mathbb{E}\left[q_{k} q_{l}^{*}\right]=0, \forall k \neq l E[qkql]=0,k=l (i.e. symbols used for different UEs are uncorrelated). ThenmmThe transmitted signal of m APs using conjugate beamforming[

x m = ∑ k ∈ C m ρ m k g ^ m k ∗ q k (1) x_{m}=\sum_{k \in \mathcal{C}_{m}} \sqrt{\rho_{m k}} \widehat{g}_{m k}^{*} q_{k}\tag{1} xm=kCmrmk g mkqk(1)

Among them, g ^ mk \widehat{g}_{mk}g mkis the channel gmk g_{mk}gmkestimated value.

r k = ∑ m ∈ S g m k x m + w k = ∑ m ∈ S k g m k x m + ∑ m ∈ S k c g m k x m + w k = ∑ m ∈ S k ∑ k ′ ∈ C m ρ m k ′ g m k g m k ′ ∗ q k ′ + ∑ m ∈ S k c g m k x m + w k = ∑ m ∈ S k ρ m k ∣ g m k ∣ 2 q k ⏟ useful signal  + ∑ m ∈ S ∑ k ′ ∈ C m , k ′ ≠ k ρ m k ′ g m k g m k ′ ∗ q k ′ + w k ⏟ interference plus noise  (2) \begin{aligned} r_{k} & =\sum_{m \in \mathcal{S}} g_{m k} x_{m}+w_{k} \\ = & \sum_{m \in \mathcal{S}_{k}} g_{m k} x_{m}+\sum_{m \in \mathcal{S}_{k}^{c}} g_{m k} x_{m}+w_{k} \\ = & \sum_{m \in \mathcal{S}_{k}} \sum_{k^{\prime} \in \mathcal{C}_{m}} \sqrt{\rho_{m k^{\prime}}} g_{m k} g_{m k^{\prime}}^{*} q_{k^{\prime}}+\sum_{m \in \mathcal{S}_{k}^{c}} g_{m k} x_{m}+w_{k} \\ = & \underbrace{\sum_{m \in \mathcal{S}_{k}} \sqrt{\rho_{m k}}\left|g_{m k}\right|^{2} q_{k}}_{\text {useful signal }} \\ & +\underbrace{\sum_{m \in \mathcal{S}} \sum_{k^{\prime} \in \mathcal{C}_m, k^{\prime} \neq k} \sqrt{\rho_{m k^{\prime}}} g_{m k} g_{m k^{\prime}}^{*} q_{k^{\prime}}+w_{k}}_{\text {interference plus noise }} \end{aligned}\tag{2} rk====mSgmkxm+wkmSkgmkxm+mSkcgmkxm+wkmSkkCmrmk gmkgmkqk+mSkcgmkxm+wkuseful signal  mSkrmk gmk2qk+interference plus noise  mSkCm,k=krmk gmkgmkqk+wk(2)

R k = log ⁡ 2 ( 1 + ∣ ∑ m ∈ S k ρ m k ∣ g m k ∣ 2 ∣ 2 ∑ k ′ ∈ C , k ′ ≠ k ∣ ∑ m ∈ S ρ m k ′ g m k g m k ′ ∗ ∣ 2 + σ w 2 ) . (3) R_{k}=\log _{2}\left(1+\frac{\left.\left.\left|\sum_{m \in \mathcal{S}_{k}} \sqrt{\rho_{m k}}\right| g_{m k}\right|^{2}\right|^{2}}{\sum_{k^{\prime} \in \mathcal{C}, k^{\prime} \neq k}\left|\sum_{m \in \mathcal{S}} \sqrt{\rho_{m k^{\prime}}} g_{m k} g_{m k^{\prime}}^{*}\right|^{2}+\sigma_{w}^{2}}\right) .\tag{3} Rk=log2 1+kC,k=k mSrmk gmkgmk 2+pw2 mSkrmk gmk 2 2 .(3)

There is a big problem with this formula (3).

The maximum achievable sum rate of the network is:

R s u m = ∑ k ∈ C R k . (4) R_{\mathrm{sum}}=\sum_{k \in \mathcal{C}} R_{k}.\tag{4} Rsum=kCRk.(4)

B. Caching Model

C. Power Consumption Model

The total network power consumption consists of three parts [14]:

  • i) Transmit power of all serving APs, including the power required to provide content in the AP cache directly to UEs,
  • ii) the power required by APs to retrieve missing content files from the CPU, and
  • iii) The power required by the CPU to download the missing content from the backbone .

For i) , the sum of the transmit power of all service ap is given by ∑ m ∈ SP m \sum_{m \in \mathcal{S}} P_{m}mSPmgiven.

For ii) , we believe that the power required by the AP to retrieve missing content files is proportional to the number of different missing files. When multiple terminals associated with an AP request the same content file, the AP only needs to retrieve the missing file from the CPU once. Let F mmiss = ( ⋃ k ∈ C m { fk } ) ∩ F mc \mathcal{F}_{m}^{\mathrm{miss}}=\left(\bigcup_{k \in \mathcal{C}_ {m}}\left\{f_{k}\right\}\right) \cap \mathcal{F}_{m}^{c}Fmmiss=(kCm{ fk})Fmcis mmm APs are needed but are not in themmthA set of content files cached by m APs and jointly requested by terminals,P backhaul P_{\text {backhaul }}Pbackhaul  为每个 AP 向CPU请求一个缺失内容文件所需的功率,其中包括 backhaul 功率(在AP - CPU链路中获取内容所需的功率)和电路功率。那么,所有从CPU检索内容的总功耗由 P backhaul  ⋅ ∑ m ∈ S ∣ F m miss  ∣ P_{\text {backhaul }} \cdot \sum_{m \in \mathcal{S}}\left|\mathcal{F}_{m}^{\text {miss }}\right| Pbackhaul mS Fmmiss  给出。

对于 iii),我们认为,如果与不同 APs 关联的两个终端请求相同的内容,而这些内容没有在各自的服务 APs上缓存,CPU 只需要下载此内容文件一次。设 P backbone P_{\text {backbone}} Pbackbone 为CPU从主干下载一个内容文件所需的功率。则 CPU 从骨干网下载内容的功耗为 P backbone  ⋅ ∣ ⋃ m ∈ S F m miss  ∣ P_{\text {backbone }} \cdot\left|\bigcup_{m \in \mathcal{S}} \mathcal{F}_{m}^{\text {miss }}\right| Pbackbone  mSFmmiss 

Note that this consideration can be implemented practically by caching specific content on the CPU, which is equipped with a cache whose storage is large enough to hold the missing content, but not large enough to hold everything.

Summarizing i) -iii) , the total power consumption of the network

P total  = ∑ m ∈ S P m + P backhaul  ⋅ ∑ m ∈ S ∣ F m miss  ∣ + P backbone  ⋅ ∣ ⋃ m ∈ S F m miss  ∣ (5) \begin{aligned} P_{\text {total }}= & \sum_{m \in \mathcal{S}} P_{m}+P_{\text {backhaul }} \cdot \sum_{m \in \mathcal{S}}\left|\mathcal{F}_{m}^{\text {miss }}\right| \\ & +P_{\text {backbone }} \cdot\left|\bigcup_{m \in \mathcal{S}} \mathcal{F}_{m}^{\text {miss }}\right| \end{aligned}\tag{5} Ptotal =mSPm+Pbackhaul mS Fmmiss  +Pbackbone  mSFmmiss  (5)



D. Problem Formulation

我们的设计目标是共同设计合作聚类, S 1 , S 2 , … , S K \mathcal{S}_{1}, \mathcal{S}_{2}, \ldots, \mathcal{S}_{K} S1,S2,,SK,还有内容缓存, F 1 , F 2 , … , F M \mathcal{F}_{1}, \mathcal{F}_{2}, \ldots, \mathcal{F}_{M} F1,F2,,FM,使得 R sum  / P total  R_{\text {sum }} / P_{\text {total }} Rsum /Ptotal  定义的网络能效(energy efficiency,EE)最大化。从数学上讲,设计问题是

max ⁡ S 1 , S 2 , … , S K F 1 , F 2 , … , F M R sum  P total   s.t.  ∣ S k ∣ ≤ L , ∀ k ∈ C , ∣ F m ∣ ≤ N , ∀ m ∈ S . \begin{aligned} \max _{\substack{\mathcal{S}_{1}, \mathcal{S}_{2}, \ldots, \mathcal{S}_{K} \\ \mathcal{F}_{1}, \mathcal{F}_{2}, \ldots, \mathcal{F}_{M}}} & \frac{R_{\text {sum }}}{P_{\text {total }}} \\ \text { s.t. } & \left|\mathcal{S}_{k}\right| \leq L, \forall k \in \mathcal{C}, \\ & \left|\mathcal{F}_{m}\right| \leq N, \forall m \in \mathcal{S} . \end{aligned} S1,S2,,SKF1,F2,,FMmax s.t. Ptotal Rsum SkL,kC,FmN,mS.

Note that there are design tradeoffs in solving problem (6) . Completely based on S 1 , S 2 , … , SK \mathcal{S}_{1}, \mathcal{S}_{2}, \ldots, \mathcal{S}_{K}S1,S2,,SKThe design may tend to use the kkthk UEsthe subsetAPsthebest channel conditionsS k \mathcal{S}_kSkassociated as this may increase R k R_kRk, thereby increasing R sum R_\text{sum}Rsum

In contrast, a design based entirely on F 1 , F 2 , … , FM \mathcal{F}_{1}, \mathcal{F}_{2}, \ldots, \mathcal{F}_{M}F1,F2,,FMThere may be a tendency to regard the kthk UEsa subset ofAPsthat best aligns content cache status and user requestsS k \mathcal{S}_kSkassociated, as this may increase hit events and thus reduce P total P_{\text {total }}Ptotal 

This motivates joint design. Furthermore, the problem becomes intractable for large networks with a large number of APs and terminals. Therefore, we developed a cooperative clustering and content caching strategy based on reinforcement learning ( RL ), which is discussed in the next section.




III. REINFORCEMENT LEARNING METHOD

In this section, we elaborate on how to solve the joint problem of clustering and caching through the DDPG algorithm . Defines the three basic elements ( action, state and reward ) in the RL problem under consideration. Since the dynamic evolutions of clustering and caching strategies are particularly important when formulating RL problems, relevant parameters will be indexed by time in the following.


A. Action, Observation State, and Reward Function

t th t\text{th} t th time slot, agent’s actionat a_tatBoth clustering and caching are involved. Let the index amk , t ∈ { 0 , 1 } a_{mk,t}∈\{0,1\}amk,t{ 0,1 }amf , t ∈ { 0 , 1 } a_{mf,t}∈\{0,1\}am f , t{ 0,1 } respectively represents themmthm APskth_k UEstatusandmmthm APsthfff file cache status, where "1" means successful association or caching, and "0" means the opposite.

Then, the action of the agent is at a_tat 可以定义为 a t = { a t c l , a t c a } a_{t}=\left\{a_{t}^{c l}, a_{t}^{c a}\right\} at={ atcl,atca},其中集合 a t c l = { a m k , t c l : m ∈ M , k ∈ K } a_{t}^{c l}=\left\{a_{m k, t}^{c l}:m \in \mathcal{M}, k \in \mathcal{K}\right\} atcl={ amk,tcl:mM,kK} a t c a = { a m f , t c a : m ∈ M , f ∈ F } a_{t}^{c a}=\left\{a_{m f, t}^{c a}: m \in \mathcal{M}, f \in \mathcal{F}\right\} atca={ amf , t _ca:mM,fF } respectively contains thettthMetrics for the overall results of clustering and caching for t time slots. Note that actionat a_tatUniquely determines the set S k \mathcal{S}_{k}Sk C m \mathcal{C}_{m} Cmand F m \mathcal{F}_{m}Fm,即 S k = { m : a m k , t c l = 1 , m ∈ M } \mathcal{S}_{k}=\left\{m: a_{m k, t}^{c l}=1, m \in \mathcal{M}\right\} Sk={ m:amk,tcl=1,mM} C m = { k : a m k , t c l = 1 , k ∈ K } \mathcal{C}_{m}=\left\{k: a_{m k, t}^{c l}=1, k \in \mathcal{K}\right\} Cm={ k:amk,tcl=1,kK} F m = { f : a m f , t c a = 1 , f ∈ F } \mathcal{F}_{m}=\left\{f: a_{m f, t}^{c a}=1, f \in \mathcal{F}\right\} Fm={ f:amf , t _ca=1,fF}


The state in RL considered should be the collectable information set of the CPU that can be used to calculate the reward function. In this work, the ttt 个时隙的观察状态(observation state)定义为信道增益 G t = { g m k , t : m ∈ M , k ∈ K } G_t=\left\{g_{m k, t}: m \in \mathcal{M}, k \in \mathcal{K}\right\} Gt={ gmk,t:mM,kK},前一个时隙的聚类和缓存动作,以及每个终端的文件请求历史的集合。用户请求统计 e t e_t et 定义为 e t = { e k f , t : k ∈ K , f ∈ F } e_{t}=\left\{e_{k f, t}: k \in\right. \mathcal{K}, f \in \mathcal{F}\} et={ ekf,t:kK,fF},其中 e k f , t = ∑ t ′ = 1 t 1 f k , t ′ = f e_{k f, t}=\sum_{t^{\prime}=1}^{t} \mathbf{1}_{f_{k, t^{\prime}}=f} ekf,t=t=1t1fk,t=fFor the kthk terminals to timettffat time tThe number of downloads of f files. The observation state is expressed as

s t ≜ { G t , e t , a t − 1 } (7) s_{t} \triangleq\left\{G_{t}, e_{t}, a_{t-1}\right\}\tag{7} st{ Gt,et,at1}(7)

Following the problem objective in (6a), at ttThe reward function for t time slots is defined as

r ( s t , a t ) ≜ R s u m , t P t o t a l , t r\left(s_{t}, a_{t}\right) \triangleq \frac{R_{\mathrm{sum}, t}}{P_{\mathrm{total}, t}} r(st,at)Ptotal,tRsum,t

其中, R s u m , t R_{\mathrm{sum}, t} Rsum,t P t o t a l , t P_{\mathrm{total}, t} Ptotal,tare given by equations (4) and (5) respectively, with the appended subscript ttt emphasizes dynamic behaviors. Note that the total rateR sum , t R_{\mathrm{sum}, t}Rsum,tDepends on channel conditions G t G_{t}Gtand clustering results atcl a_{t}^{cl}atcl, and the total power P total , t P_{\mathrm{total}, t}Ptotal,tDepends on cached result atca a_{t}^{ca}atca


B. Deep Deterministic Policy Gradient Approach

Since the potentially large number of UEs and APs in the CF-mMIMO network requires considerable action spaces , the traditional deep Q network ( DQN ) requires a long training time to learn joint clustering and caching strategies, so it cannot Effectively capture system dynamics . In this work, the deep deterministic policy gradient ( DDPG ) algorithm is adopted to achieve fast and stable learning [15]. Essentially, the DDPG network follows the actor -critic approach by combining an additional target -network with the original evaluation -network to improve convergence speed and stability.


The purpose of the player part is to use a deterministic strategy μ ( st ∣ θ μ ) \mu\left(s_{t} \mid \theta^{\mu}\right)m(stiμ )produces an action in each time slot, and the policy consists of a weightθ μ \theta^{\mu}iDeep neural network (DNN) learning of μ . Weightθ μ \theta^{\mu}iμ update is to find the best deterministic strategyμ ( st ∣ θ μ ) \mu\left(s_{t} \mid \theta^{\mu}\right)m(stiμ )is based on the action-value function, that is, the expected long-term reward is defined as (expected long-term reward):

Q ( s t , a t ) ≜ E [ R t ∣ s t , a t ] (9) Q\left(s_{t}, a_{t}\right) \triangleq \mathbb{E}\left[R_{t} \mid s_{t}, a_{t}\right]\tag{9} Q(st,at)E[Rtst,at](9)

where R t R_tRtis the cumulative discounted future reward (cumulative discounted future reward ) R t ≜ ∑ i = t ∞ γ i − tr ( si , ai ) R_t \triangleq \sum_{i=t}^{\infty} \gamma^{it} r\left(s_{i}, a_{i}\right)Rti=tcitr(si,ai)γ ∈ [ 0 , 1 ] γ∈[0,1]c[0,1 ] is the discount factor. Normally,θ μ θ^\muiμ is updated by the gradient ascent method

θ μ ← θ μ + α μ ∇ θ J ( θ ) ∣ θ = θ μ (10) \theta^{\in} \leftarrow \theta^{\in}+\left.\alpha^{\in}\ nabla_{\theta} J(\theta)\right|_{\theta=\theta^{\mu}}\tag{10}imim+aμiJ(θ)θ = θm(10)

where α μ \alpha^{\mu}aμ is the learning rate and

J ( θ ) ≜ E s t [ Q ( s t , μ ( s t ∣ θ ) ) ] (11) J(\theta) \triangleq \mathbb{E}_{s_{t}}\left[Q\left(s_{t}, \mu\left(s_{t} \mid \theta\right)\right)\right]\tag{11} J(θ)Est[Q(st,m(sti ) ) ](11)

is the objective with the expectation taken with respect to s t s_{t} st. Note that the action-value function in (9) allows recursive relations

Q ( s t , a t ) = r ( s t , a t ) + γ E [ Q ( s t + 1 , a t + 1 ) ] (12) Q\left(s_{t}, a_{t}\right)=r\left(s_{t}, a_{t}\right)+\gamma \mathbb{E}\left[Q\left(s_{t+1}, a_{t+1}\right)\right]\tag{12} Q(st,at)=r(st,at)+c E[Q(st+1,at+1)](12)

In the critic part, the above formula needs to be determined Q ( st , μ ( st ∣ θ μ ) ) Q\left(s_{t}, \mu\left(s_{t} \mid \theta^{\mu}\right)\ right)Q(st,m(stiμ))。更具体地说,评论家评估动作值函数 Q ( s t , μ ( s t ∣ θ μ ) ∣ θ Q ) Q\left(s_{t}, \mu\left(s_{t} \mid \theta^{\mu}\right) \mid \theta^{Q}\right) Q(st,μ(stθμ)θQ) 使用权重为 θ Q \theta^{Q} θQ 的单独DNN。通常,权重 θ Q \theta^{Q} θQ 使用更新

θ Q ← θ Q − α Q ∇ θ L ( θ ) ∣ θ = θ Q (13) \theta^{Q} \leftarrow \theta^{Q}-\left.\alpha^{Q} \nabla_{\theta} L(\theta)\right|_{\theta=\theta^{Q}}\tag{13} θQθQαQθL(θ) θ=θQ(13)

其中 α Q α^Q αQ 是学习率并且

L ( θ ) = E [ ( Q ( s t , a t ∣ θ ) − y t ) 2 ] (14) L(\theta)=\mathbb{E}\left[\left(Q\left(s_{t}, a_{t} \mid \theta\right)-y_{t}\right)^{2}\right]\tag{14} L ( i )=E[(Q(st,ati )yt)2](14)

is the mean square value of the mean square Bellman error function, where the target is

y t = r ( s t , a t ) + γ Q ′ ( s t + 1 , μ ′ ( s t + 1 ∣ θ μ ′ ) ∣ θ Q ′ ) (15) y_{t}=r\left(s_{t}, a_{t}\right)+\gamma Q^{\prime}\left(s_{t+1}, \mu^{\prime}\left(s_{t+1} \mid \theta^{\mu^{\prime}}\right) \mid \theta^{Q^{\prime}}\right)\tag{15} yt=r(st,at)+γQ(st+1,m(st+1im)iQ)(15)

Adapted from the recurrence relation in (12). Superscript ′ \prime in (15) 用来表示(15)中产生 y t y_t yt 的动作-价值函数是由单独的目标网络(target-network)提供的。相比之下,(14)中的 Q ( s t , a t ∣ θ ) Q\left(s_{t}, a_{t} \mid \theta\right) Q(st,atθ) 则由原始的评估网络(original evaluation-network)提供。评估网络和目标网络的结合使DDPG算法趋于稳定[15]。


请注意,实际上,(11)和(14)中的期望可以用样本均值近似(approximated by sample mean),因为很难知道 s t s_t st 的精确概率分布,其中随机样本是从存储元组 b i ≜ ( s i , a i , r i , s i + 1 ) b_{i} \triangleq\left(s_{i}, a_{i}, r_{i}, s_{i+1}\right) bi(si,ai,ri,si+1) 的 replay buffer D \mathcal{D} Extracted from D , i = t − D + 1, … , ti=t-D+1, \ldots, ti=tD+1,,t。replay bufferD \mathcal{D}D has limited buffer sizeDDD , where DDis cachedD recent actions (actions) and corresponding status and rewards.


Finally, soft updates are performed to further stabilize the target critic network θ Q ′ ← τ θ Q ′ + ( 1 − τ ) θ Q \theta^{Q^{\prime}} \leftarrow \ tau \theta^{Q^{\prime}}+(1-\tau) \theta^{Q}iQt iQ+(1t ) iQ also quantifies the function(target actor network)θ µ ′ ← τ θ µ ′ + ( 1 − τ ) θ µ \theta^{\mu^{\prime}} \leftarrow \tau \theta^{\mu^{; \prime}}+(1-\tau) \theta^{\mu}imt im+(1t ) iμ , whereτ ≪ 1 \tau \ll 1t1 . The complete DDPG process is summarized in Algorithm 1.


Insert image description here




IV. SIMULATION RESULTS

A. Simulation Settings

We consider two scenarios as shown in Table 1, where UEs and APs are evenly distributed over 1 km 2 1\text{ }\mathrm{km}^21 km 2 range, one of the APs is anchored at the reference coordinates(0, 0) (0,0)(0,0 ) . During the evaluation phase,UEsandAPsare fixed. A content preference vectoris randomly generatedfor eachUE, and the user request is determined bythe Zipffactorβ = 1 β = 1b=1 Content preference vector generation [9], [10], [16]. During the evaluation phase, user requests remain fixed. The path loss exponent isα = 2 α= 2a=2 , small-scale fading coefficienthmk ∼ CN ( 0 , 1 ) h_{mk} \sim \mathcal{C} \mathcal{N}(0,1)hmkCN(0,1 ) obey the time-varying model [17].

h m k ( t + 1 ) = 1 − ϵ 2 × h m k ( t ) + ϵ × n m k ( t ) (16) h_{m k}(t+1)=\sqrt{1-\epsilon^{2}} \times h_{m k}(t)+\epsilon \times n_{m k}(t)\tag{16} hmk(t+1)=1ϵ2 ×hmk(t)+ϵ×nmk(t)(16)

Where nmk ( t ) ∼ CN ( 0 , 1 ) n_{mk}(t) \sim \mathcal{CN}(0,1)nmk(t)CN(0,1 )ϵ ∈ [ 0 , 1 ] \epsilon \in[0,1]ϵ[0,1 ] is the channel variation coefficient, which is set to ϵ = 0.01 \epsilon=0.01in all simulationsϵ=0.01

We set the transmit power of all APs to P m = 10 m W P_{m}=10 \mathrm{~mW}Pm=10 mW, P backhaul  = P backbone  = 500   m W P_{\text {backhaul }}=P_{\text {backbone }}=500 \mathrm{~mW} Pbackhaul =Pbackbone =500 mW  [8]. The thermal noise power at each terminal is derived from [1]

σ w 2 =  Bandwidth  × K B × T 0 ×  noise figure (W)  (17) \sigma_{w}^{2}=\text { Bandwidth } \times K_{B} \times T_{0} \times \text { noise figure (W) }\tag{17} pw2= Bandwidth ×KB×T0× noise figure (W) (17)

where  Bandwidth \text { Bandwidth } Bandwidth  is set to20 MHz 20 \text{~MHz}20 MHz K B = 1.381 × 1 0 − 23 K_{B}=1.381 \times 10^{-23} KB=1.381×1023 T 0 = 300  K T_0= 300\text{~K} T0=300 K,噪声系数 9  dB 9 \text{~dB} 9 dB,这导致 σ w 2 = 7.457 × 1 0 − 13   W \sigma_{w}^{2}=7.457 \times10^{-13} \mathrm{~W} σw2=7.457×1013 Wactorcritic 网络都使用了两个隐藏层的DNNs,分别有400和300个神经元,以及tanh激活函数。

我们将所提出的RL方法与三种固定策略基准(fixed strategy benchmarks),称为BM1、BM2和BM3,以及仅适用于小场景(即场景1)的最优蛮力(BF)算法进行比较:

    1. BM1:基于信噪比的聚类策略,SNR-based clustering policy(第 k k kUE 连接到所有AP中 ∣ g m k ∣ 2 \left|g_{m k}\right|^{2} gmk2 最高的 l ≤ L l≤L lLAPs)和基于本地流行度的缓存策略,local-popularity-based caching policy(连接到第 m m m 个AP的特定 UEs 中最流行的 N N N 个文件缓存在第 m m m 个AP上)。
    1. BM2:基于信噪比的集群策略(与BM1相同)和基于全局流行度的缓存策略(所有终端中最流行的 N N N 文件在所有AP上缓存)。
    1. BM3:基于缓存的集群策略(一个终端连接到 l ≤ L l≤L lL 个AP,这些 APs 的前一个时隙的缓存最符合终端的文件请求)和基于全局流行度的缓存策略(与BM2相同)。
    1. BF:对所有集群/缓存组合进行穷举搜索,并找到产生最高EE的组合。

Insert image description here


B. Results and Discussion

图2显示了场景1的结果。可以看出,BF、提出的RL和BM1( l = 1 l = 1 l=1 ) Exhibits comparable EE performance, significantly higher than all other schemes. This difference is due to nearly two orders of magnitude difference in power consumption due to varying degrees of missing files and different values ​​of the second and/or third terms of (5) in different schemes (Figure 2(c) )).

The proposed RL outperforms all benchmarks, and unlike these benchmarks, it does not require user preference information .

BM1 outperforms BM2 in terms of EE because although they use the same clustering strategy to achieve the same summing rate, BM1's local popularity-based caching strategy results in fewer file losses and thus lower power consumption. More specifically, in BM1, different APs can be in the AP's own service network C m \mathcal{C}_m according to the fileCmDifferent files are cached based on their popularity in BM2, while in BM2, all APs cache the same files based on their popularity in the entire network.

BM2 is better than BM3 in terms of EE because in this case, BM2 is superior in achieving higher sum rate due to signal-to-noise ratio based clustering than BM3 is in achieving lower network efficiency due to cache-based clustering. consumption advantages. Fluctuations in the BM3 summation rate are due to its random correlation in selecting the AP that best matches the terminal file request .l AP mechanism. Compare BM1 (l = 1 l = 1l=1 ) and BM1 (l=2 l=2l=2 ), the EE performance of BM1 (l = 2) will degrade because it becomes increasingly difficult for anAPto match its cached files with file requests from all relevant endpoints, and more files are lost and Higher power consumption dominates performance.

Figure 2 Snapshots of BF, RL and BM1 (l = 1) at t = 5 are shown in Figure 3. BF finds the optimal strategy of ee. Similar strategies were found for RL and BM1, where RL prescribed additional UE2-AP4 associations, which resulted in slight increases in sum rate and EE. RL may converge to a local optimum, or the optimal action may not be explored during the exploration process in the training phase, resulting in the observed gap between RL and BF.




V. CONCLUSION

In this paper, we propose a DDPG-based approach to solve the joint cooperative clustering and content caching problem in cache-enabled CF-mMIMO networks. The DDPG framework follows the actor-critic network architecture to enhance convergence speed and stability. The actor produces an action corresponding to the clustering and caching policy at each time slot, and the critic evaluates the expected cumulative reward corresponding to the action. The proposed method demonstrates verifiable, near-optimal performance in small networks and outperforms fixed-policy benchmarks in actual, potentially large networks.

Guess you like

Origin blog.csdn.net/m0_51143578/article/details/131849398