RL+RA 文献Multi-Agent Deep Reinforcement Learning for Enhancement of Distributed Resource Allocation

O. Urmonov, H. Aliev, and H. Kim, ‘Multi-Agent Deep Reinforcement Learning for Enhancement of Distributed Resource Allocation in Vehicular Network’, IEEE Systems Journal, vol. 17, no. 1, pp. 491–502, Mar. 2023, doi: 10.1109/JSYST.2022.3197880.

Brief description:

In this paper, a resource allocation algorithm based on multi-agent deep reinforcement learning ( MARL ) is proposed for the distributed wireless resource management problem in 5G Internet of Vehicles. We let each vehicle act as an individual agent that can choose a unique combination of transport block (TB) and transmit power to broadcast periodic packets.

Contributions are summarized below. 1) To address non-stationary problems in multi-agent environments, centralized training is run in the critic DNN and decentralized execution in the actor DNN. 2) An effective method is proposed to reduce hidden terminal interference scenarios and provide highly reliable one-hop V2V broadcasting. 3) Utilize LSTM to effectively utilize timing information while sharing training parameters to speed up the entire training process. 4) A new reward function is introduced to reflect the impact of half-duplex and combined collision scenarios on V2V link capacity and PRR. 5) We apply an efficient multi-agent deep learning method that considers the mobility of all agents to avoid merge collisions during TB allocation.

C-V2X resource allocation mode:

In a C-V2X network, a frequency band is divided into several sub-channels, and each sub-channel represents a group of radio resource blocks (Radio Resource Block, RB) in a subframe of 1 ms, as shown in Fig. 1 . More precisely, Figure 1 shows an example where a 10 MHz band is divided into 4 subchannels and a full frame is divided into 20 subframes of 1 ms to construct a set of 80 RBs. Each device selects its transport block (TB), which is a set of RBs, to transmit data. The size of TB varies according to the length of the data in the packet. The first two RBs of the TB are dedicated to the control channel to broadcast a Sidelink Control Information (SCI).

C-V2X mode 1 is for base stations to allocate resources, and mode 2 is for vehicles to allocate resources (there are four subdivided modes).

 1) The C-V2X standard provides a medium access control mode for "in coverage" or "mode 1" and "out of coverage" or "mode 2" communication scenarios [3], [4]. In Mode-1, resource allocation is managed by the cellular network infrastructure, while in Mode-2, each vehicle chooses its own resources autonomously, as shown in Figure 2(b).

In Mode 2, the vehicle can perform long-term (window-level) or short-term (symbol-level) channel awareness operations [3] to find available TBs. Mode-2 is further expanded into four different modes.

2) Mode 2 (a): Each vehicle performs a window-aware operation to select an available TB.

3) Mode 2 (b): Vehicle A assists vehicle B in selecting a suitable TB.

4) Mode 2 (c): Each vehicle uses sign-aware and pre-configured sidelink authorization mode [3] to occupy the available TB.

5) Mode 2 (d): Vehicle A performs TB picking on vehicle B.

In a network of N vehicles, each vehicle receives periodic broadcast packets from its one-hop neighbors. Therefore, each vehicle can easily detect all occupied TBs within its one-hop range. However, beyond a hop range, the vehicle may fail to detect a busy TB. Therefore, the standard protocols [2], [3] cannot provide a complete solution to the hidden terminal interference problem, because the interference range is much larger than the channel perception distance.

The C-V2X standard [4][5] provides a semi-persistent scheduling (SPS) scheme that allows vehicles to select resources after long-term channel-aware operations. Each vehicle (as a transmitter) reserves its TB for a predefined reselection counter ( RC ), and continuously sends its data within this TB so that other vehicles can accurately estimate that this TB is occupied. The transmitter also constantly senses all subchannels to detect ongoing transmissions in other TBs. In order to identify free TBs, the transmitter uses a Selection Window (SW) to find available TBs. Document [4] has detailed the usage purpose of SW. When the transmitter broadcasts 10/20/50 data packets per second, the length of SW can be equal to 100/50/20 ms. In SW, the sender finds candidate TBs and adds them to list L1. An ATB is considered idle if its RSSI is below a predefined threshold. Once the vehicle has collected at least 20% of the TBs in L1, the next list L2 is constructed, which contains the TBs of L1 with lower RSSI. Finally, the transmitter randomly draws a TB from L2 and uses it for the next RC cycle. A car repeats the process to pick a new TB. SPS can effectively eliminate concurrent transmission or access collisions within a hop distance, but it cannot solve the problem of hidden node interference.

In a recent 3GPP standard proposal [21][22], a new short-term sensing-based channel access technique, Listen Before Talk (LBT), is described as an alternative to SPS. In LBT, each vehicle undergoes explicit channel assessment and potentially random backoff before accessing a channel. Therefore, the channel sensing period may last only a few symbol times. However, the random backoff process in LBT mode may seriously affect the total end-to-end delay. In particular, when the system is loaded or the network is congested, the latency is over budget. In addition, in LBT mode, the hidden node problem may still affect network performance. Therefore, we believe that an extensive performance analysis of LBT mode should be performed before considering it as a standard channel access scheme.

To maintain extremely low latency (eg, 3ms end-to-end latency) and high reliability (eg, 99.999% PRR), fast retransmissions and instant access to preconfigured resources need to be established. According to the literature [22], [23], this can be achieved by the unlicensed channel access pattern, which represents a pre-configured two-dimensional time/frequency repetition pattern (TFRP) pool . In this mode, the vehicle can pre-configure a TFRP pool, and then autonomously select a random TFRP from the pool to transmit data. To alleviate the half-duplex constraint, the TFRP pool should be organized in such a way that any two different TFRPs do not collide in at least one timing device. Therefore, the license-free channel access mode guarantees a throughput of more than one terabyte and requires repeated transmissions of the same load to avoid hidden nodes and half-duplex problems. However, this may lead to inefficient bandwidth utilization issues and congested network conditions.

Model mainly solves the hidden terminal problem, ignore it.

Guess you like

Origin blog.csdn.net/qq_38480311/article/details/132294763