An Exploratory Method for Deep Reinforcement Learning Based on Attentional Neural Networks: ARiADNE

insert image description here

ARiADNE:A Reinforcement learning approach using Attention-based Deep Networks for Exploration

参考论文:Cao Y, Hou T, Wang Y, et al. Ariadne: A reinforcement learning approach using attention-based deep networks for exploration[J]. arXiv preprint arXiv:2301.11575, 2023.
2023 IEEE International Conference on Robotics and Automation (ICRA 2023)

Autonomous Robotic Exploration (ARE)

ARE's Traditional Boundary Law

Autonomous robot exploration (ARE)

Goal: The goal of the ARE is to plan the shortest path to complete exploration where small noise/errors in the final map are tolerable

Mission Points: The mission point of the ARE is to plan a non-myopic path, to balance the trade-offs between exploring the environment (i.e. optimizing the map in already explored areas) and exploring new areas (usually more distant areas), and most importantly Only partially aware of the environment. Such exploration paths are usually planned online incrementally, as the local map is updated with new measurements along the way

Traditional boundary-based methods: Traditional boundary-based methods generate multiple candidate paths, each covering a boundary (i.e., the boundary between explored free regions and unexplored regions), and greedily select the path with the greatest payoff , usually defined as a combination of utility (i.e., the number of boundaries observable along a path) and cost (i.e., path length).

Limitations: This short-sighted choice of boundaries does not guarantee long-term optimality. Since the environment is only partially known, the previously optimal path tends to become suboptimal as more environments are exposed, or worse, leads to redundant moves (e.g., missed unexplored shortcut, and the two rooms were not previously known to be connected).

non-myopic path

Spatial non-myopia requires planners to reason about the current local map to balance the trade-offs between exploration and exploitation

Time is not myopic: Planners are asked to estimate the impact of current decisions on the future (e.g., predicting changes in the local map that may result from a given path planning decision).

The Positive Significance of Deep Reinforcement Learning Approaches

Discover local map dependencies:

  The ARE method based on deep reinforcement learning (DRL), under the support of the attention mechanism, allows the agent to reason about the dependencies of different regions in the local map at different spatial scales, thus allowing the agent to optimize long paths without efficiently sequence spatial non-myopic decisions.

Explore potential areas:

  The critic network implicitly provides the robot with the ability to estimate potential areas that may be discovered by learning state values, which further helps to make decisions that are beneficial to long-term efficiency, thus solving the problem of temporary non-myopia.

Method example:

  Autonomous exploration is first formulated as a sequential decision problem on a collision-free graph covering known traversable regions, where a node is the robot's current position. Then, an attention-based neural network is used to select a neighboring node of the robot's current location as the next viewpoint of the robot.

problem definition

Environment E \mathcal{E}E,由 x × y x\times y x×2D occupancy grid map representation of y

Grid map PPP , the unknown area isP u P_uPu, the known area is P k P_kPk, the free area is P f P_fPf, obstacle area P o P_oPo,There is also a proposition:
P f ∈ P k P o ∈ P k P u ∪ P k = PP f ∪ P o = P k P_f\in P_k\\P_o\in P_k\\P_{u} \cup P_{k }=P\\P_{f} \cup P_{o}=P_{k}PfPkPoPkPuPk=PPfPo=Pk

The initial state map is completely unknown P = P u P = P_uP=Pu

The exploration process is measured by the sensor (measuring range is d_s d_sds) Classify unknown regions into free and occupied regions

  The robot looks for the shortest collision-free trajectory ψ ∗ \psi^{*}p
ψ ∗ = argmin ⁡ ψ ∈ Ψ C ( ψ ) , st P k = P g \psi^{*}=\underset{\psi \in \Psi}{\operatorname{argmin}} \mathrm{C}( \psi), \text { st } P_{k}=P_{g}p=ψΨargminC ( ψ ) , s.t. Pk=Pg
  for CCC function,ψ → R + \psi \rightarrow \mathbb{R}^{+}pR+Map the trajectory into its length and groud-truthP g P_gPg。

  Although the ground truth is not available in real-world deployments, it is known and can be used to evaluate the performance of the planner under test. In practice, most works treat the closed set of occupied regions as P k = P g P_k = P_gPk=Pg

method

Mapping Modeling for Reinforcement Learning Problems

Problem definition:

  Since the update view of the free area is based on the movement of the robot, the planning of the online ARE is actually a sequential decision-making problem. According to the previous work [20] for information path planning, we set the robot trajectory ψ ψψis regarded as the sequenceψ = ( ψ 0 , ψ 1 , … ) , ψ i ∈ P f ψ = (ψ_0 , ψ_1 , …) , ψ_i∈P_fp=( p0, p1... ) , piPf

Sequential decision problems:

  In each decision step t, first in the current free area P f P_fPfEvenly distributed candidate viewpoints V t = { v 0 , v 1 , … } , ∀ vi = ( xi , yi ) ∈ P f Vt=\{v_0,v_1,\ldots\},∀v_i=(x_i, y_i) ∈ P_fVt={ v0,v1,},vi=(xi,yi)Pf. Then, to find collision-free paths between viewpoints, each viewpoint is connected to its k-nearest neighbors by a straight line, and edges colliding with occupied or unknown regions are removed. In the process, construct a collision-free graph G t = ( V t , E t ) G_t = (V_t, E_t)Gt=(Vt,Et) , whereV t V_tVtis a set of nodes uniformly distributed on the free area (ie viewpoint), but E t E_tEtA set of traversable edges. Finally let the robot choose an adjacent node of its current position as the next viewpoint. Since the decision will be made upon reaching the last selected viewpoint, the trajectory is a sequence of waypoints such that ψ i ∈ V ψ_i∈VpiV

Observation:

  The observed value of the agent ot = ( G t ′ , ψ t ) o_t=(G_t^{\prime},\psi_t)ot=(Gt,pt) G t ′ = ( V t ′ , E t ) G^\prime_t=(V_t^{\prime},E_t) Gt=(Vt,Et) is the collision-free graphG t G_tGtAugmented graph of , where ψ t \psi_tptis the current position of the robot, G t ′ G_t^{\prime}Gtwith G t G_tGtShared edge set E t E_tEt, augmented graph node vi ′ v_i^{\prime}vinot including vi = ( xi , yi ) v_i=(x_i,y_i)vi=(xi,yi) attributes, but also contains the binarybi b_ibiattribute, indicating whether the node has been visited by the agent, and then with the utility ui u_iui Since vi ′ = ( xi , yi , ui , bi ) v_i^{\prime}=(x_i,y_i,u_i,b_i)vi=(xi,yi,ui,bi)。

   b i b_i biEffect of : The authors demonstrate experimentally that adding a binary access bit leads to better perception of previous movement

   u i u_i uirole: utility ui u_iuiRepresents the node vi v_iviThe number of observable boundaries, the observable boundary is the boundary between the node and the collision-free boundary within the range of the sensor, the ui u_iui定义为
u i = ∣ F o , i ∣ , ∀ f j ∈ F o , i , ∥ f j − v i ∥ ≤ d s , L ( v i , f j ) ∩ ( P − P f ) = ∅ \begin{array}{c}u_{i}=\left|F_{o, i}\right|, \forall f_{j} \in F_{o, i},\left\|f_{j}-v_{i}\right\| \leq d_{s}, L\left(v_{i}, f_{j}\right) \cap\left(P-P_{f}\right)=\emptyset\end{array} ui=Fo , i,fjFo , i,fjvids,L(vi,fj)(PPf)=
   F o , i F_{o,i} Fo , imeans vi v_iviThe observable boundary set of ds d_sdsIndicates the range of the sensor, L ( vi , fj ) L(v_i,f_j)L(vi,fj) means nodevi v_ivito boundary fj f_jfj, normalize the node coordinate system and utility before inputting into the neural network

action:

  At each decision step ttt, the observation value ot o_tof the given agentot, the attention-based neural network outputs a stochastic policy to select a node from all neighboring nodes as the next viewpoint to visit. It is defined as:
π θ ( at ∣ ot ) = π θ ( ψ t + 1 = vi , ( ψ t , vi ) ∈ E t ∣ ot ) \pi_{\theta}\left(a_{t} \mid o_ {t}\right)=\pi_{\theta}\left(\psi_{t+1}=v_{i},\left(\psi_{t}, v_{i}\right) \in E_{t } \mid o_{t}\right)Pii(atot)=Pii( pt+1=vi,( pt,vi)Etot)
   θ \theta θ is the weight of the neural network, and the robot moves in a straight line to the next viewpoint, updating parts of the map based on data collected along the way.

award:

r o = ∣ F o , ψ t + 1 ∣ r_{o}=\left|F_{o, \psi_{t+1}}\right| ro=Fo , pt+1 , the number of boundaries observed at the new viewpoint

r c = − C ( ψ t , ψ t + 1 ) r_{c}=-\mathrm{C}\left(\psi_{\mathrm{t}}, \psi_{\mathrm{t}+1}\right) rc=C( pt,pt+1) , the distance penalty between the previous viewpoint and the new viewpoint

r f = { 20 , P k = P g 0 ,  otherwise  r_{f}=\left\{\begin{array}{ll}20, & P_{k}=P_{g} \\0, & \text { otherwise }\end{array}\right. rf={ 20,0,Pk=Pg otherwise , after an episoid, the exploration task is completed and rewarded

总奖励 r t ( o t , a t ) = a ⋅ r o + b ⋅ r c + r f r_{t}\left(o_{t}, a_{t}\right)=a \cdot r_{o}+b \cdot r_{c}+r_{f} rt(ot,at)=aro+brc+rf a a a andbbbis a scalar parameter (reference valuea = 1/50 a=1/50a=1/50, b = 1 / 64 b=1/64 b=1/64

policy network

insert image description here

  Attention-based neural network output strategy composed of encoder and decoder ψ θ \psi_\thetapi, the encoder extracts salient features from part of the map, especially G ′ G^{\prime} in the augmented mapG' Dependency relationship between nodes; based on these salient features and the current position of the robot, the decoder outputs a strategy for determining the next access point on adjacent nodes.

Use of Pointer Network

  Policy-based RL agents usually have a fixed action space, and the decoder is inspired by pointer networks, allowing the action space to depend on the number of neighboring nodes input in the network. This allows the network to adapt naturally to collision-free graphs, where nodes have an arbitrary number of neighbors.

Attention Layer

  Using the attention module as the basic building block, the input of the attention layer includes a query vector hqh^qhqand key-value vectorhk , vh^{k,v}hk , v , the output of the attention layer ishi ′ h_i^{\prime}hi, which is a weighted sum of value vectors, with weighting factors depending on the similarity between key and value

Encoder

Mask MM   for edge setsM的元素由 m i j = { 0 , ( v i , v j ) ∈ E t 1 , ( v i , v j ) ∉ E t m_{i j}=\begin{array}{l}\left\{\begin{array}{l}0,\left(v_{i}, v_{j}\right) \in E_{t} \\1,\left(v_{i}, v_{j}\right) \notin E_{t}\end{array}\right.\\\end{array} mij={ 0,(vi,vj)Et1,(vi,vj)/EtCalculation, the characteristics of the node are then passed to the multi-layer attention layer (actually 6 layers are taken, among them) hq = hk , v = hnh^q=h^{k,v}=h^nhq=hk,v=hn , the input of each layer is the output of the previous layer, by settingwij=1wij=0,(i,j),mij=1 The edge set mask is used to allow each node to access the features of adjacent nodes. Although the attention of each layer is limited to adjacent nodes, through this stacked attention structure, nodes can still be aggregated through multiple Node features to obtain features of non-adjacent nodes. We empirically find that this structure is better than graph transformers for learning pathfinding in maps with cluttered obstacles. We refer to the output of the encoder as the enhanced node featureh ^ e \hat{h}^e h^e , because each updated node featureh ^ in \hat{h}_i^nh^inboth contain and vi ′ v_i^{\prime}viDependencies on other nodes.

Decoder

  The decoder is used to output the enhanced node features h ^ e \hat{h}^eh^e and the current position of the robotψ t \psi_tpt。

  The current position of the robot is expressed as a node vc = ψ t v_c=\psi_tvc=pt, first select the current node feature hch^chc and neighbor featurehnbh^{nb}hnb,其中 ∀ h ^ i n b , ( v c , v i ) ∈ E t \forall \hat{h}_{i}^{n b},\left(v_{c}, v_{i}\right) \in E_{t} h^inb,(vc,vi)EtCorrespondingly, it is to enhance the node features, and pass the current node features and enhanced node features into the attention layer hq = hk , v = hnh^q=h^{k,v}=h^nhq=hk,v=hn , output it andhch^chc concatenate and project it into a d-dimensional vector, and name this vector the enhanced current node vectorh ^ c \hat{h}^ch^c, after that, the enhanced current node features and neighbor node features are passed to the pointer layer, and the attention layer directly outputs the attention weightwww h q = h ^ c , h k , v = h n b h^q=\hat{h}^c,h^{k,v}=h^{nb} hq=h^c,hk,v=hnbas output. Finally, the output of the pointer layer is used as the strategy of the robot (ieπPii(atot)=wi

Policy Network Training Method

  soft actor critic(SAC)

judging network

Training purpose: Predict the value of state-action

State-action value: Since the value of the state-action value is an approximation of the long-term return, it has an implicit prediction of the potential return (that is, the potential area may be discovered). This property can help the robot sequence to make non-myopic decision-making . Actually train a judging network to approximate the soft update state-action value Q ϕ ( ot , at ) Q_{\phi}\left(o_{t}, a_{t}\right)Qϕ(ot,at) ϕϕ represents the weight set of the evaluation network.

Network structure: The network structure of the judgment network is almost the same as that of the policy network, except that there is no pointer layer at the end. The last is to directly concatenate the enhanced current node features and adjacent features, and project them into state-action values

train

Soft Actor-critic

  The goal of SAC is to learn a policy that maximizes reward while making its entropy as high as possible:
π ∗ = argmax ⁡ E ( ot , at ) [ ∑ t = 0 T γ t ( rt + α H ( π ( . ∣ ot ) ) ) ] \pi^{*}=\operatorname{argmax} \mathbb{E}_{\left(o_{t}, a_{t}\right)}\left[\sum_{t=0 }^{T} \gamma^{t}\left(r_{t}+\alpha \mathcal{H}\left(\pi\left(. \mid o_{t}\right)\right)\right) \right]Pi=argmaxE(ot,at)[t=0Tct(rt+a H( p(.ot) ) )]
  π ∗ \pi^{*}Pi is an optimization strategy,TTTis the number of decision-making steps,γ \gammaγis the discount coefficient,α \alphaαis an important parameter to adjust the cross-entropy for the return.

软状态价值
V ( o t ) = E a t [ Q ( o t , a t ) ] − α log ⁡ ( π ( a t ∣ o t ) ) V(o_t)=\mathbb{E}_{a_{t}}\left[Q\left(o_{t}, a_{t}\right)\right]-\alpha \log \left(\pi\left(a_{t} \mid o_{t}\right)\right) V(ot)=Eat[Q(ot,at)]alog( p(atot))
critic损失函数
J Q ( ϕ ) = E o t [ 1 2 ( Q ϕ ( o t , a t ) − ( r t + γ E o t + 1 [ V ( o t + 1 ) ] ) ) 2 ] J_{Q}(\phi)=\mathbb{E}_{o_{t}}\left[\frac{1}{2} \left(Q_{\phi}\left(o_{t}, a_{t}\right)-\right.\right.\left.\left.\left(r_{t}+\gamma \mathbb{E}_{o_{t+1}}\left[V\left(o_{t+1}\right)\right]\right)\right)^{2}\right] JQ( ϕ )=Eot[21(Qϕ(ot,at)(rt+c Eot+1[V(ot+1)]))2]
policy损失函数
J π ( θ ) = E ( o t , a t ) [ α log ⁡ ( π θ ( a t ∣ o t ) ) − Q ϕ ( o t , a t ) ] J_{\pi}(\theta)=\mathbb{E}_{\left(o_{t}, a_{t}\right)}\left[\alpha \log \left(\pi_{\theta}\left(a_{t} \mid o_{t}\right)\right)-Q_{\phi}\left(o_{t}, a_{t}\right)\right] Jp( i )=E(ot,at)[ alog( pi(atot))Qϕ(ot,at) ]
Self-adjustingα \alphaα损失
J ( α ) = E a t [ − α ( log ⁡ π t ( a t ∣ o t ) + H ‾ ) ] J(\alpha)=\mathbb{E}_{a_{t}}\left[-\alpha\left(\log \pi_{t}\left(a_{t} \mid o_{t}\right)+\overline{\mathcal{H}}\right)\right] J ( a )=Eat[a(logPit(atot)+H)]
   H \mathcal{H} Hdenotes the target entropy. In practice, use a dual-target critic network for training

training details

basic work

  Training was performed using the same environments provided in references, generated by a random dungeon generator. Each environment is a 640 × 480 grid map, and the sensor range ds = 80 d_s = 80ds=80

F. Chen, S. Bai, T. Shan, and B. Englot, “Self-learning exploration
and mapping for mobile robots via deep reinforcement learning,” in
Aiaa scitech 2019 forum, 2019, p. 0396.

Collision-free graph construction process

  To construct the collision-free map, 900 points are uniformly distributed to cover the entire environment, and all points within the known free area are considered as candidate viewpoints VVV. _ We check the k = 20 nearest neighbors of each viewpoint, and if such an edge is collision-free, connect them to form the edge setEEE。

Task Completion Judgment Criteria

  We consider the exploration task complete once more than 99% of the ground-truth is detected (|Pk|/|Pg| > 0.99).

parameter

  During training, the maximum set length is set to 128 decision steps, the discount factor is set to γ ​​= 1, the batch size is set to 256, and the episoid buffer size is set to 10,000.
  Training starts after the episoid buffer has collected more than 2000 steps of data. The target entropy is set to 0.01 ⋅ log ( k ) 0.01 log(k)0.01l o g ( k ) . Each training step consists of 1 iteration and is performed after 1 episoid is completed. For the policy network and the critic network, we use a learning rate of1 0 − 5 10^{−5}105Adam optimizer. The target critic network is updated every 256 training steps.

training time

  Models are trained on workstations equipped with i9-10980XE CPU and NVIDIA GeForce RTX 3090 GPU. We train our models using Ray, a distributed framework for machine learning, to parallelize and speed up data collection (32 instances in practice). Training takes around 24 hours to complete.

Code reference https://github.com/marmotlab/ARiADNE

Guess you like

Origin blog.csdn.net/qq_38853759/article/details/132259943