Exploiting Proximity-Aware Tasks for Embodied Social Navigation Paper Reading

Paper information

Title : Exploiting Proximity-Aware Tasks for Embodied Social Navigation
Authors : Enrico Cancelli, Tommaso Campari
Source : arXiv
Time : 2023

Abstract

Learning how to navigate among humans in closed and space-constrained indoor environments is a key ability that embodies the integration of subjects into our society.

In this paper, we propose an end-to-end architecture that infuses the ability to infer commonsense social behaviors into reinforcement learning navigation policies using proximity-aware tasks called risk and proximity compass.

To this end, our task exploits the notion of immediate and future dangers of collision.

Furthermore, we propose an evaluation protocol specifically designed for social navigation tasks in simulated environments. This is done to capture the fine-grained features and characteristics of policies by analyzing the smallest unit of human-machine spatial interaction, called an "encounter".

Introduction

As shown in Figure 1, the agent should be able to dynamically and interactively navigate the environment by avoiding static objects and moving people.
insert image description here
In the PointGoal navigation task (which requires the agent to reach a specific location in the environment), an agent trained without any sensor/actuation noise for billions of steps achieves near-perfect performance [42]. Even in the presence of noise, other methods [45, 27] achieve impressive results.

Another related task is object-goal navigation, where an agent needs to find and navigate to a specific object given an object category. This task requires both semantic and navigation capabilities and has led to the development of modular approaches based on semantic graphs [9, 6, 32] and end-to-end trained models using reinforcement learning (RL) [46, 33]

But all of the previously mentioned tasks build navigation in a largely static environment . The dynamic element introduced by sentient, moving humans in the scene forced us to rethink the way our current models are designed.

The dynamic element introduced by sentient, moving humans in the scene forced us to rethink the way our current models are designed. A good navigation policy must not only be effective (i.e. be able to achieve its goal) and efficient (i.e. be able to achieve it by a near-optimal path), but also safe (i.e. reach the destination without harming others). This social element is included in the social navigation task (SocialNav) [44, 29], where an agent has to deal with PointGoal navigation in a simulated indoor environment.

To solve this task, Yokoyama et al. [47] introduced a simple but very effective model, which won the first place in the iGibson 2021 SocialNav Challenge. However, this method does not explicitly encode any social behavior in its navigation strategy.

We define a fine-grained evaluation protocol for the SocialNav task to better analyze agent performance during human-agent interactions. Our evaluation protocol is inspired by similar attempts in robotics [30], which consist of collecting statistics (via questionnaires) on specific types of interactions between humans and robots.

We propose automatic evaluation by recognizing and characterizing encounters between humans and agents. To this end, we extracted short subsequences in which interaction with humans emerged as the dominant factor influencing navigation, and we established a set of rules for classifying each encounter according to the type of human-agent spatial relationship over time. Finally, we also introduce the HM3D [31] based episodic dataset for embodying social navigation to evaluate the agent in different environments.

In summary, the contributions of this work are threefold:
(1) A novel architecture embodying social navigation based on proximity-aware tasks; we demonstrate the effectiveness of the model on two public datasets.
(2) A novel encounter-based evaluation protocol for analyzing social navigation models.
(3) A set of episodes based on the HM3D dataset (referred to as HM3D-S) for evaluating embodying social navigation.

Related Work

Embodied Navigation

There is growing interest in studying indoor navigation in concrete environments [14]. This is largely due to large-scale datasets consisting of 3D indoor environments [8, 36, 31], and simulators that allow researchers to simulate navigation in these 3D environments [35, 36, 21].

To study embodied AI, many tasks have been proposed [1], including: PointGoal navigation [42], ObjectGoal navigation [4], embodied question answering [41], and vision and language navigation (VLN) [2, 22].

To address these issues of agents operating in static, single-agent environments, modular approaches have been proposed [7,10,9,32], utilizing SLAM, path planning and exploration strategies, and end-to-end RL training strategies [40,27, 45,6,46], without utilizing any explicit maps.

Socially-Aware Navigation

Work on collision-free multi-agent navigation [5, 39, 13, 23] and navigation in dynamic environments [3] has been extended to navigation in the presence of humans [19, 16, 12, 24, 11].

Chen et al. [12] employ collision avoidance algorithms such as CADRL [13] and introduce common-sense social rules to reduce uncertainty while minimizing the risk of the frozen robot problem [38].

Other works [16, 11] have attempted to model human-computer interaction using techniques such as spatio-temporal graphs [24]. These methods are usually tested in minimalist simulated environments that provide complete knowledge with simple obstacles and often assume cooperation among mobile agents.

Egocentric Visual SocialNav

A growing body of work has begun to investigate SocialNav for egocentric visual data.

Large-scale datasets of egocentric social compliance demonstrations [20] and trajectories of agents navigating social environments [34, 25] enable the training of social navigating agents via imitation learning.

As useful as large datasets are, it is still useful to train agents in simulations so that corner cases that are not easily collected in the real world can be observed. Through simulation, agents are typically trained using imitation learning [37], end-to-end reinforcement learning [47], or a combination of planners and RL [29].

Evaluation of SocialNav

Previous work on SocialNav [17] is often evaluated using overall success rate, path efficiency, collision avoidance, or similarity to human paths.

The success rate is limited because the agent is dealing with humans and it is best to navigate safely to avoid collisions, even if that means a lower success rate.

In the iGibson challenge, SocialNav is also evaluated using STL (success weighted by time length) and PSC (personal space compliance), which is the percentage of time steps where the agent is more than a threshold distance (0.5m) from the pedestrian.

In contrast to these metrics, we propose a fine-grained evaluation based on the type of encounter the agent has with the pedestrian.

Background:the SocialNav Task

In Embodied Social Navigation [44,29,47], as in PointGoal Navigation, the agent's goal is to reach a goal location, but a collision with a human agent constitutes a failure and will terminate the episode.

plot eee is represented by the agent trajectoryα αα , human trajectory tuple (pip^ipi ) and targetggg to represent. Agent trajectoryα αα is the sequence of positions and rotations of the agent from the beginning to the end of the episode trend. Formally,α = { α t } t ∈ [ 0 , tend ] α = \{α_t\}_{t∈[0,t_{end}]}a={ at}t[0,tend], among them α t ∈ SE ( 2 ) α_t ∈ SE(2)atSE ( 2 ) is the agent at timettt 2D translation and rotation relative to the origin.

Similarly, the trajectory of a human in the same episode is the sequence of positions and rotations associated with the ith human. Formally, pi = { pti } t ∈ [ 0 , tend ] ∀ i ∈ P^i = \{p^i_t\}_{t∈[0,tend]} ∀i ∈ Ppi={ pti}t[0,tend]iP andpti ∈ SE ( 2 ) p^i_t ∈ SE(2)ptiSE ( 2 ) . In our simulation, each person's movement is constrained by a start point and an end point, and the person moves back and forth between these two points along the shortest path.

Goal g ∈ G g ∈ GgG is specified by a 2D position in world coordinates. The agent must provide an action (lin vel, ang vel) ∈ [ − 1 , + 1 ] 2 \in [−1, +1]^2at any point in time[1,+1]2 for normalized linear forward velocity and normalized clockwise rotational angular velocity (where +1 is maximum velocity and -1 maximum backward/counterclockwise velocity). The stop action will be called when the agent is within 0.2 meters of the target target point. The agent has 500 actions (or steps) to reach the goal location. If it collides with a human, the episode ends immediately.

Method

Figure 2 shows the outline of our framework.

It consists of two main modules: (i) neighborhood feature extraction, and (ii) policy architecture. The proximity feature extraction module refines the proximity information obtained from the simulator to extract features that describe some aspect of social interactions (ground truth proximity features).

The strategy architecture extracts embeddings from RGB-D and GPS+Compass sensors as input for our proximity perception task. The tasks refine this embedding and create n task embeddings (one for each task), which are then fused together via state attention. Actions are sampled from state attention outputs.

insert image description here

Policy Architecture

Our policy network consists of the following modules:
i) two encoders (visual backbone and position encoder) to create embeddings from RGB-D and GPS+compass sensors;
ii) recurrent state encoder to accumulate this iii) a
state attention module that fuses the outputs of these units via an attention mechanism to produce actions that the robot has to perform.

We use a CNN (visual backbone) f ( ⋅ ) f( )f() will each RGB-D framext x_txtEncoded as a visual embedding ϕ tv = f ( xt ) \phi ^v_t = f(x_t)ϕtv=f(xt) . Using a linear layerg ( ⋅ ) g( )g() for subjectα t α_tatThe position and rotation are encoded to obtain the embedding φ tp = g ( α t ) φ^p_t = g(α_t)Phitp=g ( at) . The outputs of these two encoders are then concatenated into the final embeddingϕ tf = ϕ tv ⊕ ϕ tp \phi ^f_t = \phi ^v_t ⊕ \phi ^p_tϕtf=ϕtvϕtp

To accumulate embeddings over time, we follow the PointNav design of Ye et al. [45] and implement our state encoder as a stack of parallel recurrent units. Each unit of each time step will be input ϕ tf \phi ^f_tϕtf, and outputs its internal state, called belief beliefbelief

have multiple beliefsThe key idea of ​​b e li e f is that each loop unit can focus on a specific aspect of navigation.
The final decision about what action the robot should take is sampled by weighting each belief according to the situation. Therefore, all beliefsBBB is then fused by a state attention module to compute the mean µ t ⃗ \vec {µ_t}of the normal distribution over which we sample actionsmt and standard deviation σ t ⃗ \vec {σ_t}pt

Formally, we define { RU ( i ) } ∀ i ∈ B \{RU^{(i)}\}_{∀i∈B}{ RU(i)}iBA set of recurrent units, encoding confidence ht h_thtdefined as:

insert image description here

Proximity-Aware Tasks

With multiple beliefs, we can inject different signals into the embeddings, for example, social dynamics happening in a certain episode.
To this end, during training, we condition each belief with a unique auxiliary loss, which is jointly optimized with action and value losses during the optimization step of the policy network.

This is done by processing each belief with a specific type of proximity feature through a regressor network (see Figure 3), which computes our proximity-aware task predictions. Each auxiliary task is responsible for predicting the time horizon [ t , t + k ] [t, t+k][t,t+The adjacent features within k ] , with the corresponding belief h(i) t and the sequence of actions{ aj } j ∈ [ t , t + k ] \{a_j\}_{j∈[t,t+k]}{ aj}j[t,t+k], where k is the number of future frames to predict. Formally, for a given adjacent feature sequence { sj } j ∈ [ t , t + k ] \{s_j\}_{j∈[t, t+k]}{ sj}j[tt+k], the task aims to optimize the following auxiliary loss:
insert image description here

We design two types of proximity tasks, corresponding to two social features:
(i) risk estimation
(ii) proximity compass

The advantage of our design is that it can be easily extended to other potentially more complex social tasks and is also compatible with general self-supervised tasks used in Ye et al. [45] (e.g. CPC|A [18] or ID [28, 46]). To exploit different proximity features, we extract the relative position of each person from the simulator. intermediary. We refer to this data as proximity information:
insert image description here

Risk Estimation

Risk estimation is a proximity sensing task designed to handle short-range social interactions to inform agents of impending collision hazards.

Given SI t SI_tS It, we define the Risk value as a scalar, representing the maximum distance D r D_r between the agent and the nearest personDrHow close is the distance within. The value ranges from 0 (nearest neighbor ratio D r D_rDrfar away) to 1 (the distance between the agent and the nearest person D r D_rDrstill far). people collide). Expressed as:
insert image description here

Proximity Compass

Proximity sensing tasks model the distant part of social dynamics. This feature not only captures the radius D c > D r D_c > D_rDc>Drsocial interactions within a larger area of ​​the human body, but also captures faint indications of the direction in which a person might be coming. Just like humans can guess a person's whereabouts based on previous observations, partial knowledge of the environment, and the person's trajectory; we want our agents to acquire similar knowledge when training

Such information is indicated by the proximity compass. In a compass, north represents the direction the agent is looking, and quadrants are divided into a finite number of non-overlapping sectors. Given each person i ∈ P , θ a → ii ∈ P, θ_{a→i}iP iaiIndicates the angle of the line segment connecting the agent to the person. North of the compass. These angles are associated with a particular sector. We calculate risk values ​​for each sector within the population in the same quadrant. The entire compass is represented as a vector by unrolling a sequence of sectors clockwise from north. Formally, if we have kkk equivalent sectors, then the vectorcompt ∈ R k comp_t ∈ R^kcomptRk is defined as:
insert image description here

Implementation details

The vision backbone used in our experiments is a modified version of ResNet-18 for RGB-D frames. We use the same implementation used in [47]; this produces an output embedding of size 512. For the proxy's position encoder, we use a linear layer with an output size of 32. Our recurrent state encoder is implemented as a single-layer GRU stack with 512 hidden dimensions each. The regressors for both proximity tasks are implemented as single-layer GRUs with a hidden size of 512. Action sequence { ai } i ∈ [ t , t + k ] \{a_i\}_{i∈[t,t+k]}{ ai}i[t,t+k]through the embedding layer and then fed to the input of the GRU. The initial hidden state is initialized to ht ( i ) h^{(i)}_tht(i), k is set to 4. See Figure 3 for the complete scheme.
insert image description here
Each model is trained using the DD-PPO [42] algorithm on 3 GPUs Nvidia A5000 with 8 parallel environments per machine. Our reward function is as follows
insert image description here

An evaluation protocol for SocialNav

We introduce an evaluation protocol to analyze the performance of navigation agents on the SocialNav task. We classify encounters between an agent and other pedestrians in the environment into four categories (frontal, intersection, blind corner, follower) and propose a set of metrics to analyze different types of encounters.

Encounter
we will be in episode eeAn encounter between an agent and a specific pedestrian in e is formally defined as a given time range[t 1 , t 2 ] ⊆ [ 0 , tend ] [t_1, t_2] ⊆ [0, t_{end}][t1,t2][0,tend] Inner trajectoryα αα p i p^i pA subsequence of i
to satisfy the following constraints: • Time Constraint: time range[t 1 , t 2 ] [t_1, t_2][t1t2] greater than the thresholdT min T_{min}Tmin; • Spatial Constraint: the geodesic distance i ∀ t ∈ [ t 1 , t 2 ] i ∀t ∈ [t_1, t_2]
between the agent and the humanit[t1,t2] less than the thresholdD max D_{max}Dmax;
• Haeding Constraint: at the first T front T_{front}Tfronttime step, person iii is located in front of the proxy. That is, given the heading angleθ ta θ^a_tita θ t a → i θ^{a→i}_t itaiThe line segment angle and threshold θ max connecting the agent and person i , ∣ θ ta − θ ta → i ∣ ≤ θ max θ_{max}, |θ^a_t - θ^{a→i}_t | ≤ θ_{max}imaxθtaitaiimax 满足 ∀ t ∈ [ t 1 , t 1 + T f r o n t ] ∀t ∈ [t_1, t_1 + T_{front}] t[t1,t1+Tfront]

Inclusion rule
To describe the encounter between agents and humans, we design a heuristic called the inclusion rule (IR).
Each IR is defined by the following parameters:
Δ ti , Δ ta Δ^i_t, Δ^a_tDtiDtaRepresented in the time range [ t 1 , t ] [t_1, t][t1,An approximation of the general direction of the trajectory of the agent and person i within t ] , where t ∈ [ t 1 , t 2 ] t ∈ [t_1, t_2 ]t[t1,t2] , wheret 1 , t 2 t_1, t_2t1t2are the start and end time steps of the encounter;
intersect intersectin t ersec t is a binary value, 1 if the path of the robot and pedestrian intersects, and 0 otherwise;
blind ( t ) blind(t)b l in d ( t ) is a binary value of a time condition indicating whether the agent can see the person at time step t;
d _ diff ( t ) d\_diff(t)d _ d i ff ( t ) : the difference between the geodesic distance and the Euclidean distance between the agent and the person

Encounter classification
We define four classes of encounters (inspired by Pirk et al. [30]), and their respective inclusion rules (see Figure 4):

• Frontal approach: The robot and human come from opposite directions and have approximately parallel trajectories. Therefore, the agent should deviate slightly to avoid a head-on collision. IR : ( ¬ blind ( t ) ∀ t ∈ [ t 1 , t 1 + T view ] ) ∧ π − Δ slack ≤ ∣ Δ t 2 i − Δ t 2 a ∣ ≤ π + Δ slack IR: (\neg blind (t) ∀t ∈ [t_1, t_1 + T_{view}]) ∧ π − Δ_{slack} ≤ |Δ^i_{t2} − Δ^a_{t2}| ≤ π+Δ_{slack}I R:(¬blind(t)t[t1,t1+Tview])PiDslackΔt2 _iDt2 _aPi+Dslack, where Δ slack Δ_{slack}Dslackis the relaxation value (in radians) over the angle π, T view T_{view}Tviewis the initial number of time steps the agent must see the person.

• Intersection: The trajectories of the robot and the human intersect at approximately 90°. In this case, the agent might want to stop and give way to the human, or reduce its linear velocity and deviate slightly. IR : ( ¬ blind ( t ) ∀ t ∈ [ t 1 , t 1 + T view ] ) ∧ π 2 − Δ slack ≤ ∣ Δ t 2 i − Δ t 2 a ∣ ≤ π 2 + Δ slack ∧ intersect IR: (\neg blind(t) ∀t ∈ [t_1, t_1 + T_{view}]) ∧ \frac{π}{2} − Δ_{slack} ≤ |Δ^i_{t2} − Δ^a_{t2} | ≤ \frac{π}{2} + Δ_{slack} ∧ intersectI R:(¬blind(t)t[t1,t1+Tview])2pDslackΔt2 _iDt2 _a2p+Dslackintersect

• Blind Corner: An agent approaches a person from an initially obstructed location, such as a corner or a narrow doorway. With such limited visibility, agents should proceed with caution to avoid a collision. IR : ( blind ( t ) ∀ t ∈ [ t 1 , t 1 + T blind ] ) ∧ d _ diff ( t 1 ) ≤ 0.5 IR: (blind(t) ∀t ∈ [t_1, t_1 + T_{blind} ]) ∧ d\_ diff(t_1) ≤ 0.5I R:(blind(t)t[t1,t1+Tblind])d_diff(t1)0.5 whereT blind T_{blind}Tblindis the initial number of time steps the agent cannot see the person.

• Person following: Person and agent travel in the same direction. Agents must maintain a safe distance from people and relatively low line speeds. IR : ( ¬ blind ( t ) ∀ t ∈ [ t 1 , t 1 + T blind ] ) ∧ ∣ Δ t 2 i − Δ t 2 a ∣ ≤ Δ slack IR: (\neg blind(t) ∀t ∈ [ t_1, t_1 + T_{blind}]) ∧ |Δ^i_{t2} − Δ^a_{t2}| ≤ Δ_{slack}I R:(¬blind(t)t[t1,t1+Tblind])Δt2 _iDt2 _aDslack

Metrics
For each encounter class, we evaluate:
• Encounter Survival Rate (ESR) is the percentage of encounters (in a particular class) that have no human collisions (eg, if the encounter is in a blind corner, the agent is at 20% of the time In the case of a collision with a human inside, the ESR will be 80%);
• Average Linear-Velocity (ALV) is the average linear velocity of the agent in an encounter;
• Average Distance (AD) is The average distance of the agent relative to the humans in the encounter.

Experiment

insert image description here
insert image description here
insert image description here
insert image description here

Conclusion

We introduce an embodied social navigation model based on two proximity-aware tasks.

Our experiments show that exploiting social signals alone or in combination with self-supervised auxiliary tasks is an effective strategy in complex and crowded scenes.

Our model can avoid most encounters by using only proximity-aware tasks .

Furthermore, by combining proximity sensing and auxiliary tasks [45], it can prevent human collisions with high confidence in almost all encounter categories, despite the increased number of collision categories.

In our future work, we will focus on simulating more natural human behavior and we will conduct experiments on the sim2real domain transfer

Guess you like

Origin blog.csdn.net/qin_liang/article/details/132243955