Retrospectives on the Embodied AI Workshop (Review of Embodied AI Workshop) Paper Reading

Paper information

Title : Retrospectives on the Embodied AI Workshop
Authors : Matt Deitke, Dhruv Batra, Yonatan Bisk
Source : arXiv
Paper Address : https://arxiv.org/pdf/2210.06849
Time : 2022

Abstract

Our analysis focuses on 13 challenges presented at the CVPR Embodied AI Workshop. These challenges are grouped into three themes: (1) visual navigation
(2) re-arrangement
(3) embodied vision-and language

insert image description here

Introduction

The challenges presented by the workshop focused on navigation, rearrangement, and benchmarking progress on specific visual and linguistic aspects.

  1. Navigation challenges: including Habitat PointNav [1] and ObjectNav [17], interactive and social navigation using iGibson [210], RoboTHOR ObjectNav [51], MultiON [198], RVSU Semantic SLAM [82] and using SoundSpaces [38] audiovisual navigation];
  2. Rearrangement Challenge: including AI2-THOR Rearrangement [200], TDW-Transport [67] and RVSU Scene Change Detection [82];
  3. Embodied visual and language challenges: including RxR-Habitat [102], ALFRED [177] and TEACh [133].

We discuss the setting of each challenge and its state-of-the-art performance, analyze common approaches among winning works in the challenges, and finally discuss promising future directions in the field

Challenge Details

Navigation Challenges

At a high level, a navigation task consists of an agent operating in a simulated 3D environment (such as a home) with the goal of moving to a certain goal. For each task, the agent has access to an egocentric camera and observes the environment from a first-person perspective. The agent must learn to navigate the environment through visual observation.

The challenge mainly depends on how the goal is encoded (e.g. ObjectGoal, PointGoal, AudioGoal), how the agent interacts with the environment (e.g. static navigation, interactive navigation, social navigation), the training and evaluation scenarios (e.g. 3D scans, video game environments, real-world ), observation space (such as RGB and RGBD, whether to provide positioning information), and action space (such as output discrete high-level actions or continuous joint motion actions).

PointNav

In PointNav, the agent's goal is to navigate to a goal coordinate in the new environment relative to its starting position (e.g., navigate 5m north and 3m west relative to its starting pose), without accessing a pre-built map of the environment . The agent has access to egocentric sensory input (RGB images, depth images, or both) and ego motion sensors (sometimes called GPS+compass sensors) for localization.

The action space of the robot includes: move forward 0.25m, rotate right 30°, rotate left 30° and finish.

A "Done" command is considered successful if the agent is within 0.2 meters of the target and the maximum number of steps is within 500 steps. The agent is evaluated using the success rate (SR) and the "success weighted by path length" (SPL) [9] metric, which measures the success and efficiency of the paths taken by the agent. For training and evaluation, challenge participants to use the training and validation splits of the Gibson 3D dataset [223] In
insert image description here
2019, AI Habitat held its first challenge on PointNav. The winning work [31] achieved a high test SPL of 0.948 on the RGB-D track and 0.805 on the RGB track using a combination of classical and learning-based methods. Based on the findings of Kadian et al. [92], in 2020 and 2021 the PointNav Challenge was modified to emphasize increased realism and sim2real predictability (the ability to predict real robot performance based on simulated performance). Specifically, the challenge (PointNavv2) introduces (1) no GPS+compass sensor, (2) noisy actuation and sensing, (3) collision dynamics and "sliding", and (4) constraints on robot implementation/dimensions, Minor changes to camera resolution, height to better match LoCoBot robot. These variations proved to be more challenging, with the 2020 winning submission [149] achieving an SPL of 0.21 and an SR of 0.28. In 2021, a major breakthrough was achieved, with a 3x performance improvement over the 2020 winner; the winning entry had an SPL of 0.74 and an SR of 0.96 [1]. Since agents with perfect GPS + compass sensors in this PointNav-v2 setting can only achieve at most 0.76 SPL and 0.99 SR, the PointNav-v2 challenge is considered solved and discontinued for the next few years.

Interactive and Social PointNav

In interactive and social navigation, an agent needs to reach a PointGoal in a dynamic environment containing dynamic objects (furniture, clutter, etc.) or dynamic agents (pedestrians). Despite the notable success of robotic navigation in static structured environments such as warehouses, it remains a challenging research problem in dynamic environments such as homes and offices. In 2020 and 2021, the Stanford Vision and Learning Lab, in partnership with Robotics@Google, is hosting the Interactive and Social (Dynamic) Navigation Challenge 2. These challenges use the simulated environment iGibson [105, 175] and many real indoor scenes, as shown in Fig. 4. The 2020 Challenge 3 also featured a Sim2Real component where participants trained their policies in iGibson simulations and deployed them in the real world.

insert image description here

  1. In Interactive Navigation, we challenge the notion that a navigation agent should avoid collisions at all costs. Our point is just the opposite — in real-world environments full of clutter, such as homes, agents must interact and push objects away in order to navigate meaningfully. Note that all objects in the scene are assigned real physical weights and can be interacted with.

    Just like in the real world, while some objects are light and can be moved by robots, others are not. In addition to the original furniture objects in the scene, other objects such as shoes and toys from the Google Scanned Objects dataset [54] are added to simulate real-world clutter. The agent’s performance is evaluated using a novel Interactive Navigation Score (INS) [210] that measures both the success of the navigation and the amount of disturbance the agent caused to the scene along the way

  2. In social navigation, agents navigate among walking humans in a home environment. Humans in the scene move towards randomly sampled locations, and their 2D trajectories are simulated using the Optimal Mutual Collision Avoidance (ORCA) [18] model integrated in iGibson [105, 140, 175].

    The agent should avoid collisions or approach pedestrians beyond a threshold (distance <0.3m) to avoid event termination. It should also maintain a comfortable distance from pedestrians (distance < 0.5 meters), exceeding this score will be penalized, but the incident will not be terminated. The Social Navigation Score (SNS) is an average of STL (Success Measured by Length of Time) and PSC (Personal Space Compliance) to assess an agent's performance.

One of the challenges in the social navigation part is the difficulty in modeling the trajectories of human agents, including the reactions and interactions between agents. Many times, the space needs to be negotiated to reach the goal, or the agent needs to exceed the desired personal space threshold; or the simulated human agent behaves unstable due to the constraints of the behavioral model and spatial constraints. For future releases, we will emphasize the importance of high-fidelity navigation simulations with human-like behavior.

For the Sim2Real component of the 2020 Challenge, significant performance degradation was observed during Sim2Real transfers due to reality gaps in vision sensor readings, dynamics (eg, motor drive), and 3D modeling (eg, soft carpet).

ObjectNav

In ObjectNav, an agent is tasked with navigating to one of a set of target object types (eg navigating to a bed) given egocentric sensory input. The sensory input can be an RGB image, a depth image, or a combination of both. At each time step, the agent must issue one of the following actions: move forward, rotate right, rotate left, seek up, seek down, and finish. The move forward operation moves the agent 0.25m, and the rotate and view operations are performed in 30° increments.

Success is considered if
(1) the object is visible in the camera view,
(2) the agent is within 1 meter of the target object, and
(3) the agent issues a "done" action . The agent's starting position is a random location in the scene.

oboTHOR ObjectNav Challenge [51] and Habitat ObjectNav Challenge [166, 214]. Both challenges used the action and observation spaces described above, along with a simulated LoCoBot robotic agent.
insert image description here

Multi-ObjectNav

In Multi-ObjectNav (MultiON) [198], the agent is initialized at a random starting position in the environment and asked to navigate to an ordered sequence of objects placed inside the real 3D (Fig. 6a, 6b). The agent must navigate to each target object in the given sequence and call the Found operation to signal the object's discovery. This task is a generic variant of ObjectNav, where the agent must navigate to a sequence of objects rather than a single object. MultiON explicitly tests the agent's ability to navigate to localize previously observed target objects and is thus a suitable testbed for evaluating memory-based embedded AI architectures.

The proxy is equipped with an RGB-D camera and a (noise-free) GPS+compass sensor. The GPS+Compass sensor provides the agent's current position and orientation relative to its initial position and orientation in the episode. It does not provide a map of the environment. The action space consists of moving forward 0.25 m, turning left 30°, turning right 30°, and finding.

The MultiON dataset is created by synthetically adding objects to a Habitat-Matterport 3D (HM3D) [152] scene. These objects are either cylindrical or natural-looking (real) objects. As shown in Figure 6a, the cylinder objects have the same height and radius, but different colors. However, these objects appear unreal in the interior scenes of the Matterport house. Also, detecting the same object with different colors may be easy for the agent to learn. This leads us to incorporate realistic objects that occur naturally in houses (Fig. 6b).
insert image description here
insert image description here

These objects vary in size and shape, presenting more demanding detection challenges. The training part has 800 HM3D scenes and 8M sets, the validation part has 30 unseen scenes and 1050 sets, and the testing part has 70 unseen scenes and 1050 sets. The episodes are generated by sampling random navigable points as start and goal locations such that the locations are on the same floor and there is a navigable path between them. Next, five target objects are randomly sampled from a set of cylinders or real objects, which are inserted between the origin and the target, maintaining a minimum pairwise geodesic distance between them to avoid confusion. Also, to make the missions more realistic and challenging, three distractors (not targets) are inserted in each episode. The presence of distractors will encourage the new agent to distinguish the target object from other objects in the environment.

An event is considered successful if the agent is able to come within 1 meter of each target in the specified order and generate a FOUND action at each target object. In addition to the standard evaluation metrics used in ObjectNav, such as success rate (SR) and success weighted by path length (SPL) [9], we also use progress and progress weighted by path length (PPL) to measure agent performance. The leaderboard for the challenge is based on the PPL metric.

Navigating to Identify All Objects in a Scene

The RVSU Semantic SLAM Challenge task requires participants to explore a simulated environment to map all objects of interest within it. This challenge poses a question to the robotic agent: "What object is where?" within the scene. A robotic agent traverses the scene, creates an axis-aligned 3D cuboid semantic map of the objects within that scene, and evaluates it against the accuracy of its map. Providing a semantic understanding of objects can help a robot's ability to interpret properties of its environment, such as knowing how to interact with an object and understanding the type of room it might be in. Such semantic understanding is often viewed as a Semantic Simultaneous Localization and Mapping (SLAM) problem.

Semantic SLAM tasks have been extensively studied using static datasets such as KITTI [68], Sun RGBD [181] and SceneNet [115]. However, these static datasets ignore the active capabilities of robots and abandon the search for actions in the physical action space that best explore and understand the environment. To address this limitation, the RVSU Semantic SLAM Challenge [82] helps bridge the gap between passive and active semantic SLAM systems by providing a framework and simulation environment for reproducible, quantitative comparisons of passive and active methods.

insert image description here

Audio-Visual Navigation

While current navigation models tightly couple vision and movement, they are deaf to the world around them, motivated by these factors, the audio-visual navigation task [38, 66] is introduced, where the entity agent is tasked with navigating to an object that emits sound In an unknown, unmapped environment, with egocentric visual and auditory perception (Fig. 8). This audio-visual navigation task can find applications in assistive and mobile robots, such as those used in search-and-rescue operations and assistive home robots. In addition to this mission, the SoundSpaces platform was launched, the first audio-visual simulator of its kind, in which physical agents can move around a simulated environment while seeing and hearing sound.

Other

Other parts are temporarily out of the scope of research, and will be read later if necessary

Guess you like

Origin blog.csdn.net/qin_liang/article/details/131953549