[AAAI'18]SOMA: A Framework for Understanding Change in Everyday Environments Using Semantic Object Maps

Summary

Understanding the changes associated with the dynamics of people and objects in everyday environments is a challenging problem, and it is also a key requirement for automated mobile service robots in many applications.
In this paper, we propose a novel semantic mapping framework that can map the location of objects, areas of interest, and the movement of people over time. Our goals for using this framework are twofold: (1) we want to allow robots to reason about their environment in terms of semantics, space, and time, and (2) we want to enable researchers to investigate and study in the context of long-term robot scenarios problem. Dynamic environment.
The experimental results prove the effectiveness of the framework, which has been deployed in the actual environment for several months.

 

Introduction

For example, when setting a table for dinner, a person takes out plates, cups, knives and forks from the kitchen cabinets and drawers, and then puts them on the dining table.
However, due to its limited perception ability and self-centered view, the robot can only partially observe these events. While observing this kind of scene, the mobile robot may only perceive the movement of a person in space and the appearance and disappearance of objects in different places at different times. Since the robot only perceives a snapshot of potential events, it needs to reason about what it sees, when and where to infer what to do next.

What did I see? where? When? It is an indispensable problem in many robotic tasks, such as finding objects and monitoring human activities (Galindo et al., 2008; Kostavelis and Gasteratos, 2015).
The semantic environment graph can link the semantic information about the world (such as objects and people) to the space-time representation, thus providing answers to these questions.
For this reason, they are an important resource in many robot tasks. They allow autonomous robots to interpret (or grind) advanced mission instructions; plan and explain how to complete missions in a given environment; and communicate observations to humans.
However, constructing, maintaining, and using such maps in daily dynamic environments poses some challenges: (i) observations from sensor data must be interpreted at the semantic level; (ii) interpretation needs to be integrated into a consistent map; ( iii) The map needs to be constantly updated to reflect the dynamic changes of the environment (long time); (iv) The query of the semantic map needs to provide task-related information by considering semantic and / or spatiotemporal constraints.

In the past, many semantic mapping methods in computer vision and robotics solved the challenges by interpreting and integrating data from various sensors (i) and (ii); these methods assumed that the environment was static, so the focus was on large structures ( Such as rooms, walls and furniture).

In this work, we respond to challenges (ii), (iii) and (iv). Regarding challenge (i), we used and adapted the latest robot perception method, which provides intermediate semantic interpretation from sensor data.
Our work focuses on the dynamic aspects of the daily environment, including the different positions of objects, areas of interest that may change over time, and the movement of people.
To this end, we studied how objects, regions, and motions
can index people by time and space so that map updates and queries can be handled efficiently.

To meet these challenges, we designed, developed, and evaluated SOMA; a framework for constructing, maintaining, and querying semantic object graphs. In our work, the semantic object graph models the semantic and spatial information about objects, regions, and agent trajectories over time. Therefore, they can provide answers to the following questions: what, where and when?

The proposed framework allows autonomous robots to use the latest perception methods to automatically construct and update (or modify) maps based on observations and query them in the robot's control program.
At the same time, knowledge engineers and researchers can edit and query maps to provide domain knowledge and study long-term research problems. It is important to note that when we designed SOMA, we considered these two different user groups: autonomous robots and researchers. When robots can add their own observations and query maps to make decisions, researchers can model and / or analyze the environment and extract spatiotemporal data collected by autonomous systems. The latter enables researchers to build and learn novel models about (long-term) dynamics in everyday environments.

In this work, we focus on mapping objects, areas, and humans under long-term and dynamic settings.
Table 1 summarizes the high-level concepts used in SOMA. From an abstract point of view, our method is similar to other works because it uses similar concepts to represent entities in the environment.
Concepts such as objects, regions and tracks are nature and common sense.
However, over time, our approach differs greatly in the way storage, indexing, linking, and query observation, interpretation, and semantic concepts.
The main contributions of this work are as follows:
• An open source semantic mapping framework (SOMA) designed specifically for long-term dynamic scenarios;
• Multi-layer knowledge representation architecture, using spatio-temporal index links to observe, explain, and semantic concepts;
• Used for Adaptive mechanism for grounding objects in sensor data and a set of (extensible) interfaces for updating semantic object graphs over time;
• Query interface for retrieving and processing semantic object graphs using semantic and spatiotemporal constraints Objects;
• Long-term case studies of SOMA and semantic object graphs in real-world environments.

 

 

2. Related work

The researchers proposed a standardized method for representing and evaluating semantic graphs.
They define semantic mapping as an incremental process that maps the relevant information of the world (that is, spatial information, temporal events, subjects, and actions) to the formal description supported by the inference engine.
Our work uses a similar method. We gradually map the spatiotemporal information about objects, people, and regions, and use standardized database queries and specialized reasoning mechanisms to query this information.

Several semantic mapping methods mainly focus on the interpretation and integration of data from various sensors including laser rangefinders, stereo and monocular cameras, and RGB-D cameras.

Most of these methods assume that the environment is static, so the focus is on the mapping of static large structures such as rooms, walls, and furniture.
Our work differs from these methods in two ways. First, we do not develop methods for interpreting sensor data, but establish and adapt the most advanced robot perception methods (Aldoma et al., 2012; Wohlkinger et al., 2012). Second, we focus on mapping, updating, and dynamic Query semantic graph in the environment.

Some semantic mapping methods explore this topic from different angles.
They focus on the design of ontology and connect the ontology with the representation of the underlying environment (Zender et al., 2008; Tenorth et al., 2010).
For example, the work of (Pronobis and Jensfelt 2012) shows how to integrate different sensor modalities and ontology inference functions to infer semantic room categories.
Using semantic Web technology to represent environmental maps also allows robots to exchange information with other platforms through the cloud (Riazuelo et al., 2015).
We believe that these types of methods complement our work because the semantic categories in our framework can be integrated and associated with existing ontologies. For example, in (Young and others 2017a), object-related assumptions are associated with structured semantic knowledge bases (such as DBpedia1 and WordNet) (Fellbaum 1998)

(Elfring et al., 2013) proposed a framework for grounding objects in sensor data probabilistically. In general, our framework does not make any strong assumptions about how the objects observed by the robot are grounded.
Instead, the grounding of the object needs to be specified or learned by the user in the interpretation layer of the framework (see section 3.3).

(Bastianelli et al., 2013) proposed an online interactive method and evaluation of semantic graph construction (Gemignani et al., 2016). Similarly, our work supports crowd marking of objects found (see section 3.5). However, our method is offline and designed to work asynchronously.

Our work is similar to (Mason and Marthi 2012) and (Herrero, Castano and Mozos 2015), both of which are dedicated to semantic query of maps in dynamic environments.
(Herrero, Castano, and Mozos, 2015) proposed a method based on a relational database that stores semantic information about the objects, rooms, and the relationships needed for mobile robot navigation.
Our method is similar because it also considers objects and areas in space (but not just rooms).
However, in our method, the relationship between objects and regions does
not have to be explicitly modeled, but it can be inferred using spatial reasoning.
(Mason and Marthi 2012) Focus on semantic query and change detection of objects. In work, objects represent geometrically different occupied areas on a plane, and their positions are described in the global reference frame.
Instead, our work can distinguish between unknown objects, classified objects, and known object instances. Through spatial indexing, we can associate objects into local, global, and robotic reference systems. In addition, we can associate objects with regions and human trajectories.

The most similar to our method is the semantic mapping framework of (Deeken, Wiemann and Hertzberg 2018). Their framework aims to maintain and analyze spatial data of multi-modal environment models. It uses a spatial database to store metric data and links it to semantic descriptions through semantic annotations. Spatial and semantic data can be queried from the framework to use topology and semantic information to expand the metric graph. This design and function are very similar to our approach.
However, our method goes beyond spatial and semantic information because it also contains time information about objects, regions, and people. Therefore, it enables robots and users to infer not only static configurations, but also temporarily extended events, such as daily activities.

 

3. soma framework

Figure 1 provides a conceptual overview of the designed framework. The framework consists of two parts: (1) SOMA core and (2) a set of SOMA extensions (or tools).
Overall, the core has four levels. These three horizontal layers are interconnected and manage information at different levels of abstraction: from observation (ie raw sensor data) and its interpretation to semantic concepts. These three layers are responsible for the representation in SOMA.
The vertical interface layer provides access to all three levels. A set of extensions (or tools) use this layer to visualize, edit, query and extend the semantic object graph. This allows knowledge engineers to expand and analyze them. Similarly, robots and user applications can access and manage maps through the interface layer.

Now let us consider the process of storing new information in SOMA. First, the robot's observation values ​​are stored in the form of raw sensor data, and spatio-temporal indexing is performed in the observation layer. Second, the interpretation layer uses perceptual methods (such as segmentation, object recognition, object classification, and person tracking) to analyze these observations, merge the results, and generate a consistent description at the semantic level. Finally, observation, interpretation and semantic description are linked together to
allow the robot to query them at various levels using space, time and / or semantic constraints.

3.2 obseravation layer

The role of the observation layer is to store unprocessed raw sensor data from the robot, as well as any metadata that may be useful when its system interprets and processes the data.
To this end, the observation layer stores input from robot sensors during the learning task.
All other layers of SOMA also access this stored data.
The views we store contain data from a single robot's perception of motion, and collect a series of views as fragments.
For our object learning task, a single view stores a point cloud, RGB images, depth images, the current pose of the robot, and all ranging transformations.
A series of views selected by the fragment collection planning algorithm for a specific learning task.
Fragments and views may also have metadata tags attached, which allows multiple different perception pipes (perhaps all using different standards to trigger, control and interpret data from learning tasks) to use the same database.

 

One of our design goals is to provide a method for storing raw robot perception data, which will allow us to completely regenerate the SOMA database and perform all necessary processing steps in this process. As long as you provide a copy of the robot's observation layer, it can be achieved, because the accompanying original observation can be played back like a real-time playback.
This is a key function of evaluating different perception algorithms and pipelines above or below the robot. This feature also helps to improve fault tolerance. For example, if the robot runs for a period of time, but an error is missed in the segmentation or object recognition layer, we can correct the error and completely regenerate the database from the observation layer and use the new The corrected system processes the data without causing data loss.

 

3.3 Explanation layer

The interpretation layer takes input from the observation layer and mainly contains application-specific methods for processing data.
The observation layer can be seen as a wrap around the robot's sensor, while the interpretation layer is regarded as part of the system for application-specific processing of the data.
In object learning, the first step of interpretation is to apply segmentation algorithms, such as depth-based segmentation or region proposal networks, to extract object proposals for further processing.
SOMA provides a method for constructing the output of such segmentation algorithms by providing an object structure similar to a scene graph.
This provides tools for storing data about a single segmented object and its relationship to views and plots, and allows developers to organize observations of objects when shooting other views.
The exact choice of algorithms for scene segmentation, object tracking between views, and other aspects are left to developers as part of their own dedicated interpretation layer design.

Once the interpretation pipeline processes and filters the raw sensor output provided by the observation layer, advanced SOMA objects can be constructed from the processed data.
Such an example is shown in Figure 2. Advanced objects represent the results of processing, and the output of object recognition algorithms can be recorded on a series of views of the object, a merged 3D model constructed from multiple views, and metadata.
These high-level objects are linked back to the low-level observations that make up them, and developers can move back and forth from the full merged object to its components as needed.
These objects can then be used in other applications built on top of SOMA-they can be displayed to the end user in the label application, sent via email, posted, visualized on the website, and used in the application to find missing ones Cup, for further processing, or any other content that the developer might want.

The explanation layer feels almost done, and what else?

3.4 Semantic layer

The semantic layer stores high-level knowledge extracted from robot observations (see Table 1).
The high-level knowledge may be the identified object instances received from various recognition / detection pipelines, or the objects tracked from the segmentation / tracking pipeline.
Each high-level data instance stores space-time information, so that the development of knowledge about each object instance can be maintained and retrieved.
In addition, each high-level SOMA object is linked to other SOMA layers through the SOMA ID in order to access all knowledge about the object within the framework.

In addition, the semantic layer can store other information about objects, such as 3D models, camera images, and any type of metadata to build a complete knowledge base. The stored high-level information can help users understand the semantics of various environments, allowing robots to perform high-level reasoning to complete tasks such as discovering and / or grabbing objects.

 

3.5 Interface layer and other extensions

The interface layer serves as the backbone between different SOMA layers and users for exchanging data. In this way, robots / users can use SOMA extensions and other applications to insert, delete, update, and query data (Figure 3).

SOMAEdit allows users to create virtual scenes without any perceived data. Using this editor, users can add, delete, or move objects and areas at the top of the metric chart.

SOMAQuery allows users to query the map using semantic, spatial and / or temporal constraints.
The query may ask all objects of a certain type ("select all cups").
Such queries may be further constrained by space-time constraints ("select all cups in the meeting room on Monday between 10: 00-12: 00"). Spatial constraints can be used to determine whether spatial entities are close to another entity, within one entity (area), or whether they intersect another entity. Time constraints can be expressed by using time points or time intervals. To discover time patterns and periodic processes, the day of the day, the day of the week, and the day of the month are particularly important.

 

The article is very clear, but it seems that it does not involve whether the object belongs to the same instance. . In addition, the recognition and segmentation of objects have always been through image algorithms. .

 

4. Implementation

We have implemented SOMA2 based on ROS and MongoDB. The overall implementation structure of the framework is shown in Figure 4.
ROS is used as the backbone of the entire SOMA framework because it is the most commonly used platform in the robotics research community.
The various SOMA layers and components are developed as ROS nodes, so each of them can communicate with any other ROS component.
The data structure used to store SOMA objects is itself a ROS message composed of primitive ROS types. As long as they are built on the ROS stack, this provides a common interface between systems.

 

5. Evaluation

Our work is motivated by the European project STRANDS (Hawes et al., 2016). In STRANDS, we investigated the spatiotemporal representations and activities in long-term scenarios. In this project, we are interested in providing services to humans in our daily environment. The tasks performed by our robot include object search, object discovery, and activity recognition and motion analysis. In this case, we conducted a series of deployments of robots in a real office environment over a period of several months, and evaluated this work.

SOMA has been deployed many times in two different locations in the UK (Transport System Catapult (TSC) and facilities belonging to Group 4 Security (G4S)). We will report the details of these deployments for the three main entities represented in SOMA: objects, areas, and trajectories.

As mentioned above, although SOMA has been used as a basic technology for developing more advanced and powerful robot perception systems, it also plays a key role in many user-oriented robot applications.
Figure 5 shows a small set of samples learned by the robot using the perception pipeline (Young and others 2017b) at the transportation system ejection (TSC) deployment site. In particular, the 2D image of the object is used to pass to a convolutional neural network (CNN) for recognition (such as Young and others 2017b; 2017a), and to the end user of the site for recognition.
Real-time tags.
SOMA is also a data collection platform-we have used it to store and distribute scenes for offline annotation by human annotators in the future.
SOMA's performance in this regard depends on the perception channel that provides information to it. Overall, during the first and second deployment of TSC, the system stored 141 scenes and 341 scenes, respectively. The design of SOMA means that these scenes can be reprocessed offline later, if necessary, different perception pipes, algorithms or filters can be used to extract different objects or different kinds of information from them. Table 2 shows the comparison between the object learning pipelines used in the third year (Y3) and the fourth year (Y4) of the TSC site.

In the first deployment of TSC (Y3), SOMA was used to provide a report of objects found on the predefined surface of the site (Figure 6). When the robot performs its normal object learning task, it will generate a report and present it in a web-based blog interface for the end user to access.

In other experimental work, we designated the two surfaces on the TSC site as "learning tables", where office workers can bring objects for robot learning. The robot will then visit the table twice a day and try to learn and identify any objects found. Using CNN (with 1000 possible categories) trained on a large image database, it will publish tweets about them while trying to identify them.
Internally, the system is implemented through SOMA's notification function (which enables new objects to be imported into the system), and then the function triggers the object recognition and release process by passing 2D images of objects entered into SOMA to these functions.

In the last long-term deployment of the TSC site (Y4), we used a CNN-based object detection pipeline.
The pipeline can detect 20 types of objects, including people, chairs, monitors, etc., and can use the registered depth information to extract partial 3D views of objects.
In this way, the object position and global metric map relative to the robot can be identified.
Fig. 7 shows an example of a detected chair and an extracted partial 3D view. Then the detected objects are stored as advanced SOMA objects with spatiotemporal information. Table 3 shows some detailed statistics about the objects detected using this pipeline during deployment. The results show that if the robot works in an office environment, the detected objects are at most chairs, people and monitors (Table 3).

We also used the SOMAQuery interface to analyze the time aspect of Y4 deployment from the perspective of advanced SOMA object perception.
Table 4 shows the daily object awareness statistics throughout the deployment process, excluding the time of day. As you can see from the table, the robot is active most of the time on Wednesday and Thursday, and has never used it for object perception on weekends.
It can also be observed that the robot is most active in the afternoon, but it is rarely used during off hours (after 17:00). During the entire Y4 deployment, the robot sensed a total of 930 advanced SOMA objects.

 

 

 

This group is really strong, soma is really okay. This article does not seem to involve those few issues. The movement, appearance, and disappearance of instances in dynamic scenes; that is, its perception is irrelevant in terms of time.

 

Guess you like

Origin www.cnblogs.com/zherlock/p/12709174.html