Won two championships in the CVPR 2023 competition|Exploration and application of visual segmentation technology in Meituan street view understanding

Visual segmentation technology plays an important role in street view understanding, but it also faces many challenges. After long-term exploration, the Meituan Street View Understanding Team has built a segmentation technology system that takes into account both accuracy and efficiency, and has achieved remarkable results in application. At the same time, related technologies won 2 championships and 1 third place in the CVPR 2023 competition. This article will introduce in detail the exploration and application of segmentation technology in street view understanding, hoping to bring some help or inspiration to students engaged in related research work.

  • 1 Problem Background

  • 2 Research Status

    • 2.1 Segmentation technology system

    • 2.2 Current status of street view segmentation

  • 3 core technologies

    • 3.1 High-precision segmentation in complex scenes

    • 3.2 Efficient Iteration of Segmentation Model

    • 3.3 Towards unified and open street view segmentation

  • 4 CVPR 2023 Technical Achievements

    • 4.1 Publication of papers

    • 4.2 Winners of Double Tracks in Severe Weather Street View Comprehension Contest

    • 4.3 The third place in the Video Understanding Competition Video Panoramic Segmentation Track

  • 5 business applications

  • 6 Summary and Outlook

1 Problem Background

Street view data is collected by different devices, with video images collected by various cameras and point clouds collected by various radars as the main carrier. Among them, video images collected by cameras are low-cost and easy to obtain, and are the most common type of street view data, and the street view data processed in this paper are mainly video image data. As an important information carrier of indoor and outdoor scenes, street view video image data is a key research object for many tasks of computer vision.

In order to analyze effective street view information from video image data, computer vision technologies are combined and complemented to achieve a deep and comprehensive understanding of street views such as traffic roads, indoor and outdoor spaces, and this process is often referred to as street view understanding. The development of street view understanding related technologies plays a very important role in the evolution of computer vision technology, and also plays an important role in many downstream tasks (such as scene reconstruction, automatic driving, robot navigation, etc.).

In general, street view understanding technology integrates many computer vision technologies, and can be divided into four levels in terms of the representation of recognition results of different technologies: point level, line level, surface level, volume level, and each level The logical relationship between elements within and between different levels. in:

  1. Point-level extraction technology is used to analyze various information related to "points", mainly extracting coordinates and feature descriptors, including general feature points, semantic key points and other point-level information extraction technologies. The processing objects include various Elements are used to represent information such as the location and characteristics of elements.

  2. Line-level extraction technology is used to analyze various information related to "lines", mainly extracting lines, including lane lines, horizon lines, various curves/straight lines and other line-level information extraction technologies, and the processing objects include various lines , which is used to represent the position, vector, topology and other information of the feature.

  3. Surface-level extraction technology is used to analyze various information related to "surface", mainly to extract regions. Due to the perspective projection imaging method of street view video image data, all information is displayed on a two-dimensional plane. The plane is divided into different regions according to different semantics and instances. These regions represent the two-dimensional position, outline, semantics and other information of elements. . Capabilities at this level include extraction techniques such as semantic segmentation and instance segmentation.

  4. Volume-level technology is used to analyze various information related to "volume", mainly extracting 3D structure, including depth estimation, visual explicit/implicit 3D reconstruction and other extraction technologies, which are used to represent the 3D structure information of scenes and elements.

  5. Logical relationship extraction technology is based on the extracted elements and scene information of the above technologies, through time series information fusion and logical reasoning, to extract logical relationships between elements at different levels or at the same level, including point matching relationships, line topological relationships, and multi-frame elements Tracking and positional relationship, etc.

Specific to real-world scenarios, the recognition results of point-level, line-level, and surface-level extraction technologies are shown in Figure 1 below:

6a10598a71298c146653d36fe9be7e89.png

Figure 1 Example of street view analysis results

In street view understanding, various video image segmentation technologies are the key technologies in "face-level" extraction and "logical relationship" extraction, realizing the pixel-level representation of street view two-dimensional information. In street view segmentation, due to the complexity of the actual scene, it faces many difficulties.

First of all, the outstanding difficulty of street view segmentation is the large difference in the shape and size of the elements, as shown in the first column of Figure 2 (the image example comes from the dataset [1]). Due to the diversity of various targets in real scenes and the limitations of video image imaging, there are various abnormal or incomplete problems in the targets in the collected data. In addition, due to the problem of perspective imaging, the size of the same target in the distance and near in the image is very different. These two problems require the street view segmentation algorithm to have the ability to accurately segment complex objects robustly.

Secondly, another difficulty in street view segmentation is the interference caused by harsh natural conditions, as shown in the second and third columns of Figure 2 (the example comes from the dataset [2]). Since severe weather or extreme lighting conditions often appear in actual scenes, the targets in the collected data are often affected by natural conditions, and there are problems such as low visibility, occlusion or blurring. This requires the street view segmentation algorithm to have the ability to discover and identify difficult targets.

In addition, since street view understanding requires segmentation techniques that use different data forms such as video/images and different result representations, how to build efficient and iterative segmentation techniques? How to ensure that different segmentation algorithms cooperate with each other and complement each other in performance, and at the same time ensure that multiple algorithms coexist under limited computing resources and maintenance costs? How to combine segmentation tasks with other tasks to become more unified and open? It is also a problem that needs to be solved urgently for street view segmentation.

d4fead695aa0f4577d5f93800ffa84fa.png

Figure 2 Example of Street View image

In order to solve the above problems, the Meituan Street View understanding team has done a lot of exploration in segmentation technology, constructed a series of high-precision segmentation algorithms in real and complex scenes, and realized the precise segmentation of complex objects and the discovery and recognition of difficult objects. At the same time, the high-efficiency segmentation technology has also been practiced, realizing the efficient iteration and application of the segmentation model, and exploring a unified and open street view visual segmentation technology. In the end, the proposed related technology has achieved obvious results in street view understanding. The related algorithm was accepted as a Workshop paper by CVPR 2023, and won 2 championships and 1 third place in the international competition.

2 Research Status

2.1 Segmentation Technology System

Segmentation, as one of the three basic tasks of computer vision (classification, detection, and segmentation), represents the position and contour information of the target at pixel granularity. After computer vision entered the era of deep learning, the segmentation task was further subdivided into various subtasks according to different application scenarios, which were divided into two categories according to different data forms: image segmentation and video segmentation, as shown in Figure 3 below (picture from [3][15] and other documents):

0166ec8e6ff15c7a7d9d6b97223b4cbd.png

Figure 3 Examples of different segmentation tasks

The processing object of the image segmentation task is a single image. According to the different representation forms of the output results, semantic segmentation, instance segmentation, and panoramic segmentation are gradually developed. Among them, semantic segmentation assigns each pixel in the image to the corresponding semantic category. Representative works include FCN[4], U-Net[4], DeepLab[6], OCRNet[7], SegFormer[8], etc. The goal of instance segmentation is to segment each instance in the image and identify the semantic category of each instance. Representative works include Mask-RCNN[9], YOLACT[10], SOLO[11], etc. The goal of panoramic segmentation is to assign all pixels in the image to the corresponding semantic categories and distinguish different instances. Representative works include EfficientPS[12], Panoptic-DeepLab[13], UPSNet[14], etc. In addition to the above segmentation tasks, there are some other image segmentation techniques that have application requirements in many scenarios, such as image segmentation tasks such as image matting and salient target segmentation.

The processing object of the video segmentation task is a video sequence with a temporal relationship, which not only pays attention to the spatial dimension information of a single frame, but also pays attention to the temporal dimension information between multiple frames. According to the different representation forms of input and output results, video segmentation is divided into video object segmentation, video semantic segmentation, video instance segmentation, video panorama segmentation, etc.; video object segmentation is further divided into automatic video object segmentation, semi-automatic Subtasks such as video object segmentation, interactive video object segmentation, and language-guided video object segmentation. Automatic video object segmentation only uses video frames as input to segment out the area where the video sequence object is located. Representative works include OSVOS[16], MATNet[17], etc. In semi-automatic video object segmentation, in addition to the input video sequence, the target to be segmented is also input. Representative works include SiamMask[18], STM[19], AOT[20], etc. Interactive video object segmentation refers to the segmentation of video foreground objects under the interaction of users. Representative works include MiVOS[21] and so on. Language-guided video target segmentation specifies the target to be segmented by inputting text. Representative work includes CMSANet [22] and so on.

In addition to classifying the segmentation tasks according to the above methods, they can also be classified according to the model learning paradigm or the degree of supervision. According to the degree of restriction of the training data labeling information, it can be divided into strong supervised segmentation, unsupervised segmentation, semi-supervised segmentation, and weakly supervised segmentation. Supervised segmentation, etc. When the segmentation model is actually running, the computing resources of the device in many application scenarios are often limited. For example, the computing resources provided by some mobile devices are very limited. However, the actual demand requires the model to have a certain real-time performance, which requires the segmentation model. Efficient in model architecture design. Some work focuses on this research, such as BiSeNet[23], STDCNet[24], etc. In addition, the semantic categories in the real world are complex and diverse, and the data distribution of each category is uneven. Therefore, research directions such as open set segmentation and long tail segmentation have also been derived.

2.2 Status Quo of Street View Segmentation

Many methods have proposed corresponding solutions to the problems existing in street view segmentation.

In order to solve the problem of accurate segmentation of complex objects, PSPNet [25] proposes a pyramid pooling model, which uses contextual information of different scales to achieve segmentation of objects of different scales. OCRNet [7] introduces the object context feature representation module to enhance the feature of context information and predict the semantic category of each pixel. SegFormer [8] proposes a new hierarchical Transformer encoder to output multi-scale features, and introduces a multi-layer perceptron decoder to aggregate information from different layers to obtain segmentation results of different targets. In addition, there are some video-based methods, such as TMANet [26], which solves the problem of accurate target segmentation by using the features of adjacent frames to enhance the current frame. Among the existing methods, the image-based method mainly starts from the perspective of contextual feature enhancement, but because the information of a single frame and single view is not complete enough to describe the scene information, it is difficult to accurately segment complex objects. At the same time, video-based methods mainly use the features of surrounding frames to enhance the features of the current frame. However, due to the difficulties in the alignment and fusion of multi-frame features, the precise segmentation of complex objects is still difficult to solve.

In order to solve the problem of finding and identifying difficult objects, difficult sample mining techniques are generally used to enhance the discrimination ability of features. In order to reduce the computational overhead of identifying difficult samples, existing work has mainly explored in two directions: exact search within each batch, and approximate search across the entire dataset. For example, OHEM [27] automatically selects difficult samples in batches based on feedback from loss functions to make training more effective and efficient, reducing heavy heuristic search practices and hyperparameters. Furthermore, UHEM [28] automatically obtains a large number of difficult negative samples by analyzing the outputs of trained detectors on video sequences. Furthermore, SCHEM [29] uses class signatures to track feature embeddings online at a small additional computational cost during training, by using which signatures are used to identify difficult negative samples. In the existing methods, the learning ability of difficult samples is enhanced by optimizing the training strategy, so as to realize the discovery and recognition of difficult targets. However, since these solutions indirectly achieve the goal of target discovery through constrained model training, they cannot Completely solve the problem of finding and identifying difficult targets.

In order to solve the problem of efficient iteration of segmentation models, researchers have also made many efforts. In order to improve the labeling efficiency of segmentation tasks, ScribbleSup[30], FocalClick[31], MiVOS[21], etc. speed up the pixel-level labeling of image or video objects through interactive segmentation. In addition, due to the existence of long-tail distribution, CANet[32], PADing[33], etc. reduce the dependence on sparse category data samples through few-shot learning and zero-shot learning methods, and some work uses resampling and category balance loss functions etc. to alleviate the long-tail problem during training. In addition, the design of the model structure needs to focus on efficiency. For example, models such as BiSeNet[23] and STDCNet[24] obtain better real-time performance through multi-branch network structures, and ShuffleNet[34] and MobileNet[35] use special operators to and modules to reduce the amount of computation and parameters of the model.

With the development of segmentation technology, many different segmentation subtasks have been derived, such as semantic segmentation, instance segmentation, panoramic segmentation, etc., which have certain differences in data annotation form, output representation mode, and model structure design. Finding a unified solution among different segmentation algorithms and making full use of different forms of segmentation and labeling data are important research issues. Methods such as MaskFormer[36] and OneFormer[37] proposed a general segmentation model structure, which unifies tasks such as semantic segmentation, instance segmentation, and panoramic segmentation, and can be easily extended from image segmentation to video segmentation. The recently proposed Segment Anything Model [38] is a zero-sample basic segmentation model that can "segment everything". Based on SAM, many downstream applications of segmentation can be further developed, such as semantic segmentation, instance segmentation, and video object segmentation.

3 core technologies

This part aims at the problems existing in the segmentation task of street view understanding, and introduces the corresponding solutions from three aspects: high-precision segmentation in real complex scenes, efficient iteration of segmentation models, and leading to unified and open street view segmentation.

3.1 High-precision segmentation in complex scenes

3.1.1 Precise Segmentation of Complex Objects Based on Spatiotemporal Alignment

The problem of accurate segmentation of complex targets is that the prediction results of some areas in complex targets often have high uncertainty, which makes it difficult to accurately segment or even make segmentation errors. Generally, by fusing the prediction information of surrounding frames, the certainty of object segmentation in the current frame can be improved, thereby improving the segmentation accuracy of this object. In order to solve the problem of precise segmentation of complex targets, the street view understanding team proposed a complex target precise segmentation framework based on spatio-temporal alignment (Motion-State Alignment Framework, referred to as MSAF), as shown in Figure 4 below:

a827c8c681e94d6108c3e4221c311cdd.png

Figure 4 Framework for accurate segmentation of complex targets based on spatio-temporal alignment

MSAF rethinks the information gain brought by the video signal: the semantic information contained in the video can be divided into dynamic semantics and static semantics. Dynamic semantics exist in the sequential relationship of multiple consecutive frames, which can increase the position information and feature description of the target area; Static semantics exists in a single frame image at a certain moment, and can effectively restore the detailed information of the target area. Dynamic and static semantics can bring different aspects of deterministic information gain to segmentation models.

MSAF first extracts the features of adjacent multiple frames of the video, and enhances the dynamic and static semantic features of the current frame through the dynamic feature alignment mechanism and the static feature alignment mechanism, and then extracts the target region descriptor from the dynamic semantic features, and extracts the target region descriptor from the static semantic features. Extract the target pixel descriptor, and then calculate the feature distance between the pixel descriptor and the region descriptor, assign an accurate region category to each pixel, and achieve precise segmentation of complex objects.

In the end, compared with mainstream image and video-level segmentation methods, the precise segmentation method of complex objects based on spatio-temporal alignment has achieved leading accuracy in Cityscapes[1], CamVid[39] and other data sets, and has a faster inference speed .

3.1.2 Difficult target discovery based on sample automatic mining

In order to solve the problem of difficult target discovery and recognition, the street view understanding team proposed a difficult target discovery framework (Perceive, Excavate and Purify, PEP) based on sample perception, mining, and purification, as shown in Figure 5 below:

9c680c6ea5fa09686bc080ec523b34fe.png

Figure 5. Difficult target discovery framework based on sample automatic mining

First, features of different scales are extracted using the feature pyramid backbone network, and then the features of different scales are input into three branches: instance-aware branch, instance description branch, and feature learning branch.

The instance-aware branch classifies each pixel of the backbone feature, and preliminarily determines whether there is an instance at the pixel location; the instance description branch learns the original feature descriptors of different instances, and uses the sample mining sub-network to mine difficult targets, and characterizes them as extraction descriptions son. In addition, an instance association sub-network is introduced to improve the similarity of the same instance and reduce the similarity of different instances to achieve object purification and further improve the segmentation performance. Finally, the original and mined instance descriptors are convolved with the common features of the feature learning branch to obtain the segmentation results of each object.

Finally, compared with mainstream segmentation methods, the sample-based automatic mining method for finding difficult objects achieves leading accuracy on datasets such as COCO [40].

3.2 Efficient Iteration of Segmentation Model

In order to better adapt to the varied real scenarios in street view understanding and to meet various new practical business needs, the street view segmentation model needs to be continuously iterated. Therefore, it is necessary to establish an efficient iteration method. After long-term exploration, the street view understanding team has built a set of efficient data-model closed-loop for segmentation tasks, which can accumulate a large number of high-quality labeled segmentation data sets at a limited cost, continuously improve the performance of the segmentation model, and complete the model efficiently. Iterate to meet the customized needs of actual business scenarios. The overall flowchart of the data-model closed loop is shown in Figure 6 below:

f2ab879eb336ac27b70be45c416dd81d.png

Figure 6 Data-model closed-loop process

In actual street view understanding business scenarios, a large amount of unlabeled data can be obtained through data backflow. These unlabeled data can be inferred and predicted by many street view understanding models to obtain rich and diverse label attributes, so that it is possible to build a system that can cover various complex scenarios and levels. A richly structured street view understanding system tag tree. When new business requirements come, relying on the system label tree can obtain a large amount of data that is highly relevant to the requirements in a timely and efficient manner. In addition, the model can also screen and mine data with higher uncertainty and more valuable model iterations through active learning on a large number of unlabeled data.

In the face of high-value data, through the semi-automatic data labeling of the efficient collaboration between the model and the labeling personnel, data with more guaranteed labeling quality can be obtained, and a large number of pseudo-labeled data can also be obtained based on the existing model through the pseudo-label technology. data, and then complete model iterations through supervised or semi-supervised training. After iteration, the new model with better performance and richer capabilities can not only better empower business scenarios, but also better assist each link in the data-model closed loop. As a result, the data-model closed loop realizes continuous iteration and circulation.

In the actual deployment and application of the segmentation model, it is necessary to balance the recognition accuracy of the business side and the computing resources of the model. To this end, a divisional model family including lightweight, middleweight, and heavyweight was constructed. With its small amount of parameters and high throughput, the lightweight segmentation model is often used in scenarios such as end-side deployment with limited computing resources or server-side deployment with a large number of calls; the medium-weight segmentation model is used for Scenes that require higher precision and a medium-scale call volume; heavyweight segmentation models rely on their extremely high model capacity and recognition accuracy, and help lighter models through model distillation, pseudo-label generation, and pre-label generation. Performance improvement and data-model closed loop enable its advantages to be brought to the forefront of actual business scenarios.

In addition, through methods such as model quantification and high-performance deployment, the execution efficiency of the model can be further improved, the computing cost can be reduced, and the efficient application of the segmentation model can be realized.

3.3 Towards unified and open street view segmentation

Recently, with the popularity of natural language processing and multi-modal large models such as ChatGPT[41] and Stable Diffusion[42], people's attention to large models and unified models continues to increase, and the popularity of unified visual large models continues to rise. , Segment Anything Model[38], UniAD[43], etc. have also shown the potential of a general and unified basic model in the field of vision. The street view understanding team is also constantly exploring the division and unity of the large model.

In image segmentation, explore a unified segmentation model structure covering multiple tasks such as semantic segmentation, instance segmentation, panoramic segmentation, and edge detection, and give full play to the potential of various types of segmentation and labeling data through multi-task training to ensure that different tasks cooperate with each other. Get buffs. In terms of video segmentation, we are also exploring a unified segmentation model structure covering tasks such as video semantic segmentation, video instance segmentation, video panorama segmentation, and video object segmentation, making full use of existing video segmentation annotations when video segmentation annotations are difficult. and image annotation data. In addition, it is also very important to transfer the knowledge learned in the respective segmentation tasks of images and videos to another task.

In addition, the integration and cooperation of segmentation tasks and other visual tasks is also a very important direction, which plays an important role in the technical system of street view understanding. For example, the fusion of segmentation tasks, classification and detection tasks can not only reduce resource consumption in scenarios with limited computing resources, improve the overall throughput of the system, but also give full play to the potential of supervision information of different visual tasks.

In addition to the fusion and unification of various tasks in visual tasks, cross-modal research related to segmentation tasks also has great potential, such as open-set segmentation tasks combined with text modalities, text-guided directional segmentation tasks, etc., which not only It can extend the segmentation task to a more open real environment, and can also improve the interaction ability between people and the segmentation model through the bridge of text, so that it can realize customized segmentation requirements more quickly and accurately. Higher-level semantic reasoning research based on segmentation tasks is also of great value. With the support of fine-grained basic scene understanding and semantic analysis capabilities such as segmentation technology, supplemented by a large language model with prior knowledge and powerful logical reasoning capabilities, In the future, it can also generate huge application value in street view understanding scenarios.

In short, a more unified and open street view visual segmentation model has become an important direction now and in the future. The street view understanding team will continue to practice, accumulate, and explore the future of the visual segmentation model.

4 CVPR 2023 Technical Achievements

Based on the accumulation of segmentation technology in street view understanding, the street view understanding team published 2 workshop papers in CVPR 2023, and won 2 championships and 1 third place in related competitions. At the same time, the corresponding technical achievements have applied for a number of national patents.

4.1 Publication of Papers

The segmentation task-oriented accurate segmentation method of complex targets based on spatio-temporal alignment and the difficult target discovery method based on automatic sample mining have been perfected into 2 academic papers: "Motion-state Alignment for Video Semantic Segmentation" [44], "Perceive, Excavate and Purify : A Novel Object Mining Framework for Instance Segmentation "[45], accepted by CVPR 2023 as a Workshop paper (8 pages long).

4.2 Winners of Double Tracks in Severe Weather Street View Comprehension Competition

In autonomous driving scenarios, severe weather (such as fog, rain, snow, low light, night, exposure, shadows, etc.) will greatly interfere with the perception system. In order to ensure the smooth operation of self-driving cars in severe weather conditions, the perception system needs to have the ability to handle extreme weather. Although the performance of computer vision technology in scenes such as street view understanding is developing rapidly, the existing evaluation benchmarks mainly focus on sunny weather conditions (good weather, favorable lighting), even the current best-performing algorithms cannot Significant performance degradation can also occur under weather conditions. To this end, the ACDC Challenge proposes an evaluation benchmark specifically for severe weather, which is used to promote research on the design of robust vision algorithms under adverse weather and lighting conditions.

In this competition, the street view understanding team won the championship in the two tracks of semantic segmentation and panoramic segmentation.

| 4.3 Second runner-up in the Video Panoramic Segmentation Track of the Video Understanding Competition

Pixel-level scene understanding is one of the fundamental problems in computer vision, which aims to identify the semantic category and exact location of each object in a given image. Since the real world is actually dynamic rather than static, video-oriented panoptic segmentation is reasonable and practical for real-world applications. Therefore, the PVUW Challenge proposes a practical and challenging large-scale video panorama segmentation data set in natural scenes and holds a competition to promote the algorithm research of video panorama segmentation.

In this competition, the street view understanding team won the third place in the video panorama segmentation track.

5 business applications

The segmentation technology in Meituan’s street view understanding has been widely used in many business scenarios, including maps, automatic delivery, and real-world stores.

Maps are an essential infrastructure for Meituan’s local life services. Automatic production of map data is an important link in the map business. This link mainly extracts and processes various traffic elements from images and other data, and segmentation technology plays an important role in it. effect. First of all, the segmentation technology acts on low-quality image filtering, extracts accurate areas such as roads, vehicles, and lens occlusions, and identifies and prevents low-quality images such as road congestion and lens occlusions from affecting the map data production process. At the same time, the segmentation technology is applied to the extraction of traffic elements, effectively extracting all kinds of lane lines (such as single white solid line, double yellow dashed line, four lines, etc.), various object separators (such as fences, water horses, concrete stone piers, etc.) Location information, outline information, semantic information, instance information, etc., are used for subsequent element production.

In addition, the segmentation technology is used to extract road structure, analyze the structure of main and auxiliary roads and intersections, etc., obtain road structure, and at the same time be used for location information extraction of traffic elements. In addition, segmentation technology is also used in satellite image recognition, such as automatically extracting buildings for rendering on the front end of the map. As shown in Figure 7 below:

34c2db91a85730abee2d6a865a10e8dd.png

Figure 7 Front-end rendering effect of the Beijing Wangjing section of the Meituan map

The automatic delivery business focuses on Meituan’s core businesses such as food delivery and errand running, improving delivery efficiency and user experience. High-precision maps are the core infrastructure of autonomous driving. In order to ensure low cost and high freshness, the segmentation of video image data plays an important role in the production process of high-precision maps. Among them, the segmentation of various elements such as lane lines, ground arrows, and traffic signs provides important information for the extraction of traffic elements, and the segmentation of traffic facilities such as poles and bus stops provides important information for the extraction of traffic facilities, which improves the high-precision map. The automation rate and accuracy of factor extraction in production.

In addition, the high-efficiency semantic segmentation model provides 70+ categories of semantic segmentation for the positioning layer of the high-precision map, supporting the production of the positioning layer, and the high-precision semantic segmentation supports the semi-supervised learning of the perception model. In addition, the video image segmentation technology provides the extraction and tracking capabilities of various elements for the construction of visual high-precision maps, and improves the success rate, automation rate and accuracy of map construction.

The store real-life business involves a wide range of offline indoor scenes. The purpose is to provide vision-based indoor mapping and rendering solutions, to provide understanding and rendering capabilities for the geometry, semantics, objects, etc. of offline scenes, and to provide segmentation technology. play an important role in it. In the store real-life business, the segmentation technology improves the reconstruction accuracy and rendering success rate, and effectively supports store layout estimation, business area calculation, and automatic counting of key facilities.

In addition, segmentation technology also plays an important role in applications such as intelligent labeling and data generation, and is also empowering other technologies in street view understanding.

6 Summary and Outlook

Segmentation techniques play an important role in street view understanding, but they also face many challenges. In order to cope with these challenges, the Meituan Street View Understanding Team has built a segmentation technology system that takes into account both accuracy and efficiency, and has achieved remarkable results in business applications.

With the development of artificial intelligence technology, the segmentation technology in street view understanding will also be more accurate, more versatile, and more intelligent. Combined with multi-source data and automatic iteration of advanced models, the segmentation effect will become more and more accurate; combined with advanced technologies such as language and visual large models, segmentation technology will gradually realize the "segmentation of everything" in the open world; combined with the success of large-scale language models With experience, segmentation techniques will also lead to modeling and reasoning of higher-level semantic relations.

In the future, the Meituan Street View Understanding Team will continue to promote the application and evolution of visual technology in street view understanding, and provide more efficient and convenient technical support for application scenarios such as scene reconstruction, autonomous driving, and robot navigation.

7 Authors

Jinming, Want Want, Yiting, Xingyue, Junfeng, etc. are all from Meituan's basic research and development platform/visual intelligence department.

8 References

[1] Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler, Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.

[2] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding. In ICCV, 2021.

[3] Kirillov, Alexander and He, Kaiming and Girshick, Ross and Rother, Carsten and Doll{'a}r, Piotr. Panoptic segmentation. In CVPR, 2019.

[4] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

[5] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.

[6] Chen, Liang-Chieh and Papandreou, George and Kokkinos, Iasonas and Murphy, Kevin and Yuille, Alan L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.

[7] Yuan, Yuhui, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020.

[8] Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping. SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.

[9] He, Kaiming and Gkioxari, Georgia and Doll{'a}r, Piotr and Girshick, Ross. Mask r-cnn. In ICCV, 2017.

[10] Bolya, Daniel and Zhou, Chong and Xiao, Fanyi and Lee, Yong Jael. Yolact: Real-time instance segmentation. In ICCV, 2019.

[11] Bolya, Daniel and Zhou, Chong and Xiao, Fanyi and Lee, Yong Jae. Solo: Segmenting objects by locations. In ECCV, 2020.

[12] Mohan, Rohit and Valada, Abhinav. Efficientps: Efficient panoptic segmentation. IJCV, 2021.

[13] Cheng, Bowen and Collins, Maxwell D and Zhu, Yukun and Liu, Ting and Huang, Thomas S and Adam, Hartwig and Chen, Liang-Chieh. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.

[14] Xiong, Yuwen and Liao, Renjie and Zhao, Hengshuang and Hu, Rui and Bai, Min and Yumer, Ersin and Urtasun, Raquel. Upsnet: A unified panoptic segmentation network. In CVPR, 2019.

[15] Zhou, Tianfei and Porikli, Fatih and Crandall, David J and Van Gool, Luc and Wang, Wenguan. A survey on deep learning technique for video segmentation. TPAMI, 2022.

[16] Caelles, Sergi and Maninis, Kevis-Kokitsi and Pont-Tuset, Jordi and Leal-Taix{'e}, Laura and Cremers, Daniel and Van Gool, Luc. One-shot video object segmentation. In CVPR, 2017.

[17] Zhou, Tianfei and Li, Jianwu and Wang, Shunzhou and Tao, Ran and Shen, Jianbing. Matnet: Motion-attentive transition network for zero-shot video object segmentation. TIP, 2020.

[18] Wang, Qiang and Zhang, Li and Bertinetto, Luca and Hu, Weiming and Torr, Philip HS. Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.

[19] Oh, Seoung Wug and Lee, Joon-Young and Xu, Ning and Kim, Seon Joo. Video object segmentation using space-time memory networks. In ICCV, 2019.

[20] Yang, Zongxin, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object segmentation. In NeurIPS 2021.

[21] Cheng, Ho Kei, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021.

[22] Ye, Linwei and Rochan, Mrigank and Liu, Zhi and Zhang, Xiaoqin and Wang, Yang. Referring segmentation in images and videos with cross-modal self-attention network. TPAMI, 2021.

[23] Yu, Changqian and Wang, Jingbo and Peng, Chao and Gao, Changxin and Yu, Gang and Sang, Nong. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV 2018.

[24] Fan, Mingyuan and Lai, Shenqi and Huang, Junshi and Wei, Xiaoming and Chai, Zhenhua and Luo, Junfeng and Wei, Xiaolin. Rethinking bisenet for real-time semantic segmentation. In CVPR, 2021.

[25] Zhao, Hengshuang and Shi, Jianping and Qi, Xiaojuan and Wang, Xiaogang and Jia, Jiaya. Pyramid scene parsing network. In CVPR, 2017.

[26] Hao Wang, Weining Wang, and Jing Liu. Temporal memory attention for video semantic segmentation. In ICIP 2021.

[27] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.

[28] SouYoung Jin, Aruni RoyChowdhury, Huaizu Jiang, Ashish Singh, Aditya Prasad, Deep Chakraborty, and Erik Learned- Miller. Unsupervised hard example mining from videos for improved object detection. In ECCV, 2018.

[29] Yumin Suh, Bohyung Han, Wonsik Kim, and Kyoung Mu Lee. Stochastic class-based hard example mining for deep metric learning. In CVPR, 2019.

[30] Lin, Di and Dai, Jifeng and Jia, Jiaya and He, Kaiming and Sun, Jian. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016.

[31] Chen, Xi and Zhao, Zhiyan and Zhang, Yilei and Duan, Manni and Qi, Donglian and Zhao, Hengshuang. Focalclick: Towards practical interactive image segmentation. In CVPR, 2022.

[32] Zhang, Chi and Lin, Guosheng and Liu, Fayao and Yao, Rui and Shen, Chunhua. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In CVPR, 2019.

[33] He, Shuting, Henghui Ding, and Wei Jiang. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In CVPR, 2023.

[34] Zhang, Xiangyu and Zhou, Xinyu and Lin, Mengxiao and Sun, Jian. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.

[35] Howard, Andrew G and Zhu, Menglong and Chen, Bo and Kalenichenko, Dmitry and Wang, Weijun and Weyand, Tobias and Andreetto, Marco and Adam, Hartwig. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Arxiv, 2017.

[36] Cheng, Bowen, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.

[37] Jain, Jitesh and Li, Jiachen and Chiu, Mang Tik and Hassani, Ali and Orlov, Nikita and Shi, Humphrey. Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023.

[38] Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C and Lo, Wan-Yen and others. Segment anything. Arxiv, 2023.

[39] Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2008.

[40] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.

[41] OpenAI.https://openai.com/blog/chatgpt. 2022.

[42] Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj{"o}rn. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.

[43] Hu, Yihan and Yang, Jiazhi and Chen, Li and Li, Keyu and Sima, Chonghao and Zhu, Xizhou and Chai, Siqi and Du, Senyao and Lin, Tianwei and Wang, Wenhai and others. Planning-oriented autonomous driving .In CVPR, 2023.

[44] Su, Jinming and Yin, Ruihong and Zhang, Shuaibin and Luo, Junfeng. Motion-state Alignment for Video Semantic Segmentation. In CVPRW, 2023.

[45] Su, Jinming and Yin, Ruihong and Chen, Xingyue and Luo, Junfeng. Perceive, Excavate and Purify: A Novel Object Mining Framework for Instance Segmentation. In CVPRW, 2023.

----------  END  ----------

 Meituan scientific research cooperation 

Meituan's scientific research cooperation is committed to building a bridge and platform for cooperation between Meituan's technical team and universities, scientific research institutions, and think tanks. Relying on Meituan's rich business scenarios, data resources, and real industrial problems, open innovation, gather upward forces, and focus on robots , artificial intelligence, big data, Internet of Things, unmanned driving, operational optimization and other fields, jointly explore cutting-edge technology and industry focus macro issues, promote industry-university-research cooperation and exchange and achievement transformation, and promote the cultivation of outstanding talents. Facing the future, we look forward to cooperating with more teachers and students from universities and research institutes. Teachers and students are welcome to send emails to: [email protected].

 recommended reading 

  |  Meituan Visual GPU Inference Service Deployment Architecture Optimization Practice

  |  The quantitative deployment of the general target detection open source framework YOLOv6 in Meituan

  |  The target detection open source framework YOLOv6 has been fully upgraded, and the faster and more accurate version 2.0 is coming

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/131971040