101 CV models are collectively open-sourced, and the visual AI in-depth analysis of the magic community

Author: Xie Xuansong Open Visual Intelligence Team of Bodhidharma Academy

On November 3, at the 2022 Yunqi Conference, Alibaba DAMO Institute and CCF Open Source Development Committee jointly launched the AI ​​model community "Magic Build" ModelScope, aiming to lower the threshold of AI application.

The AI ​​model is relatively complex, especially if it is to be applied to industry scenarios, it often needs to be retrained, which makes AI only in the hands of a few algorithm personnel, and it is difficult to become popular. And the newly launched ModelScope, a community of magic, implements the new concept of Model as a Service (Model as a Service), and provides many pre-trained basic models, which can be quickly put into use with a little tuning for specific scenarios.

Dharma Academy took the lead in contributing more than 300 verified high-quality AI models to the Mota community, more than 1/3 of which are Chinese models, which are fully open source and open, and turn the models into directly usable services. The first batch of open source models in the community includes the main directions of AI such as vision, speech, natural language processing, and multimodality, and actively explores new fields such as AI for Science, covering more than 60 mainstream tasks. The models have been screened and verified by experts, including more than 150 SOTA (industry-leading) models and more than 10 large models, which are fully open source and open for use.

In this article, Xie Xuansong, head of open visual intelligence at Alibaba Dharma Academy, deeply analyzed the first batch of 101 open-source visual AI models in the Mota community.


Computer vision is the cornerstone of artificial intelligence, and it is also the most widely used AI technology. From the face recognition used to unlock mobile phones in daily life, to the fiery cutting-edge automatic driving, visual AI has shown its talents. As a visual AI researcher, I believe that the potential of visual AI is far from being fully realized. Exhausting the power of our researchers can only cover a small number of industries and scenarios, far from meeting the needs of the whole society.

Therefore, on ModelScope, an AI model community, we decided to fully open source the visual AI models developed by Bodhidharma Academy. The first batch reached 101, most of which are SOTA or have been tested in practice. We hope to allow more developers to use visual AI, and look forward to AI becoming one of the driving forces for the advancement of human society.

Magic building community address: modelscope.cn

1. Summary: Human-Centered Visual AI

Over the years, Dharma Academy, as a basic scientific research institution and talent highland of Alibaba, has developed a batch of excellent visual AI capabilities in Alibaba's massive business scenarios, distributed in various links:

 

These visual AI technologies cover almost everything from understanding to generation. Due to the large number of visual technology tasks, we need a relatively reasonable classification method, which can be divided from several dimensions such as modality, object, function, and scene:

The first batch of main visual task models released by the Mota community, these models include both academic and innovative SOTA technology, as well as proven combat models, covering common perception, understanding, and production from the dimension of "function/task". Other categories:

Although visual technology is a bit complicated, there is actually a core, that is, the study of "objects", and "people" have always been the most important "objects". The "human-centered" visual AI technology is also the earliest, deepest, and most widely used technology. We start with a photo of a person. AI first needs to understand the photo/image, such as identifying who is in the photo, what action is there, and whether the image can be extracted. Then, we need to further explore: what is the quality of the photo, can it be better, can the people in it become more beautiful, or even become cartoon people, digital people, etc...

The above 7 "human" related processes basically cover the categories of "understanding", "enhancement", and "editing" in visual tasks. Features, benefits, examples, and applications of the center's vision technology.

2. Perceptual understanding model

2.1 Cut out portraits from photos

Model name: BSHM portrait matting

Experience link:

https://www.modelscope.cn/models/damo/cv_unet_image-matting/

Cutting out portraits from photos and removing the background is a very common demand and one of the basic operations of "PS", but traditional manual operations are time-consuming and laborious, and the results are not good. The portrait matting model provided by Mota is a  fully automatic, end-to-end portrait matting model, which can achieve fine segmentation at the hairline level.

We have also made innovations in technology. Unlike other models that are trained based on a large amount of finely labeled data, our model can achieve fine matting using coarsely labeled data, with low data requirements and high precision.

Specifically, the model framework is divided into three parts: coarse mask estimation network (MPN), quality unification network (QUN), and precise alpha matte estimation network (MRN). We first disassemble the complex problem, first coarse segmentation (MPN) and then fine segmentation (MRN). There is a large amount of rough segmentation data that is easy to obtain in academia, but the inconsistency between the coarse segmentation data and the fine segmentation data leads to a large expected GAP, so we designed a quality unified network (QUN). The purpose of MPN is to estimate the coarse semantic information (rough mask), and use the coarse label data and the fine label data to train together. QUN is a quality unification network, which is used to standardize the quality of coarse masks. QUN can unify the quality of coarse masks output by MPN. The MRN network inputs the original image and the coarse mask normalized by QUN, estimates the accurate alpha matte, and trains with accurate labeled data.

Of course, the needs related to matting and segmentation are very diverse. We have also launched a series of models that support non-portrait matting and video matting. Developers can take it out-of-the-box, such as assisting designers to cut out pictures, one-click cutout, which greatly improves design efficiency, or freely change the background, which can realize the virtual background of the meeting, ID photos, time-traveling and other effects. These are also widely used by Ali's own products (such as Dingding video conferencing) and cloud customers.

2.2 Human body key points and action recognition

Model name: HRNet key points of human body-2D

Experience link:

https://www.modelscope.cn/models/damo/cv_hrnetv2w32_body-2d-keypoints_image/

 

This task uses a top-down human body key point detection framework, and 15 human body key points in the image can be obtained through end-to-end fast reasoning. Among them, the key point model of the human body is based on the improved backbone of HRNet, which makes full use of multi-resolution features to better support the daily human body posture, and achieves higher accuracy on the AP and AR50 of the COCO dataset. At the same time, we have also optimized it for sports and fitness scenarios , especially in yoga, fitness and other scenes with multi-occlusion, uncommon, multi-prone postures and other postures with SOTA detection accuracy. 

In order to better apply to various scenarios, we continue to optimize:

  • Large models for general scenarios achieve SOTA performance in terms of indicators;

  • For the small model deployed on the mobile terminal, the memory usage is small, the operation is fast, and the performance is stable, reaching 25-30FPS on the thousand yuan machine;

  • In-depth optimization has been made for yoga, rope skipping, sit-ups, push-ups, high leg lifts and other physical fitness counting and scoring scenarios, such as multi-occlusion, uncommon, and multi-lying postures, etc., to improve the accuracy and accuracy of the algorithm.

This model has been widely used in AI sports fitness and sports test scenarios, such as Ali Sports Music Power, DingTalk Sports, Fitness Mirror, etc. It can also be applied to 3D key point detection and 3D human body reconstruction and other scenarios.

2.3 Summary

The above two models related to "people" both belong to the category of perception and understanding. Know the world first, and then transform the world. Perception and understanding vision technology is the most basic and most widely used model category. It can also be divided into three sub-categories: recognition, detection and segmentation:

  • Recognition/classification is the most basic and classic task in vision (including image, video, etc.) technology, and it is also the most basic ability for living things to understand the world through their eyes. To put it simply, determine whether a set of image data contains a specific object, image feature or motion state, and know what the object and content described in the image and video are. In addition, you need to know some finer-dimensional information, or some descriptive labels of non-entity objects.

  • The task of object detection is to find out the objects (objects) of interest in the visual content, determine their position and size, and it is also one of the core issues in the field of machine vision. Generally speaking, the located target is also classified and identified at the same time.

  • Segmentation is another core task in vision tasks. Compared with recognition and detection, it goes a step further and solves the problem of "which object or scene each pixel belongs to". It is the technology and process of dividing the image into several specific and unique regions and proposing the target of interest.

The Mota community has opened up a wealth of perception and understanding models for trial use by AI developers:

2.4 Easter eggs: DAMO-YOLO released for the first time

Model name: DAMOYOLO- High Performance Universal Detection Model-S

Experience link:

https://www.modelscope.cn/models/damo/cv_tinynas_object-detection_damoyolo/summary

 

Generic object detection is one of the fundamental problems in computer vision and has a very wide range of applications. DAMO-YOLO is a new target detection framework launched by Ali  . It takes into account the speed and accuracy of the model. Its effect surpasses the current YOLO series methods, and the reasoning speed is faster. DAMO-YOLO also provides efficient training strategies and convenient and easy-to-use deployment tools, which can help developers quickly solve practical problems in industrial implementation.

DAMO-YOLO introduces TinyNAS technology, which enables users to customize low-cost detection models according to hardware computing power, improve hardware utilization efficiency and obtain higher accuracy. In addition, DAMO-YOLO also optimizes key factors such as neck and head structure design in the detection model, as well as label assignment and data augmentation during training. Due to a series of optimizations, the accuracy of DAMO-YOLO has been significantly improved under the strict limitation of Latency, and it has become a new SOTA in the YOLO framework.

3. The underlying visual model

3.1 Photo denoising and deblurring

Model name: NAFNet image denoising

Experience address:

https://www.modelscope.cn/models/damo/cv_nafnet_image-denoise_sidd/

 

Due to the shooting environment, equipment, operation and other reasons, the image quality is sometimes poor. How to remove the noise and restore the blur of these images? The model has good generalization in the field of image restoration, and both image denoising and image deblurring tasks have reached the current SOTA. Due to technological innovation, the model replaces the activation function with a simple multiplication operation, which improves the processing speed without affecting performance. 

The full name of the model is NAFNet denoising model, that is, Nonlinear Activation Free Network (Nonlinear Activation Free Network), which proves that common nonlinear activation functions (Sigmoid, ReLU, GELU, Softmax, etc.) are not necessary, they can be moved Division is either replaced by a multiplication algorithm. This model is an important innovation in the design of CNN structure.

This model can be used as a pre-step for many applications, such as smartphone image denoising, image motion blur removal, etc.

3.2 Photo Restoration and Enhancement

Model name: GPEN portrait enhancement model

Experience address:

https://www.modelscope.cn/models/damo/cv_gpen_image-portrait-enhancement/

In addition to photo denoising, there will be higher processing requirements for photo quality (including resolution, detail texture, color, etc.), and we have also opened a special portrait enhancement model, which is used for each detected portrait in the input image. Repair and enhance, and use RealESRNet to do twice the super-resolution of the non-portrait area in the image, and finally return the complete image after restoration. The model can robustly handle most complex real-world degradations and repair severely damaged portraits. 

From the effect point of view, the GPEN portrait enhancement model embeds the pre-trained StyleGAN2 network as a decoder into the complete model, and finally realizes the restoration function through finetune, achieving industry-leading results in many indicators. In the future, we will add 1024, 2048 and other pre-trained models that support processing large-resolution faces, and continue to update and iterate on the model effect. From the perspective of application, this model can restore old family photos or old photos of celebrities, repair low-quality photos taken by mobile phone night scenes, restore portraits in old videos, etc.

3.3 Summary

The underlying vision focuses on image quality issues. As long as it is a living thing (including human beings), it will be sensitive to the details, shape, color, fluency, etc. caused by light and shadow. It is natural for people to pursue high-quality images. Ideally, this is where vision AI can come in handy.

From the classification of tasks, it can be divided into: sharpness (resolution/detail, noise/scratch, frame rate), color (brightness, color cast, etc.), blemish correction (skin quality optimization, watermark subtitle removal), etc., as shown in the following table :

Fourth, edit the generated class model

4.1 Become more beautiful

Model Name: ABPN Portrait Beauty

Experience address:

https://www.modelscope.cn/models/damo/cv_unet_skin-retouching/

People have a rigid demand for the aesthetics of photos and portraits, including spots, colors, blemishes, etc., even tall, short, fat, and thin. This time, we have opened professional-level portrait skin, liquefaction and other models for everyone to use.

This model proposes a novel adaptive blending module ABM, which uses adaptive blending layers to achieve local precise retouching of images. In addition, we further build a blended layer pyramid on the basis of ABM, enabling fast retouching of ultra-high-definition images. Compared with the existing image modification methods, ABPN has greatly improved the modification accuracy and speed. The ABPN portrait skin beautification model is a specific application of the ABPN model in the portrait skin beautification task.

For example: 

Going a step further, we can also make some interesting attempts on clothing, such as wrinkle removal:

Even Slimming Beauty:

https://www.modelscope.cn/models/damo/cv_flow-based-body-reshaping_damo/

In terms of effect, it has the following characteristics:

  • partial modification. Only the target area is edited, leaving the non-target area untouched.

  • Precision retouching. Fully consider the texture features and global context information of the target itself to achieve precise modification, remove blemishes while retaining the texture of the skin itself.

  • Ultra-high resolution processing capability. The model's mixed layer pyramid design allows it to handle ultra-high resolution images (4K~6K).

This model has strong practicability. For example, it can be applied to professional retouching fields, such as photo studios, advertising, etc., to improve productivity. It can also be applied to live entertainment scenes to improve the skin texture of portraits.

4.2 Turn into a cartoon

Model name: DCT-Net portrait cartoon model

Experience address:

https://www.modelscope.cn/models/damo/cv_unet_person-image-cartoon_compound-models/

Portrait cartoonization is a very interactive gameplay, and there are a variety of styles to choose from. Mota's open portrait cartoon model is implemented based on the brand-new domain-calibrated image translation network DCT-Net (Domain-Calibrated Translation), adopting the core idea of ​​"first global feature calibration, and then local texture conversion", using a hundred small sample styles Data, you can train a lightweight and stable style converter to achieve high-fidelity, robust, and easy-to-expand high-quality portrait style conversion effects.

For example: 

From the effect point of view:

  • DCT-Net has the high-fidelity capability of content matching, and can effectively retain the character ID, accessories, body parts, background and other details in the original image content;

  • DCT-Net has strong robustness for complex scenes, and can easily handle facial occlusions, rare poses, etc.;

  • DCT-Net is easy to expand in terms of processing dimensions and style adaptation. Using head data, it can be extended to the refined style conversion of full-body images/full images. At the same time, the model has universal applicability and is suitable for Japanese comics Wind, 3D, hand-painted and other styles of conversion.

In the future, we will also open up a series of cartoonization. In addition to image conversion, the follow-up will include a series of effects such as images, videos, and 3D cartoonization. Let’s take a look at some effects first:

4.3 Summary

This type of model modifies the image content, including editing and processing the source image content (adding content, deleting content, changing content, etc.), or directly generating a new visual content, converting a style, and obtaining a new image (based on Source image and different from the source image), all belong to the category of editing and generation, which can be understood as the process of obtaining B image from A image.

5. Industry Scenario Models

As mentioned at the beginning, the value of visual AI technology exists in a wide range of scenarios. In addition to the above-mentioned "human"-related visual AI technology, we also open up information from the Internet, industry, mutual entertainment, media, security, There are many actual combat models such as medical treatment. These models can be used immediately, or can be further processed and improved based on finetune training or self-learning tools, and are used in specific scenarios for developers and customers. Here is an example:

Model name: smoke detection (integrating)

Model function: It can be used for outdoor and indoor flame detection and smoke detection, forests, urban roads, parks, bedrooms, office areas, kitchens, smoking places, etc. The algorithm has been polished for nearly 2 years and has been practically applied in multiple customer scenarios. The overall effect relatively stable.

From a technical perspective, this model proposes a Correlation block to improve multi-frame detection accuracy, and its design data enhancement method improves recognition sensitivity while effectively controlling false positives. In terms of application, the model can be applied to various indoor and outdoor scenes, and only needs simple equipment such as mobile phone shooting and surveillance cameras to realize the function of the model.

6. Conclusion: The Open Future of Visual AI

Through the above analysis, we can find that the application potential of visual AI is extremely wide, and the social needs are extremely diverse, but the reality is that the supply capacity of visual AI is very limited.

Dharma Academy took the lead in opening the visual AI service in the form of API before the ModelScope, and provided AI developers with a one-stop visual online service platform through the public cloud platform, that is, the visual intelligence open platform (vision.aliyun.com), Among them, more than 200 APIs have been opened, covering basic vision, industry vision, etc., as well as the "human-centered" vision technology mentioned above. 

From the open visual platform to the Mota community, this means that the opening of the visual AI of Dharma Academy has taken a bigger step. From OpenAPI to OpenSDK and OpenSOTA, from public cloud to device-cloud collaboration, from platform to community, we hope to meet the needs of various industries for visual AI and promote the ecological development of visual AI.

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/128034323#comments_27249308