Another great achievement in the field of NLP! Several papers of Alibaba Cloud machine learning platform PAI were selected for ACL 2023

Recently, several papers led by Alibaba Cloud's machine learning platform PAI were selected for the ACL 2023 Industry Track. ACL is the top international conference in the field of artificial intelligence natural language processing, focusing on the academic research of natural language processing technology in various application scenarios. The conference has promoted core innovations in the fields of natural language processing such as pre-trained language models, text mining, dialogue systems, and machine translation, and has huge influence in both academia and industry.

The results of the thesis are jointly developed by the machine learning platform PAI and Alibaba International Trade Division, the joint training project between Alibaba Cloud and South China University of Technology, and the team of Professor Xiao Yanghua from Fudan University. This selection means that the natural language developed by Alibaba Cloud machine learning platform PAI The processing and multi-modal algorithms, as well as the algorithm framework capabilities have reached the advanced level in the global industry, and have been recognized by international scholars, demonstrating the international competitiveness of China's artificial intelligence technology innovation.

Paper brief

FashionKLIP, an e-commerce scene graphic model based on e-commerce multi-modal conceptual knowledge graph enhancement

Image-text retrieval, as a popular cross-modal task, has strong practical value in a wide range of industrial applications. The flourishing of vision-language pretrained (VLP) models has greatly improved representation learning across different modal data, leading to significant performance gains. However, the data in the field of e-commerce has its own characteristics: 1) Most of the texts in general scenarios contain complete sentence structure descriptions, while the descriptions or queries in e-commerce scenarios usually consist of multiple descriptive phrases that describe the product’s material or style details. 2) Images in the general domain usually have complex backgrounds; in contrast, product images mainly contain a large product image without many background objects. Based on this paper, a VLP model FashionKLIP for e-commerce knowledge enhancement is proposed. It consists of two parts: a data-driven construction strategy, which builds a multimodal e-commerce concept knowledge map (FashionMMKG) from a large-scale e-commerce graphic corpus; and a training strategy that integrates knowledge into learning two modal images- The representations of the text pairs are aligned, and the concept alignment is further obtained by matching the text representations with the visual prototype representations of fashion concepts in FashionMMKG.

image.png

In order to verify the practicability of the FashionKLIP method, we applied it to the product search platform of the Alibaba International Department, and carried out the verification in the zero-sample scene on the image-commodity and text-commodity retrieval sub-tasks, and compared it with the baseline Compared with the method CLIP, the experimental results further prove the practical value and efficiency of FashionKLIP.

Dual-Encoder Model Distillation Algorithm ConaCLIP for Lightweight Text-Image Retrieval

Text-Image Retrieval aims to retrieve a list of the most relevant images from a large image collection given a specific text query. With the rapid development of information interaction and social scenarios, this task has been considered as a key component of cross-modal applications and is required by various real-world scenarios, such as e-commerce platforms, websites, etc. Existing related models such as CLIP are still not practical on edge devices with limited computing resources or dynamic indexing scenarios such as private photo/message collections. To solve this problem, our goal is to start from a large-scale pre-trained dual-stream encoder model and focus on the distillation process of the small model pre-training stage to obtain a series of corresponding lightweight models that are smaller, faster, and more efficient. . Different from existing work, our method introduces a fully-Connected knowledge interaction graph for distillation in the pre-training stage. In addition to intra-modal teacher-student interactive learning, our method also includes intra-modal student-student interactive learning, inter-modal teacher-student interactive learning, and inter-modal student-student interactive learning, as shown in the figure below.

image.png

This kind of fully connected graph established for the student network can be regarded as an integration of multi-view and multi-task learning schemes, which can strengthen the robustness and effectiveness required for the pre-training model. At the same time, we suggest that each type of learning process should test the effect of various supervision strategies in detail. We apply the proposed technology to the end-to-end cross-modal retrieval scenario of the e-commerce platform, and the results show that we significantly reduce the storage space of the model and increase the computational efficiency of the model while basically guaranteeing the performance of the model.

Diffusion Model and Toolchain for Chinese Domain Text-Graph Generation with Efficient Reasoning Speed

Text-to-Image Synthesis (TIS) refers to the technology of generating images based on text input, given a text instruction, using a computer program to generate an image that conforms to the description of the text content. However, because the pre-trained language model lacks domain-specific entity knowledge and is limited by the reasoning speed of the diffusion model, it is difficult for the current popular text-graph generation model in the open source community to support applications in specific industrial fields. The main problem is that the diffusion-based methods need to use a pre-trained text encoder to encode the input text, which is then used as the conditional input of the UNet model of the diffusion model. However, the current pre-trained text encoder model lacks the ability to understand specific entity concepts using text images collected on the Internet, and it is difficult to capture specific entity knowledge, which is crucial for generating realistic entity object pictures. At the same time, the inference speed and calculation cost of the diffusion model are also important factors to be considered, and the cumbersome calculation of the iterative inverse diffusion denoising process has always been the bottleneck of the inference speed of the diffusion model. Our proposed new framework is used to train and deploy the graph generation diffusion model, and the model architecture is shown in the figure below. In order to improve the ability to understand specific entities, we inject rich entity knowledge into CLIP's text encoder, and use knowledge graphs for knowledge enhancement. Unlike the open source Stable Diffusion, which directly uses a large-scale layered diffusion model, we integrate an ESRGAN-based network after the image diffusion module to improve the resolution of the generated image while effectively solving the problem of parameter explosion and time-consuming. For online deployment, we design an efficient inference process based on the FlashAttention optimized neural architecture. The Intermediate Representation (IR) of the generated model calculation graph is further processed by the end-to-end artificial intelligence compiler BladeDISC to improve the inference speed of the generated model.

image.png

Our experiments demonstrate that our knowledge-augmented model for domain-specific scenarios can better understand domain knowledge and generate more realistic and diverse images. In terms of inference speed, we use the end-to-end artificial intelligence compiler BladeDISC and FlashAttention technology to improve the inference speed of the model. We also integrated this technology with Alibaba Cloud's machine learning platform PAI to demonstrate its practical value in practical applications. Users can train, fine-tune and reason their own models on their own tasks (data) with one click. .

Algorithm open source

In order to better serve the open source community, the source codes of the above three algorithms will be contributed to EasyNLP, a natural language processing algorithm framework, and NLP practitioners and researchers are welcome to use them. EasyNLP is an easy-to-use and rich Chinese NLP algorithm framework developed by the Alibaba Cloud machine learning platform PAI team based on PyTorch. It supports commonly used Chinese pre-training models and large-scale model landing technologies, and provides a one-stop NLP development experience from training to deployment. . Due to the increasing demand for cross-modal understanding, EasyNLP will also support various cross-modal models, especially the cross-modal models in the Chinese field, and push them to the open source community, hoping to serve more NLP and multi-modal algorithm developers And researchers, also hope to work with the community to promote the development of NLP/multimodal technology and model implementation.

Github address: https://github.com/alibaba/EasyNLP

Summary of Papers

论文名字:FashionKLIP: Enhancing E-Commerce Image-Text Retrieval with Fashion Multi-Modal Conceptual Knowledge Graph

Authors of the paper: Wang Xiaodan, Wang Chengyu, Li Lei, Li Zhixu, Chen Ben, Jin Linbo, Huang Jun, Xiao Yanghua, Gao Ming

Paper PDF link: https://aclanthology.org/2023.acl-industry.16.pdf

Paper name: ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval
Paper authors: Wang Jiapeng, Wang Chengyu, Wang Xiaodan, Huang Jun, Jin Lianwen

Paper PDF link: https://aclanthology.org/2023.acl-industry.8.pdf

Paper name: Rapid Diffusion: Building Domain-Specific Text-to-Image Synthesizers with Fast Inference Speed
​​Paper authors: Liu Bingyan, Lin Weifeng, Duan Zhongjie, Wang Chengyu, Wu Ziheng, Zhang Zipeng, Jia Kui, Jin Lianwen, Chen Cen, Huang Jun

Paper PDF link: https://aclanthology.org/2023.acl-industry.28.pdf

Free interactive modeling PAI-DSW, model training PAI-DLC 5000CU*H computing resource package, and model online service PAI-EAS deduction package worth 500 yuan.

iQIYI client "White" TV, the background uploads at full speed The highest-paid technical position in 2023 deepin uses Asahi Linux to adapt Apple M1 Threads registrations have exceeded 30 million, and the backend is based on CPython's deep "magic modification" TIOBE July list: C++ is about to surpass C, JavaScript enters Top6 Visual Studio Code 1.80 released, supports terminal picture function ChatGPT traffic drops by 10% mid-background front-end suffers from CURD for a long time, and today will take Koala Form July database ranking: Oracle soars, Once again opened up the global desktop browser market share rankings, Safari continued to sit firmly in the second place
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10087527