Dialogue with machines, Ali Dharma Institute challenges the new generation of human-machine dialogue technology

​Authors: Huang Fei, Sun Jian, Li Yongbin, Zhang Ji, Dai Yinpei, Yu Haiyang, Geng Ruiying, Gao Xing, Yan Ming

1. Overview of man-machine dialogue

When you mention the term man-machine dialogue, you may be confused, but when it comes to the application and experience of man-machine dialogue technology, you are definitely familiar with it. For example, mobile phone voice assistants represented by Siri, smart speakers, vehicle-mounted dialogue robots, and similar consumer-grade hardware dialogue interactions. This voice-based dialogue form makes human-computer interaction more convenient and fast; another type of scenario is the service scenario. Conversational robots, for example, when a user calls customer service on weekends/evenings, there is a high probability that the first answer to the user’s call is a dialogue robot. This type of robot is mainly used in service scenarios such as customer service and pan-interaction.

To put it simply, human-machine dialogue refers to an intelligent system that enables machines to understand human natural language and interact with people accordingly. Since the early days of artificial intelligence research, people have been committed to developing highly intelligent human-computer dialogue systems. In the usual sense, human-computer dialogue mainly includes five subsystems in the technical framework, as shown in the following figure:

According to the degree of openness of the field discussed in man-machine dialogue, it can be divided into open domain man-machine dialogue and vertical field man-machine dialogue; according to whether man-machine dialogue has a clear goal, it can be divided into chat (no specific goal) and goal-oriented Dialogue (goal-oriented dialog); according to different functions, it is generally divided into three types: task-oriented dialogue, intelligent question-and-answer and chat; referring to the definition of iResearch, from the product dimension, we divide dialogue interactive products into consumer-grade hardware interactive products and dialogue There are two types of AI products.

In terms of domestic market size, the output value of consumer-grade hardware interactive AI voice assistant algorithms will be about 3.4 billion yuan in 2021; the market size of conversational AI in 2021 will be 4.5 billion yuan, driving a scale of 12.6 billion yuan . Interactive products and conversational AI products are in a relatively high-speed growth stage. Why has man-machine dialogue made significant progress in the past few years? The author believes that there are the following reasons: First, it comes from the rigid demand of C-end consumers for fast and convenient access to information and services anytime and anywhere; second, it comes from the labor cost pressure and customer-centric services of B-end enterprises The idea is to promote enterprises to create more intelligent and efficient customer connection and interaction service methods, that is, the overall solution for customer contact centers with customer service robots as the core; third, a new generation of technology based on pre-trained large models + fine-tuning Paradigm has significantly improved the generalization ability of human-machine dialogue robots, enhanced the scalability between scenarios, and reduced the cost of building robots.

Relying on the natural language processing and voice interaction capabilities of DAMO Academy, we have accumulated some experience in FAQ knowledge retrieval questions and answers, task flow questions and answers, complex reasoning questions and answers on knowledge graphs, table retrieval questions and answers, and MRC document understanding questions and answers. Continuously innovate and upgrade in terms of capabilities, full-link operation tools, intelligent assistance and insight analysis. This article will take you through several chapters to understand:

  • Key technical challenges behind conversational AI products

  • A new generation of human-computer dialogue technology platform of Bodhidharma Academy

  • Typical application scenarios and customers of intelligent customer service

  • Thoughts on the future development direction and development path of human-computer dialogue

2. Key technical challenges faced by conversational AI

To allow machines to understand human language and communicate freely with people, at least the following key challenges must be faced:

  • The cost of knowledge construction is high: for machines to understand and understand what people say, the premise is that machines must have a lot of knowledge in advance like humans, and this knowledge also needs to be structured . The structured knowledge here mainly includes two types: dialogue process knowledge and knowledge graph centered on specific goals. It is estimated that it will take 1 to 2 weeks of manpower to build a relatively complete dialogue logic process (dynamic knowledge) around a given scene, and it is estimated that it will take about 2 weeks of manpower to build a schema and knowledge map (static knowledge) for a given scene, so knowledge The cost of construction is very high;

  • The optimization period for the robot from startup to meeting the online standards is long: the robot needs to be optimized and polished in multiple batches from startup to meeting the online standards. Each batch involves the collection of real dialogue data, labeled data, training models, It is estimated that it will take 2 to 3 weeks to debug the model, test the dialogue effect, analyze the reasons behind the problem, and then proceed to the next round of optimization; similarly, iterative optimization of the knowledge graph Q&A effect will also take 2 weeks;

  • The dialogue experience of the robot migrating from a mature scene to a small sample new scene is poor: in a mature scene, since the dialogue robot has more real dialogue data for processing and utilization, the dialogue experience can be continuously optimized . However, after migrating to small samples and new scenes, the effect of dialogue experience has been significantly reduced.

  • The language families, languages, and dialects of human languages ​​are very diverse, and data on a large number of small languages ​​is scarce; in multilingual societies such as Southeast Asia/South Asia, the phenomenon of mixed languages ​​is very common; some languages ​​have different writing methods, and the transcription between different writing methods It is not standardized; there are many challenges for robots to cross language barriers, penetrate local cultures, and support authentic local languages.

  • The way humans perceive the world is multi-modal, which involves information in different modalities such as images, texts, voices, and videos. Robots need to be able to understand multi-dimensional information of different modalities at the same time, and how to process information in different modalities Efficient and accurate semantic representation, how to better align cross-modal information for the cross-modal semantic gap, and how to perform deep modal fusion based on aligned multi-modal information.

In response to the above key challenges, the Intelligent Dialogue and Service Team of DAMO Academy has mainly developed from the following aspects in the past year:

  • From the knowledge level, focus on building the semi-automatic construction capability of structured knowledge to reduce the cost of knowledge construction; and further expand to the full use of multi-modal knowledge such as graphics and videos;

  • From the dialogue model level, focus on creating a pre-training dialogue model that incorporates knowledge, thereby shortening the optimization cycle of the robot from startup to meeting the online standards; and further expanding from single-modality to multi-modality, from single-language to multi-language pre-training model ability;

  • From the dialogue engine level, focus on expanding and enhancing the core capabilities of the dialogue engine, including multi-capability dialogue engine, multilingual question and answer, multimodal question and answer, small sample learning technology, etc.

3. The new generation of man-machine dialogue technology system of Bodhidharma Academy

Based on the above ideas, we designed a new generation of human-computer dialogue technology system, the core of which is mainly composed of three layers: knowledge layer, pre-trained dialogue question and answer model layer, and engine layer. Among them, the pre-trained dialog Q&A model layer includes pre-trained dialog model, pre-trained graph Q&A model (KGBert), and pre-trained table Q&A model (TableBert); the engine layer includes Dialog Studio multi-round dialog engine, KBQA graph Q&A engine, and TableQA table Q&A engine , FAQ multilingual question answering engine, VQA visual question answering engine.

3.1 Knowledge layer: scalable knowledge map construction

Knowledge comes from data, and data sources are mainly divided into two categories, one is everyone's dialogue logs, and the other is enterprise documents. Correspondingly, knowledge construction is also divided into two aspects. One is the construction of dialogue flow based on everyone’s dialogue log, which is upgraded from traditional manual configuration to automatic mining intent, and from manual labeling to automatic mining, semi-automatic labeling, and dialogue flow. Semi-automatic build. From manual construction to semi-automatic construction, the construction cost of process knowledge is greatly reduced. The second is document-based knowledge graph construction. The document itself has certain structured information, which can make the question and answer more accurate after being structured. From the perspective of multiple rounds of dialogue interaction, structuring makes dialogue interaction smoother.

Focusing on the construction of document-based knowledge graphs, we have designed a scalable knowledge graph construction solution, which mainly includes three layers: document pre-training model (DocBert), enterprise document annotation platform, and information extraction (see the figure below ). Among them, information extraction is subdivided into three steps: document structure recognition, coarse-grained triplet extraction, and fine-grained triplet extraction.

3.1.1 DocBert

We designed DocBert, a pre-trained document model for semi-structured long documents. Its main design idea is to divide the document representation into three levels: physical structure, logical structure and semantic structure. Text semantics, layout information, and visual features are used to construct self-supervised learning tasks, so that the model can better understand document semantics and structural information. The specific pre-training tasks are as follows:

1) Layout-Aware MLM: Through the semantic and physical joint modeling task Layout-Aware MLM, in the Mask task, the text position, font size and other information are considered, and the semantic understanding task of document layout perception is realized;

2) Text-Image Alignmet: For the alignment of text and images, we adopt the same method as LayoutLM, that is, by reconstructing the mask of the text in the document image, it helps the model learn the alignment between different modalities of text, layout, and image relation;

3) Title Permutation: Construct chapter title reconstruction tasks in a self-supervised manner, and enhance the model's ability to understand the hierarchical directory structure of documents;

4) Sparse Transformer Layers: Use the Sparse-Attention-based Transformer layer to replace the traditional Transformer and enhance the model's ability to process long documents. As shown below:

3.1.2 Coarse-grained triplet extraction

Coarse-grained triplet extraction based on documents is essentially to input an ordered sequence of physical components of the entire document, and then identify its physical components such as title and text, and then generate a document tree based on these information, and finally based on some simple The rules can get all the coarse-grained triples of the document, and its core is the generation of the document tree. The overall process is shown in the following figure:

Document tree generation based on document logical structure extraction faces two important challenges, namely long documents and variable-depth hierarchical structures. On the one hand, a long document means that the document may contain hundreds of pages and thousands of physical components, and the calculation is very heavy; on the other hand, the hierarchical structure with variable depth means that in different documents, the depth of the tree is different, and some There are only 3 floors, and some may have as many as 10 floors. Based on this, we propose a three-stage framework for document structure extraction:

  • The first step is to detect the titles in the physical component sequence. We first extract the text and format information of the physical component sequence, use DocBert to extract features, and then do a binary classification for each title, whether the category is a title or other components. Since this step is relatively simple, the model labeled with sequences can achieve higher accuracy;

  • In the second step, we generate a title hierarchy tree for the extracted title sequence. Specifically, take an empty tree as the initial state, sequentially take a title in the sequence and insert it into the tree, and the possible insertion position of the current title is the child node of the node in the rightmost branch of the tree;

  • In the last step, when the title hierarchy tree is generated, it can be inserted into the corresponding node of the tree according to the position of other components in the sequence.

We apply DocBert to the downstream coarse-grained triplet extraction business. On the test sets of government affairs, insurance, banking, and electric power industries, the triplet extraction is generally 3%~7% better than the traditional pre-training model. Especially on the small sample data set, it has achieved more than 10% improvement; at the same time, on the public data set LIE constructed by ourselves, it also surpassed the latest pre-training models such as LayoutLMV2, and achieved very good results.

3.1.3 Fine-grained triplet extraction

In the fine-grained triplet extraction for text, we designed the following fine-grained information extraction tasks:

ClosedIE performs fine-grained triple knowledge extraction under the premise of a given graph schema, that is, entity and relationship types. From the model perspective, we have studied technologies such as bilinear 3D tensor sparsity, Rotationary span length modeling, and loss function Power Trick. Experiments based on self-built government affairs, electric power, medical care, common sense and other business data sets show that our model is effective Compared with the baseline Biaffine model, there is an improvement of 1-3 points. For details, see Q&A technology system based on semi-structured knowledge.

Different from the classic ClosedIE, OpenIE can extract triple knowledge from documents without a given schema. The current OpenIE model of SOTA, MacroIE, models the knowledge in the text as a large group structure with words as the basic granularity, and achieves the best results on the Chinese SAOKE and English OIE4 data sets. Modeling knowledge as a large clique is poor in model robustness and generalization, and is prone to missing or wrong edges. Therefore, we relax the limitation of the maximum clique structure and turn to the modeling of the directed acyclic graph structure, and propose a new model DragonIE. This model has obvious advantages in dealing with complex cases such as overlapping spans and discontinuous spans, which greatly reduces the complexity of the model. On the Chinese public data set SAOKE and the English public data set OIE4, compared with the current SOTA, our self-developed DragonIE has reduced the number of tags by 80%, reduced the memory usage by 50%, and improved the effect by 1 point.

3.2 Semi-supervised pre-training opens up a new paradigm for incorporating knowledge into dialogue models

Based on the characteristics of the dialogue, we specially designed a pre-trained dialogue model. The pre-trained dialogue model (Pre-trained Conversation Model, PCM) modeling describes the selection/generation of a most suitable response given the context of the dialogue history. Compared with pre-trained language model tasks, it is more specific, and needs to comprehensively consider dialogue history, dialogue goals, dialogue strategies, dialogue roles, dialogue turns, etc.

3.2.1 Why integrate knowledge?

The essence of pre-training is to implicitly store the information contained in the training data into the parameters in a way that the model can understand. Many research works have shown that pre-training models such as BERT can learn a part of linguistic knowledge in large-scale texts. (syntax, grammar), and even a certain degree of world knowledge and common sense knowledge. However, there are still many problems in how to better learn and use human experience knowledge in the pre-training model.

Here, we roughly divide human experience knowledge into three categories: the first category is factual knowledge, such as artificially constructed knowledge tables, knowledge graphs, and structured documents (including text structure, graphic information); the second category is mathematical logic Knowledge, including mathematical formulas, axioms and theorems, symbolic calculations, etc., this type of knowledge is not discussed in this article; the third type is annotation knowledge, that is, the knowledge contained in the annotation data. This type of knowledge is very common and is task-related, such as Text classification, sentiment analysis, etc. In the labeling process, humans need to summarize according to the specific task, infer unlabeled data and assign corresponding labels in the pre-defined high-level semantic classification space. Therefore, augmenting pre-trained models with human experiential knowledge should lead to significant improvements in relevant downstream tasks.

3.2.2 Dialogue strategy knowledge

Dialogue strategy is an important module in the dialogue process, which is generally characterized by dialog act (DA), that is, given the dialogue history of both parties, the dialogue strategy needs to select the correct dialogue action to guide the dialogue generation. At present, various common pre-trained dialogue models, such as Meena and DialoGPT, often implicitly model the selection process of dialogue actions into the model parameters, and there are problems such as unexplainable and uncontrollable. Since the policy is a high-level semantics, it is difficult to learn it well only by self-supervision. Therefore, next, we will start from dialogue strategy modeling, propose a semi-supervised approach to achieve better pre-training, and integrate the dialogue strategy knowledge in the labeled data into the pre-training dialogue model. The figure below shows the dialog act system we sorted out and defined:

3.2.3 Inject dialogue policy knowledge into pre-training

We designed a semi-supervised pre-training method to solve the problem of dialogue strategy modeling, transformed the dialogue action prediction task into a semi-supervised learning task, and designed a dialogue pre-training model SPACE. This model is also an integral part of Alibaba's deep language model system.

Specifically, SPACE adopts an encoder+decoder architecture. The pre-training goal includes not only the traditional self-supervised loss for dialogue understanding and generation modeling, but also the semi-supervised loss for dialogue strategy modeling. The complete framework See below:

Semi-supervised dialogue pre-training framework

First of all, for the comprehension ability , we use response selection as the pre-training target, that is, given the dialogue context (context) and the candidate response (response) to perform binary classification at [CLS] to determine whether it is a correct response. It has been proved in many PCM works that the training of response selection is crucial for dialogue understanding, so we retain this training objective; for generation ability , we use the common response generation objective, that is, given the dialogue context Generate correct reply sentences; for the policy part , we use a very efficient consistency regularization method in semi-supervised learning to model dialogue actions. Theory can prove that under the assumption of low density (that is, the classification boundary is in a low-density distribution), the classification results still have a certain degree of consistency after perturbing the same sample (that is, the distribution is close or the prediction result is close), then the final Semi-supervised learning based on consistency regularization is guaranteed to find the correct classification face. Finally, for the pre-training of the model, we optimize the understanding, strategy, and generation goals of the entire model.

3.2.4 Semi-supervised pre-training brings significant improvement

We have verified the effect on three international dialogue datasets (Stanford's In-Car dataset, MultiWOZ2.0, MultiWOZ2.1 dataset), as shown in the figure below, after semi-supervised pre-training and incorporating policy knowledge, you can see Our GALAXY model has greatly surpassed the previous SOTA model on these dialogue lists, and the end-to-end overall score has increased by 2.5, 5.3 and 5.5 points in In-Car, MultiWOZ2.0 and MultiWOZ2.1, respectively.

3.3 Multimodal pre-training brings a new experience of dialogue question answering

For different visual feature representations, considering the advantages and disadvantages of their respective features, we have developed a series of self-developed multi-modal pre-training models, which have achieved SOTA effects on multiple multi-modal public tasks.

  • Region: In real image-text data, some image-text pairs are easy to align semantically on the two modalities, while other image-text pairs require higher-level semantic alignment. There are two existing pre-training frameworks based on Region features:

    1) Directly connect feature-level image representation and text representation as the input of single-stream Transformer, which is more suitable for simple image-text pairs; 2)

    Using two-stream Transformer can Align image-text representations in a high-level semantic space. Based on this, we propose the SemVLP multi-modal single- and dual-stream fusion model, and introduce a new cross-modal fusion mechanism soft cross-modal attention, which integrates hard cross-modal attention and partial cross-modal attention, and can learn from different semantics Granularly align text and images. Experiments have been carried out on multiple visual language understanding tasks. The experiments show that the SemVLP model based on single- and dual-stream fusion can achieve a certain degree of improvement compared with the traditional single-stream model and dual-stream model.

  • Grid: Regarding the problem of long online delay in Region and how to make better use of Grid features, we explored two fusion methods. 1) E2E-VLP

    : Unify the End2End multimodal pre-training into the Transformer framework, and at the same time Supports NLU/NLG tasks; in the pretraining stage, add VisualTasks (ObjectDetection, ImageCaption) to better integrate image and text semantics; in the finetuning stage, you can get rid of the time-consuming detection module and directly perform end-to-end training based on ResNet feature maps. The modal NLU/NLG task achieves the same effect as the two-stage method, and at the same time increases the speed by 3 times. For details, see the E2E-VLP paper; 2

    ) Grid-VLP: The FasterCNN Encoder of the pre-trained target detector is used as the Visual Encoder. In the Pretraining stage, through The Random Grid Sampling mechanism improves the robustness of the model, and achieves the effect of surpassing the Region-based multimodal model on data sets such as VQA, NLVR2, and GQA. For details, see the Grid-VLP paper. Among them, E2E-VLP has been accepted by ACL2021.

E2E-VLP                                           Grid-VLP

Main conclusions: The Grid-based model can achieve comparable effects with the Region-based model, and supports end-to-end training and prediction. It has a faster Infer speed and is more suitable for practical business applications.

  • Patch & Fusion: ViT has made great progress in visual tasks, and has recently become a multi-modal research hotspot. We have tried to extract Patch features based on pre-trained target detectors and image-text pairs of pre-trained CLIP, and are exploring Similar to the result of VILT's underlying fusion of graphics and text. In addition, in order to combine the advantages of various graphic and text features, we proposed Fusion-VLP , through Learning to Attend adaptive fusion (Region, Grid, Patch) three types of visual features and text features, in the multi-modal visual question answering VQA achieved equal For the SOTA effect of the Single model under the amount of pre-trained data, see related papers for details.

The main conclusion: the pre-trained Patch feature based on detection is relatively easy to overfit, and the Patch feature pre-trained with more unlabeled graphic data can achieve better results, but it is easy to destroy the boundary information, and the temporary effect is lower than that of Region-based and The Grid-based method and the Patch-Based model can unify images and texts into the Transformer framework, which is a current research hotspot; integrating the three types of features can more effectively capture the semantic information of different granularities in the image, and play a complementary role.

  • Learning to Attend: Most of the existing multimodal pre-training frameworks use single-stream and dual-stream interaction modes. In the single-stream framework, the interaction between graphics and text still uses the conventional self-attention mechanism. From an empirical point of view, the bottom layer of the model should be more inclined to model the representations of images and texts, while the top layer is more inclined to model the representations between images and texts. Therefore, we propose a new multi-modal pre-training based on Learning to Attend framework, using two learnable self-attention weights for each layer to dynamically control the interaction between modalities and within modalities, this framework can adaptively fuse the above-mentioned multi-class visual features (region, grid, patch ) and text features.

On the original transformer mechanism, we split the self-attention calculation attention matrix into two parts: intra-modal attention matrix and inter-modal attention matrix. Then, we introduce two learnable weights ε1 and ε2 for intra-modal and inter-modal attention matrices, respectively. In the self-attention calculation of each layer transformer, we multiply the learnable weights with the corresponding attention matrix to obtain a new attention weight matrix, in this way, the model can adaptively learn and adjust the intra-modal sum Attention weights across modalities.

Main conclusion: Based on the Learning to Attend image-text fusion framework, we verified it under multiple features. Compared with the original transformer, both Region and Fusion features have achieved certain improvements, indicating that the new framework can be self-adaptive to a certain extent. Fusion of visual features and textual features.

  • Structure: In multimodal data, in addition to all visual element pictures, part of the picture contains rich text information, and the current visual features cannot represent the OCR text information in the picture. In response to these challenges, we propose a structured pre-training model StructuralLM. Based on the language model StructBERT, we make full use of the 2D position information of image document data, propose box shared coordinate representation, and introduce the pre-training task of box position prediction to help the model Perceive the relationship between words in different positions of the picture. The related methods have improved by nearly 10 points in the classic table understanding data set FUNSD and table question answering data DocVQA compared with the previous SOTA method. See the StructuralLM paper for details. This paper is accepted by ACL2021.

StructuralLM

Main conclusion: On the basis of the diversity visual representation model, the StructuralLM model is introduced, and the VQA test set has an absolute improvement of 1.2pt, which proves that our model can well learn the rich text information and its spatial position representation in the picture.

At present, the state-of-the-art methods for public and authoritative multi-modal tasks are basically based on multi-modal pre-training technology. Using massive unlabeled multi-modal data to pre-train models, the effect of unpre-trained models has been significantly improved. Our Multimodal pre-training technology is not limited to VQA tasks, but can also be widely used in multimodal classification, search, generation and other tasks. It has won the first place in SemEval 2021 multimodal classification and DocVQA structured lists.

3.4 Engine layer: TableQA of Bodhidharma Institute won the first place in the four lists and applied it on a large scale

At the engine level, the Bodhidharma man-machine dialogue platform mainly includes the dialogue engine Dialog Studio for process-based knowledge, the TableQA question-and-answer engine for table knowledge, and the graph question-and-answer engine for knowledge graphs. Due to space limitations, the TableQA question-answering engine is mainly introduced here.

Because the tabular data structure is clear, easy to maintain, and friendly to both human and machine understanding, tabular/SQL databases are the most common form of structured knowledge storage used in various industries. Table QA TableQA directly converts natural language into SQL query language, allowing users to use natural language to directly interact with table knowledge, thereby expanding the capabilities of dialogue robots. We have made a series of explorations around TableQA, successively won the first place in the list of four major data sets, and open sourced the first pre-trained table model in Chinese, becoming one of the core engines in the new generation of human-computer dialogue technology system.

The TableQA engine for table questions and answers for tables, the simplest case is a single round of question and answer, and the work in the industry mainly stays in the single round of questions. On the basis of single-round Q&A, the team mainly developed and built the Q&A ability from single-round to multi-round and from single-table to multi-table.

3.4.1 From single round to multiple rounds

For multiple rounds of table questions and answers, the difficulties include the following two aspects:

  • How to effectively model and utilize multiple rounds of dialogue history to understand user questions;

  • The semantic link problem between utterance and table Schema;

Aiming at the problem of semantic linking in multi-round scenarios, we proposed a framework R²SQL (Hybrid Relation Network for Cross-Domain Context-Dependent Text-to-SQL Parsing) based on dynamic context schema graphs at AAAI 2021, which effectively depicts multi-round The complex semantic link relationship between the natural language in the scene and the table schema.

The framework includes the following two modules: 1. Fusion relationship graph, 2. Dynamic memory forgetting mechanism. As shown in the figure below, our fusion relationship graph contains both the implicit relationship obtained based on the attention mechanism and the explicit relationship obtained based on the semantic link, so as to maximize the use of the model and prior rules in dealing with multiple rounds of semantic understanding. Advantage. And with the progress of multiple rounds of question and answer, we found that the user's topic will change with the rounds, so we further propose a dynamic memory forgetting mechanism to update the weight of each relationship, so as to obtain a dynamic context mode suitable for multiple rounds of question and answer scenarios picture.

We conducted experiments on SParC and CoSQL, the authoritative multi-round table question and answer datasets in the industry. Compared with EditSQL, the accuracy rate of Turn granularity increased by 7.9% (47.9%->55.8%) in SParC dataset, and the effect in CoSQL dataset Increased by 6.0% (40.8% -> 46.8%). This work was published at AAAI 2021.

3.4.2 Question answering from single-table to multi-table

Real-world scenarios often contain multiple tables, which will involve joint queries of multiple tables, and bring two types of complexity to the parsing of SQL statements: 1. More SQL keywords such as JOIN, UNION and other advanced keywords ; 2. The situation where SQL is nested with each other.

Compared with single-single rounds, the task of multi-single rounds mainly has the following difficulties:

  • SQL level: How to design a decoder with grammatical constraints for complex SQL statements;

  • Table level: how to use the relationship between multiple tables in the database;

  • The semantic link relationship between schemas is more complicated in multi-table scenarios.

Some of the previous work focused on the internal modeling of the schema, converting the table, column, and foreign key information in the schema into the form of a graph, and integrating it into the network for learning. Other work mainly focused on the establishment of semantic link relationships in multi-table tasks. mold. And we first pay attention to the importance of the syntactic structure of the natural language problem to the text-to-SQL task. To this end, we use the syntactic relationship to model the internal relationship of natural language questions. Under the measure of syntactic distance, the relationship between id and date will be shortened, so that the correct SQL will be generated. Based on this motivation, we propose S²SQL (Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers), which integrates the syntactic structure inside natural language, the internal structure of Schema, and the semantic interaction structure between natural language and Schema Simultaneously modeling to obtain a complete Question-Schema semantic interaction diagram to achieve stronger representation capabilities.

In effect, S²SQL has achieved the best results in the Spider dataset, which is 2.8 percentage points higher than the previous best result (Microsoft's RAT-SQL) (64.3->67.1). The R²SQL model topped the list in July and August 2020 respectively.

The SDSQL and S²SQL models won the first place in the WikiSQL list and the Spider list in March 2021 and September 2021, respectively.

3.5 FAQ

Basically every customer needs to support FAQ questions and answers, so the FAQ question and answer engine is the most widely used engine. Focusing on "improving business customization efficiency, improving Q&A experience, and reducing FAQ operation and deployment costs", we have developed Q&A model libraries and code frameworks, dialogue pre-training and small sample understanding, multiple rounds of understanding and clarification guidance, colloquial long sentence understanding, A lot of exploration and practice have been carried out in terms of generating FAQs based on multi-source heterogeneous content, model distillation, and high-performance deployment. Here we focus on our application of small sample classification in FAQ questions and answers.

Based on MetaLearning's small sample classification, compared with traditional intention classification, unseen categories can be automatically generalized and recognized only by providing a small number of samples; compared with sentence pair matching, it can model FAQ knowledge more completely and solve the problem of ambiguity in a single knowledge title. . Most of the existing models are based on the classic prototype network design, which belongs to representational matching. We propose MGIMN (Multi-grained Interactive Matching Network) based on interactive matching design model, first calculate the instance-wise matching feature vector, and then aggregate Get the matching feature vector of each class, and finally get the matching confidence of each class; multi-granularity interactive matching between sentences can be performed from the global perspective, category perspective, sentence pair perspective, and single sentence perspective, and the enhanced pair discrimination during matching Lexical attention.

In the process of landing in real application scenarios, we found that only relying on the pre-trained language model and a small amount of data of the target task for training cannot meet our online effect requirements. This small-sample classification technology is not as popular and well-known as the sentence pair matching technology in the industry. The main reason is that the training data set needs to contain a large number of categories, otherwise it will be easier to meta-overfitting, and the public data set is not enough to support this technology quickly. develop. Fortunately, after years of accumulation, the platform has accumulated millions of intent categories and tens of millions of knowledge titles. Based on this, we have improved the meta-task sampling strategy (probability sampling, dynamic NK sampling, difficult sample sampling, multi-domain /Language sampling), and through inference acceleration, it has achieved significant improvement in many practical application scenarios. For example, when the small-sample classification model does not use any Yunxiaomi data, the FAQ question-and-answer box-opening effect exceeds the SOTA of the sentence-pair matching model (there are millions of matching labeled data in the target field).

3.6 Multilingual Q&A

Globalization is one of Alibaba's three major strategies. With the expansion of international business, multilingual Q&A faces completely different difficulties and new technical challenges from monolingual ones. Most of the newly accessed languages ​​are low-resource languages. How to use high-resource language migration to help low-resource small languages ​​improve is a challenge ; different languages ​​have different and complex grammar and word formation, such as Arabic word formation is complex, part-of-speech changes + large vocabulary lead to a decline in model effectiveness; Southeast Asia (Indonesia) , Malay, etc.) and South Asia (Pakistan, etc.) are mixed with various cultures, which brings about the phenomenon of language mixing; there are many languages ​​and businesses, and each business needs to be upgraded quickly, and online maintenance costs are high. 

The team grew and polished in the business, and gradually built a language-independent question-and-answer dialogue technology system, including language-independent preprocessing, language-independent sentence representation, language-independent dialogue pre-training model, language-independent data augmentation and language-independent Operational tools. This paper focuses on language-independent sentence representations:

  • Language-independent sentence representation: Add more pre-training tasks such as parallel bag-of-words prediction, dialogue adaptation comparison training, and self-encoding MLM to eliminate language barriers, and adapt to the field of question and answer to enhance the learning ability of language-independent sentence vector representation. Reduce the dependence on target language labeling data for new languages, and achieve rapid business cold start; the same language has differences in word order, language codes, and different words in different regions, based on Normalization, Romanized transliteration, data augmentation, Methods such as adversarial attacks strengthen sentence representations in mixed languages.

3.7 Multimodal VQA Question Answering

The NLP team of Dharma Academy systematically designed the AI ​​visual-text reasoning system and made a series of innovations, including diverse visual feature representation, multi-modal pre-training models, adaptive cross-modal semantic fusion and Alignment technology, knowledge-driven multi-skill AI integration, etc., have brought AI to a new level of "reading pictures and understanding".

Specifically, in order to solve the challenges of multi-modal tasks, based on the engineering foundation of the Alibaba Cloud PAI platform and EFLOPS framework, the language technology laboratory and vision laboratory of DAMO Academy systematically designed the AI ​​visual-text reasoning system, integrated Numerous algorithmic innovations, including:

  1. Diversified visual feature representation, which describes the local and global semantic information of the picture from all aspects, and uses Region, Grid, Patch and other visual feature representations to more accurately understand single-modality;

  2. Multimodal pre-training based on massive graphic data and multi-granularity visual features is used for better multimodal information fusion and semantic mapping, innovatively proposed SemVLP [3], Grid-VLP [4], E2E - Pre-trained models such as VLP [5] and Fusion-VLP;

  3. Research and develop adaptive cross-modal semantic fusion and alignment technology, and add Learning to Attend mechanism to the multi-modal pre-training model for efficient and deep fusion of cross-modal information;

  4. Based on structured pre-training containing rich text information in pictures, it is used to better fuse pictures and OCR texts, and proposes a StructuralLM [6] pre-training model for multi-modal fusion of pictures, OCR, and text;

  5. Use Mixture of Experts (MOE) technology for knowledge-driven multi-skill AI integration, use knowledge mining to independently discover AI skills, and automatically match and build AI skill experts through MoE technology.

Students who are interested in the overall technical details can also read our paper "Achieving Human Parity on Visual Question Answering", in which E2E-VLP [5] and StructuralLM [6] have been accepted by the chairman of the top international conference ACL2021.

In June 2021, Alibaba Dharma Academy won the VQA Challenge 2021 among the 55 submitted teams, leading the second place by 1 percentage point and leading last year's champion by 3.4 percentage points.

Two months later, Bodhidharma made another key breakthrough in the VQA list, setting a global record for VQA Leaderboard with an accuracy rate of 81.26% , surpassing the human baseline of 80.83% for the first time.

This is the first time since the VQA test that AI has surpassed the human level, which is a major breakthrough. This is a major advance in the field of multimodal technology involving high-level cognition of visual-text multimodal understanding after AI surpassed humans in the fields of visual recognition and text understanding in 2015 and 2018 respectively. This progress was included in the MIT Technology Review "2021 Artificial Intelligence Innovation Research Institute Report" as a key technological breakthrough.

4. Application customers and scenarios

4.1 New Retail Smart Customer Service 

4.1.1 DianXiaomi

⍟ 4.1.1.1 VQA

In Dianxiaomi, when a buyer asks a question, Dianxiaomi recognizes the buyer's intention, and then finds the corresponding merchant configuration answer in the knowledge base to reply to the user. In this process, the answer needs to be manually configured by the merchant, resulting in high startup costs. Based on this pain point, it is proposed to use the graphic and text content on the product details page to answer questions, which can not only reduce the answer configuration cost of merchants, reduce start-up costs, but also promote buyers' desire to purchase and increase the conversion rate of inquiries.

Therefore, we have developed a question-and-answer capability for product detail page pictures based on graphic pre-training and other technologies. According to the buyer's question, we can find the most suitable picture from the product detail page and highlight it in the specific answer area. Reply to buyer.

At present, DianXiaomi's industry-wide application has been supported, and the resolution rate and conversion rate of opened merchants have been significantly improved. It not only improves the user experience, but also greatly reduces the knowledge maintenance cost of merchants.

⍟ 4.1.1.2 Video Q&A

Live delivery of goods has become a new business model. More and more merchants have launched live broadcasts to introduce products. Live videos contain rich product explanations, product details, upper body effects, etc., and these live broadcasts can be automatically cut out based on algorithms. Replying users with video clips can not only save the cost of editing videos for merchants, but also answer questions more vividly and specifically, improving user experience.

Based on this idea, we developed the Q&A capability for live video by combining text understanding and video understanding technologies. The core work is to understand the structure of live video. Here is a little introduction to the overall plan. First, obtain the complete live video clip corresponding to the product, and then perform structural understanding based on two methods. One is based on text understanding technology, which performs intent recognition and named entity recognition for video ASR text. Xiaomi already has a complete intent system and entity category system to identify the intent and slot value of each text segment; the second is based on video understanding technology, first through the video-text pre-training model to mine out the Coarse-grained video clips that meet the requirements, and then locate a finer time interval based on Video Grounding technology. Through the above two methods, structured video clips can be mined as the merchant's video answer. For multi-modal video understanding, we have made some innovations and precipitation, explored a set of effective methods for Video-Text Retrieval tasks based on powerful graphic pre-training models, and proposed a multi-instance learning idea, using only The video-level supervision information realizes the clip-level refined Video Grounding positioning ability.

⍟ 4.1.1.3 Questions and Answers for Product Reviews

There are great challenges in applying buyer comments to merchant customer service questions and answers. In addition to content risk control through fine-grained sentiment analysis, time-sensitive discourse discrimination, low-information content filtering, and uncertain discourse discrimination, it is also necessary to Conflict detection and integration of multi-source heterogeneous content are carried out in combination with the merchant's self-produced content (customer service FAQ knowledge, product detail map, product attributes, merchant live broadcast content, etc.) to ensure the availability of this part of the content.

  • Product Q&A in Smart Live Room

With the rise of live broadcast e-commerce, it is essential for the virtual anchors in the smart live broadcast room to have the ability to interact with questions and answers, helping virtual anchors to efficiently answer users' pre-sales consultation questions about products and improve conversion. However, due to the large number of merchants’ products, we cannot allow merchants to configure FAQs one by one. We need to provide ready-made multi-source heterogeneous and multi-modal content based on commodities, including product reviews, product detail pages, expert articles, etc., without relying on merchant configuration. The out-of-the-box product Q&A capability creates a multi-modal Q&A experience combining virtual anchor voice broadcast + subtitle/print display + Kanban picture combination. Compared with the traditional online intelligent customer service in the past, the question and answer in the live broadcast room also has some new features. For example, the anchor is one-to-many question and answer. It is necessary to judge the timing of answering the question during the broadcast of the product content, and broadcast the answer in a colloquial way. All these bring new challenges and opportunities to Q&A in the live broadcast room.

4.1.2 Intelligent Customer Service for Alibaba Group

The DeepQA technology system supports the intelligent services of dozens of BUs of the group, covering online channels and hotline channels, and FAQ questions and answers support most of the business traffic of the market. With the continuous optimization of the single-round Q&A effect, the bottleneck of consumer Q&A experience gradually shifts to the processing of vague questions. Multiple rounds of understanding and clarification guidance for FAQ questions and answers, supporting many fields such as new retailers, e-commerce platforms, and local life, as well as hotline colloquial scenarios, dynamic shortcut phrase predictions, picture questions and answers, FAQ knowledge classification and matching, and recommendation scenarios without answers and many other scenarios; for the access of small customers, multiple rounds of question-and-answer capabilities can be quickly enabled at low cost, and the consumer service experience has been better improved. For example, Hotline Xiaomi provides multiple rounds of semantic modeling, multiple rounds of question rewriting, clarification and rhetorical questions, clarification and confirmation, multiple rounds of dialogue status management, and multiple rounds of FAQ question and answer capabilities, effectively improving the answer rate of the system.

4.1.3 Intelligent Customer Service for Overseas Customers

Expand domestic Chinese intelligent service capabilities to the world through the multilingual question-and-answer technology system, supporting 22 languages ​​such as English, Russian, Spanish, French, Japanese, Arabic, Korean, Polish, Portuguese, Thai, Indian, and Vietnamese, allowing Ali Group Lazada , AliExpress, Daraz and other international business users have entered the era of intelligent services. Based on the construction of a multilingual algorithm platform, a new small language can be expanded within 2 weeks, and the overall resolution rate is already comparable to that of Chinese.

In addition to rapidly expanding the language, it is also necessary to go deep into the local culture to support authentic local language understanding. Currently, it supports mixed language styles in line with local habits in Malaysia, Thailand, Pakistan and other places. Because some language input methods in the Middle East and South Asia are not perfect, local users often use romanized pinyin to input when communicating online. The system needs to support original Urdu, English, and romanized (pinyin) Urdu at the same time Mixed understanding of three languages.

4.2 Intelligent customer service on the cloud

The new-generation human-computer dialogue technology system of Bodhidharma Academy has served and fully supported Alibaba Cloud's intelligent customer service business, including government affairs city brain (government service network, 12345 hotline robot, etc.), finance (banking, insurance, securities, etc.), transportation (high-speed ETC, ports, etc.), energy (grid, gas, water, heat, etc.), medical care (medical insurance, health care, chronic disease management, etc.), operators (phone bills, traffic, etc.). As of now, Alibaba Cloud Smart Customer Service has provided conversational AI-related services to more than a thousand domestic and foreign companies and institutions, and has accumulated mature solutions and customer cases in nearly 20 industries such as manufacturing, retail, finance, transportation, communications, and government affairs.

In the "China AI Cloud Service Market Research Report" released by IDC every six months, Alibaba Cloud's intelligent customer service has been ranked first in China's conversational AI cloud service market share since 2019. In October 2021, IDC, an international authoritative research institution, released the "IDC MarketScape Global Conversational AI Platform Vendor Evaluation Report". Low-cost knowledge construction, low-code visual operation, self-training semantic model and other product technical advantages, as well as accumulated field experience and applications in rich scenarios, were selected for the first time in the IDC Global Marketscape report and won the Major Players position.

4.2.1 Government affairs industry

In the field of government affairs, the typical business is the 12345 hotline. Covers a wide range of scenarios: including social security inquiries, ETC, household registration management, entry-exit management, housing security, provident fund full voice portal and other scenarios.

4.2.2 Banking Industry

In the era of digital economy, the traditional mode of relying only on human agents for services has gradually become difficult to meet the customer service needs of financial institutions. Empowering artificial agents through intelligent customer service, on the one hand, enhances the personal value of artificial agents and reduces the turnover rate, on the other hand, improves the efficiency of customer service access and improves user experience, which is an important demand of financial institutions. Bodhidharma Academy's powerful conversational AI capabilities, speech recognition capabilities, and AIC technology have created coaching robots including intelligent assistance and intelligent training, and built an AI capability platform, with customers covering many top banks.

4.2.3 Energy Industry

Energy and infrastructure fields: Build a full-link service platform that runs through new installations, billing, fault reporting, maintenance, complaints and other scenarios; channel coverage hotline, WeChat, Alipay and other channels; gas, water, heat, electricity, etc. in various cities reuse.

4.3 Social Responsibility: Epidemic Outbound Platform

When the epidemic just broke out, the Bodhidharma Academy team took action, hoping to help the government solve some problems by creating an outbound platform for the epidemic. The platform was quickly built within five days and began to be promoted across the country. As of March 31, 2020, the platform has been used in 27 provinces and helped the government make over 10 million outbound calls, with a completion rate of over 90%. Won the first prize of "People's Fight Against Epidemic" by People's Daily Online.

5. Future Prospects for the New Generation of Human-Machine Dialogue

In the past two years, the intelligent dialogue and service team of Bodhidharma Institute has made great progress in technology and business. So what stage is the human-machine dialogue ability at now? In which direction will it develop in the future?

For this reason, on the basis of referring to the 5-level autonomous driving system, we define a 5-level standard for man-machine dialogue capability, which is mainly described from three dimensions: 1. The limitation and openness of the scene; 2. The human-machine dialogue involves 3. Whether the dialogue ability is pre-defined or capable of continuous learning and evolution. The 5-level standard system is defined as follows:

  • L1: Constrained scenarios, single language, single modality, predefined dialogs

  • L2: semi-open scene, single language, single mode, predefined dialogue

  • L3: semi-open scene, multi-language, multi-modal, predefined dialogue

  • L4: Semi-open scene, multi-language, multi-modal, life-long learning dialogue robot

  • L5: Fully open-scenario, multi-language, multi-modal, life-long learning dialogue robot

According to this standard, the man-machine dialogue in the industry is basically between L1 and L2. In the next three years, human-machine dialogue will gradually expand from restricted scenarios to semi-open scenarios, from single-modality to multi-modality that integrates voice + language + vision + emotion, and dialogue capabilities will span from pre-defined to lifelong learning. Dialogue robots, thus evolving to L3~L4.

To achieve L5, to enable machines to freely communicate with humans across language gaps and modal restrictions in a completely open scene, it still requires persistent research and exploration by the intelligent dialogue and service team of DAMO Academy.

We sincerely invite talents who are interested in human-computer dialogue, knowledge graph, intelligent question answering, multi-modal human-computer interaction, and virtual space scene human-computer dialogue to join us.

Papers related to Intelligent Dialogue and Service of Bodhidharma Academy

1.

Yinpei Dai, Hangyu Li, Yongbin Li, Jian Sun, Fei Huang, Luo Si and Xiaodan Zhu. Preview, Attend and Review: Schema-Aware Curriculum Learning for Multi-Domain Dialogue State Tracking. ACL-IJCNLP 2021

2.

Che Liu, Rui Wang, Jinghua Liu, Jian Sun, Fei Huang, Luo Si. DialogueCSE: Dialogue-based Contrastive Learning of Sentence Embeddings, EMNLP2021

3.

Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, Jian Sun, Yongbin Li. GALAXY: A Generative Pre-trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection, AAAI 2022

4.

Binyuan Hui, Ruiying Geng, Qiyu Ren, Binhua Li, Yongbin Li, Jian Sun, Fei Huang, Luo Si, Pengfei Zhu, Xiaodan Zhu, Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing, AAAA 2021.

5.

Guanglin Niu, Yang Li, Chengguang Tang, Ruiying Geng, Jian Dai, Qiao Liu, Hao Wang, Jian Sun, Fei Huang and Luo Si. Relational Learning with Gated and Attentive Neighbor Aggregator for Few-Shot Knowledge Graph Completion, SIGIR2021

6.

Ruiying Geng, Binghua Li, Yongbin Li, Jian Sun, Xiaodan Zhu. Dynamic Memory Induction Networks for Few-Shot Text Classification, The 59th Annual Meeting of the Association for Computational Linguistics (ACL2020). Seattle, USA.

7.

Yinpei Dai, Hangyu Li, Chengguang Tang, Yongbin Li, Jian Sun, Xiaodan Zhu. Learning Low-Resource End-To-End Goal-Oriented Dialog for Fast and Reliable System Deployment, The 59th Annual Meeting of the Association for Computational Linguistics (ACL2020). Seattle, USA.

8.

Jinghan Zhang, Yuxiao Ye, Yue Zhang, Likun Qiu, Jian Sun. Multi-Point Semantic Representation for Intent Classification, Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI2020). New York City, NY, USA.

9.

Yinpei Dai, Huihua Yu, Yixuan Jiang, Chengguang Tang, Yongbin Li, Jian Sun, A Survey on Dialog Management: Recent Advances and Challenges, arXiv: 2005.02233

10.

Haitao Mi, Qiyu Ren, Yinpei Dai, Yifan He, Jian Sun, Yongbin Li, Jing Zheng, Peng Xu, Towards Generalized Models for Beyond Domain API Task-oriented Dialogue, AAAI 2021 DSTC9 Workshop.

11.

Yajing Sun, Yong Shan, Chengguang Tang, Yue Hu, Yinpei Dai, JING YU, Jian Sun, Fei Huang, Luo Si, Unsupervised Learning of Deterministic Dialogue Structure with Edge-Enhanced Graph Auto-Encoder, AAAI2021.

12.

Bin Fu, Yunqi Qiu, Chengguang Tang, Yang Li, Haiyang Yu, Jian Sun, A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges, arXiv:2007.13069

13.

Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, ​​Bin Bi, Wei Wang, Weihua Chen, Xianzhe Xu, Fan Wang, Zheng Cao, Zhicheng Zhang, Qiyu Zhang, Ji Zhang, Songfang Huang, Fei Huang, Luo Si, Rong Jin . "Achieving Human Parity on Visual Question Answering", arXiv.org, https://arxiv.org/abs/2111.08896

14.

Feng-Lin Li, Zhongzhou Zhao, Qin Lu, Xuming Lin, Hehong Chen, Bo Chen, Liming Pu, Jiashuo Zhang, Fu Sun, Xikai Liu, Liqun Xie, Qi Huang, Ji Zhang, Haiqing Chen, AliMe Avatar: Multi-modal Content Production and Presentation for Live-streaming E-commerce [SIGIR2021 Industrial Track]

15.

Guohai Xu, Yan Shao, Chenliang Li, Feng-Lin Li, Bing Bi, Ji Zhang, Haiqing Chen, AliMe DA: a Data Augmentation Framework for Question Answering in Cold-start Scenarios [SIGIR2021 Industrial Track]

16.

Qianglong Chen, Feng Ji, Xiangji Zeng, Feng-Lin Li, Ji Zhang, Haiqing Chen, Yin Zhang, KACE: Generating Knowledge Aware Contrastive Explanations for Natural Language Inference [ACL2021]

17.

Feng-Lin Li, Hehong Chen, Guohai Xu, Tian Qiu, Feng Ji, Ji Zhang, Haiqing Chen, AliMe KG:Domain Knowledge Graph Construction and Application in E-commerce, CIKM 2020, Applied Research Track

18.

Haiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang. "E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning", ACL 2021, https:// aclanthology.org/2021.acl-long.42.pdf

19.

Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, Luo Si. "StructuralLM: Structural Pre-training for Form Understanding", ACL 2021, https://aclanthology.org/2021.acl-long. 493/

20.

Chenliang Li, Ming Yan, Haiyang Xu, Fuli Luo, Wei Wang, Bin Bi, Songfang Huang. "SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels", arXiv.org, https://arxiv.org/ abs/2103.07829.

21.

Ming Yan, Haiyang Xu, Chenliang Li, Bin Bi, Junfeng Tian, ​​Min Gui, Wei Wang. "Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training", arXiv.org, https://arxiv.org/ abs/2108.09479

22.

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross-and Intra-modal Knowledge IntegrationY Cui, Z Yu, C Wang, Z Zhao, J Zhang, M Wang, J Yu [ACM MM 2021]

23.

Xuming Lin, Shaobo Cui, Zhongzhou Zhao, Wei Zhou, Ji Zhang and Haiqing Chen, GGP: A Graph-based Grouping Planner for Explicit Control of Long Text Generation [CIKM2021]

24.

Guohai Xu, Hehong Chen, Feng-Lin Li, Fu Sun, Yunzhou Shi, ZhiXiong Zeng, Wei Zhou, Zhongzhou Zhao, Ji Zhang, AliMe MKG: a Multi-modal Knowledge Graph for Live-streaming E-commerce [CIKM21 Demo]

25.

Fu Sun, Feng-Lin Li, Ruize Wang, Qianglong Chen, Xingyi Cheng, Ji Zhang, K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering [CIKM21 Applied Track]

26.

Fangkai Jiao, Yangyang Guo, Yilin Niu, Feng Ji, Feng-Lin Li, Liqiang Nie, REPT: Bridging Language Models and Machine Reading Comprehension via Retrieval-Based Pre-training [ACL 2021 Findinds]

27.

Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, Haiqing Chen, OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach, [WWW2021]

28.

Yangyang Guo, Liqiang Nie, Zhiyong Cheng, Feng Ji, Ji Zhang, Alberto Del Bimbo, AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss, [IJCAI2021]

29.

Zhenxin Fu, Shaobo Cui, Feng Ji, Ji Zhang, Haiqing Chen, Dongyan Zhao, Rui Yan, Query-to-Session Matching: Do NOT Forget History and Future during Response Selection for Multi-Turn Dialogue Systems, CIKM 2020

30.

Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji and Haiqing Chen, Simple and Effective Text Matching with Richer Alignment Features, ACL 2019, Long Paper

31.

Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, Zhongzhou Zhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang and Haiqing Chen, A Deep Cascade Model for Multi-Document Reading Comprehension [AAAI 2019]

32.

Feng-Lin Li, Minghui Qiu, Haiqing Chen, Xiongwei Wang, Xing Gao, Jun Huang, Juwei Ren, Zhongzhou Zhao, Weipeng Zhao, Lei Wang, Guwei Jin and Wei Chu, AliMe Assist: An Intelligent Assistant for Creating an Innovative E- Commerce Experience, CIKM 2017 Demo (Best Demo Award)

33.

Minghui Qiu, Fenglin Li, Siyu Wang, Xing Gao, Yan Chen, Weipeng Zhao, Haiqing Chen, Jun Huang and Wei Chu, AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine, ACL 2017, Short Paper

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/124411686