Multiple Rounds of Dialogue (3): Spoken Language Understanding Progress and Frontiers

This blog is based on a paper published by Harbin Institute of Technology on IJCAI: A Survey on Spoken Language Understanding - Recent Advances and New Frontiers.

Paper link
github link

Spoken Language Understanding (SLU), which aims to extract the semantic framework of user queries, is a core component of task-oriented dialogue systems. This paper includes: (1) New classification method : We provide a new perspective for the SLU field, including single model and joint model, implicit joint modeling and explicit joint modeling in joint model, non-pre-training paradigm and pre-training paradigm; (2) New field: some emerging fields in complex SLU and corresponding challenges; (3) Rich open source resources: related papers, baseline projects and leaderboards are collected on Awesome - SLU - Survey .

一、Introduction

Spoken Language Understanding (SLU) is a core component of task-oriented dialogue systems that aim to capture the semantics of user queries. It usually consists of two tasks: intent detection and slot filling . The input is a sentence, and the output consists of an intent class label and a sequence of slot labels.

Spoken Language Understanding Example

Intent detection can be defined as a sentence classification problem (CNN, RNN), and slot filling can be defined as a sequence labeling task (CRF, RNN, LSTM). Traditional approaches treat slot filling and intent detection as two independent tasks, ignoring the shared knowledge between the two tasks. Intuitively, intent detection and slot filling are not independent but highly linked. To this end, leading models in the literature employ joint models to exploit the shared knowledge between the two tasks .

  • vanilla multi-task:
    A joint model of intent determination and slot filling for spoken language understanding
  • slot-gated:
    Slot-gated modeling for joint slot filling and intent prediction
    A self-attentive model with gate mechanism for spoken language understanding
  • stack-propagation:
    A stack-propagation framework with token-level intent detection for spoken language understanding
  • bi-directional interaction:
    A novel bi-directional interrelated model for joint intent detection and slot filling
    A co-interactive transformer for joint slot filling and intent detection

At present, Intent Acc and Slot F1 exceed 97% and 98% on ATIS, and exceed 97% and 99% on SNIPS. But have we accomplished the SLU task perfectly? After investigation, it is found that the current mainstream work is still a simple setting: single domain and single turn , which is far from meeting the requirements of some complex applications.

2. Background

1. Definition
  • Intent Detection
    Given an input utterance X = ( x 1 , . . . , xn ) X=(x_1,...,x_n)X=(x1,...,xn) , intent detection can be thought of as deciding the intent labelo I o^IoThe sentence classification task of I , the form is:o I = ID ( X ) . o^I=ID(X).oI=ID(X).
  • Slot Filling
    slot filling can be seen as generating a sequence slot o S = ( o 1 S , . . . , ons ) o^S=(o^S_1,...,o^s_n)oS=(o1S,...,ons) sequence labeling task, can be written as:o S = SF ( X ) . o^S=SF(X).oS=SF(X).
  • The Joint Model
    joint model can predict slot sequences and intents at the same time, and has the advantage of capturing shared knowledge across related tasks, using: ( o I , o S ) = JM ( X ) . (o^I, o^S)=JM(X).(oI,oS)=J M ( X ) .
2. Dataset

The most widely used datasets are ATIS and SNIPS .

  • ATIS
    The A TIS spoken language systems pilot corpus.
    The ATIS dataset contains audio recordings of flights, bookings. There are 4478 utterances for training, 500 utterances for validation and 500 utterances for testing. The ATIS training data contains 120 slot labels and 21 intent types.
  • SNIPS
    Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces.
    SNIPS is a custom intent engine. There are 13084 utterances for training, 700 utterances for validation and 700 utterances for testing. There are 72 slot labels and 7 intent types in total.
3. Evaluation Metrics

The most widely used evaluation metrics for SLU are F1 scores , intent accuracy and overall accuracy .

  • F1 Scores:
    Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. The
    F1 score is used to evaluate the performance of slot filling , and the F1 score is the harmonic mean score between precision and recall. A slot prediction is considered correct when an exact match is found.
  • Intent Accuracy:
    Intent Accuracy is used to evaluate the performance of intent detection , calculating the proportion of sentences that correctly predict the intent.
  • Overall Accuracy:
    Slot-gated modeling for joint slot filling and intent prediction.
    Overall Accuracy is used to calculate the proportion of sentences in which intent and slot are correctly predicted . This metric takes into account both intent detection and slot filling.

三、Taxonomy

SLU Taxonomy

1. Single Model

A single model is trained for each task separately for intent detection and slot filling. Due to separate training, there is no interaction between intent detection and slot filling for a single model, resulting in shared knowledge leakage between the two tasks .

  • Intent Detection:
    Gradient-based learning applied to document recognition: Convolutional neural network (CNN) is used to extract 5-gram features, and max-pooling is applied to obtain word representations.
    Recurrent neural network and lstm models for lexical utterance classification: Successfully applied RNN and long short-term memory network (LSTM) to ID task, showing that sequential features are beneficial for intent detection task.
  • Slot Filling:
    Recurrent neural networks for language understanding: RNN language models (RNN-LMs) are used to predict slot labels instead of words. In addition, RNN-LMs explore future words, named entities, grammatical features, and part-of-speech information. Spoken language understanding using long short-term memory neural networks: An LSTM framework
    for slot filling tasks is proposed . Using recurrent neural networks for slot filling in spoken language understanding: Viterbi encoding and recurrent crf are applied to eliminate the label bias problem. Recurrent conditional random field for language understanding: R-CRF is proposed to address label bias. Recurrent neural network structured output prediction for spoken language understanding: Proposes to use a sampling method to model slot label dependencies, feeding back the sampled output labels (true or predicted) to the sequence state.



    Bi-directional recurrent neural network with ranking loss for spoken language understanding: The ranking loss function of the bi-RNN model is used in SF to further improve the performance of the ATIS dataset. Leveraging sentence-level information with encoder LSTM for semantic slot filling: Leveraging sentence-level information from the encoder to improve the performance of the SF task.
2. Joint Model

model performance

Considering the close correlation between intent detection and slot filling, major works in the literature employ joint models to exploit shared knowledge across tasks . Existing joint work mainly falls into two categories: implicit joint modeling and explicit joint modeling.

  • Implicit Joint Modeling:
    The implicit joint modeling representation model only adopts shared encoders to capture shared features without any explicit interaction. While implicit joint modeling is an approach that directly incorporates shared knowledge, it does not explicitly model interactions, leading to low interpretability and low performance . A joint model of intent determination and slot filling for spoken language understanding: A shared RNNs
    (Joint ID and SF) is introduced to learn the correlation between intents and slots. Attention-based recurrent neural network models for joint intent detection and slot filling: A shared encoder-decoder framework with attention mechanism (Attention BiRNN) is introduced for joint intent detection and slot filling. Joint online spoken language understanding and language modeling with recurrent neural networks: Joint SF, ID and language modeling using shared RNN (Joint SLU-LM), aiming to improve online prediction capabilities. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm: A shared method for joint modeling is proposed


    RNN-LSTM architecture (Joint Seq).
  • Explicit Joint Modeling:
    In recent years, more and more people propose to explicitly model the interaction between intent detection and slot filling with an explicit interaction module, which has the advantage of explicitly controlling the interaction process . Existing explicit joint modeling methods can be divided into two types: single-flow interaction and two-way flow interaction.
    • Single Flow Interaction:
      Current single-flow interaction research mainly considers a single flow of information from intents to slots.
      Slot-gated modeling for joint slot filling and intent prediction: A joint slot-gated model (Slot-Gated) is proposed that allows slot filling to be conditioned on learned intent .
      A self-attentive model with gate mechanism for spoken language understanding: A new self-attentive model (Self-Atten. Model) is proposed, which uses an intent- enhanced gating mechanism to guide slot filling. A stack-propagation framework with token-level intent detection for spoken language understanding: A stack propagation model
      is proposed , which directly uses intent detection results to guide slot filling, and uses token-level intent detection to mitigate error propagation, further improving performance.
    • Bidirectional Flow Interaction:
      Bidirectional Flow Interaction means that the model takes into account cross effects between intent detection and slot filling. Simple multi-task frameworks only implicitly consider the interconnection between two tasks by sharing latent representations, explicit joint modeling can make the model fully capture the shared knowledge across tasks , thus improving the performance of both tasks. Second, explicitly controlling the knowledge transfer of the two tasks helps improve interpretability , making it easier to analyze the effect between SF and ID.
      A bi-model based RNN semantic frame parsing model for intent detection and slot filling: A bi-model architecture is proposed to consider the cross influence between SF and ID by using two correlated bidirectional LSTMs .
      A novel bi-directional interrelated model for joint intent detection and slot filling: Considering the influence of SF-to-ID and ID-to-SF, a new SF-ID network is proposed, which provides a bidirectional association mechanism for SF and ID tasks . Joint slot filling and intent detection via capsule neural networks: A dynamic routing capsule network
      (Capsule-NLU) is introduced to fuse the hierarchical and interrelated relations between two tasks.
      CM-net: A novel collaborative memory network for spoken language understanding: A novel collaborative memory network (CM-Net) is proposed for jointly modeling SF and ID.
      Graph lstm with context-gated mechanism for spoken language understanding: Explored the introduction of graph LSTM into SLU and achieved good performance. A co-interactive transformer for joint slot filling and intent detection: A co-interactive Transformer
      is proposed to consider cross-influence by establishing a bidirectional connection between two related tasks.
3. Pre-trained Paradigm

Recently, pretrained language models (PLMs) have achieved amazing results in various NLP tasks, where shared BERT is regarded as an encoder for extracting contextual representations. In the Bert-based model, each utterance begins with [CLS] and ends with [SEP], where [CLS] is a special symbol representing the entire sequence and [SEP] is a special symbol separating non-consecutive token sequences. Further, the representation of special tokens [CLS] is used for intent detection, and other token representations are used for slot filling. The pre-trained model can provide rich semantic features and help improve the performance of the SLU task.

Bert for joint intent classification and slot filling: BERT is used to extract shared contextual embeddings for intent detection and slot filling, and the model achieves significant improvements compared to other non-pretrained models.
A stack-propagation framework with token-level intent detection for spoken language understanding: Replace its attention encoder (Stack-Propagation+BERT) with a pre-trained embedding encoder, further improving the performance of the model.
A co-interactive transformer for joint slot filling and intent detection: BERT (Co-Interactive transformer BERT) is also explored for SLU, achieving state-of-the-art performance.

四、New Frontiers and Challenges

1. Contextual SLU

Naturally, completing a task often requires multiple back-and-forth dialogues between the user and the system, which requires the model to consider contextual SLUs. Unlike single-turn SLU, contextual SLU faces unique ambiguity challenges, as users and the system may refer to entities introduced in previous dialogue turns , thus introducing ambiguity, which requires the model to incorporate contextual information to mitigate ambiguity.

End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding: A memory network is proposed to integrate dialogue history information, showing that their model outperforms models without context. Sequential dialogue context modeling for spoken language understanding: A Sequential Dialogue Encoder
Network is proposed that allows to encode context in a dialogue history in chronological order. How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues: Various time-decay attention functions are designed and studied based on an end-to-end contextual language understanding model . Cosda-ml: Multi-lingual code-switching data augmentation for zero-shot cross-lingual nlp: An adaptive fusion layer is proposed to dynamically consider different and related context information to guide slot filling and achieve fine-grained context information transfer.

The main challenges are: 1) Contextual Information Integration : It is a core challenge to correctly distinguish the correlation between different dialogue histories and the current utterance, and effectively integrate contextual information into contextual SLU. 2) Long Distance Obstacle : Since some conversations have very long histories, how to effectively model long distance conversation histories and filter irrelevant noise is an interesting research topic.

2. Multi-Intent SLU

Multi-intent SLU means that the system can handle utterances containing multiple intents and their corresponding slots. It is shown that 52% of the examples in the Amazon internal dataset are multi-intent, which shows that the multi-intent setting is more practical in real-world scenarios.

Joint multiple intent detection and slot labeling for goal-oriented dialog: Explores a multi-task framework that jointly performs multiple intent classification and slot filling. AGIF: An adaptive graph-interactive framework for joint multiple intent detection and slot filling: An adaptive graph interaction framework
is proposed for modeling interactions between multiple intents and slots on each token.

Main challenges: 1) Interaction between Multiple Intents and Slots : Different from single-intent SLU, how to effectively integrate multi-intent information to guide slot prediction is a unique challenge faced by multi-intent SLU. 2) Lack of Data : There is currently no human-annotated data for multi-purpose SLU, which may be another reason for the slow progress.

3. Chinese SLU

Chinese SLU means that the SLU model trained with Chinese data is directly applied to the Chinese community. Compared with English SLU, Chinese SLU faces unique challenges because it usually requires word segmentation .

CM-net: A novel collaborative memory network for spoken language understanding: Contributed a new corpus (CAIS) to the research community . In addition, they also proposed a character-based joint model to perform Chinese SLU, but one drawback of the character-based SLU model is that it does not fully exploit the explicit word sequence information, which may be useful. Injecting word information with multi-level word adapter for chinese spoken language understanding: A multi-level word adapter
is proposed to efficiently incorporate word information into sentence-level intent detection and token-level slot filling.

Main challenges: 1) Word Information Integration : How to effectively combine vocabulary information to guide Chinese language learning is a unique challenge. 2) Multiple Word Segmentation Criteria : Due to the existence of multiple word segmentation criteria, how to effectively combine multiple word segmentation information of Chinese SLU is not easy.

4. Cross Domain SLU

Although existing SLU models achieve good performance in a single domain setting, they rely on large amounts of annotated data, which limits their utility in new and extended domains. In practice, it is not feasible to collect rich labeled datasets for each new domain. So hopefully a cross-domain setup is considered. Knowledge transfer methods in this field can be classified into two categories: Implicit domain knowledge transfer and Explicit domain knowledge transfer with parameter sharing.

Implicit domain knowledge transfer means that the model is simply trained with multi-domain datasets to capture domain features. This approach can implicitly extract shared features, but cannot effectively capture domain-specific features.
Multi-domain joint semantic frame parsing using bi-directional rnn-lstm: A single LSTM model on mixed multi-domain datasets is proposed, which can implicitly learn domain-shared features.
Onenet: Joint domain, intent, slot prediction for spoken language understanding: Adopt one network to jointly model slot filling, intent detection and domain classification, implicitly learn domain sharing and task sharing information.

The explicit domain knowledge transfer method means that the model adopts a shared-private framework , including a shared module to obtain domain-shared features, and each domain has a private module, which has the advantage of explicitly distinguishing shared knowledge from private knowledge.
Domain attention with an ensemble of experts: Use an attention mechanism to learn weighted combinations from feedback from expert models in different domains. Multi-domain adversarial learning for slot filling in spoken language understanding: Shared LSTM
is used to capture domain shared knowledge, and private LSTM is used to extract domain-specific features, which are combined for multi-domain slot filling. Multi-domain spoken language understanding using domain- and task-aware parameterization: A model with independent domain-specific parameters and task-specific parameters is proposed, which is able to capture the task-aware and domain-aware features of multi - domain SLU .

Main challenges: 1) Domain Knowledge Transfer : It is not trivial to transfer knowledge from source domain to target domain. In addition, how to perform domain knowledge transfer at a fine-grained level, sentence-level intent detection and token-level slot filling is also a difficulty. 2) Zero-shot Setting : When there is no training data in the target domain, how to transfer knowledge from the source domain data to the target domain is a challenge.

5. Cross-Lingual SLU

Cross-language SLU means that the SLU system trained on the basis of English can be directly applied to other low-resource languages, and it has attracted more and more attention.

Main challenges: 1) Domain Knowledge Transfer : It is not trivial to transfer knowledge from source domain to target domain. In addition, how to perform domain knowledge transfer at a fine-grained level, sentence-level intent detection and token-level slot filling is also a difficulty. 2) Zero-shot Setting : When there is no training data in the target domain, how to transfer knowledge from the source domain data to the target domain is a challenge.

6. Low-resource SLU

Significant progress in SLU has largely relied on large amounts of labeled training data, which cannot work in low-resource settings because little or no data is accessible. We will discuss the trends and progress of low resource SLU, including Few-shot SLU, zero-shot SLU and Unsupervised SLU.

  • Few-shot SLU:
    In some cases, a slot or intent has fewer instances, which makes traditional supervised SLU models impotent. To alleviate this problem, few-shot SLU is attractive in this case, since it can quickly adapt to new applications with few examples . Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network: A few-shot CRF model with collapsed dependency transfer mechanism
    is proposed for few-shot slot-tagging . Few-shot learning for multi-label intent detection: Start exploring few-shot multi-intent detection .
  • Zero-shot SLU:
    In the face of rapidly changing applications, a brand new application may have no target training data . Many zero-shot methods provide a way to solve this problem by discovering commonalities between slots . Towards zero-shot frame semantic parsing for domain scaling: A method of using slot description
    is proposed , which carries slot information, acquires and transfers concepts through different applications, and enhances the zero-shot capability of the model. Coach: A coarse-to-fine approach for cross-domain slot filling: A similar architecture is used to train the model's perception of slot descriptions . Robust zero-shot cross-domain slot filling with example values: Addresses the problem of misplaced overlapping patterns by adding slot example values ​​and descriptions during training .

  • Unsupervised SLU:
    In recent years, unsupervised methods have been proposed for automatic extraction of slot-value pairs , which is a promising direction to free models from heavy manual annotation.
    Dialogue state induction using neural latent variable models: A new dialogue state induction task (dialogue state induction) is proposed, which automatically recognizes dialogue state slot-value pairs.

Main challenges: 1) Interaction on low-resource setting : How to make full use of the connections between intents and slots in low-resource settings is still an open question. 2) Lack of Benchmark : There is a lack of public benchmarks on low resource settings, which may hinder progress.

Guess you like

Origin blog.csdn.net/Bat_Reality/article/details/128698162