Task 4: Contextual Word Embeddings

Contextual Word Embeddings

What do you want to learn most, summarized as follows:


  • Transformers
  • BERT
  • Question Answering (QA)
  • Text generation and summarization

Pre-trained word vectors: Collobert in the early years, Weston et al. 2011 results


Pre-trained word vector: current (2014 -)


  • We can start with random word vectors and then train in the areas we are interested in.
  • But in many cases, using pre-trained word vectors will help, because we can train on more data to get more words
  • Chen and Manning (2014) dependency analysis
  • Random: uniform (-0.01, 0.01)•--Pre-training:
    • PTB(C&W):+ 0.7%
    • CTB(word2vec):+ 1.7%

Identify new words with word vectors


  • The simplest and common solution:

  • Training Time: Vocab is {words occurring, say, ≥ 5 times} ∪ {<UNK>}

  • Map all rare (<5) to the word <UNK>, training for a word vector

  • Run Time: When out-of-vocabulary (OOV) words to appear, using <UNK>

  • problem:

    • Regardless of identity or meaning, it is impossible to distinguish between different UNK words
  • solution:

  1. We just learned about the char-level model used to build vectors!
    • Especially in applications such as question and answer
      • The important position of the word identification match (even words outside the word vector vocabulary)

2. Try these tips (from Dhingra, Liu, Salakhutdinov and Cohen, 2017 Nian)
-. A test in time, if <UNK> word appears in unsupervised embedded in the word, on the use of the vector in the time of testing.
-b. In other words, just assign them a random vector and add them to your vocabulary.

  • a. It definitely helps a lot; b. It might help
  • Another thing you can try:
    • Assemble things into word classes (for example, unknowns, capitalized things, etc.), and each has a <UNK-class>

1. How to describe a word?


  • So far, basically we have said that we have a way of describing words:

    • Word vectors learned from the beginning
      • Word2vec , GloVe , fastText
  • These have two problems:

    • For word types, always use the same representation, regardless of the context in which the word tag appears
      • We may want fine-grained word meaning to disambiguate
    • We only use one word to express, but words have different aspects, including semantics, syntactic behavior and registration/connotation

Have we always had a solution to this problem?


  • In an NLM, we immediately insert the word vector through the LSTM layer (maybe only train on the corpus)
  • These LSTM layers are trained to predict the next word
  • But these language models produce context-specific word representations at every location!

2. Peters et al. (2017): TagLM – "Pre-ELMo"


https://arxiv.org/pdf/1705.00108.pdf

  • Idea: I want to use the meaning of words in the context, but usually only learn task RNNs on data with small task labels (such as NER)
  • Why not do a semi-supervised method and train NLM (not just word vectors) on a large unlabeled corpus?

Day LM


Named Entity Recognition (NER)


A very important subtask of NLP: For example, to find and classify names in text:

  • Independent MP Andrew Wilkie's decision to withdraw support for the minority Labor Party government sounds compelling, but it should not further threaten its stability. After the 2010 election, when Wilkie, Rob Oakshot, Tony Windsor and the Greens agreed to support the Labor Party, they only offered two guarantees: confidence and supply.

Peters et al. (2017): TagLM – "Pre-ELMo"


The language model is trained on 800 million training words of the "Billion Word Benchmark"
Language model observation

  • LM trained on supervisory data did not help
  • Two-way LM can only advance forward about 0.2
  • The huge LM design (ppl 30) can reduce the smaller model (ppl 48) by 0.3
    task-specific BiLSTM observations
  • Using only LM embedding to predict the effect is not good: 88.17 F1
  • Much lower than using the BiLSTM tagger for label data only

Also in the air: McCann et al. (2017)


https://arxiv.org/pdf/1708.00107.pdf

  • Also has the idea of ​​using the trained sequence model to provide context for other NLP models
  • Idea: Machine translation aims to preserve meaning, maybe this is a good goal?
  • Use the 2-layer bi-LSTM as the encoder of the seq2seq + attention NMT system as the context provider
  • On various tasks, the performance of the generated CoVe vector is better than that of the GloVe vector
  • However, the result is not as powerful as the simpler NLM training described in the other slides, so it seems to be abandoned
  • Maybe NMT is more difficult than language modeling?
  • Maybe one day this idea will come back?

Peters et al. (2018): ELMo: Embedded Model of Language


Deeply contextualized word representation. NAACL 2018.https://arxiv.org/abs/1802.05365

  • A breakthrough version of word tag vector or context word vector
  • Use long context instead of context window to learn word tag vectors (here, the entire sentence may be longer)
  • Learn in-depth Bi-NLM and use all its layers for prediction

Peters et al. (2018): ELMo: Embedded Model of Language Model


  • Training bidirectional LM
  • For LM with higher performance, but not too big:
  • Use 2 biLSTM layers
  • Use character CNN to build initial word representation (only)
  • 2048 g filter and 2 highway layers, 512d projection
  • User 4096 dim hidden/cell LSTM states and 512 dim
    projection of the next input
  • Use remaining connections
  • Bind the parameters of the token input and output (softmax) and bind them between the forward and backward LM

Peters et al. (2018): ELMo: Embedded Model of Language


  • ELMo learns a combination of biLM representations for specific tasks
  • This is an innovation, only improved when using the top layer of the LSTM stack

  • The overall practicality of ELMo can be extended to tasks;

  • Is the softmax-normalized mixed model weight

Peters et al. (2018): ELMo: Use with tasks


  • First run biLM to get the representation of each word
  • Then let (any) final task model use them
  • Freeze the weight of ELMo for supervising the model
  • Connect ELMo weights to task-specific models• Details depend on the task
  • Connected to the middle layer to achieve marking LM typical
  • Can provide more EL representation when generating output, for example in a question answering system

ELMo used in sequence markers


CoNLL 2003 named entity recognition (en news testb)


ELMo results: applicable to all tasks


ELMo: layer weight


  • The two biLSTM NLM layers have different uses/meanings • The lower layer is more suitable for the syntax of the lower layer, etc.
  • Part-of-speech tags, syntactic dependence, NER•Higher layers are more suitable for higher-level semantics
  • Emotion, semantic labeling, question answering, SNLI
  • This looks interesting, but it seems more interesting.
    How can it be achieved with more than two layers of networks

Also around: ULMfit


Howard and Ruder (2018) General language model fine-tuning for text classification. https://arxiv.org/pdf/1801.06146.pdf

  • General idea of ​​transferring NLM knowledge
  • Suitable for text classification

ULMfit


Train LM on a large general domain corpus (using biLM) Adjust LM
fine-tuning on the target task data as a classifier for the target task

ULMfit highlights


Use a moderately sized "1 GPU" language model that is not a really huge model. Pay more attention to LM fine-tuning . When learning the classifier with a
different
tilted triangular learning rate (STLR) schedule for each layer , use the level [h#, maxpool + , Meanpool +] is gradually stratified and STLR when classifying

ULMfit performance

  • Text classifier error rate

ULMfit transfer learning


Let us scale up!


GPT-2 language model (selected) output


System prompts (written by humans) to
complete the model (written by machine, 10
items ) In a shocking discovery, scientists discovered a group of unicorns living in a remote, previously unexplored valley in the Andes Mountains. To the researchers' surprise, unicorns can speak perfect English.
The scientist named this population after the unicorn with its unique horn Ovid. These four-horned silver-white unicorns were previously unknown to the scientific community.
Now, after nearly two centuries, the mystery that caused this strange phenomenon has finally been solved.
Dr. Jorge Pérez, an evolutionary biologist at the University of La Paz, and several companions were exploring when they discovered a small valley in the Andes without other animals or humans. Perez noticed that the valley seemed to be a natural fountain, surrounded by two rocks and silver snow peaks.
Perez and others then ventured further into the valley. ...

Transformer model


All these models are Transformer architecture models... so maybe we better understand Transformers?

4. The power of the transformer


  • We want to parallelize, but RNN is essentially sequential

  • Despite GRU and LSTM, RNN still needs to pay attention to mechanisms to deal with long-distance dependencies – the path length between states will increase as the sequence grows
  • But if attention enables us to enter any state... maybe we can just use attention without RNN?

Transformer overview


Attention is all you need. 2017. Aswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin https://arxiv.org/pdf/1706.03762.pdf

  • Acyclic sequence to sequence encoder-decoder model
  • Task: machine translation with parallel corpus
  • Predict each translated word
  • The final cost/error function is the standard cross-entropy error above the softmax classifier

Transformer basics


  • Know the transformer yourself?
  • Important recommended resources:
    http://nlp.seas.harvard.edu/2018/04/03/attention.html •TheAnnotatedTransformerbySashaRush
  • Use PyTorch's Jupyter notebook to explain everything!
    • Now: let's define the basic building blocks.
      Transformer network: First, a new layer of attention!

Dot product attention (expanding our previous DEF)


-Input: query q and a set of key-value (kv) pairs to the output

  • Query, key-value pairs and output are all vectors
  • The output is the weighted sum of the values, where
  • The weight of each value is calculated by the inner product of the query and the corresponding key
  • The query and the key have the same dimension dk value has dv

Point product attention-matrix symbol


  • When we have multiple queries q, we stack them in matrix Q:

  • become:

Progressive softmax

Dot product product attention


  • Problem: As dk becomes larger, the variance of qTk increases-some values ​​within softmax become larger-softmax becomes very sharp, so its gradient becomes smaller.
  • Solution: Scale according to the length of the query/key vector:

Self-attention in the encoder


  • The input word vector is query, key and value
  • In other words: the word vectors themselves choose each other
  • Wordvectorstack = Q = K = V
  • We will see in the decoder why we separate them in the definition

Multi-headed attention


  • Simple self-concern questions:
  • The only way words interact with each other
  • Solution: Multi-headed attention
  • Mapping the first Q, K, V to h = 8 is much lower than the
    W matrix dimension space
  • Then pay attention, then connect the output and pass through the fully connected layer

Complete transformer block


Each block has two "sublayers"

  • Multi-headed attention
  • 2-layer feedforward NNet (with ReLU)

These two steps also have:
residual (short circuit) connection and LayerNorm
LayerNorm (x + sublayer (x))
Layernorm changes the input to mean 0 and variance 1,
each layer and each training point (and two more parameters)

Layer normalization by Ba, Kiros and Hinton, https://arxiv.org/pdf/1607.06450.pdf

Encoder input


  • The actual word representation is byte-pair encoding
    • Same as last lecture
  • Position coding is also added, so the same word in different positions has a different overall representation

Complete encoder


  • For the encoder, in each block, we use the same Q, K and V
    from the previous layer
  • Block repeat 6 times
    • (In vertical stack)

Attention visualization in layer 5


  • Words start to focus on other words in a wise way

Focus on visualization: implicit anaphora resolution


On the 5th floor. Note that the attention of heads 5 and 6 is only separated from the word "it". Please note that the attention of this word is very sharp.

Transformer decoder


  • There are 2 sub-layer changes in the decoder
  • The decoder that masks the previously generated output is self-focused:

  • Encoder-Decoder Attention, where the query comes from the previous decoder layer, and the key and value come from the output of the encoder

The block is also repeated 6 times

Tips and tricks for transformers


Detailed information (written and/or future lectures):

  • Byte-pair encoding
  • Checkpoint average
  • ADAM optimization and changes in learning rate
  • Before adding residuals, there will be dropouts in the training process of each layer
  • Label smoothing
  • Automatic regression decoding with beam search and length
    penalty
  • The use of transformers is gaining popularity, but they are difficult to optimize. Unlike LSTM, they are usually not out-of-the-box
    and they do not work well on other task building blocks.

MT experimental results


Analyze experimental results


5. BERT:Devlin,Chang,Lee,Toutanova(2018年)


BERT (Representation of Bidirectional Encoder from Transformer):
Pre-training of deep bidirectional transformer to improve language comprehension

Based on a slide by Jacob Devlin

  • Problem: The language model only uses left or right context, but language understanding is bidirectional.
  • Why is LM unidirectional?
  • Reason 1: Directionality is needed to generate a
    probability distribution in the correct format .
    • We don't care about this.
  • Reason 2: Words can "see themselves" in the two-way encoder.

  • Solution: Block out k% of the input words, and then predict the blocked words
  • They always use k = 15%

The man went to [MASK] to buy milk [MASK]

  • Too few masks: training is too expensive
  • Too much cover up: insufficient context

BERT complicates: the next sentence of prediction


  • To understand the relationship between sentences, predict whether sentence B is an actual sentence that executes sentence A or a random sentence

BERT sentence pair encoding


Token embedding is
a segmented embedding of word fragment learning, which means that the positional embedding of each sentence is the same as other Transformer architectures.

BERT model structure and training


  • Transformer encoder (same as before)
  • Self-concern ⇒ No local prejudice
  • "Equal opportunity" in long distances
  • Single multiplication per layer ⇒ GPU/TPU efficiency
  • Train on Wikipedia + BookCorpus
  • Train 2 types of model sizes:
  • BERT-Base: 12 layers, 768 hidden layers, 12 header layers
  • BERT-Large: 24 layers, 1024 hidden layers, 16 header layers
  • Trained on 4x4 or 8x8 TPU slices for 4 days

BERT model fine-tuning


  • Fine-tune each task, just learn the classification built on top

BERT's results on the GLUE task


  • The GLUE benchmark test is dominated by natural language inference tasks, but also has sentence similarity and sentiment
  • MultiNLI
  • Prerequisite: Jains especially admire hills and mountains.
    Hypothesis: Jainism hates nature.
    Label: Contradiction
  • CoLa
  • Sentence: The carriage rumbling on the road. Label: Acceptable
  • Sentence: The car honked its horn on the road. Label: unacceptable

CoNLL 2003 named entity recognition (en news testb)


BERT results on SQuAD 1.1


SQuAD 2.0 ranking, 2019-02-07


Effect of pre-training tasks


Size matters


  • Parameters from 110M to 340M are very helpful
  • No improvement



Reference: https://www.jianshu.com/p/1bb863c4f26f

Guess you like

Origin blog.csdn.net/m0_38024592/article/details/107117461