Course Introduction
What this course is mainly to learn
- An understanding of the effective modern methods for deep learning
- A big picture understanding of human languages and the difficulties in understanding and producing them
- An understanding of and ability to build systems (in PyTorch) for some of the major problems in NLP:
- Word meaning, dependency parsing, machine translation, question answering
2019Winter previous courses What is the difference
- 新材料:character models,transformers, safety/fairness,multitask learn
- Experiment: Experiment 3 a week before to a week five experiments (accounting for 6% + 4 x 12%)
- Experiment contains new material: NMT with attention, ConvNets, subword modeling
- Tensorflow by the use Pytorch
Introduction of five experiments
- HW1 is hopefully an easy on ramp –an IPythonNotebook
- HW2 is pure Python (numpy) but expects you to do (multivariate) calculus so you really understand the basics
- HW3 introduces PyTorch
- HW4 and HW5 use PyTorchon a GPU (Microsoft Azure)
- Libraries like PyTorch, Tensorflow(and Chainer, MXNet, CNTK, Keras, etc.) are becoming the standard tools of DL
- For FP, you either
- Do the default project, which is SQuADquestion answering
- Open-ended but an easier start; a good choice for most
- Propose a custom final project, which we approve
- You will receive feedback from a mentor(TA/prof/postdoc/PhD)
- Can work in teams of 1–3; can use any language
- Do the default project, which is SQuADquestion answering
lecture 1
How to represent the meaning of a word (meaning of a word)
- Establish hypernym (ie, "is a" relationship) of all thesaurus synonyms synonym and hyponym
- wordnet
- one-hot vector
- word2vec
- Skip-Gram model
- CBOW
WordNet
- As a good resource, but the lack of nuance
- The lack of new meanings of words can not be kept up to date
- subjective
- Need to manually
- We can not calculate word similarity
one-hot
- Any two words are orthogonal vectors, not representing the similarity
- Solution: combine synonyms in WordNet, the result failed. Due to reasons such as incomplete
Use the context (context) to represent the word
- Distributional semantics: A word’s meaning is given by the words that frequently appear close-by
- A word appears in the article, a fixed window, which refers to the context term appears around a set of configuration
- Many use the word w w can be constructed context indicates
Word2vec (Mikolovet al. 2013)
- o is contex words, c is the center words
- The word vector c is calculated and o or
- Adjusted to maximize the probability vector word
- The objective function, as shown below, is to maximize the probability of correctly predicted minimizing the cost function
- Compute
- Use two vectors to represent each word
- : W is the center word
- : W is the context word
- Compute
- Used in the formulas of softmax, whose role is to:
- Use two vectors to represent each word
Training model
-
representative of all the model parameters, i.e. all word vector words (two words each word vectors)
How to calculate the gradient vector word
- A foundation
- Chain guide law
For a context word of the cost function on a window Gradient
分子求导比较容易,log和exp抵消后,形式就和我们之前所说的一样
分母需要两次链导法则,中间有一步将求导和求和调换顺序:
最终结果为
当前中心词的梯度相当于当前context word o的词向量减去所有context 词向量的期望或者说加权平均值(概率*词向量)
- 如何计算所有的梯度
- 在一个window中需要遍历计算每一个center vector v的梯度,同时也要计算context u的梯度
- 在一个window中需要计算如下参数
为何需要两个向量
容易优化,最后取平均即可
使用一个词向量也可以
两种模型
- Skip-grams(SG):给定中心词预测context words
- Continuous Bag of Words(CBOW):给定context words预测中心词
优化:梯度下降
Gradient Descent is an algorithm to minimize
随机梯度下降 (SGD):随机选择一个window来更新词向量
总结
本节课主要讲了词的表示,中心想法是使用一个词的context来表示词,这即是word2vec的思想。word2vec有两种模型,skip-gram是根据中心词预测上下文(本文就是这种,但去掉了负采样),CBOW是根据周围词预测中心词。