2019 CS224N lecture1 Introduction and Word Vectors

Course Introduction

What this course is mainly to learn

  • An understanding of the effective modern methods for deep learning
  • A big picture understanding of human languages and the difficulties in understanding and producing them
  • An understanding of and ability to build systems (in PyTorch) for some of the major problems in NLP:
    • Word meaning, dependency parsing, machine translation, question answering

2019Winter previous courses What is the difference

  • 新材料:character models,transformers, safety/fairness,multitask learn
  • Experiment: Experiment 3 a week before to a week five experiments (accounting for 6% + 4 x 12%)
  • Experiment contains new material: NMT with attention, ConvNets, subword modeling
  • Tensorflow by the use Pytorch

Introduction of five experiments

  • HW1 is hopefully an easy on ramp –an IPythonNotebook
  • HW2 is pure Python (numpy) but expects you to do (multivariate) calculus so you really understand the basics
  • HW3 introduces PyTorch
  • HW4 and HW5 use PyTorchon a GPU (Microsoft Azure)
    • Libraries like PyTorch, Tensorflow(and Chainer, MXNet, CNTK, Keras, etc.) are becoming the standard tools of DL
  • For FP, you either
    • Do the default project, which is SQuADquestion answering
      • Open-ended but an easier start; a good choice for most
    • Propose a custom final project, which we approve
      • You will receive feedback from a mentor(TA/prof/postdoc/PhD)
    • Can work in teams of 1–3; can use any language

lecture 1

How to represent the meaning of a word (meaning of a word)

  • Establish hypernym (ie, "is a" relationship) of all thesaurus synonyms synonym and hyponym
    • wordnet
  • one-hot vector
  • word2vec
    • Skip-Gram model
    • CBOW

WordNet

  • As a good resource, but the lack of nuance
  • The lack of new meanings of words can not be kept up to date
  • subjective
  • Need to manually
  • We can not calculate word similarity

one-hot

  • Any two words are orthogonal vectors, not representing the similarity
  • Solution: combine synonyms in WordNet, the result failed. Due to reasons such as incomplete

Use the context (context) to represent the word

  • Distributional semantics: A word’s meaning is given by the words that frequently appear close-by
  • A word appears in the article, a fixed window, which refers to the context term appears around a set of configuration
  • Many use the word w w can be constructed context indicates

Word2vec (Mikolovet al. 2013)

  • o is contex words, c is the center words
  • The word vector c is calculated and o p ( O c ) p(o|c) or p ( c O ) p(c|o)
  • Adjusted to maximize the probability vector word
  • The objective function, as shown below, is to maximize the probability of correctly predicted minimizing the cost function
    Here Insert Picture Description
  • Compute p ( w t + j w t ; i ) p(w_{t+j}|w_t;\theta)
    • Use two vectors to represent each word
      • v w v_w : W is the center word
      • in w u_w : W is the context word
    • ComputeHere Insert Picture Description
    • Used in the formulas of softmax, whose role is to:Here Insert Picture Description

Training model

  • i \theta representative of all the model parameters, i.e. all word vector words (two words each word vectors)
    Here Insert Picture Description

How to calculate the gradient vector word

  • A foundation
    Here Insert Picture Description
  • Chain guide law

For a context word of the cost function on a window v c v_c Gradient

Here Insert Picture Description
Here Insert Picture Description
分子求导比较容易,log和exp抵消后,形式就和我们之前所说的一样
分母需要两次链导法则,中间有一步将求导和求和调换顺序:
Here Insert Picture Description
最终结果为
Here Insert Picture Description
Here Insert Picture Description
当前中心词的梯度相当于当前context word o的词向量减去所有context 词向量的期望或者说加权平均值(概率*词向量)

  • 如何计算所有的梯度
    • 在一个window中需要遍历计算每一个center vector v的梯度,同时也要计算context u的梯度
    • 在一个window中需要计算如下参数
      Here Insert Picture Description

为何需要两个向量

容易优化,最后取平均即可
使用一个词向量也可以

两种模型

  • Skip-grams(SG):给定中心词预测context words
  • Continuous Bag of Words(CBOW):给定context words预测中心词

优化:梯度下降

Gradient Descent is an algorithm to minimize J ( i ) J(\theta)
Here Insert Picture Description
Here Insert Picture Description
随机梯度下降 (SGD):随机选择一个window来更新词向量
Here Insert Picture Description

总结

本节课主要讲了词的表示,中心想法是使用一个词的context来表示词,这即是word2vec的思想。word2vec有两种模型,skip-gram是根据中心词预测上下文(本文就是这种,但去掉了负采样),CBOW是根据周围词预测中心词。

本文中截图来自斯坦福CS224N课程,感谢

发布了29 篇原创文章 · 获赞 10 · 访问量 7171

Guess you like

Origin blog.csdn.net/weixin_42017042/article/details/104000721