Stanford 224N- Word Vector Representations: Word2Vec

word vector: model of word meaning (vectors that can encode the word meanings)
用分类资源来处理词义 WordNet: taxonomy information about words
nltk.corpus语料库 import wordnet
synonyms同义词，一词多义；but 这些分类的描述存在很多nuances, incomplete. People will use words more flexible
很难对词汇的相似性给出准确的定义，can't get many nuances in WordNet

Can build a very big vector, each word assign for 1."one-hot" representation (localist store representation)→ dot product = 0
but doesn't give any inherent notion of relationships between words. We need to include similarity.
Encodes a way that you can just read the similarity between words!!! dot product

Use distributional similarity(分布相似性):
get a lot of value by looking at the contexts in which it appears, 并看到和它们一起出现的词，统计
“You shall know a word by the company it keeps”
Word vector(distributed representations, 通过分布相似性构建) 的数字要让它预测目标单词所在文本的其他词汇(词与词之间可以相互预测)
像概率分布的感觉，中心词+外边context words

What is word2vec?
recipe in general for learning neural word embeddings -> predict between a center word and words that appear in its context
p(context words| wt as center) = probability of perfectly predict the words around the word
J = 1 - p(w -t | w t) loss function (wt是中心词，w -t是围绕在中心词周围的其他单词)
Change the representaion of words to minimize our loss!!
Use deep learning: you set this goal and say nothing else about this, depend on DL! the output vectors are powerful!

powerful word vectors!

Having distributed representations of words 2. and able to predict other words in context!

How that happens?
word2vec: predict between every words and its context words to learn word vectors
two algorithms:
1.skip-grams(SG)
for each estimation step, taking one word as the center word. Then try to predict words in its context in some window size(在一定句长内预测)
模型将定义一个概率分布：即给定一个中心词汇，某个单词在他上下文中出现的概率；选取词向量表示，以让概率最大化
对一个词汇，我们有且仅有一个概率分布

for each word t =1,2, ... T, predicting surrounding words in a window of radius m

get big long sequences of words to learn (enough words in the context)
go through each position in the text, have a window of 2m size around it
have a probability distribution
set the parameters of our model (that is, the representation vector of the word) to maximize p

lecture2: 27分钟左右

see ipad note

one word one vector representation?
2 can be better: 一个是作为中心词的向量v，一个作为上下文词的向量u
don't pay attention to distance and position

for each position in the context, multiple the center vector by a matrix that stores the representation of context words
get the dot product for similarity（uoTvc，乘以每一行uo)，uoTvc只有一个结果向量，上下结果代表三次vc变换
softmax, predict the context word 范围内最可能出现的词
w为所有词作为中心词组成的权重矩阵（word embedding词向量矩阵），w‘为所有上下文词组成的权重矩阵
different prediction: loss
w and w‘ (v and u)：

Train the model: compute vector representations
put u and v vector of one word together to construct a 2d vector
this 2d vector is the thing going to be optimizing
Gradient!

PAPER: A simple but Tough-to-beat baseline for sentence embeddings
How to encode the meaning of sentences? (not word as mentioned above)
Sentence Embedding:

Compute sentence similarity through inner dot product
Sentence classification task, like sentiment analysis

Compose word representations into sentence representations:

the bag-of-words (BoW): average of word vector

CNN, RNN

This paper: a very simple unsupervised method
weighted BoW + remove some special direction
Step 1:
compute word vector representations, each word has a seperate weight ???
weight = a/a+f(w)
Step 2:
compute first principle component
将句子嵌入与词频率，词与相关程度结合起来

（GD）梯度下降，斜率值在不断下降；learning rate a 必须足够小，防止跳过了最小值
如果对400亿个词求梯度，无法完成： not practical!!
作为代替：随机梯度下降（SGD，Stochastic Gradient Descent）
只选取一个中心词汇，有了它周围的词汇；移动这一个位置，对所有参数求解梯度（using that estimate of the gradient）？
不一定是一个好的方向的估计 (基线)，但是比GD快几个数量级
neural networks love noise!

Stanford 224N- Word Vector Representations: Word2Vec

猜你喜欢