[cs224n] Lecture 2 | Word Vector Representations: word2vec

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_37993251/article/details/88927128

 Lecture 2 | Word Vector Representations: word2vec


1. How do we represent the meaning of a word?

The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our models. Much of the earlier NLP work that we will not cover treats words as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Euclidean, etc).

1.1 How do we have usable meaning in a computer?

1.2 Problem with this discrete representation

one-hot vector

So let’s dive into our first word vector and arguably the most simple, the one-hot vector: Represent every word as an vector with all 0s and one 1 at the index of that word in the sorted english language. In this notation, is the size of our vocabulary. Word vectors in this type of encoding would appear as the following:

We represent each word as a completely independent entity. As we previously discussed, this word representation does not give us directly any notion of similarity. For instance,

So maybe we can try to reduce the size of this space from to something smaller and thus find a subspace that encodes the relationships between words.

1.3 Distributional similarity based representations

1.4 Word meaning is defined in terms of vectors


2. word2vec

 There are an estimated 13 million tokens for the English language but are they all completely unrelated? Feline to cat, hotel to motel? I think not. Thus, we want to encode word tokens each into some vector that represents a point in some sort of "word" space. This is paramount for a number of reasons but the most intuitive reason is that perhaps there actually exists some N-dimensional space (such that N 13 million) that is sufficient to encode all semantics of our language. Each dimension would encode some meaning that we transfer using speech. For instance, semantic dimensions might indicate tense (past vs. present vs. future), count (singular vs. plural), and gender (masculine vs. feminine).

2.1 Basic idea of learning neural network word embeddings

2.2 Directly learning low-dimensional word vectors

Main idea of word2vec

Skip-gram prediction


Skip-gram

Training Data

Skip-gram Neural Network Architecture 

 Behavior of the output neuron


Details of word2vec

The objective function - details

dot products

softmax

To train the model: Compute all vector gradients


3. CS224N Research Highlight


4. object function gradient

u_o-\sum ^v _{x=1}p(x \mid c)u_x

猜你喜欢

转载自blog.csdn.net/weixin_37993251/article/details/88927128