CS 224N Summary

CS 224N网址:Stanford CS 224N | Natural Language Processing with Deep Learning

Lecture1

PPT URL: PowerPoint Presentation (stanford.edu)

This lecture mainly talks about the object of NLP research, how we represent the meaning of words, and the basic principles of the Word2Vec method.

Here we briefly introduce some basic principles of the Word2Vec method: people think that a word often has the same meaning as its context words, so the expression vectors of these words are expected to be similar. We define the words in a window as the center word and the context word, as follows As shown in the figure.

insert image description here

The loss function can be written as:

  • J ( θ ) = − 1 T log ⁡ L ( θ ) = − 1 T ∑ t = 1 T ∑ m ≤ j ≤ m j ≠ 0 log ⁡ P ( w t + j ∣ w t ; θ ) J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{m \leq j \leq m \\ j \neq 0}} \log P\left(w_{t+j} \mid w_t ; \theta\right) J(θ)=T1logL ( i )=T1t=1Tmjmj=0logP(wt+jwt;i )

    • Loss is rewritten as a logarithm for the convenience of calculation, so that all multiplications will become additions
  • Why use two vectors per word? (It is mentioned in the PPT of Lecture2, but not in detail)

    • For the convenience of mathematical calculations
      • softmax: P ( o ∣ c ) = exp ⁡ ( u o T v c ) ∑ w ∈ V exp ⁡ ( u w T v c ) P(o \mid c)=\frac{\exp \left(u_o^T v_c\right)}{\sum_{w \in V} \exp \left(u_w^T v_c\right)} P(oc)=wVexp(uwTvc)exp(uoTvc)
      • It can be noticed that in the denominator, there is such a term ∑ w ∈ V uw T vc \sum_{w \in V}u_w^Tv_cwVuwTvc, if we compare it to vc v_cvcDerivation, the result is ∑ w ∈ V uw \sum_{w \in V} u_wwVuw. However, if we do not use two sets of vectors, the term should be written as ∑ w ∈ V vw T vc \sum_{w \in V}v_w^Tv_cwVvwTvc, what needs to be noted here is that www may be the same asccc is the same, we can write∑ w ∈ V , w ≠ cvw T vc + vc T vc \sum_{w \in V ,w \ne c}v_w^Tv_c + v_c^Tv_cwV,w=cvwTvc+vcTvc, for vc v_cvcDerivation, the result is ∑ w ∈ V , w ≠ cvw + 2 vc \sum_{w \in V ,w \ne c}v_w + 2v_cwV,w=cvw+2v _c, which leads to a result that is not as easy to understand as the result obtained using the two vectors.
    • In the end, the two vectors will be very similar, but not the same. We take the average of the two vectors to express the final word vector
    • Specific derivation: 01 Introduction and Word Vectors - The Sun Also Rises

Lecture2

PPT URL: cs224n-2023-lecture02-wordvecs2.pdf (stanford.edu)

Bag of words model: The model does not consider the order of words, and the prediction of a word in different positions is the same

  • Two variants of Word2Vec:
    • Skip-grams: Given the center word to predict the context word (the Word2Vec model we showed is this form)
    • Continuous Bag of Words (CBOW): Predict the central word through the context word

Methods that can be used to update:

  • Gradient Descent (GD): After calculating the gradient on all samples, update it again
  • Stochastic Gradient Descent (SGD): update with a single sample at a time
  • Mini-batch Gradient Descent (MBGD): Each time a batch of samples are used to update, between the above two

Available for training loss function:

  • Simple softmax(simple, but when there are many categories, the calculation is very heavy)
  • Optimized variants such as hierarchicalsoftmax
  • Negative sampling ( negative sampling)

Above, we are using loss functionthe naivesoftmax

  • J ( θ ) = − 1 T log ⁡ L ( θ ) = − 1 T ∑ t = 1 T ∑ m ≤ j ≤ m j ≠ 0 log ⁡ P ( w t + j ∣ w t ; θ ) J(\theta)=-\frac{1}{T} \log L(\theta)=-\frac{1}{T} \sum_{t=1}^T \sum_{\substack{m \leq j \leq m \\ j \neq 0}} \log P\left(w_{t+j} \mid w_t ; \theta\right) J(θ)=T1logL ( i )=T1t=1Tmjmj=0logP(wt+jwt;i )

    • P ( wt + j ∣ wt ; θ ) = P ( o ∣ c ) P(w_{t+j} |w_t;\theta) = P(o|c)P(wt+jwt;i )=P ( o c ) , the denominator of this term is very computationally intensive, so in the standardword2vec, we do not use this form, but use the negative sampling method

    • The core idea of ​​negative sampling: train a binary logistic regression to distinguish a true pair (center word and words in the context window) from some noise pairs (center word and some random words)

Negative sampling loss function:

  • 最大化 J t ( θ ) = log ⁡ σ ( u o T v c ) + ∑ i = 1 k E j ∼ P ( w ) [ log ⁡ σ ( − u j T v c ) ] J_t(\theta)=\log \sigma\left(u_o^T v_c\right)+\sum_{i=1}^k \mathbb{E}_{j \sim P(w)}\left[\log \sigma\left(-u_j^T v_c\right)\right] Jt( i )=logp(uoTvc)+i=1kEjP(w)[logp(ujTvc)]
    • We maximize the probability of two words co-occurring in the first log and minimize the probability of noise words in the second log
    • k represents the number of negative samples sampled
  • 这可以写成 J neg-sample  ( u o , v c , U ) = − log ⁡ σ ( u o T v c ) − ∑ k ∈ { K  sampled indices  } log ⁡ σ ( u k T v c ) J_{\text {neg-sample }}\left(\boldsymbol{u}_o, \boldsymbol{v}_c, U\right)=-\log \sigma\left(\boldsymbol{u}_o^T \boldsymbol{v}_c\right)-\sum_{k \in\{K \text { sampled indices }\}} \log \sigma\left(\boldsymbol{u}_k^T \boldsymbol{v}_c\right) Jneg-sample (uo,vc,U)=logp(uoTvc)k{ K sampled indices }logp(ukTvc)
    • Using the probability distribution P ( w ) = U ( w ) 3 / 4 / ZP(w)=U(w)^{3/4} / ZP(w)=U(w)3/4 /Zto sample,U ( w ) U(w)U ( w ) is a unigram distribution

The above mainly introduces the Word2Vec model, which is a prediction model (Direct prediction) based on the local context window. For learning word vector, there is another type of model that is count based global matrix factorization

  • The advantage of Direct Prediction is that it can summarize more complex information than correlation, and the disadvantage is that it does not fully utilize statistical information
  • The advantage of the Count based model is that it is trained quickly and effectively uses statistical information. The disadvantage is that it is biased towards high-frequency words and can only summarize the relevance of phrases.

count based model: construct a vocabulary co-occurrence matrix, each row is a word, and each column is context. But the matrix is ​​easy to be too large. We hope that the dimension of the word vector will not be too large, so we use some dimensionality reduction methods to learn the low-dimensional representation of word.

  • Typical is SVD

  • GloVe (Global Vectors): A method that combines the local context window and (count based) vocabulary co-occurrence matrix

    insert image description here

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/130675489