Word Representation and Word Embeddings · Amy Huang

Word Representation, means acts on behalf of a set of numbers to text / mode. In fact, there are many ways you can use a set of numbers represents the text currently two main ways are Distributional Semantics and Word Embeddings, and Word Embeddings common way in recent years. [1]

Why Word Representation?

Text is important, because the text itself with meaning, we read from the text is the meaning behind it, meaning the text you want to do more analysis, computational linguistics in it raises the question : " there is no way with the implications of a set of numbers to represent text? " with such a set of numbers, we can text you can put the mathematical model analysis.

Word Representation of evolution

Put into digital text, a simple method is to do one-hot encoding / representation (referred to as a dummy variable statistically), i.e. so that a vector of length number (vocabulary list) the word appeared in all, this vector each location will correspond to a certain word vocabulary list which, then each word may be composed of a position 1, 0 represents the remaining vector. As shown below one-hot representation, we can observe two points:

  1. Vector word vocabulary list with the order of a relationship; that is to change the order, there is a different word vector representation, the first fixed vocabulary list.
  2. Vector and does not reflect the relationship between words with words. For example, Euclidean distance with vector car bike vector are $ sqrt {2} $, car sun vector is a vector with $ sqrt {2} $, but with the car should be relatively close to significance bike (Jiaotonggongju belong).
Issues: difficult to compute the similarity

So our goal more clearly, to find a set of numbers can represent text, and can reflect the relationship between words with words.

Goal: word representation that capture the relationships between words

Measure the meaning of text, often inferred from the context, we are looking to figure it out on behalf of the text to reflect the relationship between words.

Idea: words with similar meanings often have similar neighbors

We can use words to appear before and after the establishment of a number of occurrences of the table, known as Window-based Co-occurrence Matrix. As shown below:

Window-based Co-occurrence Matrix

來源: NTU-ADLxMLDS word representation y.v.chen slides

Number of preceding or following appears I love is 2, I is the number of front or rear enjoy occurrence is (length of the same vocabulary list), enjoy the love with vector 1 vector, we each column as a top word of a representative vector the distance between, just enjoy the distance between the vector and the vector is not the same deep, which means such a vector representation reflects the relationship between words with words. But this representation has some drawbacks:

  1. When a lot of words when the matrix size is large, vector dimension is also high
  2. There are many easy matrix 0, matrix sparse, it is not easy to put into the model analysis.
Issues:
* matrix size increases with vocabulary
* high dimensional
* sparsity -> poor robustness

So we need representation dimensionality reduction of vector-based window-based co-occurrence matrix of. Mentioned dimension reduction, the first thought should be the PCA (Principal Component Analysis), PCA is based on SVD (Singular Value Decomposition) dimension reduction way in the application of SVD in NLP is called Latent Semantic Analysis (LSA). Simply speaking, so that all $ C $ is a matrix consisting of the word vector representation of the SVD do $ C $ $ Sigma $ is as follows wherein eigenvalue (eigenvalue) is a diagonal matrix diagonal, front retention $ k $ th feature value, into the remaining $ 0 $ to give $ Sigma_ {k} $, then the following matrix approximately $ C $ $ C_ {k} $ is the new latent semantic space. The disadvantage of this method is that:

  1. A large amount of computation required. computational complexity: $ O (mn ^ 2) $ when $ n <m $ for $ n times m $ matrix
  2. New vocabulary is difficult. Because every new word, it is necessary to recalculate the SVD matrix eigenvector / value, and updates each word represents the vector.
Issues:
* computationally expensive
* difficult to add new words

So we think this again, there is no direct use of low-dimensional vector representing a literal way?

Idea: directly learn low-dimensional word vectors

The text-to-vector composed of real numbers (vectors of real numbers), such a practice called word embeddings. Conceptually, word embedding to do is put each word originally a one-dimensional vector space, projected onto a lower dimensional (continuous) vector space. In recent years, commonly used word embeddings model word2vec (Mikolov et al. 2013) and Glove (Pennington et al., 2014).

Benefits of Word Embeddings

To a corpus (unlabeled training corpus), to a representative of each word vector, the vector with the benefits of its semantic message (semantic information) that

  1. Can cosine similarty measure semantic similarity
  2. Word vector (word vectors) are useful in many NLP tasks, the feature with a variable semantic
  3. Can be placed in neural network (neural networks) and updated during training

How to find a word Word Embeddings?

Word Embedding Model - Word2Vec & Glove

Word2Vec

Word2Vec is a kind of neural network based word vector generation mode, there are two main architectures, skip-gram and Continuous Bag of Words (CBOW). skip-gram concept is given a word, a single-layer neural network architecture (single hidden layer) to predict the word context (also called neighbor), CBOW with a context word (neighbor) to word prediction , and one of the hidden layer is what we want word representation, that is, the word of the word embedding.

word2vec model

FIG skip-gram above as an example, $ x_ {k} $ of a word is one-hot vector, $ y_ {1j}, ..., y_ {Cj} $ prediction representative of a context, the context is $ C $ length, how much depends on the basis of the text before and after decide the size of the $ C $ (that is, we see that the word will be how far the text before and after the impact, whereby to provide for size). Hidden layer is the dimension where $ N (ll V) $ node $ h_ {i} $ hidden layer composed of, $ h = W ^ {T} x $ word is word embeddings [3].

Word2Vec Skip-Gram

Word2Vec Skip-Gram 的作法是輸入是某個字,預測這個字的前後文(給定某個長度內),目標是最大化給定這個字時,前後文出現的機率,

that is, maximize likelihood

等價於 mimize cost/loss function

其中,word vector 在這個神經網路中的 hidden layer 實現,word embedding matrix (某個字對應到某個向量的 lookup table) 就是 hidden layer weight matrix。

word2vec方法的瓶頸在於 output layer 的神經元個數 (也就是 output vectors) 等同於總字彙量,如果字彙量或是corpus很大,會帶來很大的計算負擔,因此有了使用 hierarchical softmax 和 negative sampling 等方法限制每次更新的參數數量的想法。

large vocabularies or large training corpora → expensive computations

⇒ limit the number of output vectors that must be updated per training instance → hierarchical softmax, sampling

Hierarchical Softmax

Idea: compute the probability of leaf nodes using the paths

細節可參考: 大专栏  Word Representation and Word Embeddings · Amy Huangc-neural-networks-neural-network-language-model">類神經網路 – Hierarchical Probabilistic Neural Network Language Model (Hierarchical Softmax)

Negative Sampling (NEG)

Idea: only update a sample of output vectors

細節可參考: Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in NIPS, 2013

Negative Sampling 只更新一部份的 output vectors,因此 loss function 可以改寫成

NEG objective function

Mikolov 表示: the task is to distinguish the target word $w_{O}$ from draws from the noise distribution $P_{n}(w)$ using logistic regression, where there are $k$ negative samples for each data sample.

What is a good $P_{n}(w)$ ?

Mikolov 表示: We investigated a number of choices for $P_{n}(w)$ and found that the unigram distribution $U(w)$ raised to the 3/4rd power (i.e., $U(w)^{3/4}/Z$ ) outperformed significantly the unigram and the uniform distributions.

也就是說,現在還沒有科學的方法說明如何挑選 $P_{n}(w)$,不過經驗法則所找到的函數,其產生的結果的表現勝過現有的其他模型。

Idea: less frequent words sampled more often

Empirical setting: unigram model raised to the power of 3/4 NEG empirical example

GloVe

另一種近年來常用的 word embeddings 模型為 GloVe,由 Pennington 等人提出。 細節可參考: Pennington et al., ”GloVe: Global Vectors for Word Representation ,” in EMNLP, 2014

GloVe 的概念是文字之間的共同出現比隱藏字意的訊息。令 $P_{ij}$ 是 $w_{j}$ 出現在 $w_{i}$ 上下文裡的機率

$X_{ij}$ 代表 $w_{j}$ 在 $w_{i}$ 的上下文裡出現的次數,$X_{i} = sum_{k}X_{ik}$ 是出現在 $w_{i}$ 的上下文裡所有字數

$w_{i}$ and $w_{j}$ 的關係近可以以他們同時在 $w_{k}$的上下文裡出現的機率比作為代表。

$frac{P_{ik}}{P_{jk}}$ 稱為 ratio of co-occurrence probability。

Idea: ratio of co-occurrence probability can encode meaning

令$F(x) = exp(x)$,則

我們可以加上bias項 $b_{i}$ 讓 $w_{i}$ 獨立於 $k$ (?),再加上bias項 $b_{k}$ 讓 $w_{k}$ 保持對稱(為何加bias可以使之獨立、對稱?),得到

把這個問題看成迴歸式,用最小平方法(least square estimate)求解,也就是 loss function = $sum_{i,k=1}^{V} (w_{i}^{T} w_{k} + b_{i} + b_{k} - log{X_{ik}})^{2}$ 可以找到 $b_{i}, b_{k}, w_{i}, w_{k}$。(不確定如何計算?)

但其中還有幾個問題,其一是 log 函數會在 0 點無定義,其二是在最小平方法的 loss function 裡,每個 $(w_{i}, w_{k})$ 組合跟 $log{X_{ik}}$ 的差距都以相等重要性看待,不會因為某組 $(w_{i}, w_{k})$ 比較常共同出現而特別看重這一組的 loss。

所以要再做一些調整,給每組 $(w_{i}, w_{k})$ 權重 $f(X_{ik})$,則 loss function 可以寫成

權重 $f(x) = (x/x_{max})^{alpha}$ if $x < x_{max}$ and $f(x) = 1$ otherwise,$x_{max}, alpha$ 是常數。

很巧的是,Pennington 等人實驗的結果發現 $x_{max} = 100, alpha=3/4$ 時模型表現最好,跟 Mikolov 等人在 negative sampling 裡面提出的經驗是一樣的。

GloVe 的優點在於 fast training, scalable, good performance even with small corpus, and small vectors

Implementation

Gensim: a Word2Vec Library

就算不懂上面的理論,直接在python套用Gensim也可以輕鬆得到詞向量,然後丟入模型進行操作。

Reference

  1. Quora - What’s the difference between word vectors, word representations and vector embeddings?

    There are many ways to represent words in NLP / Computational Linguistics. Two prominent approaches use vectors as their representations. These are, largely speaking:

    • Distributional Semantics: represent a word with a very high-dimensional sparse vector, where each dimension reflects a context in which the word occurred in the corpus. For example, a context could be another word that appeared in proximity.

    • Word Embeddings: represent a word with a low-dimensional vector (e.g. 100 dimensions). The dimensions are usually latent, and often obtained using the information as in the distributional semantics approach (e.g. LSA, word2vec).

  2. NTU-ADLxMLDS word embedding 陳縕儂授課講義
  3. NTHU-ML Word2Vec 吳尚鴻授課講義 the weight matrix $W$ encode a one-hot vector $x$ into a low dimensional dense vector $h$. Note that the weights are shared across words to ensure that each word has a single embedding. This is called weight tying. Also, word2vec is a unsupervised learning task as it does not require explicit labels. An NN can be used for both supervised and unsupervised learning tasks.

  4. Word2Vec Skip-Gram Visualization

  5. 使用TENSORFLOW實作WORD2VEC CBOW

  6. Word2Vec 相關論文
  7. Proposed improvements to GloVe: Simpler GloVe

Guess you like

Origin www.cnblogs.com/wangziqiang123/p/11690602.html