Study Notes: Deep Learning (5) - Related Concepts of Word Vectors

Study time: 2022.04.21

Natural Language Processing (NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics.

The two core tasks of NLP are: natural language understanding NLU and natural language generation NLG. Common applications of NLP are:

  • Sequence tagging: such as Named Entity Recognition (NER), semantic tagging, part-of-speech tagging, word segmentation, etc.;
  • Classification tasks: such as text classification, sentiment analysis, etc.;
  • Sentence pair relationship judgment: such as natural language reasoning, question answering QA, text semantic similarity, etc.;
  • Generative tasks: such as machine translation, text summarization, opinion extraction, poem writing, image description generation, etc.;
  • Others: public opinion monitoring, speech recognition, Chinese OCR.

4. Related concepts of word vector

BERT is a widely used NLP model. Before formally learning BERT, I would like to have a basic understanding of the related concepts of pre-training.

4.1 Concept Definition

4.1.1 Word Vectors

In NLP tasks, the first consideration is how words are represented in a computer. A word vector is a set of numerical values ​​to represent a word or word.

  • discrete representation

    • Traditional rule-based or statistics-based natural semantic processing methods treat words as an atomic symbol known as one-hot representation (one-hot encoding). It represents each word as a long vector. The dimension of this vector is the size of the vocabulary. Only one dimension in the vector has a value of 1, and the other dimensions are 0. This dimension represents the current word;
    • This is equivalent to assigning an id to each word, which results in this representation not showing the relationship between words, and the dimension is too high and too sparse.
  • distributed representation

    • Distributed representation represents a word as a fixed-length continuous dense vector, also known as a word vector (ie: a set of values ​​to represent a word or word). Including matrix-based distribution representation, cluster-based distribution representation, neural network-based distribution representation, and so on.
    • In this way, there can be similar relationships between words: there is a concept of "distance" between words, which is very helpful for many natural language processing tasks; it can also contain more information, and each dimension has a specific meaning.

4.1.2 Word Embedding

These instances of word vectors trained by various language models are also known as word embeddings .

Embedding can be understood as a dictionary that maps integer indices (codes corresponding to specific words) into dense vectors. It takes integers as input, looks up those integers in an internal dictionary, and returns the associated vector.

Personal understanding: The traditional way of representation is to use corpus to construct a vector representation of words, while word embedding is to use a vector of uniform dimension to represent various words, and each word is mapped to a low-dimensional continuous vector.

4.1.3 Pre-training Pre-training

A network structure is designed to do the language model task, and then a large amount of even endless unlabeled natural language text is used to encode it into the network structure. When you are satisfied with the results, you can save the parameters of the trained model so that the trained model can get better results the next time you perform similar tasks. This process is pre-training.

Basic idea: Use as much training data as possible to extract as many common features as possible, so that the learning burden of the model for specific tasks is lighter.

Principle: Reusability of underlying features. The word vector obtained by pre-training in this way is the result of training on a very large sample, which has good versatility and can be used directly for any task.

After the pre-training model is trained, there are two ways to use it when connecting downstream related tasks:

(1)Frozon / Feature-based

The parameters loaded in the shallow layers do not change when training new tasks, and other high-level parameters are still randomly initialized.

Refers to using the word vector trained by the pre-trained language model as a feature (that is, the network parameters of the Word Embedding layer are fixed) and input into the downstream target task.

(2)Fine-tuning

Fine-tuning refers to: using the parameters of the previous pre-trained model as the initial parameters of the underlying network, but still changing with the training process during the training process (Word Embedding parameters of this layer are also trained using a new training set. updated with the training process).

The purpose is to better adjust the parameters to make them more suitable for the current downstream task. At this time, you are using a Pre-trained model, and the process is Fine-tuning.

Therefore, pre-training is the process of representing words and words as vectors, and the pre-training model is a vector model trained through a large amount of corpus.

4.1.4 Tokenization of word segmentation

The first step in working with text is to break it down into tokens. A word (Token) is the smallest word with a basic semantic unit. The process of splitting text into words is called tokenization, and the model or tool used for tokenization is called a tokenizer.

There are three granularities of tokenization: word granularity (Word-Level), character granularity (Char-Level) and subword granularity (SubWord-Level).

SubWord has three classic algorithms:

  • Byte-Pair Encoding (Byte Pair, BEP): BPE encoding or binary encoding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with bytes that are not present in that data. A replacement table is required to reconstruct the original data for later use.
  • WordPiece: A variant of BPE that generates new subwords based on probability instead of the next highest frequency byte pair.
  • Unigram Language Model: ULM is another subword separation algorithm that outputs multiple subword segments with probabilities. It introduces the assumption that all occurrences of subwords are independent and that the subword sequence is produced by the product of the probabilities of occurrence of subwords.

4.2 Word vector representation method

First of all, it should be clear that pre-training is the product of the development of the word vector representation method to a certain stage. In order to understand the pre-training model more macroscopically, I would like to first sort out the entire word vector representation method. Therefore, this section briefly introduces the generation technology of word vector.

4.2.1 Based on traditional statistical methods

(1) Bag of Words model

In layman's terms, it is to throw all the words in the sentence and corpus into a bag, and construct a one-dimensional matrix of word set-word frequency, regardless of order, regardless of importance, and consider words to be independent.

Common weighting techniques that can be used for information retrieval and data mining, keywords, sentence similarity, etc., but cannot show the importance of different words, cannot describe similar words, cannot describe polysemy, cannot describe word order information, and dimension disaster.

(2)TF-IDF

Term Frequency-Inverse Document Frequency, word frequency-inverse text frequency. Divide the importance of words, introduce external corpus information (the so-called prior probability), commonly used weighting techniques for information retrieval and data mining, keywords, sentence similarity, etc. However, the overlap of word frequency will lead to conflict of representation, which is heavily dependent on the corpus; it will be biased towards words with low frequency in the text; it cannot describe word order information.

4.2.2 Matrix-based statistical methods

Source of this part: What is a word vector , a brief history of natural language representation .

(1) Co-occurrence matrix

By counting the co-occurrence times of words and words in a window of a pre-specified size, the number of co-occurrence words around the word and the word is used as the vector of the current word and word. Specifically, we define word vectors by constructing a co-occurrence matrix from a large amount of corpus text.

eg: There are corpora as follows: "I like deep learning.", "I like NLP.", "I enjoy flying.", then the contribution matrix is ​​as follows:

img

The word vector of the co-occurrence matrix alleviates the problem that the similarity of one-hot vector is 0 to a certain extent, but does not solve the problem of data sparsity (sparse matrix, sparse matrix) and dimension disaster.

(2) Singular value decomposition SVD

Since the discrete word vector obtained based on the co-occurrence matrix has the problem of high dimensionality and sparsity, one solution is to reduce the dimension of the original word vector to obtain a dense continuous word vector.

eg: Perform SVD decomposition on the co-occurrence matrix in the above figure to obtain the matrix orthogonal matrix U, and normalize U to obtain the matrix as follows:

img

SVD obtains a dense matrix of words and words, which has many good properties: words with similar semantics are similar in vector space, and can even reflect the linear relationship between words and words to a certain extent.

Supplement: SVD singular value decomposition algorithm :

The purpose of eigenvalue decomposition and singular value decomposition is the same, that is, to extract the most important features of a matrix.

Eigenvalue decomposition is a great way to extract matrix features, but it only works on square matrices. In the real world, most of the matrices we see are not square matrices. For example, there are N students, and each student has M subjects, so an N*M matrix formed in this way cannot be a square matrix. Singular value decomposition is a decomposition method that can be applied to any matrix.

Disadvantages: high computational cost, time complexity is 3 power, space complexity is 2 power; cannot share memory; it is difficult to parallelize; data transformation may be difficult to understand.

(3) Topic Model

Topic models include: Non-negative Matrix Factorization ( NMF ), Latent Semantic Analysis ( LSA ), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation, LDA ), related topic model (Correlated Topic Model, CTM)

In general, first build build N × NN × NN×The square moment of N (N is the size of the dictionary), and the element filling in the matrix is ​​generally TF-IDF, word frequency or BM25 in the training corpus. The hypothesis proposed is "word-topic, topic-sentence", so the topic model is used for dimensionality reduction, and the obtained weight matrix W is the word vector corresponding to the word.

BM25 (BM=best matching) is an optimized version of TD-IDF. For details, see: BM25 Algorithm Introduction .

It solves the hidden semantics between word vectors, and the algorithm is concise, clear and interpretable; there are rigorous assumptions and proofs; global features are used; common python toolkits are integrated. However, it relies heavily on the corpus, cannot describe word order information, calculates the disaster of complex dimensions, and the number of topics cannot be determined.

(4) Similarity and co-occurrence matrix Glove

Born after word2vec, it is a traditional statistical learning method without using artificial nerves, but the effect is similar to word2vec, and there is no obvious performance improvement.

The glove algorithm believes that the word vector of A / the word vector of B = Pac / Pbc; the statistical co-occurrence matrix, the co-occurrence matrix is ​​X, in the entire corpus, Pac is the number of times the word A and the word C appear together in a window divided by Take the number of word A appearing in the corpus; similarly, Pbc is the number of times word B and word C appear together in a window divided by the number of word B appearing in the corpus. Simplified to the word vector of A * the word vector of B ∝ Freq ab, that is, the word vector of A multiplied by the word vector of B is proportional to the word frequency of AB words, plus some bias terms, it is glove, glove loss function as follows:

The algorithm is simple and intuitive, using only pure statistical methods, synthesizing the global information and local information of words, etc., and partially solving the global relationship of word vectors. However, it relies heavily on the corpus, cannot describe word order information, cannot solve the problem of polysemy, and the common python toolkit is not integrated, which makes training troublesome and less application.

4.2.3 Statistical Language Model

Statistical language models treat language (sequence of words) as a random event and assign corresponding probabilities to describe the likelihood of it belonging to a set of languages. That is, a statistical language model is a model used to describe the probability distribution of different grammatical units such as words, sentences and even the entire document .

(1) N-gram Model N-gram Model

The idea of ​​the N-gram Model is: based on the Markov hypothesis , it appears in the iii rankverse wi w_iinithe probability of occurrence, only the same as ( n − 1 ) (n-1) preceding it(n1 ) related to a historical word.

However, this model has certain defects : the parameter space is large and sparse (due to the limitation of the training corpus, it is impossible to pursue a larger N, and the larger the N, the larger the amount of calculation, and the ternary model is currently used more); The ability to transform is not strong, and there is no ability to express the similarity between words. Each word is a separate and unrelated category. It is purely based on statistics and needs to be completely matched; there is no long-term dependency; N-gram language models still exist [ OOV Problem] ( N-gram and NNLM Language Models - OwnLu ) (Out Of Vocabulary).

(2) Maximum entropy model MaxEnt

Model details: maximum entropy model . The principle of maximum entropy states that among all possible probabilistic models, the model with the largest entropy is the best model.

Model all that is known and assume nothing about that which is unknown. In other words, given a set of facts (features+output), choose a model that matches all facts and is otherwise nearly possibly homogeneous.

(3) Probabilistic Graphical Models

Probabilistic graphical models (probabilistic graphical models) use graph-based methods to represent probability distributions (or probability density, density functions) on the basis of probability models, and are a generalized uncertainty knowledge representation and processing method.

  1. Hidden Markov Model HMM

    Hidden Markov Model (HMM for short): It is the simplest dynamic Bayesian network, which is a well-known directed graph model, and it is a generative model. Markov process with unknown parameters. Mainly used for time series analysis, it is widely used in speech recognition, natural language processing and other fields.

    HMM has two assumptions:

    • Homogeneous Markov hypothesis: it assumes that the hidden state at any time is only related to the hidden state of the previous time, and has nothing to do with the hidden state and the observed state at any other time;
    • Observation independence assumption: It assumes that the observed state at any time is only related to the hidden state at the current moment, and has nothing to do with other observed states and hidden states.

    Also, Maximum Entropy can be combined with HMM: MEMM (Maximum Entropy Markov Model). It is a directed graph model and a discriminant model; MEMM breaks the observation independence assumption of HMM. MEMM takes into account the dependencies between adjacent states and considers the entire observation sequence, so MEMM has stronger expression ability; but MEMM will bring To label the bias problem: MEMMs tend to choose states with fewer transitions due to the local normalization problem. This is the marker bias problem.

    The model introduction can be seen in detail: Probabilistic graphical model in NLP .

    Hidden Markov Principle: An Introduction to NLP Hardcore - Hidden Markov Model HMM .

  2. Conditional random field model CRF

    Conditional Random Fields (CRF): It is a conditional probability distribution model of another set of output random variables given a set of input random variables, which is characterized by the assumption that the output random variables constitute a Markov random field. A conditional random field is a discriminative model.

    Concept understanding:

    • A random field is a whole composed of several positions. When a value is randomly assigned to each position according to a certain distribution, the whole is called a random field. Or take part-of-speech tagging as an example: Suppose we have a ten-word sentence that needs to be tagged with part-of-speech. The part of speech of each of these ten words can be selected from the set of parts of speech we know (noun, verb...). This forms a random field when we have chosen parts of speech for each word.
    • Markov Random Field (MRF) is a special case of random field. It is a typical Markov network, which is a well-known undirected graph model. Each node in the graph represents a A set of variables, and edges between nodes represent dependencies between two variables. It assumes that the assignment of a certain position in the random field is only related to the assignment of its adjacent positions, and has nothing to do with the assignment of its non-adjacent positions. Continue to give the example of ten-word sentence part-of-speech tagging: if we assume that the part-of-speech of all words is only related to the parts of speech of its adjacent words, this random field is specialized into a Markov random field. For example, the part-of-speech of the third word is only related to the part-of-speech of the second and fourth words in addition to its own position.
    • CRF is a special case of Markov random field . It assumes that there are only two variables, X and Y, in Markov random field. X is generally given, and Y is generally our output under the condition of given X. In this way, the Markov random field is specialized into a conditional random field. In our ten-word sentence part-of-speech tagging example, X is the word and Y is the part-of-speech. So if we assume it is a Markov random field, then it is also a CRF.

    The CRF model is an undirected graph model and a discriminative model; it solves the problem of labeling bias and removes two unreasonable assumptions in HMM. Of course, the model is correspondingly complicated.

    CRF not only solves the problem of HMM output independence assumption, but also solves the label bias problem of MEMM. MEMM is easy to fall into local optimum because it only performs local normalization, while CRF counts the global probability and is doing normalization. When considering the global distribution of the data, rather than just local normalization, this solves the problem of label bias in MEMM. This makes the decoding of sequence annotations an optimal solution.

    The specific introduction of the model can be found in: Probabilistic Graphical Model of NLP .

    Principles of Conditional Random Fields: NLP Hardcore Primer - Conditional Random Fields CRF .

(4)GloVe

The main source of this part: GloVe detailed explanation , understanding GloVe model .

GloVe's full name is Global Vectors for Word Representation, which is a word representation tool based on count-based & overall statistics . The core idea of ​​this model is to train word vectors that contain the information contained in the co-occurrence matrix.

The implementation of GloVe is divided into the following three steps:

  • Construct a co-occurrence matrix XX from the corpusX , each element in the matrixX ij X_{ij}Xijrepresents the word jjj in wordiiThe number of times i appears in the context window. Generally speaking, the smallest unit of this number is1 11 , but GloVe doesn't think so: it depends on the distanceddd , a decaying weighting function is proposed:decay = 1 d decay=\frac{1}{d}decay=d1It is used to calculate the weight, that is to say, the farther the distance between the two words, the smaller the weight of the total count (total count);

  • To construct the approximate relationship between the word vector (Word Vector) and the co-occurrence matrix (Co-ocurrence Matrix), the author of the paper proposes the following formula to approximate the relationship between the two: wi T w ‾ j + bi + bj = log ( X ij ) w_i^T\overline{w}_j+b_i+b_j = log(X_{ij})iniTinj+bi+bj=log(Xij) . Among them ,wi T w_i^TiniTand w ‾ j \overline{w}_jinjis the word vector we finally want to solve; bi b_ibiand b ˉ j \bar{b}_jbˉjare the bias terms of the two word vectors, respectively.

  • With the above formula, we can construct its loss function: J = ∑ i , j = 1 V f ( X ij ) ( wi T w ˉ j + bi + b ˉ j − log ( X ij ) ) 2 J = \sum^V_{i,j=1}f(X_{ij})(w^T_i\bar{w}_j+b_i+\bar{b}_j-log(X_{ij}))^2J=i,j=1Vf(Xij) ( viTinˉj+bi+bˉjlog(Xij) )2 , that is, a weight functionf ( X ij ) f(X_{ij})f(Xij) .

    • Why add this function? We know that in a corpus, there must be many words that appear together a lot (frequent co-occurrences), then we hope:

      • The weight of these words is greater than those words that rarely occur together (rare co-occurrences), so the function should be non-decreasing;
      • But we don't want this weight to be too large (overweighted), and it should not increase after it reaches a certain level;
      • If the two words do not appear together, that is, X ij = 0 X_{ij}=0Xij=0 , then they should not participate in the calculation of the loss function, that is,f ( x ) f(x)f ( x ) must satisfyf ( 0 ) = 0 f(0)=0f(0)=0 .
    • There are many functions that satisfy the above two conditions. The author uses a piecewise function of the following form (all experiments in this paper, α αThe value of α is 0.75, andxmax x_{max}xmaxThe value is 100. ):

      • f ( x ) = { ( x x m a x ) α i f x < x m a x 1 o t h e r w i s e f(x) = \begin{cases}(\frac{x}{x_{max}})^\alpha & if x<x_{max}\\ 1 & otherwise \end{cases} f(x)={ (xmaxx)a1ifx<xmaxotherwise
      • insert image description here

Glove is a combination of the advantages of both LSA and Word2Vec (visible later). LSA is a word vector representation tool. It is also based on co-occurrence matrix. It uses the matrix decomposition technology based on singular value decomposition SVD to reduce the dimension of large matrices. However, the calculation cost of SVD is too high, and its statistical weight for all words is consistent, and glove overcomes these shortcomings. Because Word2vec is calculated based on a local sliding window and utilizes the features of local context, the disadvantage is that it does not fully utilize all the corpus. So Glove merges these two features together.

Overall, GloVe is easier to parallelize, so for larger training data, GloVe is faster. But the effect of Word2Vec may be slightly better, and the application is more popular.

4.2.4 Language Model Based on Deep Learning

At present, the language model technology based on deep learning has completely surpassed the traditional language model.

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124335570