Artificial intelligence AI full stack system (7)

Chapter 1 How Neural Networks are Implemented

Neural networks can not only process images, but also text. Since image processing is more vivid and easier to understand, image processing is basically explained as an example.

7. Word vector

Please add image description

  • The reason why image processing sounds relatively vivid is that the basic element of an image is a pixel, and a pixel is represented by a number and can be processed directly. The basic element of text is words. To process text, we must first solve the problem of word representation.

1. One-hot encoding

Please add image description

  • The simplest representation is called "one-hot" encoding.

2. One-hot encoding example

  • We illustrate the one-hot encoding method with an example. Suppose there is a sentence: I am studying at Tsinghua University and living in the beautiful Tsinghua Garden. We use the words that appear in this sentence to form a vocabulary list with a total of 8 words:
  • {I, at, Tsinghua University, study, live, beautiful, Tsinghua Garden, in}
  • The one-hot encoding method uses a vector with the same length as the vocabulary to represent a word. The vector has only one position as 1 and other positions as 0. Which position is 1 specifically? It depends on where the word is in the vocabulary. If it is at the nth position, then the nth position of the vector is 1. This is also where the term "one-hot encoding" comes from.
  • For example, if the word "Tsinghua University" is in the third position of the vocabulary, the word can be expressed as:
    • "Tsinghua University" = [0, 0, 1, 0, 0, 0, 0, 0]
  • Similarly, "Tsinghua Garden" and "Beautiful" can be expressed as:
    • "Tsinghua Garden" = [0, 0, 0, 0, 0, 0, 1, 0]
    • "Beautiful" = [0, 0, 0, 0, 0, 1, 0, 0]
      Please add image description

3. Characteristics of one-hot encoding

  • The advantage of this representation is that it is relatively simple. A vocabulary list is prepared in advance, and the representation of the word is determined after the vocabulary list is determined. But there are many shortcomings. For example: If you process real text, you need at least 100,000 common words, and each word needs to be represented as a vector with a length of 100,000. It is also impossible to obtain the similarity of two words through calculation. For example, in natural language processing, Euclidean distance is often used to measure the similarity or synonyms of two words. The smaller the Euclidean distance, the more similar the two words are. But for one-hot encoding, any word has only one position of 1, and as long as it is not the same word, the position of 1 must be different, so the Euclidean distance of any two words is 2 \sqrt{ 2 }2 , for example, the Euclidean distance between "Tsinghua University" and "Tsinghua Garden" is:
    ∥ "Tsinghua University" − "Tsinghua Garden" ∥ 2 \begin{Vmatrix} "Tsinghua University" - "Tsinghua Garden" \end{Vmatrix}_2 " Tsinghua University "" Tsinghua Garden " 2
    = ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 1 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 1 ) 2 + ( 0 − 0 ) 2 = \sqrt{(0-0)^2+(0-0)^2+(1-0)^2+(0-0)^2+(0-0)^2+(0-0)^2+(0-1)^2 + (0-0)^2} =(00)2+(00)2+(10)2+(00)2+(00)2+(00)2+(01)2+(00)2
    = 2 = \sqrt{2} =2
  • The Euclidean distance between "Beautiful" and "Tsinghua Garden" is:
    ∥ "Beautiful" − "Tsinghua Garden" ∥ 2 \begin{Vmatrix} "Beautiful" - "Tsinghua Garden" \end{Vmatrix}_2 beautiful " Tsinghua Garden " 2
    = ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 0 − 0 ) 2 + ( 1 − 0 ) 2 + ( 0 − 1 ) 2 + ( 0 − 0 ) 2 = \sqrt{(0-0)^2+(0-0)^2+(0-0)^2+(0-0)^2+(0-0)^2+(1-0)^2+(0-1)^2 + (0-0)^2} =(00)2+(00)2+(00)2+(00)2+(00)2+(10)2+(01)2+(00)2
    = 2 = \sqrt{2} =2
    Please add image description
  • From a semantic point of view, it is more reasonable that the distance between "Tsinghua University" and "Tsinghua Garden" should be smaller than the distance between "beautiful" and "Tsinghua Garden".

4. Distributed representation of words

  • In order to solve the shortcomings of one-hot encoding, researchers proposed a "dense" vector representation method, which still uses a vector to represent a word, but instead of having only one bit of a vector being 1 and the rest being 0, each vector Bits have specific values, and these values ​​"combine" to represent a word. Since each bit of the vector is "used" to represent a word, the length of the vector does not need to be as long as the word list. Generally, the length only needs a few hundred bits. Moreover, the distance between vectors can also be used to calculate the semantic similarity of two words.

Please add image description

  • This dense representation is generally obtained through training.
    Please add image description
    Please add image description
  • Let’s start with the neural network language model.

5. Language model

Please add image description
Please add image description
Please add image description

6. Neural network language model

  • Simply put, when the first n-1 words of a sentence are given, the probability of predicting what word the nth word is is called a language model. For example, given that the first four words are "Tsinghua University", "computer", "science", and "and", what might the fifth word be? The fifth word is more likely to be "technology", because this sentence probably means "Tsinghua University Computer Science and Technology". The possibility that the fifth word is "engineering" is not small, because "Tsinghua University Computer Science and Engineering" is also relatively smooth. But if it is "Tsinghua University Computer Science and Cabbage", although there is nothing wrong with this sentence from a grammatical level, it is rare to juxtapose "Computer Science" and "Cai Cai", so the fifth word is "Cai Cai" The probability is very small. The language model is used to evaluate whether a sentence is like "human speech". If it is like "human speech", the probability is high, otherwise the probability is small, or even 0. If the language model is implemented using a neural network, it is called a neural network language model.

Please add image description

  • The first n-1 words mentioned here are not necessarily counted from the beginning of a sentence. They can start from any position in a sentence. In short, the n-1 words before the current word are enough, regardless of the specificity of the current word. In which position. If there are less than n-1 words in front of it, count as many as there are. For example, if the current word is at the t-th position, the n-1 words before it are wt − n + 1 wt − n + 2 ⋯ wt − 2 wt − 1 w_{t-n+1}w_{t-n+2 }\cdots w_{t-2} w_{t-1}wtn+1wtn+2wt2wt1, these n-1 words are called wt {w_t}wtFor "context", use context (wt) context(w_t)context(wt) display, in whichwi w_iwiIndicates words, n is called the size of the window, which means that only n words within the window are considered.
    Please add image description

  • The figure below shows a schematic diagram of the most common neural network language model implemented using a fully connected neural network.
    Please add image description

  • The language model shown in the figure is a fully connected neural network. Different from ordinary fully connected networks, the input layer is divided into (n-1) groups, each group has m inputs, and a total of (n-1)m inputs. Each group of inputs contains a total of m values ​​to form a vector, corresponding to wt w_twtA word in the context of , the vector is represented by C ( wt − l ) ( l = 1 , 2 , . . . , n − 1 ) C(w_{tl})(l=1,2,...,n- 1)C(wtl)(l=1,2,...,n1 ) means. AllC ( wt − l ) C(w_{tl})C(wtl) are spliced ​​together to form a vector of length (n-1)m, usingx = [ x 1 , x 2 , . . . , x ( n − 1 ) m ] x=[x_1,x_2, ..., x_{(n-1)m}]x=[x1,x2,...,x(n1)m] means. If grouping is not considered, the input layer is the same as the ordinary fully connected neural network, that is, x is the input.

  • Why group the input?

    • The vector composed of each set of inputs corresponds to a word in the current context. When the context changes, the vector corresponding to the word that constitutes the context must be taken out by looking up the table and placed in the corresponding position of the input layer of the neural network. For this reason, when building a neural network language model, you must first determine a vocabulary list. This vocabulary list is usually large and contains all possible words, usually hundreds of thousands of words. Each word corresponds to a vector of length m, and a certain connection is established between the word and the vector so that it can be easily taken out when needed.
  • How to get a vector of length m is not urgent at the moment. Now you only need to know that one word corresponds to a vector. We will talk about how to get this vector later.

  • Next, look at the hidden layer of the picture. This layer has nothing special. It is an ordinary hidden layer with a total of H neurons. Each neuron is connected to the neurons of the input layer, and the weights are uh, j u_ {h,j}uh,j, represents the connection weight from the j-th input in the input layer to the h-th neuron in the hidden layer. Each neuron in the hidden layer is connected to a hyperbolic tangent activation function (tanh) as the output of the neuron. The output of all neurons in the hidden layer consists of vectors z = z 1 , z 2 , . . . , z H z = z_1,z_2,...,z_Hz=z1,z2,...,zH, the output of the h-th neuron is expressed as follows:
    zh = tanh ( uh , 1 x 1 + uh , 2 x 2 + ⋯ + uh , ( n − 1 ) mx ( n − 1 ) m + ph ) z_h = tanh(u_{h,1}x_1 + u_{h,2}x_2 + \cdots + u_{h,(n-1)m}x_{(n-1)m} + p_h)zh=t he ( uh,1x1+uh,2x2++uh,(n1)mx(n1)m+ph)

  • where ph p_hphis the bias of the h-th neuron.

  • The number of neurons in the output layer is consistent with the size of the word list. One neuron corresponds to one word. The neurons are connected to the softmax activation function to obtain the output result. The output value of each neuron is expressed in the current context, corresponding to this neuron. word time probability. For example: Assume that the 3rd neuron of the output layer corresponds to the word "technology", and the 5th neuron corresponds to the word "engineering". When the context is "Tsinghua University Computer Science and", then the 3rd neuron of the output layer The output value of "Tsinghua University Computer Science and" represents the probability of being connected to the word "technology", while the output value of the fifth neuron represents the probability of "Tsinghua University Computer Science and" being followed by the word "engineering".

  • It is also fully connected from the hidden layer to the output layer. Each neuron in the output layer is connected to the neuron in the hidden layer, and the weight is vk, h v_{k,h}vk,h, represents the connection weight from the h-th neuron in the hidden layer to the k-th neuron in the output layer. In order to get a probability output in the output layer, a softmax activation function is added at the end. Assume that the output of all neurons in the output layer before connecting the activation function consists of vector y = y 1 , y 2 , ⋯ , y K y = y_1,y_2,\cdots, y_Ky=y1,y2,,yK, where K is the length of the vocabulary, then the output of the k-th neuron is expressed as follows:
    yk = vk, 1 z 1 + vk, 2 z 2 + ⋯ + vk, H z H + qk y_k = v_{k, 1}z_1 + v_{k,2}z_2 + \cdots + v_{k,H}z_H + q_kyk=vk,1z1+vk,2z2++vk,HzH+qk

  • where qk q_kqkis the bias of the k-th neuron in the output layer.

  • After adding the softmax activation function, the output of the k-th neuron in the output layer is:
    p ( w = k ∣ context ( w ) ) = eyk ∑ i = 1 K eyip(w=k | context(w)) = \frac {e^{y_k}}{\sum_{i=1}^Ke^{y_i}}p(w=kcontext(w))=i=1Keyieyk

  • It represents the probability that the word w corresponding to the k-th neuron in the output layer appears after the current context.

  • How to determine which neuron in the output layer corresponds to which word?

    • This is artificially determined in advance. It does not matter which neuron corresponds to which word, as long as one neuron corresponds to a unique word.
  • So how to train this neural network language model?

7. How to train a neural network language model?

7.1 Training samples
  • In order to train this model, training samples are required. For language models, the sample is a word string containing n words. The first n-1 words are the context, and the nth word is equivalent to the mark. We can collect a large amount of text to form a training corpus, and any continuous word string of length n in the library constitutes a training sample. For example, the corpus contains the statement "Department of Computer Science and Technology, Tsinghua University". Assuming that the window size is 5, "Computer Science and Technology, Tsinghua University", "Department of Computer Science and Technology, University" are all training samples.

Please add image description

7.2 Loss function
  • After you have the training samples, you also need to define a loss function. Let's look at an example first. Assume that the corpus contains only three sentences: "computer science", "computer science", and "computer engineering", and the window size is 2. We hope to estimate two probability values ​​​​from this corpus: p (science | computer) and p (engineering | computer), respectively represent the probability that the next word is "science" and the probability that the next word is "engineering" when the previous word is "computer". What are the reasonable values ​​for these two probabilities? The three sentences in the corpus can be regarded as three samples. We assume that the occurrence of these three samples is independent, so their joint probability can be expressed by the product of their respective occurrence probabilities, that is: p ("Computer Science",
    " "Computer Science", "Computer Engineering" ) p("Computer Science", "Computer Science", "Computer Engineering")p ( " Computer Science " , " Computer Science " , " Computer Engineering " )
    = p (Science∣ Computer) ⋅ p (Science∣ Computer) ⋅ p (Engineering∣ Computer) = p(Science| Computer) \cdot p(Science |Computer) \cdot p(Engineering|Computer)=p ( Science | Computer)p ( Science | Computer )p ( Engineering∣Computer ) = p (Science∣Computer) 2 ⋅ p (Engineering∣Computer) = p(Science|Computer)^2 \cdot p(Engineering| Computer )
    =p ( Science | Computer )2p ( Engineering | Computer )

Please add image description

  • Since in this example there are only two possible words "science" and "engineering" that appear after "computer", the sum of the probabilities of "science" or "engineering" appearing after "computer" should be equal to 1, that is: p (
    science∣ Computer) + p (Engineering∣ Computer) = 1 p(Science|Computer) + p(Engineering|Computer) = 1p ( Science | Computer )+p ( Engineering | Computer )=1
  • So there are:
    p ("computer science", "computer science", "computer engineering") p ("computer science", "computer science", "computer engineering")p ( Computer Science , Computer Science , Computer Engineering )
    = p (Science∣ Computer) 2 (1 − p (Science∣ Computer)) = p(Science|Computer)^2(1-p(Science |Computer))=p ( Science | Computer )2(1p ( Science | Computer ))
    Please add image description
  • For different probability values, the values ​​​​of p ("Computer Science", "Computer Science", "Computer Engineering") are different. For example, when p (Science │ Computer) = 0.5: p ("Computer Science",
    " "Computer Science", "Computer Engineering") p("Computer Science", "Computer Science", "Computer Engineering")p ( " Computer Science " , " Computer Science " , " Computer Engineering " )
    = 0. 5 2 ⋅ ( 1 − 0.5 ) = 0.125 = 0.5^2 \cdot (1-0.5) = 0.125=0.52(10.5)=0.125
  • And when p(Science│Computer)=0.6:
    p ("Computer Science", "Computer Science", "Computer Engineering") p("Computer Science", "Computer Science", "Computer Engineering")p ( " Computer Science " , " Computer Science " , " Computer Engineering " )
    = 0. 6 2 ⋅ ( 1 − 0.6 ) = 0.144 = 0.6^2 \cdot (1-0.6) = 0.144=0.62(10.6)=0.144
7.3 Reasonable probability
  • What level of probability should be reasonable?
    • At present, we only have three sentences provided by the corpus, so we can only estimate based on these three sentences. Since these three samples appear at the same time, we should accept this fact and maximize their joint probability, so estimate the probability The principle is that when p(science│computer) takes whatever value, their joint probability can be maximized.
    • To find the maximum value, you only need to make the derivative of the joint probability equal to 0, and you can solve it.

The derivative of p("Computer Science","Computer Science","Computer Engineering") The derivative of p("Computer Science","Computer Science","Computer Engineering")The derivative of p ( Computer Science , Computer Science , Computer Engineering
) = p (Science∣ Computer) 2 (1 − p (Science∣ Computer)) = p(Science| Computer)^2(1 -derivative of p(science|computer))=p ( Science | Computer )2(1Derivative of p ( science∣computer )) = 2 p (science∣computer) 3 p ((science∣computer)) 2 = 2p(science|computer)3p((science|computer)) ^ 2
=2 p ( Science∣Computer ) 3 p ( ( Science∣Computer ) ) _ _ _2
Let 2 p (Science∣ Computer) 3 p ((Science∣ Computer)) 2 = 0, let 2p(Science|Computer)3p((Science|Computer))^2 = 0, we haveLet 2 p ( science∣computer ) 3 p ( ( science∣computer ) ) _ _ _2=0 , there are
2 − 3 p (science | computer) = 0 2-3p (science | computer) = 023 p ( Science | Computer )=0

  • So:
    p(Science|Computer) = 2 3 p(Science|Computer) = \frac{2}{3}p ( Science | Computer )=32
  • Since:
    p(Science|Computer) + p(Engineering|Computer) = 1 p(Science|Computer) + p(Engineering|Computer) = 1p ( Science | Computer )+p ( Engineering | Computer )=1
  • So:
    p (engineering|computer) = 1 3 p(engineering|computer) = \frac{1}{3}p ( Engineering | Computer )=31
  • Is this result consistent with our intuitive imagination?
    Please add image description
7.4 Maximum likelihood estimation
  • The method of estimating probabilities by maximizing the joint probability is called maximum likelihood estimation. However, generally speaking, the probability value is not estimated directly, because generally speaking the joint probability distribution is a function containing parameters, but the parameters of the joint probability distribution are estimated through the maximum likelihood method. For the neural network language model, the probability is represented by the neural network, so the parameters of the neural network are estimated. According to the neural network language model we introduced earlier (see the figure below), for any word w in the corpus, we assume that the window size is n. According to the position of w in the corpus, there will be a context of w (w) , that is, the first n-1 words of w, using context(w) as the input of the neural network language model, an output value will be obtained at the position k corresponding to the word w in the output layer, which represents the value in the given Given the context, the probability that the next word is w. According to the maximum likelihood estimation method, we hope that the product of probabilities of all words in a given context on this corpus is the largest. That is:
    max θ Π w ϵ C p ( w = k ∣ context ( w ) , θ ) \underset {\theta}{max} \underset {w\epsilon C}\Pi p(w=k|context(w) , \theta)imaxw ϵ CPp(w=kcontext(w),i )

Please add image description

  • Among them θ \thetaθ represents all parameters of the neural network, C represents the corpus, and the symbol "Π \PiΠ ” 这些连乘是生。式子Π w ϵ C p ( w = k ∣ context ( w ) , θ ) \underset {w\epsilon C}\Pi p(w=k|context(w), \theta )w ϵ CPp(w=kcontext(w),θ ) is called the likelihood function. Therefore, our goal is to train the neural network language model and determine the parametersθ \thetaθ , making the likelihood function the largest on the given training set.

  • Training neural networks generally uses the BP algorithm to minimize the loss function. Here, the requirement is maximum, so a transformation is needed to convert the maximization problem into a minimization problem. For the convenience of calculation, we first transform the continuous multiplication into continuous addition by performing logarithmic operation on the likelihood function, because after logarithmic operation, the original continuous multiplication is transformed into continuous addition.

max θ Π w ϵ C p ( w = k ∣ context ( w ) , θ ) \subset {\theta}{max} \subset {w\epsilon C}\Pi p(w=k|context(w), \ . theta)imaxw ϵ CPp(w=kcontext(w),i )

  • Layout:
    max θ log Π w ϵ C p ( w = k ∣ context ( w ) , θ ) \subset {\theta}{max}log \subset {w\epsilon C}\Pi p(w= k|context(w), \theta)imaxlogw ϵ CPp(w=kcontext(w),θ )
    = max θ ∑ w ϵ C logp ( w = k ∣ context ( w ) , θ ) = \subset {\theta}{max} \subset {w\epsilon C}\sum logp(w=k|context(); w), \theta)=imaxw ϵ Cl o g p ( v=kcontext(w),i )
  • If we add a "negative sign" in front of the above equation, the original maximization can become a minimization problem:
    min θ ( − ∑ w ϵ C logp ( w = k ∣ context ( w ) , θ ) ) \underset {\theta}{min} (-\underset {w\epsilon C}\sum logp(w=k|context(w), \theta))imin(w ϵ Cl o g p ( v=kcontext(w),i ))

Please add image description

  • In this way, we can use the following formula as the loss function, and then use the BP algorithm to solve it.
    L ( θ ) = − ∑ w ϵ C logp ( w = k ∣ context ( w ) , θ ) L(\theta) = - \underset{w\epsilon C}\sum logp(w=k|context(w) , \theta)L ( i )=w ϵ Cl o g p ( v=kcontext(w),i )
  • Definition− ∑ w ϵ C logp ( w = k ∣ context ( w ) , θ ) - \underset{w\epsilon C}\sum logp(w=k|context(w), \theta);w ϵ Cl o g p ( v=kcontext(w),θ ) is called the negative log-likelihood function.
  • In this way, this neural network language model is no different from an ordinary fully connected neural network.

Guess you like

Origin blog.csdn.net/sgsgkxkx/article/details/133323843