You read a text with natural language processing --word represent changes in technology (from model to bool BERT)

Personal blog navigation page (click on the right link to open a personal blog): Daniel take you on technology stack 

I. Background

    Natural language processing is to make computers understand human language, as so far, whether the computer really understand human language, which is an unknown, my understanding is by far not understand human language, just to look-up table a maximum probability of response to it. So Natural Language Processing (NLP) in the field of things which include it? Text categorization (eg: spam classification, sentiment analysis), machine translation, summary, grammar analysis, segmentation, POS tagging, entity recognition (NER), voice recognition, etc., are NLP problem to be solved. Then the solution of these problems, if your computer really understand the meaning of human language, is still unknown, but the number of articles to discuss the film. Unit of language is the word, the computer is how to represent words, what technology to represent a word, you can let the computer understand the meaning of the word it? This blog will be discussed in detail, from bool model, the vector space model, to a variety of word embedding (word2vec, elmo, GPT, BERT)

Second, the original era

    Before Deeplearning, represent a word, not a prescriptive way, how to represent, depending on the task of trying to solve.

    1, Bool model

    Here are two sentences, seek text similarity.

    I like Leslie

    You like Andy Lau

    So, Boolean model is relatively simple and crude, there was a dimension where the word is 1, where the dimensions of no-show is 0, as shown below:

   

   Then we can ask cosine of two vectors.

   In bool model, since the feature value only two values ​​0 and 1, the degree of importance are not well characterized in terms of the reaction in the text.

    2, VSM (vector space model)

    Bool model actually can be seen as a special case of VSM, VSM each dimension value only go to fill with some special rules for handling Bale, VSM below:

    

    wherein t represents term, d represents a Document, then D can be represented as D = {t1, t2, t3 ...... tN} N-dimensional vector, the value of w is how to fill it? Had to practice TF * IDF, TF represents a word frequency, IDF denotes a counter word frequency, using the following formula:

    Frequency and total number of words / document TF (t) = the feature word appears in the document

    IDF (t) = log (N / (n + 1)), where N is the total number of text in a text set, n is the number of documents that contain the feature word t

    Of course TF * IDF also has his flaws, ignoring the distribution between classes class distribution and ignored, then there are some improvements, such as: TF * IDF * IG, IG indicating gain.

   Representation of these words / document, very mechanical, do not reflect the relationship between context between words and words of similar relations.

Third, the depth of learning age

    First, we must mention a language model, the language model in estimating the probability of occurrence of a sentence, the greater the probability, the more reasonable.

    P(w1,w2,w3,……wn)=P(w1)*P(w2|w1)*P(w3|w1,w2)...P(wn|w1,w2....wn-1)

    The above formula is usually no way to estimate, so will do a Markov assumption, the assumption behind the word and only one word about the front, then reduced to the following equation:

     P(w1,w2,w3,……wn)=P(w1)*P(w2|w1)*P(w3|w2)...P(wn|wn-1)

    Of course, the word can be assumed the previous N words behind the relevant, which is often said N-gram. Language model has great use in elmo and gpt.

    1、word2vec

    word2vec, in fact, it is a single hidden layer neural network, the idea is very simple, see below

    

    Li Lei and Han Meimei figure above, all followed "in the classroom" is the word, or Li Lei Han Meimei when the input to the neural network, neural networks want to output "in the classroom," the higher the better the chances of the word, the right to be re-neural network adjust to map two different words to the same space, then say that there is a link Mingli Lei and Han Meimei, this is word2vec thought. word2vec two, cbow and skip-gram, cbow is a term introduced by the context, skip-gram is a term introduced by the context, as shown below. The results of my practice is cbow effect has to be a little bit better.

    This code is how to achieve it? In fact, his own neural network to achieve a single hidden layer to get, output layer activation function softmax, with cross entropy Loss, gradient descent can be. Indeed, we do not have too much trouble, DL4J has provided a complete solution to get a few lines of code, the code is as follows:

  Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();

   vec.fit();

      2、ELMO

    ELMO take to Embeddings from Language Model initials, paper address: https: //arxiv.org/abs/1802.05365

   Embeddings is derived from the language model. Speaking before the ELMO, first is that word2vec have any questions, word2vec can certainly represent the semantic relationships between words and between each other with the word, but word2vec is a fully static, that is, all information is compressed to a fixed dimensional vector in. So for multi-word meaning, it is the expression is relatively limited. Consider the following example,

    In the "letter righteous in the world to be" in, "letter" is the verb, "stretching" means

    In the "faith in the whole world", the "letter" is a noun, "credit" means

    If the "trust" compressed into a 100-dimensional vector, will be difficult to distinguish the difference between these two meanings, then this requires Contextualized Word Embedding, depending on the context of the word is encoded, so ELMO came.

    EMLO structure is very simple, with two-way LSTM to train a language model. As shown below (As pictures of National Taiwan University CHANG ppt)

    

    Model training process is very simple, read a word, the word the next word, the reverse read a word, a word on the forecast, this training continues until convergence. Blue and orange vector red box at the middle Embedding is a vector, and finally pick up what we want the vector, of course, this bi-lstm many layers can be stacked. Embedding each layer to obtain a vector.

    

    So, when using this code of how to use the value of it? Depending on downstream tasks, for example each layer can Embedding vector summation averaging, weighted sum or the like, the weight may come out together with the train along with the task.

    3、GPT

    ELMO realized the word dynamic coding, but he used LSTM, LSTM and can not remember a long message, and not conducive to parallel computing. GPT with self attention to change this result, of course, all of this thanks to God for google "Attention Is All You Need" Paper Address: https://arxiv.org/pdf/1706.03762.pdf

    GPT is how the operation of the process it? In fact, use self attention training a language model, see below:

    

    The front of each word and word do attention, predict the next word, for example, reads the start tag BOS, then do yourself and your attention, predicted "the tide", reads the BOS, the tide, and then BOS, tide do attention, forecast "back", and so on until the end. train to go on many corpus, we get a very powerful language model can be dynamically encoded. When using fixed parameters can live in these attention layers, then train other downstream tasks, such as doing sentiment classification problem, you can put those layers attention on a few full-front connection layer, fixed parameters, training only full-back connection layer, classified by softmax or sigmoid.

    4、Bidirectional Encoder Representations from Transformers (BERT)

    GPT has a flaw, is only dependent on the coding information above, the following information is not a member, then BERT good solution to this problem. BERT fact the transformer encoder portion, following FIG.

    

    There are two ways to train BERT, Masked LM and Next Sentence Prediction, Masked LM random conceal some words, let BERT covered guess what word. Next Sentence Prediction BERT is to infer two sentences is not a context.

    BERT full account of the context of the word is encoded, so good indication of the relationship between semantics and context, far ahead in many games.

IV Summary

    Natural Language Processing Boolean from the original model, the vector space model, and then word2vec, then ELMO, and then to GPT, and then BERT, along the way, the replacement technology. So far, BERT still relatively leading word Embedding method, in most of the natural language processing tasks, as a pre-training mission, we should first try to approach. Perhaps, before long, there will be new technologies come out, new records, we'll see. But even now, the machine really understand human language, this is a question yet to be demonstrated. The road is long Come, happiness and earth.

Attached Java / C / C ++ / machine learning / Algorithms and Data Structures / front-end / Android / Python / programmer reading / single books books Daquan:

(Click on the right to open there in the dry personal blog): Technical dry Flowering
===== >> ① [Java Daniel take you on the road to advanced] << ====
===== >> ② [+ acm algorithm data structure Daniel take you on the road to advanced] << ===
===== >> ③ [database Daniel take you on the road to advanced] << == ===
===== >> ④ [Daniel Web front-end to take you on the road to advanced] << ====
===== >> ⑤ [machine learning python and Daniel take you entry to the Advanced Road] << ====
===== >> ⑥ [architect Daniel take you on the road to advanced] << =====
===== >> ⑦ [C ++ Daniel advanced to take you on the road] << ====
===== >> ⑧ [ios Daniel take you on the road to advanced] << ====
=====> > ⑨ [Web security Daniel take you on the road to advanced] << =====
===== >> ⑩ [Linux operating system and Daniel take you on the road to advanced] << = ====

There is no unearned fruits, hope you young friends, friends want to learn techniques, overcoming all obstacles in the way of the road determined to tie into technology, understand the book, and then knock on the code, understand the principle, and go practice, will It will bring you life, your job, your future a dream.

Published 47 original articles · won praise 0 · Views 295

Guess you like

Origin blog.csdn.net/weixin_41663412/article/details/104841985