13. Deep Learning (insert word) and natural language processing to achieve --HanLP

Notes reproduced in GitHub project : https://github.com/NLP-LOVE/Introduction-NLP

13. The depth of learning and natural language processing

13.1 limitations of traditional methods

It has been mentioned a hidden Markov model, perceptron, CRFs, Naive Bayes model, support vector machines and other traditional machine learning models, while, in order to these machine learning models used in NLP, we have a feature template , TF-IDF, words like bags feature vector extraction method. The performance limitations of these methods are as follows:

  1. Sparse data

    First, the traditional machine learning methods are not good at handling data sparseness problem, which is particularly prominent in the field of natural language processing, language is a discrete symbology, each character, words are discrete random variables. We usually use the one-hot vector (one-hot) to the text into a vector representation, it refers to only one element of 1, all other elements of the binary vector 0. E.g:

    Motherland characteristics: [ "China", "America", "French"] (where N = 3)

    China => 100

    United States => 010

    French => 001

    Fatherland above features only three Fortunately, that if it is hundreds of thousands? There will be a lot of 0, manifested as sparsity of data.

  2. Feature template

    Language is highly complex. For the Chinese, constitute a radical Chinese characters, Chinese characters form words, words to form phrases, phrases form sentences, sentences constitute paragraphs, paragraphs constitute the article, meaning progressive and with increasing levels of granularity, expressed increasingly the more complex.

    This feature brings the same template data sparse problems: a particular word is common, but the particular combination of the two words is very rare, especially in three words. Many features appear only once in the training set, the feature appears only once in a statistically meaningless.

  3. Error propagation

    Real-world projects, often involving a combination of multiple natural language processing module. For example, in sentiment analysis, the need to make the word, and then speech tagging, filtering out some of the important word for speech tagging, and finally sent to the naive Bayes or support vector machines and other machine learning module classification to predict.

    This serious practices pipelined error propagation problem, i.e. a front module generates an error is input to the next module in greater error, resulting in the vulnerability of the entire system.

13.2 deep learning and Benefits

In order to solve the data of traditional machine learning and natural language processing sparse, artificial features such as templates, and error propagation problem, people turned their attention to another trend of machine learning research - deep learning.

  1. Depth study

    Deep learning (Deep Leaming, DL) belonging to the category represented by learning (Representation Learning), referring to the use of the model having a certain "depth" to automatically learn things vector representation (vectorial rpresenation) a learning paradigm. Currently, deep learning model used mainly in the layers above layer of the neural network. If in the traditional machine learning, a vector representation of things is the use of hand-feature template to extract sparse binary vector, then study in depth, the feature template is replaced by MLP. Once the problem is expressed as a vector, the next classifier can be used as a single-layer Perceptron model, etc., at the moment the depth of learning and traditional methods as they would, the same thing. So deep learning is not a mystery, is the essence of deep learning vectors extracted by the MLP.

    For depth learning principles , before my blog has been described, in detail, please click:

    http://mantchs.com/2019/08/04/DL/Neural%20Network/

  2. Dense sparse vector data to solve

    Output of the neural network is a feature vector samples x h. Because we are free to control the size of the hidden layer of the neural network, so you can also control the length h of the hidden layer obtained. Even if the input layer is a one-hot vector vocabulary size, dimensions, up to hundreds of thousands, the resulting feature vector hidden layer still controlled to a small volume, such as 100 dimension.

    Such a 100-dimensional vector is an abstract representation of the words as well as other samples, containing a highly concentrated information. Because of these low-dimensional vectors in the same space, we can easily train classifier to learn the degree of similarity between the words and word documents with documents, pictures and photos, and even trained classifier learning between the picture and document similarity. All this is represented by learning brings, it is the traditional machine learning methods difficult to achieve.

  3. Automatically extracting features represented by multilayer network

    Usually all connected (fully connected layer) between two layers of the neural network, it does not require specific design according to the specific connection problems. The hidden layer weight matrix based on the gradient will automatically adjust the loss function MLP weights to automatically learn the characteristics of the hidden layer is represented corner.

    The process is completely without human intervention, that deep learning theory deprived useless feature template.

  4. End design

    Since the neural network between the layers, "language of communication" between the various neural networks as a vector, the depth learning engineers can easily combine multiple neural network, end to end to form a design. For example, before the case comes to sentiment analysis, one of the most simple solution is to heat only the vector character of each document input to the neural network in order to obtain a feature vector of the entire document. Then the feature vector input into a number of logistic regression classifier, you can classify a document of the sentiment polarity.

    The whole process does not require Chinese word, you do not need to stop word filtering. Because neural networks simulate the human process of reading the whole article in alphabetic order, it has acquired all of the input.

13.3 word2vec

As the connection of traditional machine learning and deep learning of the bridge, the word vector has been the first stop entry-depth learning. There are many training methods word vector, word2vec is one of the most famous, as well as fastText, Glove, BERT and has been a very popular XLNet and so on.

  1. word2vec principles explained in my blog already had a detailed description, see:

    http://mantchs.com/2019/08/22/NLP/Word%20Embeddings/

  2. Training vector word

    Understanding of the basic principles of word vectors, this section describes how to invoke the word vector module HanLP implemented, the module accepts the training data format is a space-word plain text format, here to MSR corpus, for example. Training code is as follows ( automatically downloaded Corpus ):

    from pyhanlp import *
    import zipfile
    import os
    from pyhanlp.static import download, remove_file, HANLP_DATA_PATH
    
    def test_data_path():
        """
        获取测试数据路径,位于$root/data/test,根目录由配置文件指定。
        :return:
        """
        data_path = os.path.join(HANLP_DATA_PATH, 'test')
        if not os.path.isdir(data_path):
            os.mkdir(data_path)
        return data_path
    
    
    
    ## 验证是否存在语料库,如果没有自动下载
    def ensure_data(data_name, data_url):
        root_path = test_data_path()
        dest_path = os.path.join(root_path, data_name)
        if os.path.exists(dest_path):
            return dest_path
    
        if data_url.endswith('.zip'):
            dest_path += '.zip'
        download(data_url, dest_path)
        if data_url.endswith('.zip'):
            with zipfile.ZipFile(dest_path, "r") as archive:
                archive.extractall(root_path)
            remove_file(dest_path)
            dest_path = dest_path[:-len('.zip')]
        return dest_path
    
    
    sighan05 = ensure_data('icwb2-data', 'http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip')
    msr_train = os.path.join(sighan05, 'training', 'msr_training.utf8')
    ## ===============================================
    ## 以下开始 word2vec
    
    
    IOUtil = JClass('com.hankcs.hanlp.corpus.io.IOUtil')
    DocVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.DocVectorModel')
    Word2VecTrainer = JClass('com.hankcs.hanlp.mining.word2vec.Word2VecTrainer')
    WordVectorModel = JClass('com.hankcs.hanlp.mining.word2vec.WordVectorModel')
    
    # 演示词向量的训练与应用
    TRAIN_FILE_NAME = msr_train
    MODEL_FILE_NAME = os.path.join(test_data_path(), "word2vec.txt")
    
    def train_or_load_model():
        if not IOUtil.isFileExisted(MODEL_FILE_NAME):
            if not IOUtil.isFileExisted(TRAIN_FILE_NAME):
                raise RuntimeError("语料不存在,请阅读文档了解语料获取与格式:https://github.com/hankcs/HanLP/wiki/word2vec")
            trainerBuilder = Word2VecTrainer();
            return trainerBuilder.train(TRAIN_FILE_NAME, MODEL_FILE_NAME)
        return load_model()
    
    
    def load_model():
        return WordVectorModel(MODEL_FILE_NAME)
    
    
    wordVectorModel = train_or_load_model()  # 调用函数训练 word2vec
  3. Word semantic similarity

    Once you have term vectors, the most basic application is to find the meaning of a given word most similar to the first N word.

    # 打印 单词语义相似度
    def print_nearest(word, model):
        print(
            "\n                                                Word     "
            "Cosine\n------------------------------------------------------------------------")
        for entry in model.nearest(word):
            print("%50s\t\t%f" % (entry.getKey(), entry.getValue()))
    
    print_nearest("上海", wordVectorModel)
    print_nearest("美丽", wordVectorModel)
    print_nearest("购买", wordVectorModel)
    print(wordVectorModel.similarity("上海", "广州"))

    The results are as follows:

                                                    Word     Cosine
    ------------------------------------------------------------------------
                                                    广州       0.616240
                                                    天津       0.564681
                                                    西安       0.500929
                                                    抚顺       0.456107
                                                    深圳       0.454190
                                                    浙江       0.446069
                                                    杭州       0.434974
                                                    江苏       0.429291
                                                    广东       0.407300
                                                    南京       0.404509
    
                                                    Word     Cosine
    ------------------------------------------------------------------------
                                                    装点       0.652887
                                                    迷人       0.648911
                                                    恬静       0.634712
                                                    绚丽       0.634530
                                                    憧憬       0.616118
                                                    葱翠       0.612149
                                                    宁静       0.599068
                                                    清新       0.592581
                                                    纯真       0.589360
                                                    景色       0.585169
    
                                                    Word     Cosine
    ------------------------------------------------------------------------
                                                     购       0.521070
                                                    购得       0.500480
                                                    选购       0.483097
                                                    购置       0.480335
                                                    采购       0.469803
                                                    出售       0.469185
                                                   低收入       0.461131
                                                  分期付款       0.458573
                                                    代销       0.456689
                                                    高价       0.456320
    0.6162400245666504

    Cosine wherein a column is the cosine similarity between two words, is a value between -1 and 1.

  4. Words analogy

    The word vector subtract two words, will produce a new vector. By making dot product with the vector, a difference between the degree of similarity can be drawn with a word of the two words. In English, a common example is king - man + woman = queen, that is to say some dimensions word vector may be to preserve the current level of words associated with royalty, and others may save the gender dimension of information.

    
    # param A: 做加法的词语
    # param B:做减法的词语
    # param C:做加法的词语
    # return:与(A-B+C) 语义距离最近的词语及其相似度列表
    print(wordVectorModel.analogy("日本", "自民党", "共和党"))

    The results are as follows:

    [美国=0.71801066, 德米雷尔=0.6803682, 美国国会=0.65392816, 布什=0.6503047, 华尔街日报=0.62903535, 国务卿=0.6280117, 舆论界=0.6277531, 白宫=0.6175594, 驳斥=0.6155998, 最惠国待遇=0.6062231]
  5. Short text similarity

    We will all short text vector word averaging, will be able to express this short text is a dense vector. So we can measure any goose similarity between both ends of a short text.

    #  文档向量
    docVectorModel = DocVectorModel(wordVectorModel)
    documents = ["山东苹果丰收",
                 "农民在江苏种水稻",
                 "奥运会女排夺冠",
                 "世界锦标赛胜出",
                 "中国足球失败", ]
    print(docVectorModel.similarity("山东苹果丰收", "农民在江苏种水稻"))
    print(docVectorModel.similarity("山东苹果丰收", "世界锦标赛胜出"))
    print(docVectorModel.similarity(documents[0], documents[1]))
    print(docVectorModel.similarity(documents[0], documents[4]))

    The results are as follows:

    0.6743720769882202
    0.018603254109621048
    0.6743720769882202
    -0.11777809262275696

    Similarly, you can query the interface by calling the nearest most similar to a given word document

    def print_nearest_document(document, documents, model):
        print_header(document)
        for entry in model.nearest(document):
            print("%50s\t\t%f" % (documents[entry.getKey()], entry.getValue()))
    
    
    def print_header(query):
        print(
            "\n%50s          Cosine\n------------------------------------------------------------------------" % (query))
    
    
    for i, d in enumerate(documents):
        docVectorModel.addDocument(i, documents[i])
    
    print_nearest_document("体育", documents, docVectorModel)
    print_nearest_document("农业", documents, docVectorModel)
    print_nearest_document("我要看比赛", documents, docVectorModel)
    print_nearest_document("要不做饭吧", documents, docVectorModel)

    The results are as follows:

                                                   体育          Cosine
    ------------------------------------------------------------------------
                                               世界锦标赛胜出       0.256444
                                               奥运会女排夺冠       0.206812
                                                中国足球失败       0.165934
                                                山东苹果丰收       -0.037693
                                              农民在江苏种水稻       -0.047260
    
                                                    农业          Cosine
    ------------------------------------------------------------------------
                                              农民在江苏种水稻       0.393115
                                                山东苹果丰收       0.259620
                                                中国足球失败       -0.008700
                                               世界锦标赛胜出       -0.063113
                                               奥运会女排夺冠       -0.137968
    
                                                 我要看比赛          Cosine
    ------------------------------------------------------------------------
                                               奥运会女排夺冠       0.531833
                                               世界锦标赛胜出       0.357246
                                                中国足球失败       0.268507
                                                山东苹果丰收       0.000207
                                              农民在江苏种水稻       -0.022467
    
                                                 要不做饭吧          Cosine
    ------------------------------------------------------------------------
                                              农民在江苏种水稻       0.232754
                                                山东苹果丰收       0.199197
                                               奥运会女排夺冠       -0.166378
                                               世界锦标赛胜出       -0.179484
                                                中国足球失败       -0.229308

13.4 High-performance based on neural network dependent parser

  1. Arc-Standard Transfer System

    Arc-Eager different previously introduced, the dependency grammar based Arc-Standard transfer system, the following specific actions:

    Action name condition Explanation
    Shift Non-empty queue β The team first word i push
    LeftArc The second word in the stack I will dominate the second word i word stack to stack the word j, that is, as a child node j i
    RightArc Will dominate the top of the stack word word j is set to stack the second word i, j namely as a child node i

    Two different logical transfer systems, Arc-Eager constructed top to bottom, and Arc-Standard claim right subtree bottom-constructed. Although the complexity of both are O (n), but may be due to the simplicity of Arc-Standard (transfer operation less), it is more popular.

  2. Feature Extraction

    Although in theory, neural networks can automatically extract features, but as a pioneer for the paper, still failed to separate from feature template. All features are divided into three categories, namely:

    • Word feature.
    • Speech feature.
    • Characterized in dependency tag subtree it has been determined in the.

    Next, the parser extracting the three categories of characteristics of the current state, are designated as w, t and l. Unlike the conventional method, where a vector is assigned to each feature, thereby obtaining the dense three vectors Xw, Xt and Xl. Next, these three vectors stitching together input to the neural network comprises a hidden layer, and activated using a cubic function, i.e. to obtain a feature vector of the hidden layer:
    \ [H = \ left (W_. 1} {\ left (X ^ {w} \ oplus x ^
    {t} \ oplus x ^ {l} \ right) \ right) ^ {3} \] Next, for the k kinds of labels, Arc-Standard total possible presence of 2k +1 transfer action. At this time, only the input feature vector h to the polyhydric logistic regression classifier (neural network can be seen in the output layer) can be obtained in the operation of the transition probability distribution:
    \ [P = SoftMax \ left ({2 W_ } h \ right) \]
    Finally, the maximum probability p is selected corresponding to the transfer operation and execution. When training using the softmax cross entropy loss function and in stochastic gradient descent optimization method.

  3. Implementation code

    from pyhanlp import *
    
    CoNLLSentence = JClass('com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLSentence')
    CoNLLWord = JClass('com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord')
    IDependencyParser = JClass('com.hankcs.hanlp.dependency.IDependencyParser')
    NeuralNetworkDependencyParser = JClass('com.hankcs.hanlp.dependency.nnparser.NeuralNetworkDependencyParser')
    
    
    parser = NeuralNetworkDependencyParser()
    sentence = parser.parse("徐先生还具体帮助他确定了把画雄鹰、松鼠和麻雀作为主攻目标。")
    print(sentence)
    for word in sentence.iterator():  # 通过dir()可以查看sentence的方法
        print("%s --(%s)--> %s" % (word.LEMMA, word.DEPREL, word.HEAD.LEMMA))
    print()
    
    # 也可以直接拿到数组,任意顺序或逆序遍历
    word_array = sentence.getWordArray()
    for word in word_array:
        print("%s --(%s)--> %s" % (word.LEMMA, word.DEPREL, word.HEAD.LEMMA))
    print()
    
    # 还可以直接遍历子树,从某棵子树的某个节点一路遍历到虚根
    CoNLLWord = JClass("com.hankcs.hanlp.corpus.dependency.CoNll.CoNLLWord")
    head = word_array[12]
    while head.HEAD:
        head = head.HEAD
        if (head == CoNLLWord.ROOT):
            print(head.LEMMA)
        else:
            print("%s --(%s)--> " % (head.LEMMA, head.DEPREL))

    For more dependencies defined in the Chinese Dependency Treebank 1.0.

13.5 Conclusion

  1. Natural language processing is a rapidly changing discipline, especially in the era of learning in depth. In academia, even the most advanced of the current research will soon be broken in just two months. Knowledge of this series is provided only to those who basics gate-level only.

  2. Two neural networks common feature extractor: Recurrent Neural Network for sequential data RNN convolutional neural network and a data space for CNN . Which, RNN in natural language processing used widely. RNN can handle input variable length, which is directly applicable to the text. In particular RNN family LSTM network, word memory can be about 200 or so, create the conditions for long-distance dependencies between the model sentence the word. However, RNN drawback that it is difficult to parallelize. If you need to capture the text of the n-gram words, CNN but even better, and have a natural advantage in terms of parallelization. Considering the generally longer documents, many documents are classified using the CNN model to build. And relatively short sentences, based NLP tasks performed on the sentence so that the degree of particle (Chinese word, speech tagging, parsing and named entity recognition, etc.) is often used to achieve RNN.

  3. In the pre-training word embedded, word2vec is already a thing of the past. Facebook by morphological information word introduced inside the Skip-Gram model obtained fastText may be any of the words in the word vector configuration, without requiring the constant term in the corpus is now derived. However, both word2vec or fastText, can not solve the problem of polysemy. Because Polysemies disambiguation must be given sentences according to the context, which spawned a series of words can be perceived in the context of representation.

    Among them, the University of Washington presented ELMO , two-way LSTM language model that is trained on a large scale in plain text. ELMo to predict the current word by word read into the above manner is embedded introduces contextual information. Zalando Research researchers then applied this approach to the character level, has been embedded in a string context, its tagger has made the most advanced accuracy. Google's BERT models and modeling the same time, above and below, in many NLP tasks has made remarkable achievements through an efficient two-way Transformer network.

  4. Others previously considered difficult NLP tasks, such as automatic document summaries and questions and answers, but in the era of deep learning is very simple. Many QA tasks attributed to the measure between the text and the alternative answers questions similarity, which happens to have attentional mechanisms of neural networks are good at. The text document summary generation technology involved, but also happens to be the language RNN model is good at. In the field of machine translation, Google has long been based on the use of machine translation technology based on neural network out of the phrase machine translation technology. At present, the academic trend is the use of Transformer and attention mechanisms to extract features.

In short, the picture of the future of natural language processing grand and broad. Natural Language Processing Getting Started as a stepping stone to a series of articles on this long road, hoping to give readers some of the necessary people door concept. As for the next practice, the road is long, and the king of mutual encouragement.

13.6 GitHub

HanLP Ho Han - "Natural Language Processing Getting Started" notes:

https://github.com/NLP-LOVE/Introduction-NLP

table of Contents


chapter
Chapter 1: First Steps
Chapter 2: dictionary word
Chapter 3: bigram and Chinese word
Chapter 4: Hidden Markov model and serial label
Chapter 5: Perceptron classification and labeling sequence
Chapter 6: Conditional Random Fields marked with serial
Chapter 7: POS tagging
Chapter 8: named entity recognition
Chapter 9: information extraction
Chapter 10: Text Clustering
Chapter 11: Text Categorization
Chapter 12: dependency parsing
Chapter 13: deep learning and natural language processing

Guess you like

Origin www.cnblogs.com/mantch/p/12333696.html