"Introduction to Natural Language Processing" Reading Notes

Chapter One, Novice

1.1 Natural language and programming language

  1. Vocabulary
  2. Structured:
  3. Ambiguity:
  4. Fault tolerance
  5. Variability
  6. Simplicity

1.2 Levels of Natural Language Processing

  • Text : speech (voice recognition), image (optical symbol recognition), and text.
  • Lexical analysis :
    • Chinese word segmentation (segmenting the text into meaningful words and dividing the character sequence into sub-sequences),
    • Part-of-speech tagging (determine the category of each word and disambiguation at a shallow level),
    • Named entity recognition (longer proper nouns)
  • Information extraction : extract some useful information according to words and parts of speech
  • Text classification and text clustering :
    • Text classification: The semantics of a paragraph, whether the email is spam, that is, the document is classified, that is, text classification,
    • Text clustering: file acquaintance documents together, or exclude duplicate documents,
  • Syntactic analysis : According to the subject-predicate-object structure of the sentence, analyze and describe the meaning of the sentence for question answering system and search engine.
  • Semantic analysis and text analysis : Compared with grammatical analysis, semantic analysis focuses on semantics rather than grammar.
    • Including word sense disambiguation (determine the meaning of a word in the context, not a simple part of speech),
    • Semantic role labeling (labeling the relationship between the predicate and other components in the sentence)
    • Semantic Dependence Analysis (analysis of the semantic relationship between words in sentences)
  • Other tasks :
    • Automatic question and answer
    • Automatic summarization
    • machine translation
  • Information Retrieval (IR) is an independent discipline different from natural language processing. The goal of IR is to query information, while the goal of NLP is to understand language.

1.3 Schools of Natural Language Processing

1.4 Machine learning

Machine learning : A method that can give a computer to improve its ability without direct programming , and the computer can improve its ability on the task through the experience data of a task. The algorithm that allows the machine to learn the algorithm, the machine learning algorithm is called a meta-algorithm , and the algorithm learned by the machine becomes a model.

model:

A model is a mathematical abstraction of a real problem. It consists of a hypothetical function and a series of parameters, which can represent:

f(X) = W \cdot X + b

The independent variable X is a feature vector. Used to represent the characteristics of an object.

feature:

Refers to the transformed value of the characteristics of things. The characteristics of things are extracted through numerical types, which is called feature extraction .

Feature extraction of names:

"Shen Yanbing" feature extraction
Feature number Characteristic condition Eigenvalues
1 Does it contain "Goose" 1
2 Does it contain "ice" 1

Shen Yanbing's can be expressed as a two-dimensional vector X = [1,1]

Suppose the model for judging gender based on name is :

f(X) = W \cdot X + b

f(X) = W\cdot X+ b = W+b , W = [w1,w2....]Represents the characteristic value, for example, Na is a female characteristic and Lei is a male characteristic.

You don’t need to write features by words. Define a set of templates to extract features. For example, by defining namevariables, traversing multiple different names, and obtaining all templates belonging to the name, it Wis generated by name, not by analyzing and statistics by yourself, and automatically extracting features. The template is called a feature template.

Designing feature templates is called feature engineering . The more features, the more parameters, the more complex the model, and the complexity of the model matches the data set.

data set:

The demo for meta-algorithm learning is called a sample (exercise), multiple samples are called a data set, and in the field of natural language processing, it is called a corpus . There are many types of data sets.

Commonly used data sets:

  1. MNIST (Handwritten Digit Recognition):
  2. ImageNet (image recognition):
  3. TREC (Information Retrieval):
  4. SQuAD (automatic question and answer):
  5. Europrl (machine translation):

Supervised learning:

When there are standard answers to the exercises, the learning algorithm at this time is called supervised learning . The supervised learning algorithm will correct the model error after the machine finishes the problem.

For gender recognition, each person's name in the data set is represented as,(Xi,Yi), Yi \in (+1,-1) 

For supervised learning, if the answer is male, the function outputs female.

In     F(X) = W1\cdot X1+W2\cdot X2 + ...... + Wn\cdot Xn+b  , it is known    Xi \in ( 0,1) that if you want to   f(x)  increase to a non-negative value, you need to   increase Xi = 1 the Wivalue. After multiple iterations of modification, the weight of common words for men will increase, and the frequency of common words for women will decrease. That is, this priority will be calculated automatically. In natural language processing, it corresponds to the word frequency of each word.

This iterative learning process on a labeled (answer) data set is called training . The data set used for training is called a training set . The result of training is a series of parameters (feature weights) or an iterated model. Using the model, we can do gender prediction for any name ,

Unsupervised learning:

That is, tell the question, not the answer. Unsupervised learning is generally used for clustering and dimensionality reduction , neither of which requires annotated data.

Clustering : File acquaintance documents together, or exclude duplicate documents, based on the similarity of samples and the granularity of clusters.

Dimensionality reduction : The process of transforming sample points from a high-dimensional space to a low-dimensional space. High-dimensional data in machine learning abounds. For example, gender recognition, which is characterized by commonly used Chinese characters, exceeds 2000. If there is na feature, then the sample Corresponding to n + 1the point in the dimensional space, the extra is used for the dependent variable of the hypothesis function. If you want to visualize the sample points, you need to reduce it to two-dimensional or three-dimensional.

The dimensionality reduction algorithm requires that the information is not lost as much as possible after dimensionality reduction, or to say that the variance of each dimension of the sample in the low-dimensional space is as large as possible.

Other types of machine learning algorithms:

Semi-supervised learning : Extract the consistent results of the same prediction from multiple different models as training samples to expand the training set.

Reinforcement learning : While predicting, plan the next decision based on environmental feedback.

1.5 Corpus:

Data sets in the field of natural language processing

Chinese sub-words database:

Sentence set after manual segmentation

Part-of-speech tagging corpus:

After the sentence is segmented, a part-of-speech corpus is assigned to each word. Part of speech tagging set .

Named entity recognition corpus

Manually mark the entity nouns and entity categories that the creator of the text cares about.

Syntactic analysis corpus:

The commonly used syntactic analysis corpus in Chinese is CTB, which is the grammatical connection between word segmentation.

Text classification corpus:

A corpus of articles manually annotated with the category to which they belong

Corpus construction, that is, the process of constructing a corpus, standard formulation-personnel training-manual labeling

1.6 Open source tools:

  1. Python interface: https://github.com/hankcs/pyhanlp
  2. java interface: https://github.com/hankcs/HanLP

Chapter Two, Dictionary Segmentation

Chinese word segmentation: the process of splitting a piece of text into a series of words. Chinese word segmentation is divided into two categories:

  1. Based on dictionary rules
  2. Based on machine learning

2.1 Dictionary word segmentation:

Definition of words

Linguistically defines the smallest unit with independent meaning.

The nature of the word:

Zipf's Law: The word frequency of a word is inversely proportional to its word frequency ranking, as in Chinese.

Statistics of the top 30 commonly used words

2.2 Dictionaries

Hanlp dictionary:

Dictionary loading:

python:

from pyhanlp import *


def load_dictionary():
    """
    加载HanLP中的mini词库
    :return: 一个set形式的词库
    """
    IOUtil = JClass('com.hankcs.hanlp.corpus.io.IOUtil')
    path = HanLP.Config.CoreDictionaryPath.replace('.txt', '.mini.txt')
    dic = IOUtil.loadDictionary([path])
    return set(dic.keySet())


if __name__ == '__main__':
    dic = load_dictionary()
    print(len(dic))
    print(list(dic)[0])
85584
倒贴

Segmentation algorithm:

For dictionary query with different word selection rules, the commonly used rules are forward longest match and reverse longest match. And the longest match in both directions. It is based on a complete segmentation process .

Complete division:

Find all the words in a piece of text. Plain full segmentation algorithm is relatively simple, only need to traverse the text in a continuous sequence, whether the query sequence can be changed in the text ,

def fully_segment(text, dic):
    word_list = []
    for i in range(len(text)):                  # i 从 0 到text的最后一个字的下标遍历
        for j in range(i + 1, len(text) + 1):   # j 遍历[i + 1, len(text)]区间
            word = text[i:j]                    # 取出连续区间[i, j]对应的字符串
            if word in dic:                     # 如果在词典中,则认为是一个词
                word_list.append(word)
    return word_list


if __name__ == '__main__':
    dic = load_dictionary()

    print(fully_segment('商品和服务', dic))

Longest forward match:

To optimize the complete segmentation, we need a meaningful sequence of words, not a list of words in a dictionary. Consider that the longer the word, the more meaning it expresses . The longer the definition word, the higher the priority, that is, the longest word in the dictionary is output first, and the scanning order is from front to back, that is, the longest match in the forward direction.

from tests.book.ch02.utility import load_dictionary


def forward_segment(text, dic):
    word_list = []
    i = 0
    while i < len(text):
        longest_word = text[i]                      # 当前扫描位置的单字
        for j in range(i + 1, len(text) + 1):       # 所有可能的结尾
            word = text[i:j]                        # 从当前位置到结尾的连续字符串
            if word in dic:                         # 在词典中
                if len(word) > len(longest_word):   # 并且更长
                    longest_word = word             # 则更优先输出
        word_list.append(longest_word)              # 输出最长词
        i += len(longest_word)                      # 正向扫描
    return word_list


if __name__ == '__main__':
    dic = load_dictionary()

    print(forward_segment('就读北京大学', dic))
    print(forward_segment('研究生命起源', dic))
['就读', '北京大学']
['研究生', '命', '起源']

The reason for the error and the priority of graduate students is greater than research

Longest reverse match:


from tests.book.ch02.utility import load_dictionary


def backward_segment(text, dic):
    word_list = []
    i = len(text) - 1
    while i >= 0:                                   # 扫描位置作为终点
        longest_word = text[i]                      # 扫描位置的单字
        for j in range(0, i):                       # 遍历[0, i]区间作为待查询词语的起点
            word = text[j: i + 1]                   # 取出[j, i]区间作为待查询单词
            if word in dic:
                if len(word) > len(longest_word):   # 越长优先级越高
                    longest_word = word
                    break
        word_list.insert(0, longest_word)           # 逆向扫描,所以越先查出的单词在位置上越靠后
        i -= len(longest_word)
    return word_list


if __name__ == '__main__':
    dic = load_dictionary()
    print(backward_segment('项目的研究', dic))
    print(backward_segment('研究生命起源', dic))
['项', '目的', '研究']
['研究', '生命', '起源']

Longest match in both directions,

That is, the forward and reverse longest match is performed at the same time. If the number of words is different, the one with fewer words will be returned, otherwise the one with fewer words of the two will be returned. When the words are also the same, the result of the reverse longest match will be returned first. . Derived from the characteristics of linguistic Chinese characters, heuristic algorithms (the algorithm for single-character words is smaller than the algorithm for non-single-character words),

from tests.book.ch02.backward_segment import backward_segment
from tests.book.ch02.forward_segment import forward_segment
from tests.book.ch02.utility import load_dictionary


def count_single_char(word_list: list):  # 统计单字成词的个数
    return sum(1 for word in word_list if len(word) == 1)


def bidirectional_segment(text, dic):
    f = forward_segment(text, dic)
    b = backward_segment(text, dic)
    if len(f) < len(b):                                  # 词数更少优先级更高
        return f
    elif len(f) > len(b):
        return b
    else:
        if count_single_char(f) < count_single_char(b):  # 单字更少优先级更高
            return f
        else:
            return b                                     # 都相等时逆向匹配优先级更高


if __name__ == '__main__':
    dic = load_dictionary()

    print(bidirectional_segment('研究生命起源', dic))

The effect is not as good as the reverse participle .

Speed ​​evaluation:

The core of dictionary word segmentation lies in speed rather than accuracy .

python;

java:

java code:

package com.hankcs.book.ch02;

import com.hankcs.hanlp.corpus.io.IOUtil;
import com.hankcs.hanlp.dictionary.CoreDictionary;

import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

/**
 * 《自然语言处理入门》2.3 切分算法

 */
public class NaiveDictionaryBasedSegmentation
{
    public static void main(String[] args) throws IOException
    {
        // 加载词典
        TreeMap<String, CoreDictionary.Attribute> dictionary =
            IOUtil.loadDictionary("data/dictionary/CoreNatureDictionary.mini.txt");
        System.out.printf("词典大小:%d个词条\n", dictionary.size());
        System.out.println(dictionary.keySet().iterator().next());
        // 完全切分
        System.out.println(segmentFully("就读北京大学", dictionary));
        // 正向最长匹配
        System.out.println(segmentForwardLongest("就读北京大学", dictionary));
        System.out.println(segmentForwardLongest("研究生命起源", dictionary));
        System.out.println(segmentForwardLongest("项目的研究", dictionary));
        // 逆向最长匹配
        System.out.println(segmentBackwardLongest("研究生命起源", dictionary));
        System.out.println(segmentBackwardLongest("项目的研究", dictionary));
        // 双向最长匹配
        String[] text = new String[]{
            "项目的研究",
            "商品和服务",
            "研究生命起源",
            "当下雨天地面积水",
            "结婚的和尚未结婚的",
            "欢迎新老师生前来就餐",
        };
        for (int i = 0; i < text.length; i++)
        {
            System.out.printf("| %d | %s | %s | %s | %s |\n", i + 1, text[i],
                              segmentForwardLongest(text[i], dictionary),
                              segmentBackwardLongest(text[i], dictionary),
                              segmentBidirectional(text[i], dictionary)
            );
        }

        evaluateSpeed(dictionary);
    }

    /**
     * 评测速度
     *
     * @param dictionary 词典
     */
    public static void evaluateSpeed(Map<String, CoreDictionary.Attribute> dictionary)
    {
        String text = "江西鄱阳湖干枯,中国最大淡水湖变成大草原";
        long start;
        double costTime;
        final int pressure = 10000;

        System.out.println("正向最长");
        start = System.currentTimeMillis();
        for (int i = 0; i < pressure; ++i)
        {
            segmentForwardLongest(text, dictionary);
        }
        costTime = (System.currentTimeMillis() - start) / (double) 1000;
        System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);

        System.out.println("逆向最长");
        start = System.currentTimeMillis();
        for (int i = 0; i < pressure; ++i)
        {
            segmentBackwardLongest(text, dictionary);
        }
        costTime = (System.currentTimeMillis() - start) / (double) 1000;
        System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);

        System.out.println("双向最长");
        start = System.currentTimeMillis();
        for (int i = 0; i < pressure; ++i)
        {
            segmentBidirectional(text, dictionary);
        }
        costTime = (System.currentTimeMillis() - start) / (double) 1000;
        System.out.printf("%.2f万字/秒\n", text.length() * pressure / 10000 / costTime);
    }

    /**
     * 完全切分式的中文分词算法
     *
     * @param text       待分词的文本
     * @param dictionary 词典
     * @return 单词列表
     */
    public static List<String> segmentFully(String text, Map<String, CoreDictionary.Attribute> dictionary)
    {
        List<String> wordList = new LinkedList<String>();
        for (int i = 0; i < text.length(); ++i)
        {
            for (int j = i + 1; j <= text.length(); ++j)
            {
                String word = text.substring(i, j);
                if (dictionary.containsKey(word))
                {
                    wordList.add(word);
                }
            }
        }
        return wordList;
    }

    /**
     * 正向最长匹配的中文分词算法
     *
     * @param text       待分词的文本
     * @param dictionary 词典
     * @return 单词列表
     */
    public static List<String> segmentForwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary)
    {
        List<String> wordList = new LinkedList<String>();
        for (int i = 0; i < text.length(); )
        {
            String longestWord = text.substring(i, i + 1);
            for (int j = i + 1; j <= text.length(); ++j)
            {
                String word = text.substring(i, j);
                if (dictionary.containsKey(word))
                {
                    if (word.length() > longestWord.length())
                    {
                        longestWord = word;
                    }
                }
            }
            wordList.add(longestWord);
            i += longestWord.length();
        }
        return wordList;
    }

    /**
     * 逆向最长匹配的中文分词算法
     *
     * @param text       待分词的文本
     * @param dictionary 词典
     * @return 单词列表
     */
    public static List<String> segmentBackwardLongest(String text, Map<String, CoreDictionary.Attribute> dictionary)
    {
        List<String> wordList = new LinkedList<String>();
        for (int i = text.length() - 1; i >= 0; )
        {
            String longestWord = text.substring(i, i + 1);
            for (int j = 0; j <= i; ++j)
            {
                String word = text.substring(j, i + 1);
                if (dictionary.containsKey(word))
                {
                    if (word.length() > longestWord.length())
                    {
                        longestWord = word;
                        break;
                    }
                }
            }
            wordList.add(0, longestWord);
            i -= longestWord.length();
        }
        return wordList;
    }

    /**
     * 统计分词结果中的单字数量
     *
     * @param wordList 分词结果
     * @return 单字数量
     */
    public static int countSingleChar(List<String> wordList)
    {
        int size = 0;
        for (String word : wordList)
        {
            if (word.length() == 1)
                ++size;
        }
        return size;
    }

    /**
     * 双向最长匹配的中文分词算法
     *
     * @param text       待分词的文本
     * @param dictionary 词典
     * @return 单词列表
     */
    public static List<String> segmentBidirectional(String text, Map<String, CoreDictionary.Attribute> dictionary)
    {
        List<String> forwardLongest = segmentForwardLongest(text, dictionary);
        List<String> backwardLongest = segmentBackwardLongest(text, dictionary);
        if (forwardLongest.size() < backwardLongest.size())
            return forwardLongest;
        else if (forwardLongest.size() > backwardLongest.size())
            return backwardLongest;
        else
        {
            if (countSingleChar(forwardLongest) < countSingleChar(backwardLongest))
                return forwardLongest;
            else
                return backwardLongest;
        }
    }

}

Dictionary tree

One of the bottlenecks of the matching algorithm is how to determine whether a collection contains a string. If an ordered collection (TreeMap) is used O (logn) , the complexity is   nthe size of the dictionary. If a hash table (HashMap, dict) is used, it means space and time. method.

String collections are commonly stored in dictionary trees (trie trees, prefix trees) .

The dictionary tree regards words as a path from root to a node, and marks the ending points of words. If you can go to a specially marked node, it means you are in the set. Otherwise it does not exist.

Not only collections, but dictionary trees can also implement mapping, and the corresponding value needs to be suspended at the end node of the key. When the size of the dictionary is ntime, then any one of the worst caseO (logn)

The actual speed is faster than the dichotomy. The prefix matching is a progressive process, and the algorithm does not need to compare the prefixes of the branch string.

Node implementation of the number of dictionaries

We agree to use a value of None to indicate that the node does not correspond to a word, so that a key with a value of None cannot be inserted. The implementation code of the node:

class Node(object):
    # 数据结构,构造方法。
    # 指向下一个节点的字典
    # 当前节点的值
    def __init__(self, value) -> None:
        self._children = {}
        self._value = value

# 添加子节点
    def _add_child(self, char, value, overwrite=False):
        child = self._children.get(char)
        if child is None:
            child = Node(value)
            self._children[char] = child
        elif overwrite:
            child._value = value
        return child

In adding a node, the child corresponding to the existing character char is checked first, and if it does not exist, the child node is directly inserted. If it exists, it is determined whether to overwrite according to the identifier.

Implementation of Addition, Deletion, Modification and Checking of Dictionary Tree

class Trie(Node):
    def __init__(self) -> None:
        super().__init__(None)

    def __contains__(self, key):
        return self[key] is not None

    def __getitem__(self, key):
        state = self
        for char in key:
            state = state._children.get(char)
            if state is None:
                return None
        return state._value

    def __setitem__(self, key, value):
        state = self
        for i, char in enumerate(key):
            if i < len(key) - 1:
                state = state._add_child(char, None, False)
            else:
                state = state._add_child(char, value, True)

Delete , modify, and check are actually the same thing. They are all queries . The delete operation is nothing more than setting the end point to None, and modifying it to other values.

In terms of deterministic finite state machine automata (DFA), each node is a state, and the state represents the current prefix that has been queried

The process of moving from the parent node to the child node can be regarded as a state transition. When a character undergoes a state transition, the mapping relationship between the parent node and the child node of the character will be asked. If the parent node has an edge that meets the conditions, The state transition is performed, otherwise it fails. When all state transitions are completed, we ask whether the last state is the end state.

Java code implementation

The first child hashes the remaining dichotomous dictionary tree

The hash function is used to convert the object into an integer. The hash function is satisfied. If the object is the same, the hash value is the same. If the object is different and the hash value is different, it is a perfect hash.

For the built-in dict in Python, the hash function of the string is actually used. On a 64-bit system, it returns a 64-bit integer , but the total number of unicode characters is only 136690, which is much smaller than 2^{64}that, which leads to two characters adjacent to each other. , The hash value is much different. Java's hashing is relatively friendly. Java uses UTF-16, and each character can be mapped to a 16-bit non-repeating integer. The output interval of all hash functions  [0,65535] is a positive integer within,

The specific method is to create an array of 65535 and store the child nodes in the memory according to the corresponding hash value index, so that each state transition is directly addressed.

However, this approach expands the memory index, when the length of the words in the dictionary l, then the number of fields

Guess you like

Origin blog.csdn.net/sanhewuyang/article/details/105205829