A Survey of Basic Tasks in Natural Language Processing


image-20230420171422863

1. Multilingual word segmentation

​ In natural language processing, Tokenization refers to the process of dividing a continuous sequence of characters in a natural language text into meaningful symbols (tokens). Word segmentation is a basic task in text preprocessing, it is a part of natural language processing, and it is a very important part.

​ In English, words are usually separated by spaces, so word segmentation of English words is easier to implement. However, in languages ​​without spaces, such as Chinese, word segmentation becomes more difficult and complicated. Chinese word segmentation needs to consider factors such as sentence structure, context and part of speech, so many algorithms have been proposed to solve the problem of Chinese word segmentation.

​ Word segmentation is very important for natural language processing, it is the basis of sentence understanding and semantic analysis. Word segmentation can split a piece of text into meaningful units, thus laying the foundation for subsequent text processing. For example, word segmentation is required in applications such as speech recognition, machine translation, information retrieval, text classification, and sentiment analysis.

Example: "Please enter text"

{
    
    
  "result": [
    {
    
    
      "id": "0",
      "word": "请",
      "tags": [
        "基本词-中文"
      ]
    },
    {
    
    
      "id": "1",
      "word": "输入",
      "tags": [
        "基本词-中文",
        "产品类型修饰词"
      ]
    },
    {
    
    
      "id": "2",
      "word": "文本",
      "tags": [
        "基本词-中文",
        "产品类型修饰词"
      ]
    }
  ],
  "success": true
}

2. Part-of-speech tagging

​ This ability can assign a part of speech to each vocabulary in natural language text, such as noun, verb, preposition

Example: "Please enter text"

pos:词性 word:词

{
    
    
  "result": [
    {
    
    
      "pos": "VV",
      "word": "请"
    },
    {
    
    
      "pos": "VV",
      "word": "输入"
    },
    {
    
    
      "pos": "NN",
      "word": "文本"
    }
  ],
  "success": true
}

3. Named entity recognition

​ Chinese named entity recognition is a natural language processing technology used to identify entities with specific meanings from Chinese texts, such as names of people, places, and organizations. This technology is usually trained using machine learning algorithms and corpora, and can be applied to multiple fields such as information extraction, knowledge graph construction, and intelligent question answering. Common Chinese named entity recognition tools include Stanford NER, LTP, THULAC, etc.

Example: "Please enter text"

{
    
    
  "result": [
    {
    
    
      "synonym": "",
      "weight": "0.100000",
      "tag": "普通词",
      "word": "请"
    },
    {
    
    
      "synonym": "",
      "weight": "0.100000",
      "tag": "普通词",
      "word": "输入"
    },
    {
    
    
      "synonym": "",
      "weight": "1.000000",
      "tag": "品类",
      "word": "文本"
    }
  ],
  "success": true
}

4. Center Word Extraction

​ In natural language processing (NLP), central word extraction calculates a correlation score for each word to measure the degree of relevance between each word and the sentence, and then identifies and extracts the central word of the sentence.

These keywords or phrases can be used for tasks such as text classification, information retrieval, and text summarization. Common central word extraction algorithms include word frequency-based methods such as TF-IDF, graph theory-based methods such as TextRank, and topic models such as LDA. Among them, deep learning also provides some new methods and technologies, such as pre-training models such as Word2Vec and BERT.

5. Dependency syntax analysis

​ By analyzing the dependency relationship between words in a sentence, the syntactic structure information of words (such as subject-predicate, verb-object, and definite structural relationship) is captured, and the tree structure is used to represent the syntactic structure information of sentences (such as subject-verb-object, fixed form complement, etc.)

Baidu's Chinese dependency syntax analysis tool DDParser is open source - Baidu Brain - CSDN Blog
  • Rule-Based Approach: Linguistic rules are manually written to capture the relationship between words.
  • The method based on statistical learning: using a large amount of labeled sentence data, the dependency relationship pattern is automatically learned through machine learning algorithms.
  • Hybrid methods: Combining rule-based and statistical learning-based methods that leverage expert knowledge while taking full advantage of large corpora.
  • Neural network method: Using deep learning technology, the input is each word in the sentence and its context information, and the output is the dependency relationship between each word and other words

Example: "Xiao Ming loves China"

{
    
    
  "result": [
    {
    
    
      "head": 2,
      "pos": "NR",
      "id": 1,
      "label": "SBV",
      "word": "小明"
    },
    {
    
    
      "head": 0,
      "pos": "VV",
      "id": 2,
      "label": "ROOT",
      "word": "爱"
    },
    {
    
    
      "head": 2,
      "pos": "NR",
      "id": 3,
      "label": "VOB",
      "word": "中国"
    }
  ],
  "success": true
}
  • id: word sequence number, starting from 1
  • word: word
  • pos: the part of speech of the current word
  • head: the core word of the word
  • label: the type of dependency relationship, SBV and VOB represent different relationships, so I won’t repeat them here

6. Text error correction

​ Accurately identify typos and paragraph position information in the input text, and give correct suggestions for the content of the text

  • Rule-Based Approach: Detect and correct errors in text through predefined rules, such as a spell checker.
  • Statistical-based methods: use language models and probabilistic algorithms to identify and repair errors in text, such as n-gram models, edit distance algorithms, etc.
  • Deep learning-based methods: use deep learning techniques such as neural networks to train automatic error correction models, such as seq2seq models, BERT, etc.
  • Ensemble-based methods: Combining multiple methods and exploiting their respective advantages to improve the accuracy and robustness of text error correction.

Example: I eat apples today, and apples tomorrow

{
    
    
  "result": {
    
    
    "edits": [
      {
    
    
        "confidence": 0.8385,
        "pos": 11,
        "src": "姣",
        "tgt": "蕉",
        "type": "SpellingError"
      }
    ],
    "source": "我今天吃苹果,明天吃香姣",
    "target": "我今天吃苹果,明天吃香蕉"
  },
  "success": true
}

7. Text summarization

​ Text summarization is the process of compressing a long text into several key information points. Commonly used text summarization methods include:

  1. Extractive summarization: Extract the most important sentences or phrases from the original text to generate summaries, which do not involve language understanding and generation.
  2. Generative summarization: Automatically generate new, more concise text through semantic understanding and generation of raw text.

8. Text Similarity

​ Provides the calculation of the similarity between different texts, and outputs a score between 0 and 1. The larger the score, the higher the similarity between the texts. It is recommended that the score should not be used for direct judgment, but can be used as a feature and bucketed according to the range.

Commonly used text similarity methods include:

  • Cosine similarity: by calculating the cosine value of the angle between two vectors to indicate their similarity.

  • Edit distance: Indicates how similar one string is by calculating the minimum number of edit operations required to convert it into another.

  • Jaccard similarity: by calculating the ratio of the intersection and union of two sets to express their similarity.

    Jaccard similarity is an indicator commonly used to measure text similarity, which can calculate the similarity between two texts based on the set of words in the text. Its calculation method is similar to the Jaccard similarity coefficient of sets in mathematics, specifically defined as the number of words contained in two texts divided by the sum of the number of words they all contain.

    If the words involved in text A and B are A A and B B respectively , then the Jaccard similarity calculation formula is:
    J ( A , B ) = ∣ A ∩ B ∣ ∣ A ∪ B ∣ ∗ J ∗ ( ∗ A ∗ , ∗ B ∗ ) = ∣ ∗ A ∗ ∪ ∗ B ∗ ∣ ∣ ∗ A ∗ ∩ ∗ B ∗ ∣ J(A,B) = \frac{|A \cap B|}{|A \cup B|}* J*(*A*,*B*)=∣*A*∪*B*∣∣*A*∩*B*∣J ( A ,B)=ABABJ(A,B)=∣AB∣∣AB

    Among them, ∩ ∩ means intersection, ∪ ∪ means union, ∣ . ∣ ∣ . ∣ means the number of elements in the set. Among them, \cap∩ means intersection, \cup∪ means union, |.|∣.∣ means the number of elements in the set.in,means intersection,means union, ∣.∣.Indicates the number of elements in the set.

    For example, suppose there are two sentences: "I like to eat apples" and "He also likes to eat apples", then their Jaccard similarity can be calculated as follows:

    • The set of words contained in sentence 1 A = {I, like, eat, apple} A = {I, like, eat, apple}

    • The set of words contained in sentence 2 B = {he, also, likes, eats, apples} B = {he, also, likes, eats, apples}

    • The intersection of A A and B B is {like, eat, apple} {like, eat, apple}

    • The union of A A and B B is {I, he, like, eat, apple, also} {I, he, like, eat, apple, also}

    • Therefore, the Jaccard similarity between sentence 1 and sentence 2 is
      3 6 = 0.563 = 0.5 \frac{3}{6} = 0.563=0.563=0.563=0.5

  • Word vector model: By mapping the words in the text into a vector space, and then calculating the text similarity based on the distance or cosine similarity between the vectors.

  • Pre-trained language model: Use a pre-trained neural network to generate representations of text, and calculate the similarity between texts based on these representations.

Sentence A: a red dress

Sentence B: a pair of blue clothes

{
    
    
  "result": [
    {
    
    
      "score": "0.16925383",
      "flag": true
    }
  ],
  "success": true
}

9. Sentiment Analysis

​ For natural language texts with subjective descriptions, it can automatically judge the positive and negative emotional tendencies of the text and give corresponding results

10. Text Classification

​ Map a piece of text entered by the user to a specific category. Taking the field of news and information as an example, a piece of text can be divided into the following categories: Hong Kong, Macao and Taiwan, real estate, military, society, finance, entertainment, automobile, international, education, health, food, current politics, rule of law, tourism, sports, Frontiers of digital technology, health, animation, and anti-corruption.

11. Word vector

There is an article that is very good, you can jump to watch: Natural Language Processing 3: Word Vectors - Zhihu (zhihu.com)

[Vernacular NLP] What is a word vector - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/henghuizan2771/article/details/130345462