Artificial Intelligence-Natural Language Processing (NLP): The basic process of natural language processing projects [word segmentation, cleaning, standardization, feature extraction, modeling]

Insert picture description here

One, Word Segmentation (word segmentation)

1. Word segmentation tool

  • Jieba word segmentation: https://github.com/fxsjy/jieba
  • SnowNLP: https://github.com/isnowfy/snownlp
  • LTP : http://www.ltp-cloud.com/
  • HanNLP: https://github.com/hankcs/HanLP/
  • PKUseg

2. Word segmentation algorithm

Any word segmentation algorithm is based on the existing dictionary library to segment a sentence

2.1 Forward-max matching

Insert picture description here
Disadvantages: cannot consider semantics

2.2 Backward-max matching

Insert picture description here
The forward maximum match and the backward maximum match have the same probability of 90%.

Disadvantages: cannot consider semantics

2.3 Maximum matching & consideration of semantics

Insert picture description here
All the word segmentation methods generated by the maximum matching algorithm are used to calculate the probability through the "language model". The higher the probability, the better the word segmentation method

  • Step 1: Generate all word segmentation results
  • Step 2: Choose the best word segmentation result through the language model

Disadvantages: time complexity is high, because there are too many combinations of word segmentation

2.4 Viterbi Algorithm (Viterbi Algorithm)

Viterbi Algorithm (Viterbi Algorithm) is essentially dynamic programming (Dynamic Programming)

Calculate the shortest path
Insert picture description here

2. Spell Correction

1. Edit distance


Insert picture description here
Edit distance based on dynamic programming can be used to calculate the similarity of two strings. It has many application scenarios, one of which is spell correction. The definition of edit distance is that given two strings str1 and str2, we need to calculate the minimum cost to convert str1 to str2.

for example:

Input: str1 = "geek", str2 = "gesek"
Output: 1
Insert's' to convert str1 to str2

Input: str1 = "cat", str2 = "cut"
Output: 1
Replace a with u to get str2

Input: str1 = "sunday", str2 = "saturday"
Output: 3

We assume three different operations: 1. Insert a new character 2. Replace a character 3. Delete a character. The cost of each operation is 1.

# 基于动态规划的解法
def edit_dist(str1, str2):
    
    # m,n分别字符串str1和str2的长度
    m, n = len(str1), len(str2)
    
    # 构建二维数组来存储子问题(sub-problem)的答案 
    dp = [[0 for x in range(n+1)] for x in range(m+1)] 
      
    # 利用动态规划算法,填充数组
    for i in range(m+1): 
        for j in range(n+1): 
  
            # 假设第一个字符串为空,则转换的代价为j (j次的插入)
            if i == 0: 
                dp[i][j] = j    
              
            # 同样的,假设第二个字符串为空,则转换的代价为i (i次的插入)
            elif j == 0:
                dp[i][j] = i
            
            # 如果最后一个字符相等,就不会产生代价
            elif str1[i-1] == str2[j-1]: 
                dp[i][j] = dp[i-1][j-1] 
  
            # 如果最后一个字符不一样,则考虑多种可能性,并且选择其中最小的值
            else: 
                dp[i][j] = 1 + min(dp[i][j-1],        # Insert 
                                   dp[i-1][j],        # Remove 
                                   dp[i-1][j-1])      # Replace 
  
    return dp[m][n] 

Insert picture description here

三、Stop words Removal

For NLP applications, stop words and vocabulary with low frequency are usually filtered out first

Similar to the process of feature selection

There are mature tools: NLTK stop vocabulary

Insert picture description here
Insert picture description here

4. Stemming (standardization of words)

There are mature tools
Insert picture description here
https://tartarus.org/martin/PorterStemmer/java.txt
Insert picture description here

Five, Sentence Similarity (sentence similarity)

1. Euclidean distance

2. Cosine similarity




Reference material:
Chinese word segmentation engine java implementation-forward maximum, reverse maximum, two-way maximum matching method
Spelling Correction and the Noisy Channel

Guess you like

Origin blog.csdn.net/u013250861/article/details/113622118