One, Word Segmentation (word segmentation)
1. Word segmentation tool
- Jieba word segmentation: https://github.com/fxsjy/jieba
- SnowNLP: https://github.com/isnowfy/snownlp
- LTP : http://www.ltp-cloud.com/
- HanNLP: https://github.com/hankcs/HanLP/
- PKUseg
2. Word segmentation algorithm
Any word segmentation algorithm is based on the existing dictionary library to segment a sentence
2.1 Forward-max matching
Disadvantages: cannot consider semantics
2.2 Backward-max matching
The forward maximum match and the backward maximum match have the same probability of 90%.
Disadvantages: cannot consider semantics
2.3 Maximum matching & consideration of semantics
All the word segmentation methods generated by the maximum matching algorithm are used to calculate the probability through the "language model". The higher the probability, the better the word segmentation method
- Step 1: Generate all word segmentation results
- Step 2: Choose the best word segmentation result through the language model
Disadvantages: time complexity is high, because there are too many combinations of word segmentation
2.4 Viterbi Algorithm (Viterbi Algorithm)
Viterbi Algorithm (Viterbi Algorithm) is essentially dynamic programming (Dynamic Programming)
Calculate the shortest path
2. Spell Correction
1. Edit distance
Edit distance based on dynamic programming can be used to calculate the similarity of two strings. It has many application scenarios, one of which is spell correction. The definition of edit distance is that given two strings str1 and str2, we need to calculate the minimum cost to convert str1 to str2.
for example:
Input: str1 = "geek", str2 = "gesek"
Output: 1
Insert's' to convert str1 to str2
Input: str1 = "cat", str2 = "cut"
Output: 1
Replace a with u to get str2
Input: str1 = "sunday", str2 = "saturday"
Output: 3
We assume three different operations: 1. Insert a new character 2. Replace a character 3. Delete a character. The cost of each operation is 1.
# 基于动态规划的解法
def edit_dist(str1, str2):
# m,n分别字符串str1和str2的长度
m, n = len(str1), len(str2)
# 构建二维数组来存储子问题(sub-problem)的答案
dp = [[0 for x in range(n+1)] for x in range(m+1)]
# 利用动态规划算法,填充数组
for i in range(m+1):
for j in range(n+1):
# 假设第一个字符串为空,则转换的代价为j (j次的插入)
if i == 0:
dp[i][j] = j
# 同样的,假设第二个字符串为空,则转换的代价为i (i次的插入)
elif j == 0:
dp[i][j] = i
# 如果最后一个字符相等,就不会产生代价
elif str1[i-1] == str2[j-1]:
dp[i][j] = dp[i-1][j-1]
# 如果最后一个字符不一样,则考虑多种可能性,并且选择其中最小的值
else:
dp[i][j] = 1 + min(dp[i][j-1], # Insert
dp[i-1][j], # Remove
dp[i-1][j-1]) # Replace
return dp[m][n]
三、Stop words Removal
For NLP applications, stop words and vocabulary with low frequency are usually filtered out first
Similar to the process of feature selection
There are mature tools: NLTK stop vocabulary
4. Stemming (standardization of words)
There are mature tools
https://tartarus.org/martin/PorterStemmer/java.txt
Five, Sentence Similarity (sentence similarity)
1. Euclidean distance
2. Cosine similarity
Reference material:
Chinese word segmentation engine java implementation-forward maximum, reverse maximum, two-way maximum matching method
Spelling Correction and the Noisy Channel