Stop word expansion - based on point mutual information

Table of contents

 

1. The cause of the problem

2. Algorithm thinking

3. Expansion steps

1. Training corpus

2. Seed stop words

3. Segmentation of corpus

4. Count the co-occurrence words of stop words

5. Calculate PMI

6. Save the result

4. Results analysis


1. The cause of the problem

I recently participated in the development of CSDN's question-and-answer module. One task is to expand stop words, which are mainly used to improve the quality of question-and-answer, that is, the title of the question is used to describe the problem encountered as much as possible, and avoid some meaningless words, such as: "Xiaobai", "Big brother", "help" and so on. Such vocabulary is not helpful for asking questions, so we call them function words for the time being.

At present, the list of function words is provided by the operation, and there are only dozens of them. Now we need to expand the list of words. Here we refer to the expansion plan of the SOPMI sentiment dictionary.

2. Algorithm thinking

Its core ideas mainly include:

1. Use the co-occurrence of words to expand candidate stop words;

2. Use point mutual information (PMI) to calculate the correlation of words. PMI is used to measure the correlation between two things, and the larger the value, the more relevant it is.

     PMI(x, y)=log(p(xy)/p(x)p(y))

3. Expansion steps

1. Training corpus

Question and answer label dataset, each label can take up to 10,000 pieces of data. Each piece of data contains title, content and answer, here only the title of the question is taken.

{

"question_id": 1674,

"question_title": "What are the advantages and disadvantages of using ftp and http for the image upload and download functions of mobile social applications?",

"question_created_at": 1363039735,

"question_content": "A popular social application on mobile phones, you can view friends' photo albums, and you can also take photos and upload them and share them with friends. What are the advantages and disadvantages of using ftp and http technologies to upload and download respectively?",

"tag_id": 2,

"tag_name": "http",

"tags": [

{

"tag_id": 502,

"tag_name": "社交"

},

{

"tag_id": 503,

"tag_name": "Picture"

},

{

"tag_id": 504,

"tag_name": "ftp"

},

{

"tag_id": 2,

"tag_name": "http"

}]}

2. Seed stop words

Here, the stop words invalid_stopwords.txt collected by the operation are used as the seed stop words.

3. Segmentation of corpus

Using the jieba word segmenter, first add the seed stop words to the user-defined dictionary, segment the corpus (the corpus currently uses question_title), and remove punctuation marks according to the part of speech.

4. Count the co-occurrence words of stop words

Traverse each word in the corpus, take the words of each window_size length before and after it, count the co-occurrence words of stop words as candidate stop words, and the word frequency of co-occurrence word pairs.

Count the word frequency of candidate stop words.

5. Calculate PMI

For each candidate stop word, the sum of its PMI and that of all seed stop words is calculated as the final PMI score of the candidate stop word.

6. Save the result

Arranged in reverse order according to the PMI scores of the candidate words and saved to the dictionary file.

4. Results analysis

Since it is calculated based on word co-occurrence and word frequency, the algorithm can indeed expand some function words, but there will also be many non-function words, and the results need to be manually selected. The selected words are as follows:

It can be seen from the results that because the algorithm has no semantic support, the expanded words need to be proofread manually, which can be used to assist in the judgment of invalid titles. Perhaps it is better to combine word2vec to expand the effect.

Guess you like

Origin blog.csdn.net/zxm2015/article/details/118275216