User comment tag extraction

Original link: https://blog.csdn.net/shijing_0214/article/details/71036808

I accidentally saw a question in Zhihu: How does Taobao's comment induction work? After understanding it, I felt that it was easier to implement. I simply implemented a tag extraction function for user reviews. It was purely a matter of interest, so I did not do very detailed work. For example, the word vector is only trained with less than 3M comment corpus, and the dictionary It is also constructed by casually finding some words, see CommentsMining for the code .
First, let's look at what the comment tag extraction does, as shown in the figure:


Write picture description here

We hope that in the comment corpus given below, the comment tags agreed by everyone in the above rectangular box will be automatically extracted and displayed for users to watch, so that users can quickly understand the characteristics of the product. An effect of simple implementation of label extraction for a certain product is as follows:

Write picture description here

The extracted label looks normal.

Here I mainly use syntactic analysis + word2vec + dbscan + dictionary to achieve such a function, the specific steps are as follows:
1. Corpus collection
Use crawlers to simply crawl some comment corpora under women's shirts on an e-commerce website, from training set and test The set consists of two parts. The test set contains only the comment corpus of a certain shirt, which is used for the final comment extraction, and the training set contains the comment corpus of multiple shirts, which is used for word2vec training. The processed training set is as follows:


Write picture description here

2. Training comment word vector
Using the training set obtained above, first use Stanford's word segmentation tool to perform word segmentation and remove the stop words. See the tutorial here . Then the word segmentation result is handed over to word2vec to train the word vector on the comment corpus. The comment corpus used in this place is only less than 3M. In order to get a better word vector, consider increasing the size of the corpus. The dimension of the word vector is 50 dimensions here.
3. Dependent syntax analysis
Use the Stanford nlp toolkit + Chinese model file to perform syntactic analysis on the test corpus. Because the word segmentation tool is useful in the previous step, it is recommended to download the coreNLP+Chinese model file of Stanford directly to use it. The results of the syntactic analysis are as follows:

Write picture description here

4. Formulate extraction rules
Based on the results of the dependency syntactic analysis on the comment corpus in the previous step, summarize the extraction rules for comment tags, such as:
nsubj + advmod
nsubj + advmod + advmod
advmod + advmod
advmod + amod

5. Obtain candidate tags
Combine extraction rules and sentiment dictionary to obtain candidate tags.

6. Candidate tag de-
duplication The simhash algorithm can be used to de-duplicate the candidate tag set.

7. Candidate tag clustering
Use dbscan+word2vec to cluster candidate tags, and clustering some tags with similar semantics together to achieve the effect of semantic deduplication. Use dbscan to gather related tags together without introducing too much impurities.

8. Obtain the target tag
for each cluster, calculate its cluster center, and return a comment closest to the cluster center. Before returning, you can filter this comment, such as "dark color" filtering Then it becomes "dark color".

9. Summary The
above is a simple implementation of user comment tag extraction. If it needs to be more detailed, I think it can be refined from the following points:
1. The training corpus scale, for the convenience of implementation, only the 2M multi-point corpus is crawled , But the corpus with more than 2M is too small, the word vector will not be very accurate, and a good word vector is very important for clustering.
2. The construction of stop vocabulary and sentiment dictionary. A good dictionary should be constructed based on corpus. The stop vocabulary and sentiment dictionary here are all randomly found on the Internet to add some words, and the effect will not be very good.
3, extraction rules
to extract relatively simple design rules, will generate more candidate tag is generated impurities, these impurities have a relatively large label extraction effect.

Reference
1. Tag extraction and sorting in user reviews, Li Piji, 2012.

I accidentally saw a question in Zhihu: How does Taobao's comment induction work? After understanding it, I felt that it was easier to implement. I simply implemented a tag extraction function for user reviews. It was purely a matter of interest, so I did not do very detailed work. For example, the word vector is only trained with less than 3M comment corpus, and the dictionary It is also constructed by casually finding some words, see CommentsMining for the code .
First, let's look at what the comment tag extraction does, as shown in the figure:


Write picture description here

We hope that in the comment corpus given below, the comment tags agreed by everyone in the above rectangular box will be automatically extracted and displayed for users to watch, so that users can quickly understand the characteristics of the product. An effect of simple implementation of label extraction for a certain product is as follows:

Write picture description here

The extracted label looks normal.

Here I mainly use syntactic analysis + word2vec + dbscan + dictionary to achieve such a function, the specific steps are as follows:
1. Corpus collection
Use crawlers to simply crawl some comment corpora under women's shirts on an e-commerce website, from training set and test The set consists of two parts. The test set contains only the comment corpus of a certain shirt, which is used for the final comment extraction, and the training set contains the comment corpus of multiple shirts, which is used for word2vec training. The processed training set is as follows:


Write picture description here

2. Training comment word vector
Using the training set obtained above, first use Stanford's word segmentation tool to perform word segmentation and remove the stop words. See the tutorial here . Then the word segmentation result is handed over to word2vec to train the word vector on the comment corpus. The comment corpus used in this place is only less than 3M. In order to get a better word vector, consider increasing the size of the corpus. The dimension of the word vector is 50 dimensions here.
3. Dependent syntax analysis
Use the Stanford nlp toolkit + Chinese model file to perform syntactic analysis on the test corpus. Because the word segmentation tool is useful in the previous step, it is recommended to download the coreNLP+Chinese model file of Stanford directly to use it. The results of the syntactic analysis are as follows:

Write picture description here

4. Formulate extraction rules
Based on the results of the dependency syntactic analysis on the comment corpus in the previous step, summarize the extraction rules for comment tags, such as:
nsubj + advmod
nsubj + advmod + advmod
advmod + advmod
advmod + amod

5. Obtain candidate tags
Combine extraction rules and sentiment dictionary to obtain candidate tags.

6. Candidate tag de-
duplication The simhash algorithm can be used to de-duplicate the candidate tag set.

7. Candidate tag clustering
Use dbscan+word2vec to cluster candidate tags, and clustering some tags with similar semantics together to achieve the effect of semantic deduplication. Use dbscan to gather related tags together without introducing too much impurities.

8. Obtain the target tag
for each cluster, calculate its cluster center, and return a comment closest to the cluster center. Before returning, you can filter this comment, such as "dark color" filtering Then it becomes "dark color".

9. Summary The
above is a simple implementation of user comment tag extraction. If it needs to be more detailed, I think it can be refined from the following points:
1. The training corpus scale, for the convenience of implementation, only the 2M multi-point corpus is crawled , But the corpus with more than 2M is too small, the word vector will not be very accurate, and a good word vector is very important for clustering.
2. The construction of stop vocabulary and sentiment dictionary. A good dictionary should be constructed based on corpus. The stop vocabulary and sentiment dictionary here are all randomly found on the Internet to add some words, and the effect will not be very good.
3, extraction rules
to extract relatively simple design rules, will generate more candidate tag is generated impurities, these impurities have a relatively large label extraction effect.

Reference
1. Tag extraction and sorting in user reviews, Li Piji, 2012.

Guess you like

Origin blog.csdn.net/stay_foolish12/article/details/112788172