sklearn-LDA subject analysis and user recommendations

LDA thematic analysis is mainly used in text classification and prediction, and can be done for the recommended basis for recommendation.

This article does not have any mathematical formula, but given the flow LDA analysis.

step:

1. Given a different theme (the theme is unknown) text of articles, and has word processing, and deleting stop words, such as the following format

You come here to eat rice meal feel very good to eat

2. The data is read into memory, with a piece of text on the same string, text articles form a one-dimensional array.

3. The resulting word vector initialization container (thought of as a data object into a container in the form of training) from the text array.

4. After the initialization container are trained word vector word vector (usually calculated word frequency) from the text array.

5. Given that one or more test text, the training text processing method as above, but not necessary to obtain a vector word, but to initiate the test array according to text term vectors obtained to give the corresponding test vector data words.

6. Determine the number of themes, a series of parameters such as the number of iterations to obtain LDA object is recommended for beginners only change topics, so as not to be confused irritability too many parameters. Number of topics from general reference number and degree of dispersion of training data, usually no more than the number of training data. LDA plurality of training course also be selected after the comparison relating to the number of the best goodness of fit.

7. Use training training LDA word vector objects, get a training model has been trained.

8. The word test vectors as arguments, passing the LDA transform function to obtain a result. Is a one-dimensional array, the number of rows and the same test set, the number of columns is the number of topics, the test element corresponding to the probability of the current theme.

At this point, the general theme will determine the end of the text, but if you want to recommend, you can read on.

9. The results of each line, as long as there is a value greater than 0.6, in line with the theme that he would, if our training set is the same user, then we found all the topics on the training set where the user is interested in, as long as the results given in a row in line with a theme, we believe that a user may be interested in the data. Then calculate the correct rate is very simple.

I do when there is a data set of 700 data, the higher the degree of data aggregation, testing and training 28 points to give the correct rate of about 0.97 in. But another data set of 4000 data, the higher the degree of dispersion, the correct rate only 0.9, of course, you can change the number of topics, improve accuracy, but I raise the subject of the correct number but the rate dropped to 0.87, then I'll do the next step.

Someone can comment the code, as used herein, it is a python 3.6 + sk-learn

Published 16 original articles · won praise 3 · Views 1360

Guess you like

Origin blog.csdn.net/weixin_40631132/article/details/89742753
Recommended