(Reprinted) Shuyun——An artificial intelligence recommendation system based on book reviews

Shuyun——Intelligent Recommendation System Based on Book Reviews

foreword

The computer design competition is coming soon. He Dongyu and Chengyi formed a team to participate in the artificial intelligence group. 
The main part is to build a web system, and the core functions of the system are book collection and book recommendation. Recommend books for users based on the book reviews (emphasized) of the books the user collects

Innovation

  • Collaborative filtering algorithm based on book tags
  • Natural language processing based on word2vec method
  • Label extraction (this term has not been thought of)

ideas

  1. data collection
  2. Data text preprocessing
  3. train word2vec model
  4. Iteratively get tags using word2vec model
  5. Collaborative filtering algorithm processes tags to achieve recommendation
  6. web system

Data acquisition articles

It is mainly a python crawler written by Dong Yu, the source is Douban reading, the current efficiency is relatively low, and I am trying my best to find an effective solution

Data text preprocessing

  • Remove html tags and line breaks
  • remove stop words
  • Participle
  • save as text

The specific content is recorded in another blog: [Book Yun Notes-0] Text Preprocessing

train word2vec model

Mainly use word2vec under python's gensim package to train the model, and the model is rated as the subject with all the books of each book. 
We might then consider training the model as a whole with book reviews for a class of books.

Gensim-based Word2Vec Practice  
Deep learning with word2vec

The following parameter explanations come from the blog: word2vec word vector training and gensim's use of 
parameter explanations:

  • sg=1 is the skip-gram algorithm, which is sensitive to low-frequency words; the default sg=0 is the CBOW algorithm.
  • size is the dimension of the output word vector. If the value is too small, the word mapping will affect the result due to conflict. If the value is too large, it will consume memory and slow down the algorithm calculation. The general value is between 100 and 200.
  • window is the maximum distance between the current word and the target word in the sentence, 3 means look at 3-b words in front of the target word and b words in the back (b is random between 0-3).
  • min_count is to filter words. Words with a frequency less than min-count will be ignored. The default value is 5.
  • Negative and sample can be fine-tuned according to the training results, sample means that higher frequency words are randomly downsampled to the set threshold, the default value is 1e-3.
  • hs=1 means that the level softmax will be used, the default hs=0 and negative is not 0, then negative sampling will be selected for use.
  • workers control the parallelism of training, this parameter is only valid if Cpython is installed, otherwise only a single core can be used.

The specific content is recorded in another blog:(Occupy the pit, wait for the next write) [Book Yun Notes-1] word2vec model training

Iteratively get tags using word2vec model

Query the 5 closest words through the 5 words with the highest word frequency of the model, and iterate 100 words to get the label set (the numbers are only for experimental reference, and will be readjusted in detail)

The specific content is recorded in another blog:(Occupy the pit, wait for the next write) [Book Yun Notes-2] Iteratively obtain tags using word2vec model

Collaborative filtering algorithm processes tags to achieve recommendation

I haven't seen it yet, it will be formulated after the meeting

web system

Write a web from front-end to back-end

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324814405&siteId=291194637