Recommendation system learning summary

Overall architecture

  • The user’s history/data (likes, reads, forwards, comments, favorites) are transferred to the database (offline layer), during which Kafka can be used as a message transfer (kafka is a message queue, through which the data can be divided into batches, Stored in the MongoDB database. Of course, Kafka may not be used when the amount of data is small).
  • The data platform reads data from MongoDB and stores it in the data platform. The data platform also stores content portraits, user portraits, user information and other data. At this time, the data platform has all the data that can be collected.
  • Use these data now. First, we must use these data for model training (commonly include collaborative filtering, NMF, FM, YouTuBe's DNN), and generate a model file model file. How to use model files? It involves offline computation, and offline computation also uses user portraits and user information (depending on the complexity of the model, for example, collaborative filtering only requires user id, while YoutuBe's DNN all need to be used).
  • After that, you can sort, and then save the sorted result in redis (high efficiency), and then push it to the user (simpler, no sorting). [Here is the simplest recommendation system]

以上就是离线层的操作, The offline layer generally does recall, which is also the simplest recommendation system.
离线层的特点和功能:1. It is used as recall data to generate recall collection; 2. It does not need to be real-time, and it can run every few hours or days (the amount of data is relatively small, One item per few hours, a few items a day, there is no need to recall in real time, otherwise it will increase computing power, and the effect is not great)

Recall collection data form:

{
    
    
	{
    
    user_id:"1"}:[{
    
    'a':0.001},{
    
    'b':0.005},{
    
    'c':0.00001}]
}

但是仅使用离线计算会导致模型排序的同质化严重:For example, if there are a lot of domestic news, the results of the sorting may be all domestic news; or the top ten are domestic news, the last three are movies, and the last two are variety shows. Different types of news are stacked together. The user experience is poor. Therefore, adding Online Service will push the three types of news separately,The Online Service stage can also integrate user real-time data, user portraits, content portraits, etc. for a real-time calculation, But generally not necessary.

But at this time we did not use real-time data users, just use the data generated by sequencing layer, thus leading to a problem: 如果用户的行为或兴趣点突然发生了变化,这样是捕捉不到的. Therefore, an approximate online layer (GBDT+LR model can be used) is added between Online Service and Offline Computation to collect user behavior. The approximate online layer can generally be calculated within 3-5 minutes. The biggest difference between the approximate online layer and the offline or online layer is that you can get user feedback collected in real time.

The content of the push is getting narrower and narrower: push those who are interested, and don't push those who are not interested. Solution: do part of user history and part of content similarity, based on the mixed recall of heat and time


The data center can be understood as a database, data file, and distributed file system. The
approximate online layer also needs to be sorted. The sorting model also needs to be trained offline in advance in order to obtain the characteristics of the content.
The offline layer does recall, and the approximate online layer and online layer do sorting.
The processing of business logic is generally done in Online Service


Scheduler is used to store timed task
labels and portrait related

Crawler basics and scrapy

What is a crawler

Incremental update
Timing tasks: Python timers will always occupy threads; Linux server itself has timers

The principle of crawlers

Crawl web pages, store data, pre-process, and provide related services.
Cookies: used to save login information. Save it locally, not the server. The session is stored on the server.

What is scrapy framework

Create a scrapy project

(recommendation) D:\Code\PycharmProjects\scrapy_project>scrapy startproject sina
(recommendation) D:\Code\PycharmProjects\scrapy_project>cd sina
(recommendation) D:\Code\PycharmProjects\scrapy_project\sina>scrapy genspider sina2 sina.com.cn
Created spider 'sina2' using template 'basic' in module:
  sina.spiders.sina2

BeautifulSoup framework

Actual combat: use crawlers to crawl website content

Crontab

*	每隔多少分钟
*	小时
*	一个月中的第几天
*	每个月运行一次
*	一周的某一天

原定时任务:
0 1 * * * cd /home/docker-volume/scrapy_project/sina;scrapy crawl sina2 -a page=5 -a flag=1 >> /home/data/sheduler.log

新定时任务(使用shell脚本)/home/data/cron.sh:
#! /bin/sh
export PATH=$PATH:/usr/local/bin

echo$PATH
cd /home/docker-volume/scrapy_project/sina
/usr/local/bin/scrapy crawl sina2 -a page=5 -a flag=1 >> /home/data/sheduler.log 2>&1 &

#/usr/local/bin/scrapy命令的绝对路径应该是必要的


Content image

The operation process of the profiling system

Insert picture description here

Content portrait construction

What is content portrait

A collection of characteristics that can be used to describe a content

  • Key words
  • theme
  • Entity words
  • classification
  • The length of the movie
  • The author of the article

Content heat treatment

Manually handle heat. When the hot spot is gradually not paid attention to by people, manually reduce the heat value.
Insert picture description here
It is a kind of rule-based recall (heat-based, time-based recall). Follow 牛顿冷却定律: The law followed when objects with a higher temperature than the surrounding environment transfer heat to the surrounding medium and gradually cool down.
Insert picture description here

How to update the popularity of content

Recommended heat in the system:
heat attenuation
1, Newton's law of cooling
2, custom heat attenuation formula

  • When need to increase heat:
    • 1. The user operates the article (like, comment, favorite)
      a. When there is a user request, directly operate on the database to modify the popularity
      b. When there is a user request, in MongoDB, record the number of likes and favorites, and then pass The number of times is calculated to change the heat
      . There are several ways to update the heat:
      a. Go directly to MongoDB to operate on the data (low efficiency, not suitable for frequent data operations, affecting QPS)
      b. Operating data through kafka (kafka is an intermediate Use the pipeline to obtain the data, and then transfer it to the corresponding program for batch processing, but the data in MongoDB cannot be changed in real time, which can be basically real-time (1 minute, 30 seconds, 5 minutes))
      c. Operate redis first Data, and then regularly update to the MongoDB database
    • 2. Operation designation (1. Temporarily adjust the heat value to the highest state. 2. Set the operation designation pool)
    • 3. Formula calculation

What to do if redis hangs: 1. Do clustering and shard storage 2. Use pika for landing
In addition to operating the designated pool, you can also add 2 pools:
1. Hot pool (hot_pool)
2. Latest pool (last_pool)
rec_pool = rec_list -rec_list_for_bussiness-hot_pool-last_pool

rec_list = rec_list_for_bussiness[:2] + hot_pool[:3] + last_pool[:2] + recommendation_system[8]

Basic technical points of content profiling

  1. Keyword extraction: It is the most basic source of labels for item portraits, and also provides a data basis for other text analysis, such as TF-IDF, TextRank;
  2. Entity recognition: people, locations, locations, works, films, TV dramas, historical events and hot events, etc., the longest dictionary-based method combined with CRF model; analog word segmentation and part-of-speech tagging, entity recognition refers to the recognition of each well-divided word It is one of the defined named entity class collection.
  3. Content classification: classify text according to the classification system, and use classification to express coarse-grained structured information; SVM, FastText.
  4. Text clustering: Under the premise that no one has developed a classification system, it is also very common to divide text into multiple clusters without supervision. Don't look at it as a label. The cluster number is also a common component of user portraits;
  5. Topic model: Learning topic vectors from a large number of existing texts, and then predicting the probability distribution of new texts on each topic, is also very practical. In fact, this is also a kind of clustering idea. The topic vector is not a label form, but also a user Common composition of portraits; LDA topic model, which gives the topic of the article.
  6. Word Embedding: Word Embedding, from word to text, you can learn this embedding expression. Embedded expression is to dig out the semantic information under the literal meaning and express it in a limited dimension. Get dense word vectors.

Content portrait detail construction

Keyword extraction: TF-IDF

Word frequency, de-stop words, inverse document frequency

Term Frequency (Term Frequency, TF):

T F ω , 0 = count ⁡ ( ω ) ∣ D 1 ∣ \mathrm{TF}_{\omega, 0}=\frac{\operatorname{count}(\omega)}{\left|D_{1}\right|} TFm , 0=D1count(ω)
Namely: term
frequency (TF) = the number of times a word appears in the article⁡ the total number of words in the article\mathrm{term frequency (TF)}=\frac{\operatorname{number of times a word appears in the article}}{article Total number of words in the middle}Word frequency ( T F )=Wen chapter in terms of the total number ofA one word in the text chapter in the current of the times the number of

Go to stop words:

"的", "Yes", "
", etc. General segmentation uses the jieba function (jieba segmentation library), and there are three modes: normal, search engine, and full segmentation mode. There is little difference between search engines and full segmentation.

Inverse Document Frequency (Inverse Document Frequency, IDF):

Although some words are keywords in this article, they do not represent the characteristics of the whole article, because such words are also high-frequency words in other articles.
At this time, you need to find words that are both keywords and can represent the characteristics of the article. (Considering the importance of a word in a single article) The
logarithm of the ratio of the total number of documents n to the number of documents that appear in the word w docs(w,D) can be expressed as:
IDF ω = log ⁡ 2 (n docs ⁡ (ω, D )) \mathrm{IDF}_{\omega}=\log _{2}\left(\frac{n}{\operatorname{docs}(\omega, D)}\right)IDFoh=log2(d o c s ( ω ,D)n) The
literal expression is:
inverse document frequency (IDF) = log ⁡ (the total number of documents in the corpus contains the number of documents containing the word +1 ⁡) \mathrm{inverse document frequency (IDF)}=\log\left(\frac{corpus The total number of documents}{\operatorname{the number of documents containing the word+1}}\right)Inverse text file frequency rate ( the I D F. )=log(Packet containing the words of the text file number + 1Language material library of text files of the total number of)
That is, if a word is more common, then its inverse document frequency will be smaller (that is, the lower the importance), that is to say, the greater the frequency of inverse documents, the less likely the keyword appears in other documents, which means that Key words of this article.

Then, the TF-IDF value of a document 就与一个词在文档中出现的次数成正比,与改词在其他文档中出现的次数成反比:
TF − IDF = term frequency (TF) ∗ inverse document frequency (IDF) \mathrm{TF-IDF} = term frequency (TF) * inverse document frequency (IDF)TFIDF=Word frequency ( T F )Inverse text file frequency rate ( the I D F. )
因此,TF-IDF算法即对关键词的TF-IDF值降序排序,最后取前N个关键词作为最终的结果。

TextRank

Build process

Use TF-IDF to extract keywords from the news data in Mysql, store their characteristics in Mongodb, and then transfer the content in mongodb to Redis, sort by time for cold start.

Simple engineering

Use flask

Guess you like

Origin blog.csdn.net/Saker__/article/details/107734144