Content recommendation

User picture User Profile

Based on user content recommendation can not be separated portrait. In the recommendation system, not to the first view sales and marketing staff to look at the user (or several draw tag cloud image statistical attributes), user portrait should look instead to the machine posters.

Recommended system before the match score, and the first item on the user are calculated in order to quantify, in which the user to quantify the result is that user profile. The purpose is not a portrait of the user recommendation system, but to build a key recommendation system process.

In addition to user user portrait final match score (ranking), but also the stage for a recall before this (candidate generation)

 

Two key factors to build a user portrait: dimensions, quantization

  1. The actual meaning of each dimension is understandable; not determined number of dimensions; not sure what specific dimensions (to be designed according to actual situation)

  2. Do not subjective to quantify each dimension (to the machine), it should be recommended to good and bad effect of the reverse oriented to optimize the user portrait

 

Method of constructing user portrait:

  1. Cha Hukou. Using the raw data as the contents of the user's portrait. Information (demographic information), purchase history, read history and other such registration. In addition to cleaning the data, the data itself is not any abstract and induction, no technical content, but useful for the user scenarios such as cold start.

  2. heap data. The accumulation of historical data to do statistical work. Labels such as interest, from the historical behavior data to dig out the label, then do the statistics on the label dimensions, as a result of statistical quantification results.

  3. black box. For example implicit factor, embedding that vector, latent semantic decomposition model matrix construct user interest in reading. Usually can not explain, can not be directly read, but actually assume the role of the recommendation system is very large.

 

 

Mining data from the text

Internet text data is the most common form of product information expressed in quantity, fast processing, small memory, commonly used in building user portrait text mining algorithms.

Example user terminal:

  1. Register a name, personal signature

  2. publish dynamic, comment, log

  3. chats (? Security and privacy in which ...)

Example item end:

  1. Articles title, description

  2. The content of the article itself (such as news category)

  3. Other basic properties of text objects

 

Building a text-based information from a user portrait, do the following two things:

1. unstructured text structuring, filtering, retain key information

The user behavior data structure of the result of the article delivered to the user, the user's own incorporated structured information.

 

A structured text

End of text information from the article information can be used NLP sophisticated algorithm analysis obtained are the following categories:

  1. keyword extraction: the most basic label source, commonly used TF-IDF and TextRank

  2. Entity Recognition: character, location and location of works, historical events and hot issues, commonly used method is based on a combination dictionary CRF model

  3. Category Content: text classification system according to the classification, with the classification information to express the coarser-grained structure

  4. Text: No one under the premise of the development of the classification system, unsupervised text is divided into more than one class clusters. Although not a label, like cluster number is composed of a common user portrait

  5. topic model: learning theme vector from a large number of existing text, and then predict the probability of a new text on each topic distribution.

  6. embedding: limited dimension vectors to express, under the tap semantic information literally (down to find one)

 

Referred to above common text structuring algorithm:

1. TF-IDF

Term frequency - inverse document frequency. The core idea: in a word recurring character will be more important, in terms of all the text appear less important. These two points are quantized into TF and IDF and two indicators.

TF = number of times a word appears in the article / total number of words TF = article or a word appears in the article / the largest number of occurrences of words in the article appear

IDF = log (total corpus of the article / Total posts containing the word +1)

The two values ​​obtained by multiplying on each word weights then be rescreened according to the weights Keywords: 1. Take Top K words. However, if the total number of words obtained in less than K, is clearly unreasonable; 2. calculate the weight of the weighted average of all the words, weighted average taken over the words as keywords

 

2. Rank Text

  I. a text setting window width, such as K words, and words in the statistical co-occurrence relationship window, which as an undirected graph.

  ii. the importance of all the words are initialized 1

  iii. The average weight of each node assigned their rights to the "connected and they have" other nodes (co-occurrence) of

  iv. Each node to all other nodes in the weight given to their sum, as their new weights

  v. iteration iii, iv, until all the node weight converges

There are those who will co-occurrence relations support each other become the key word

 

3. Content Category

Information content streaming app graphic information needs to be automatically classified into different channels, thereby obtaining structural information of the most coarse-grained, and is also used when a user cold start exploring user interest.

Classic short text classification algorithm is SVM, on the most commonly used open source tools are open source Facebook FastText.

 

4. Entity Recognition

NER (Named-Entity Recognition), good for each sub-word identifying one of a defined set of named entities. NLP is considered in sequence labeling problem.

Sequence labeling problem is commonly used algorithms HMM or CRF, the recommendation system mainly want to dig out the structure of the desired results.

There are also practical non-model approach: Law Dictionary. Dictionary tree is stored, prepared in advance of various entity dictionary; holding points where to find a good word to the dictionary and found a word is considered to be defined in advance of the entity.

Industrial grade tools higher spaCy efficiency.

 

5. Cluster

The same is unsupervised, as the representative of the theme to the LDA model can more accurately grasp the theme, get soft clustering effect of (a text can belong to more than one class clusters).

If there is no business experts specially formulated classification system, LDA will be very helpful. K relating to the set number, the value of K can be determined experimentally by: calculating each average similarity between the K relating to every two select a lower K value. In addition, to obtain the distribution of text on each topic, you can keep the maximum probability of the first few topics as the theme of the text.

LDA open source tools Gensim, PLDA

 

6. embedding 

More familiar, not much to say. Mainly is to sparse high-dimensional data is mapped to a dense vector before calculation look-up table will make the forward and reverse propagation calculations very quickly, help to improve the effect of the follow-up model.

 

Second, the label selection

How structured information item to the user to do?

The user behavior on goods, the consumer or the consumer is not seen as a classification problem. Practical action the user has marked a number of data, then pick out the characteristics he actually became interested in feature selection problem.

The most common are two methods: chi-square test (CHI) and information gain (IG)

The basic idea:

  1. structured content items as a document

  2. The user behavior as a category of goods

  3. Each user has seen the item is a text collection

  4. In this collection of text feature selection algorithm to select what users care about each

 

1. The chi-square test

Is itself a feature selection method. Is supervised (TF-IDF, TextRank are unsupervised), the need to provide classified information labeling. Why do you need it?

In text classification task, selecting keywords for classification tasks and services, not just pick out intuitively seem important word. Chi-square test is essentially to do is check "word and a category C independent of each other," this assumption is established, and the greater the deviation from this assumption, the more that the word and category C is very relationship, it is clear that the word is a keyword .

Computing a word Wi and chi-square value of a category Cj, we need to count four values:

  1. Type the word Wi appears to Cj A number of text text

  2. The word Wi appears in the text of non-text data Cj B

  3. Wi category does not appear in the text as the number of C Cj text

  4. The word does not appear in the non-Wi Cj text of text D

 

Then calculate the chi-square value of every word and every category:

Some explanations:

  1. Each word is calculated for each category and, as long as the help of one of these categories should leave word

  2. Because it is the size of the chi-square value comparison, so the formula N can not participate in the calculation, because it, like every word, is the total number of text

  3. The larger the chi-square value, mean deviation "words and categories independently of each other" hypothesis farther, by "words and categories are not mutually independent," the alternative hypothesis closer

 

2. Information gain

Keyword selection method information gain (Information Gain) is a kind of supervision, also you need to mark information. And the chi-square test is different, chi-square test to screen out a separate label for each behavior, information gain is global unity screening.

How to understand the information entropy it. The text is marked with a number of categories, arbitrarily pick a text that category? If the original amount of text in each category are the same, it is certainly the worst guess, but if the amount of text in which a C category than any other category, then it is easy to guess right. The difference is that different information entropy, information entropy former large and small the latter.

Further think, if this bunch of text and then pick out the text contains a number of words W, at any time look at a text category, there are still the above said two cases. Consider a situation: If the situation on the whole text is 1, but singled out the case containing the word W becomes 2, and then the word W is very useful!

This is the information gain ideas:

  1. The text of the global statistical information entropy

  2. The statistics for each word of the conditional entropy (entropy know the statistics after the text of a word, but to calculate contain words and does not contain information entropy word in two parts, and then according to their proportion of the weighted average of the text)

  3. Both subtraction is the information gain of each word

CART decision tree to use the information gain as the split point selection criteria.

 

Chi-square test and information gain is in the offline stage batch completed, so that you can update the user-portrait every day. (New users it -? MAB issue)

 

 

Beyond label content recommendation system

(And not to say that the more we will be more helpful label) content-based recommendation system, only a small part of the label. Content-based recommendation, in fact, is an information retrieval system (packaged recommendation systems), but they are the basis of the recommendation system complexity, but also help solve the problem of cold start (new items).

Content data is more readily available, and easy to dig out useful information for the recommended system (especially text data).

Much effort in the data supplement homologous to increase the dimensional analysis; data cleansing, redundancy elimination, garbage, sensitive content; in-depth data mining; calculate a more reasonable interest between users and items related properties of

1. Content-based recommendation of the framework:

 

Content-based recommendation, not the most important algorithms, but mining and content analysis. The more in-depth content analysis, the more careful to seize the user groups, recommended the higher the conversion rate, favorability users rises, the more feedback.

 

Output content analysis is twofold:

  Structured content library (in conjunction with user feedback to learn user portrait);

  Content analysis model (classification, topic model, entity recognition model, embedding, when new items into the real-time recommendations need to be out, these models need real-time content analysis to extract structured content, and then used to match user portrait)

 

2. The content recommendation algorithms

i. The easiest way is to calculate the similarity between the sparse vector sparse vector content and user portrait side end, based on the similarity of the sort recommended items. (Interpretability strong)

ii. 更好地利用内容中的结构化信息:不同字段的重要性不同。常用的开源搜索引擎 Lucene 已经实现了 BM25F 算法(相关性计算)

iii. 机器学习方法(考虑推荐的目标),CTR预估模型。每条样本由两部分构成,一部分是特征,包含用户端的画像内容、物品端的结构化内容、日志记录的上下文信息(时间、地理位置、设备...);另一部分就是用户行为,作为标注信息(有反馈、无反馈两类)。训练一个二分类器,常用 LR 和 GBDT 或两者混合,按照用户行为发生的概率排序。

 

Guess you like

Origin www.cnblogs.com/chaojunwang-ml/p/11565310.html