The basic algorithm [Recommended] architecture day4 microblogging recommendation engine: the core details

introduction

Microblogging social applications is a lot of people are using. Microblogging brush every day, every day will be a few such operations: original, forward, reply, Reading, attention, @ and so on. Among them, the first four are for short Bowen, the final concern is for the @ and the relationship between users, concerned about someone it means you become his fans, and he became your friend; someone means you! he wants to see your tweets.
Weibo is believed to be "self-media", that is relevant to the general public to share their own "news" approach. Recently, some people use the media themselves from the influence and earnings reports are common. The man on the microblogging influence is how to calculate it? What the microblogging algorithm as the invisible hand in the management of us? How each of our actions affect the algorithm do?
Intuitively, the microblogging is actually a simple microcosm of human society, some of the features of microblogging network, maybe you can enlighten us get the law on the real social network. Thanks to the explosive development of social networking, "social computing" Especially social network analysis become the new darling of data mining. Here we brief on a number of algorithms for the analysis of microblogging network, which some algorithms may also be applied to other social applications.

 

Label spread

Weibo users vast amounts, different people have different interests. Tap each user's interests contribute to more accurate advertising, content recommendation. In order to get each user's interests, you can label their users, a user on behalf of each label interest, the user can have one or more labels. In order to obtain the final user tag, do first first hypothesis:
People each user's friends (or fans) have the same interests with the majority of users.
This leads to the first algorithm presented herein, i.e. label propagation algorithm. In this algorithm, each user tag label whichever is the most friends, or one or more fans. Of course, you can be friends and fans of the label is taken into account, the integration can be considered when given different labels of friends and fans of the label weights. Process label propagation algorithm is as follows:
1) the user is given an initial portion of the label;
2) for each user, count the number of fans and their friends label, give the user the largest number of one or more tags appear.
3) Step 2 cycles, until the user until the occurrence of the label is no longer big change.


 

User similarity calculation

Label propagation algorithm is simple to implement, its disadvantage is that when the assumptions made untrue, such as to the social courtesy, we generally will add their relatives and friends concerned about these people and we do not necessarily have the same label; the the results of the algorithm will become poor. The solution is to measure the contribution of friends or fans of the label rate for user tag by calculating the similarity between users. Thereby obtaining second assumption:
The more similar the user's friends or fans, which may be the user tag labels.
So, how to measure the similarity between users? This requires taking into account the tweets posted by the user, including forwarding and original. Here it is to consider the similarity between the users rather than the similarity between the microblogging users, so in actual calculations, will gather all the tweets of a user to be reckoned with. An optional method is to use bags as the term micro-blog information represented as a term vector, then used directly to calculate the cosine similarity method or the like. But this method is too simple, not easy to achieve good results, presented here based on LDA (Latent Dirichlet Allocation) similarity calculation method.
LDA bag method still used the word to represent text, but in the middle to add a theme layer, forming a "document - theme - the words" three probability model, each document that is seen as a probability distribution of the theme, the theme has been to see as is the probability distribution of the word. In the LDA model, the document can be generated as follows:
1) For each document:
2) extracting a topic from the topic distribution;
3) extracting a word from the word distribution in the subject;
4) Repeat steps 2 and 3 until all words in the document are generated.
LDA model parameter estimation algorithm is beyond the scope of this article. Here only you need to know that you can get topic distribution tweets for each user by LDA. The method is then calculated using the cosine similarity method, KL distance between users is obtained relating to the distribution of the similarity, as to the similarity between the users. This is then used to tag propagation weighted similarity.

 

The time factor and network factors

Above algorithm, what are the disadvantages?
Over time, it will change the user's interest is calculated every time the user similarity when all tweets are coming together very reasonable. In this regard, by selecting from the current time close N micro-blog. For example, for each user, select the current time from the last 50 Twitter trained together into LDA. N here is neither too big nor too small. Too much is not easy to reflect the user's interest in time, too small due to the randomness of microblogging users post prone to drift interest. In order to make the best, may not stick to a fixed N, such as micro-blog can be published as a time series of N values ​​made for each user in accordance with its adaptation.
At this point, the algorithm has not been considered micro-blog relationship by replying, forwarding, and other network information @ posed. To forward, for example, if you frequently forwarded a friend of microblogging users microblogging, then the user and the friend of similarity compared to other friends, it should be higher. Here it can be seen as three assumptions:
The higher the user forwards a friend of microblogging frequency, the greater the similarity of the user and the friend's interest.
Similarly, we assume four can be obtained:
Users microblogging @ The higher the frequency of a user, the user greater similarity with the friends of interest.
Thereby it is obtained an additional factor of similarity is calculated. There are many ways to add a new element to the existing similarity measures, such as may be considered forward frequency quantify the value added as weights to measure the similarity to go.


 

Community found

Microblogging community refers to the close link between the groups in the micro-blog relationship of people, people within the community close links between communities is relatively sparse. Close relationship referred to here has two meanings, the first is the interest between people within the community similarity large; the second is the relationship between the people within the community to be close, such as requiring two users within the community can not association of more than two degrees, two degrees that is, friends of friends association.
Similarity of interest are described in the above, you need to take advantage of the similarity relationship between the user attention be calculated. A user's attention to one-way chain, the relationship between all the micro-blog user can be represented as a huge directed graph. The relationship between the similarity of the user can simply consider, such as using the inverse of the shortest path between users. However, this method is imprecise measure, we know that in the real world, there are six degrees theory, the microblogging network and other social networks, often the relationship will be more closely. So this simple relationship can only have up to six discrete similarity values, obviously not precise enough.
In order to achieve better results, where not only the shortest path as an explicit measure, but also consider some of the implicit measure. Here the first two assumptions are given, respectively, and assuming five six assumptions:
  • The more users of two mutual friend, the higher the similarity relationship between the two friends.
  • The more common fans two users, the higher the similarity relationship between the two friends.
Here you can learn calculated Jaccard similarity, these two hypotheses quantization function indicates the size of the business is the size of the intersection and the union. To assume five, for example, which is also known as quantitative indicators were directional similarity, when two users to quantify the number of common friends divided by the number two users all friends. Quantization index is assumed to be a total of six directivity is called the degree of similarity, the co-directional manner similar to the calculation of similarity. From sense, a measure of the similarity of the two is not just relationship, to a certain extent, also a measure of the degree of similarity between the interests of users, an intuitive point of view, the more the users of common concern two friends, their interests the greater the degree of similarity. Both similarity there is a professional name, is calculated based on the similarity of structure scenarios.
The shortest path obtained similarity, similarity after co-directional, co-directivity is similarity weighting function may be adopted a fusing them together, to give the final similarity. Thereafter, the number of clustering algorithms may be employed, such as K-Means, DBSCAN clustering operation or the like, to give a final community clusters. It may also be employed weighted similarity label propagation algorithm, the people with the same label as a community.
 

Influence the calculation

Found in the community, the networks use microblogging can improve the accuracy of similarity calculation. But networks can do a lot, influence the calculation is one of the more important applications.
When it comes to computing influential here we draw on the page ranking algorithm. Page Rank algorithm was undoubtedly well known in the PageRank, the algorithm invented by google founders Larry Page and Sergey Brin, along with google's commercial success and fame. The algorithm to determine page ranking based on links between pages, the core is the assumption of the high quality of the quality of web pages must also be pointed high.
According to the idea of ​​PageRank, we can be influential on the microblogging hypothesis, called the hypothesis Seven:
Influence the user's high-impact user attention must also be high.
The user as a PageRank of web pages, will focus on the relationship seen as a link relation page. Thus, we can be influential on the microblogging concern calculation algorithm based on network flow algorithm PageRank of:
1) given the same weights for all user influence weight;
2) the influence of the weight of each user's weight equivalent allocated according to the number of their concerns;
3) for each user, which is equal to its influence on the weight of the fan assigned to him and the right;
4) steps 2 and 3 of iteration steps until the weight no longer occur until the large change.
Page ranking algorithms as well as HITS, HillTop algorithm based on network relationships and so on, these algorithms can learn to influence the calculation in the past.
The above algorithm What are the disadvantages?
If only on the relationship network, then it is easy to create, the number of fans and more people are bound to influence high. This led some users to buy some zombie powder can achieve a very high influence. This algorithm is clearly unable to cope with the actual situation, because there is too much information is not used.
Influence the user in addition to his micro-blog relationship, but also has a great relationship with his personal attributes, such as user activity, micro text quality. User's activity may use its published frequency of microblogging to measure, the number of micro-quality text that is forwarded can be used, the number of responses was obtained. By measurement of these values, together with the results of the above algorithm, we can get more precise impact results.
Of course, it can be considered, the relationship between users reply, forwarding relationship, @ relations can be composed of a network, they also have the appropriate assumptions, assumptions were eight or nine assumptions, assumptions ten:
The higher influence user replies to the influence of microblogging, so that the influence of the micro-blog owner becomes high.
The higher the influence of users forwarding the influence of microblogging, so that the influence of the original author microblogging becomes high.
The higher influence on its users tend to microblogging @ high impact users.
This has been forwarding network, network reply, @ network three network, learn PageRank algorithm, we can get another three influence the results. They fused with the result of the influence of relationship network, it can influence the final result. Here fusion can be simply considered as a weighted sum of the results of complex fusion method outside the scope of this article.
 

Topics factors and areas of factors

After obtaining a calculation method of influence, what can be done?
It may be the current hot topics impact analysis, who become a hot topic on the microblogging opinion leaders. This is done this way, and to find the current hot topics related to micro-text, to find users involved in the current hot topics. How to find and the current hot topics related to micro-Wen it? There hashtag micro-Wen Needless to say, for micro-text is not the topic of labels, it can be used LDA algorithm introduced above, it can be found in topic distribution of users in all micro text of a user, you can also be found on a micro-Wen topic distribution, in general, due to the micro-text word limit of 140, is relatively short, so a number of topics not too much micro text contained the highest probability theme topic distribution of the micro-taking as its theme to text .
After finding the topic corresponding micro text and user, run influence the calculation algorithm, we can get a large user of the topic in the impact. This is also the opinion monitoring, a social hot spot monitoring.
The results obtained for the label propagation algorithm, the calculation algorithm running influence users of the same label can be obtained under the influence of ranking the tag, i.e. the field influence the ranking. For example, Kai-fu Lee influence in all areas may not be the highest, but in the IT field, its influence is definitely one of the best.
 

Garbage user identification

In the calculation of the influence, referred to avoid interference and its impact on the user zombies calculated. In the algorithm, if such a user can be identified, in the calculation of the discharged outside influence, not only can improve the effect, can also reduce the amount of calculation.
And influence the calculation is similar to trash the user's identification to consider two factors of user attributes and link relationships at the same time.
For garbage user, there are some features on different statistics and normal users. For example, the following:
Micro general garbage user packets having a certain time regularity can be used to measure this entropy, a measure of the entropy is a measure of randomness, randomness larger, smaller entropy. Specific practices a certain time granularity DSI, Bowen probability obtained in each time slice, then the probability is calculated in accordance with entropy. The larger the entropy micro text on behalf of the user time more regular, the more likely it is spam users.
Some users prefer garbage @ others in malicious micro text, so the proportion of users of micro some garbage text @ use is higher than the average user.
Some of rubbish users in order to promote micro-text ads, adding a large number of URL. URL can be measured by the proportion of micro text. Also some users in order to cheat the URL click, the content of the micro-text content corresponding to the URL interface is inconsistent, then the degree of consistency necessary to determine the micro-text and URL content, a simple approach can be used bag of words method micro text corresponds to the URL interface representation into word vectors, the frequency of words in micro view the text appearing in web page corresponding to the URL.
For those users to sell advertising, you can also categorize their micro text file, determine whether it is micro-text advertising, if a considerable portion of a user's micro-text advertising, the user might be spam users.
Garbage users are generally free to the user's attention, it is the ratio of the number of its fans and the number of friends and a normal user there will be differences. Users are typically normal and add friends through the buddy relationship, so that attention will form a triangle, as seen their friends A B C attention, it is also to be concerned if A C, B on the formation of interest A, C, B C of interest triangle. In general, due to the randomness of garbage user attention, concern proportion triangle with different normal user.
Of course, different from the normal user and the user's junk than that, this article will not dwell on. It is a binary classification problem to identify the nature of the garbage user, access to these properties later, you can enter that information into a machine learning classification models, such as logistic regression (LR), decision trees, naive Bayes, etc. , it can be classified.
Of course, there is no use link information. In general, users will refuse to pay attention to the normal user, and the user does not concern the normal garbage user. That this assumption XI:
Normal users do not tend to focus on garbage user.
So that we can once again use the PageRank algorithm to calculate the probability of whether the user is the user's junk. It should be noted that the use of the classifier algorithm is initialized when the results of the above, the probability of spam users set to 1, the probability of normal user is set to 0. In the PageRank calculation process, not by a simple summation formula, if a user interest such as a plurality of users when the garbage, the probability may be greater than a sum; therefore requiring the use of some method of normalizing the probability function or exponential family updates.
 
 

Epilogue

In this paper, the corresponding algorithm microblogging common problems were brief, in the practical application of the algorithm is more complex than presented to. Of course, not all topics covered by this article, such as a friend recommended, there is no hot spot tracking involved. But an old saying "see reflected see the whole picture" in the hope this introduction can help you better understand the microblogging social networking applications like this.
In the text, we can see in bold the assumptions that looks consistent with our intuitive sense. According to these we can come out of a lot of efficient algorithms. So sometimes, as long as you are willing to find the algorithm on the side.
 
 
Author: Zhang Yu stone .

=> More articles please refer to : "China's Internet business development Architecture Guide"

https://blog.csdn.net/Ture010Love/article/details/104381157

=> More industry structures of authority case, the field of standards and technology trends please pay attention to micro-channel public number 'software Truth and Light':

No public: more concerned about real-time dynamic
More authoritative content public concern number: Software truth and light
 

 

He published 192 original articles · won praise 467 · views 220 000 +

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104445160
Recommended