Today's headlines, the principle of vibrato recommendation algorithm Detailed text!

 

The share will focus on an overview of today's headlines and content recommendation system analysis, user tags, assessment analysis, content security principles.

 

 

I. System Overview

 

Recommendation system, if a formal way to describe actually fit a function of the content user satisfaction, this function requires input variables in three dimensions.

 

 

The first dimension is the content. Headlines now is a comprehensive platform for content, graphics, video, UGC small video, quizzes, micro headlines, each content has a lot of its own characteristics, need to consider how to extract the contents of the different types of features make recommendations.

 

The second dimension is the user feature. A variety of labels, including interest, occupation, age, gender, etc., there are many model depicts the implicit user interest and so on.

 

The third dimension is environmental characteristics. This is the recommendation of the characteristics of the mobile Internet era, mobile users anytime, anywhere, in different scenarios workplace, commuting, tourism, information preferences have shifted.

 

Combined with three dimensions, the model gives a forecast that speculation is appropriate recommendations to the user in this scenario.

 

There is also a problem, how to introduce target can not be directly measured?

 

 

Recommended model, the click-through rate, time to read, thumbs up, comment, forwarding including thumbs can be quantified objectives, estimates can be fitted directly to do with the model, see the online lift can know do good.

 

But a general volume of recommendation systems, services many users, can not be fully assessed by the index, the introduction of elements other than the data indicators are also very important.

 

For example, ads and special-frequency content control. Q card is like a special form of content, its recommended target is not exactly allow users to browse, but also consider the answer to attract users to contribute content to the community. How these and the general content - text, how to control the frequency control need to be considered.

 

In addition, the platform for content ecological considerations and social responsibility, to suppress as vulgar content, the title of the party, to suppress low-quality content, important news of the top, weighted, Barge, low-level accounts down the right content is the algorithm itself can not be completed the need for further intervention in the content.

 

Now I will briefly describe how to achieve its goals on the basis of the above algorithm.

 

 

The aforementioned equation y = F (Xi, Xu, Xc), is a classic supervised learning problems. There are many ways to achieve, such as the traditional collaborative filtering model, supervised learning algorithm Logistic Regression model, based on the model of deep learning, Factorization Machine and GBDT and so on.

 

A good recommendation industrial systems require very flexible algorithm experiment platform that can support a variety of combinations of algorithms, including structural adjustment model. Because the model is difficult to have a common architecture for all of the recommended scene.

 

Now very popular combination of the LR and DNN, Facebook will do the combine and GBDT LR algorithm a few years ago. Today's headlines several products are in use with a powerful set of algorithms recommendation system, but according to different scenarios, the model framework will be adjusted.

 

 

Look at a typical characteristic of recommendation after model, there are four types of features would recommend to play more important role.

 

The first category is the correlation characteristics, attributes and content is to evaluate whether the user matches. Explicit matches include keyword matching, classification match, source match, matching the theme. Like FM model, there are some hidden match, the user can be derived from the distance vector and vector content.

 

The second type is the environmental characteristics, including location, time. These features not only bias, but also in order to build some matching features.

 

The third category is the heat characteristic. Including global heat, classification heat, heat the topic, and keyword heat and so on. Heat content information is very effective at a time when the system is particularly recommended in the user cold start.

 

The fourth category is the co-feature, it can help solve the so-called algorithms push the narrower issue in some degree.

 

Collaborative features are not considered user has a history. But analysis of the similarity between different user by user behavior, such as clicking a similar, interest similar theme similar interests similar words, or even similar vector, the ability to extend the exploration model.

 

 

The training model, most of the headlines Department recommended products using real-time training. Real-time training resource saving and fast feedback, which is very important for the flow of information products. User behavior information may need to be quickly captured and fed back to the next recommendation brush effect model.

 

We currently online real-time processing based on storm cluster sample data, including click, show, collection, sharing and other types of action.

 

Model parameter server is a high-performance system developed in-house, because the size of headlines data growth too fast, similar to the open-source system stability and performance can not be met, and our underlying self-development system to do a lot of targeted optimization, providing improve the operation and maintenance tool, but also adapting an existing business scenarios.

 

Currently, the recommendation algorithm model headlines around the world is relatively large, containing tens of billions and billions of vector feature original features.

 

The whole training process server is online record real-time features, Kafka file into the queue, and then further import Storm cluster Kafka consumption data, the client return the recommended label construction training samples, followed by online training model parameters are updated according to the latest sample, The final line of the model is updated.

 

This process is the main user of the delay feedback delay action, because immediately after the article recommended that you do not necessarily see, do not consider this part of the time, the entire system is almost in real time.

 

 

However, the headlines because the current content is very large, with a small video content have millions of levels, all the content recommendation system can not be estimated by the model of all.

 

So it is necessary to recall some of the design strategy, each time the level recommended screening out of thousands of library content from the mass of content. Recall strategy most important requirement is to be the ultimate performance, general overtime may not exceed 50 milliseconds.

 

 

There are many types recall strategy, we mainly use the inverted way of thinking. Offline maintain an inverted, the inverted the key may be classified, topic, entity, sources.

 

Consider ordering heat, freshness, and the like actions. Online recall can be quickly cut to make content from inverted based on user interest tags, more reliable and efficient screening of a small part from a large library.

 

 

Second, content analysis

 

Content analysis including text analysis, picture analysis and video analysis. We mainly do a headline information, today we talk about the main text analysis. Text analysis is a very important role in the recommendation system is user interest modeling.

 

No text labels and content, the user can not get label interest. For example, only those who know the article tag is the Internet, Internet users read the article labels in order to know the user with an Internet label, other keywords too.

 

 

On the other hand, the label text can directly help the recommended features, such as Meizu content can be recommended to the user's attention Meizu, which is matching the user tag.

 

If a certain period of time recommended by the main channel is not satisfactory, the recommended narrowing occurs, users will find the channel to the specific recommended reading (such as science and technology, sports, entertainment, military, etc.), the return to the main feed, recommendation would be better.

 

Because the entire model is open, the sub-channel to explore a smaller space, easier to meet customer needs. Only a single channel feedback to improve the recommendation accuracy rate will be relatively large degree of difficulty, the sub-channel to do good is very important. And this also requires good content analysis.

 

 

Today's headlines on the map is a practical text case. You can see, this article features the text classification, keyword, topic, entity words and so on.

 

Of course, not without text feature, the recommendation system can not work, the earliest application recommendation system in the Amazon, and even Wal-Mart era there, including video Netfilx do not recommend text also features direct collaborative filtering.

 

But for IT products, the majority of the day the consumer content, not the content of the new text features cold start is very difficult, collaborative class feature article can not solve the problem of cold start.

 

 

Today's headlines recommendation text feature extraction system mainly includes the following categories. First class semantic tags feature, the article explicitly marked with a semantic tag.

 

This portion of the tag is defined by the characteristics of people, each label has a clear meaning, the label system is predefined.

 

There is also an implicit semantic features which are mainly characterized by topic and keyword feature, which is characterized by topic word for describing probability distributions, no clear meaning; and based on some keywords will feature unified characterization, no clear collection.

 

 

Also features text similarity is also very important. In the headlines, user feedback has been one of the biggest question is why the total recommended duplicate content. The difficulty of the problem is that everyone is not the same definition of repeat.

 

For example, some people find this article talking about Real Madrid and Barcelona, ​​have seen similar content yesterday, today said that the two teams that is repeated.

 

But for a heavy fans, especially the Barcelona fans can not wait for all the reports have read through. Similar articles need to solve this problem according to the judgment theme, reads, body, etc., do online strategies based on these characteristics.

 

Similarly, there are spatial and temporal characteristics, scene analysis and timeliness of the content. Things such as Wuhan to Beijing to push the limit line user might not make sense.

 

Finally, also consider quality-related characteristics, to determine whether the content vulgar, pornographic, whether it is soft, chicken soup?

 

 

FIG semantic tags is the headline features and usage scenarios. The difference between their levels, different requirements.

 

 

The goal is to reach a comprehensive classification, each video has a hope of each content classification; and the real system requires accurate, the same name or the content to be able to clearly distinguish exactly which refers to a person or thing on behalf of, but do not cover a whole.

 

Concept system is responsible for solving more precise and belongs to abstraction semantics. This is our first free practice classification and concepts can be found in the technical interoperability, and later with a unified set of technical architecture.

 

 

At present, already implicit semantic features can be very helpful recommendation, and semantic tags need to continue tagging, new term emerging new concepts, tagging needs to continue iteration. Its difficulty and resources to do much larger than the implicit semantic features, then why do you need a semantic tag?

 

Some need on the product, such as content classification channel requires a well-defined and easy to understand text label system. Semantic tags is to check the effect of a company NLP technology touchstone.

 

 

Today's headlines online recommendation system of classification is a typical hierarchical text classification algorithm.

 

Root top, the classification of the first layer is below categories such as science and technology, sports, finance, entertainment, sports, and then the following breakdown football, basketball, table tennis, tennis, track and field, swimming ..., subdivided International Football Football Chinese football, Chinese football has broken down in a, Super, the national team ... than individual classifier using hierarchical text classification algorithm can better solve the problem of data skew.

 

There are some exceptions to that, if you want to improve recall, we can see some fly line connection. This common architecture, but depending on the difficulty of the problem, each isomer may Classifier, SVM classification as some good results, some in conjunction with CNN, some in conjunction with RNN reprocessing it.

 

 

FIG entity is the case a word recognition algorithm. Based on the results of segmentation and selecting candidate speech tagging, may need to make the splice according to the knowledge base period, some of the words is a combination of the entity, to determine which combination of the words can describe mapping entity.

 

If the result of the mapping multiple entities but also through word vector, topic word frequency distribution even to itself and other differences, the final calculation of a correlation model.

 

Third, the user tags

 

Content analysis and user tags are the two cornerstones of the recommendation system. Content analysis of content related to machine learning some more, compared with more user tags engineering challenges.

 

 

Today's headlines common user labels include categories and topics of interest to the user, keyword, source, based on user interest and a variety of vertical clustering features of interest (cars, sports teams, stocks, etc.). As well as gender, age, location and other information.

 

Gender information obtained through third party social user account login. Age information is usually predicted by the model, the model, estimated reading time distribution.

 

Permanent location from a user authorized to access location information, based on location information through traditional clustering methods to get the resident point.

 

Permanent point in combination with other information, can be speculated that the user's place of work, travel sites, travel sites. These users label very helpful recommendation.

 

 

Of course, the simplest user is browsing through the contents of the tag label. But here it comes to some of the data processing strategy.

 

mainly include:

 

A filtered noise. By short residence time of the click, filtration title of the party.

 

Second, the focus of punishment. User actions on a number of popular articles (such as PG One of the news some time ago) to do down the right treatment. In theory, the spread wider range of content, the confidence will decline.

 

Third, the time decay. User interest shift will occur, so the strategy is more interested in new user behavior. Therefore, with the increase of user actions, the old feature weights will decay with time, a new action feature weights will be even greater contribution to weight.

 

Fourth, the punishment show. If an article is not recommended to the user is clicked, the relevant characteristics (category, keyword, source) weight will be punished. when

 

However, we must also consider the global context, is not it more relevant content push, and related closures and dislike signals.

 

 

User tags mining in general is relatively simple, mainly engineering challenges just mentioned. User tags headlines batch computing framework is the first edition, the process is relatively simple, drawn yesterday Nikkatsu user action data in the past two months, every day, batch results in a Hadoop cluster.

 

 

But the problem is that with the rapid growth of user interest model type and other batch processing tasks are increasing, the amount of calculation involved too.

 

In 2014, millions of users batch processing tasks label update Hadoop tasks, the completion day has begun reluctantly. Cluster computing resource constraints can easily affect other work, concentrate write distributed storage system pressure begins to increase, and the user-interest tags update delay getting higher and higher.

 

 

Faced with these challenges. The end of 2014 the line user tags Storm cluster computing system streaming headlines today. After streaming into, as long as users update action updates the label, the cost is relatively small CPU, you can save 80% of the CPU time, greatly reducing the cost of computing resources.

 

At the same time, only a few dozen machines can support tens of millions of user interest model is updated daily and features updated very fast, basically can do near-real time. This system has been used so far from the line.

 

 

Of course, we also found that not all users need to label streaming system. Like gender, age, place of permanent users of this information, does not require real-time double counting, it still retains daily updates.

 

Fourth, evaluation and analysis

 

The above describes the overall architecture of the recommendation system, then recommend how to assess good effect?

 

I think there is a very wise words, "not a thing would not be able to assess optimization." The same is true of the recommendation system.

 

 

In fact, many factors will affect the recommended results. Such candidate set changes, improvements, or recall module increases, increases in the improved model architecture, like the optimization algorithm parameters recommended feature, not one by one example.

 

Meaning the assessment is that a lot of optimization may eventually be a negative effect, it will not effect the improved optimization on the line.

 

 

Comprehensive assessment recommendation system needs a complete evaluation system, a powerful experimental platform and ease of use of empirical analysis tool.

 

The so-called complete system is not a single measure, you can not just click-through rate or long residence time, etc., a comprehensive assessment.

 

Many companies do well algorithm, not enough engineers the ability, but needs a strong platform for experiments, as well as convenient experimental analysis tool that can analyze intelligence data confidence index.

 

 

A good evaluation system need to follow several principles establish, first, both short-term and long-term indicators index. Before I when the company responsible for electricity business direction observed many strategies to adjust short-term users feel fresh, but actually there is no long term benefit.

 

Secondly, it must take into account the user indicators and ecological indicators. It is necessary to provide value for content creators, let him be more creative dignity, but also an obligation to meet the user, both of which should be balanced.

 

There should also consider the interests of advertisers, which is the process of multi-game and balanced.

 

Also, pay attention to the impact of synergies. Experiment strict traffic isolation hard to do, pay attention to external effects.

 

 

Very direct experimental platform is a powerful advantage when concurrent experiments relatively long time, traffic can be assigned automatically by the platform, without human communication, and the end of the experiment flow immediately recovered, improve management efficiency.

 

This analysis can help companies reduce costs, speed up the iterative algorithm effect, the optimization of the entire system can work quickly moving forward.

 

 

This is the basic principle of the headlines A / B Test Experiment System. First, we will do well the user points the barrel, and then assign online traffic experiment, users will tag the bucket, given to the experimental group offline.

 

For example, a 10% open flow experiment, each of the two experimental groups 5%, 5% is the same as a baseline, policy and online market, another new strategy.

 

 

During the experiment will be to collect user actions, basically quasi-real-time, can be seen per hour. But because of fluctuating hours of data, usually on a daily time node view. After the operation there will be collected log processing, distribution statistics, written to the database, very convenient.

 

 

In this system engineer needs only need to set the flow rate, the experiment time, filter conditions define special, custom experimental group ID. The system can automatically generate: Comparative experimental data, experimental data confidence, summarizes the experimental results and experimental optimization suggestions.

 

 

Of course, only experimental platform is not enough. Online experiment platform can only speculate about the data indicators change change the user experience, but the data metrics and user experience differences, many indicators can not be fully quantified.

 

Many improvements still have to manually analyze, major improvements need to manually evaluate secondary confirmation.

 

Fifth, content security

 

Today's headlines last to introduce a number of initiatives in content security. Headlines now the largest content creation and distribution slip, more attention must be the responsibility of social responsibility and industry leaders. If 1% of the recommended content problems, will have a greater impact.

 

 

Now, the contents of today's headlines mainly from two parts, PGC platform first, with a production capacity of mature content

 

First, the UGC user content such as quizzes, user reviews, micro headlines. Both of these sections require unified audit. If it is a relatively small number of PGC content, will direct the audit risk, no problem will recommend a wide range.

 

UGC content needs to be filtered a risk model in question will enter secondary risk review. After approval, the content is really to recommend. Then if you receive more than a certain amount of comments or to report negative feedback, will come back here to review links, there are questions directly off the shelf.

 

The whole mechanism is relatively robust, as the industry leader in content security, today's headlines have been using the highest standards themselves.

 

 

Share content identification technology is mainly Kam yellow model, abusive and vulgar model model. Today's headlines by model vulgar depth training learning algorithms, very large sample library, simultaneous analysis of images, text.

 

This part of the model pay more attention to the recall, even sacrifice some accuracy. Abusive sample database model is also more than a million and recall up to 95% + 80% + accuracy. If a user frequently outspoken or inappropriate comments, we have some punishment mechanism.

 

 

Pan identify low-quality case involved very much like a fake news does not match, the black draft, title text, low title of the party, content quality, etc., this part of the machine is very difficult to understand, requires a lot of feedback, including other sample information Comparison.

 

Currently low-quality model of precision and recall rate is not particularly high, but also requires a combination of manual review, will increase the threshold. Currently the final recall has reached 95%, this part of the fact there are a lot of work to do. Do not platform.

 

Readers can receive the following benefits (Click to receive):

Ali cloud server 2000 yuan universal vouchers, 223 yuan / 3 years

Huawei off cloud server 8, 8888 yuan spree

Published 27 original articles · won praise 27 · views 8149

Guess you like

Origin blog.csdn.net/ABCCloud/article/details/105008839