Interpretation of Twitter Recommendation Algorithm
Recently Twitter open sourced its most valuable asset - the recommendation algorithm!
Every day, people post more than 500 million tweets on Twitter, and Twitter sends more than 150 billion tweets to users. Twitter's recommendation algorithm will only recommend a small number of relevant and attractive popular tweets to users. Similar to UGC platforms such as Douyin, a good recommendation algorithm is the magic weapon for Twitter's success. This article will take you through how Twitter recommends content.
Article directory
Recommendation Algorithm Composition
A recommendation algorithm consists of many parts, it is a collection of different models, features and services. How these components work together, please refer to the following diagram:
All components together attempt to answer two important questions:
- How likely are you to interact with other users in the future
- What communities are on Twitter and what are the top tweets in them?
This is what a community looks like...
Twitter currently has 145,000 communities, some with millions of members, that are updated every three weeks.
Demystifying the Recommendation Algorithm
The recommendation algorithm consists of three stages, which are connected in series by a pipeline:
- Candidate Tweet Collection
- tweet rank
- tweet filtering
1. Candidate Tweet Collection
First, the best 1500 candidate tweets relevant to the user are extracted from hundreds of millions of tweets.
Candidate tweets come from two main sources: people you follow and people you don't follow. Tweets come from both sources in a 50-50 split.
The tweet source uses two graph processing technologies: Real Graph , an embedded technology called SimClusters , and GraphJet , a custom matrix factorization algorithm.
Briefly, the components in the Candidate Tweet Collection System attempt to answer these questions:
- How likely is engagement between two users?
- How can we tell if a tweet is relevant to you if you don't follow the author?
- What Tweets have people I follow interact with recently?
- Who likes tweets similar to mine and what else have they liked lately?
- Which Tweets and users are similar to my interests?
2. Tweet ranking
This step uses a neural network called Heavy Ranker to score the relevance of each candidate tweet. This neural network has about 48M parameters. The system considers thousands of characteristics to score each tweet.
The following is a description of the main feature groups input to the Twitter Heavy Ranking model.
aggregate feature
Twitter's aggregated features constitute most of Twitter's features and are generated by rolling aggregations that maintain feature values within a specific range over a specific time window. Twitter calculates long-term (50-day calculations) and short-term ("real-time" - up to 3 days, typically 30-minute calculations) aggregations.
The list of aggregation features is as follows:
author_aggregate
author-topic_aggregate
list_aggregate
user_aggregate
user_author_aggregate
user_engager_aggregate
user_inferred_topic_aggregate
user_media_annotation_aggregate
user_mention_aggregate
user_request_context_aggregate
user_topic_aggregate
topic_aggregate
tweet_aggregate
non-aggregated features
Twitter also has a number of independent features for capturing information about the user, tweet, author, and tweet context.
two_hop
realgraph
authors.realgraph
recap.tweetfeature
,recap.searchfeature
etc.tweetsource
in_reply_to_tweet
timelines.earlybird
realtime_interaction_graph
user_tweet.recommendations
other
embedded features
Twhin is a large graph embedding trained on Twitter data. We use three 200-dimensional embeddings from the Twhin algorithm.
Twhin Follow Embeddings
Twhin Engagement Embeddings
⚠ Note that due to user settings or other constraints, not all features are available for each request, and there may be some variance in the "Recommended for you" ranking based on different variables.
3. Tweet filtering
Create a balanced and diverse feed by filtering candidate tweets based on various factors, such as
- frozen account
- repeat tweet
- different authors
- edited tweet
- NSFW content and more
Tweets using the Home Mixer service
After the above three stages are completed, the selected tweets can be pushed to the user.
Twitter has a service called Home Mixer for building For You timelines. Home Mixer is mainly developed based on the Scala programming language and connects all the recommendation stages together. It's also responsible for mixing tweets with other non-tweeted content, such as ads, recommendations to follow, and login prompts.
The entire pipeline we discussed above runs about 5 billion times per day, with an average completion time of less than 1.5 seconds.
Summarize
Although this article does not go into the technical details of the algorithm, all the code and data Twitter has been open sourced on GitHub . Later, I will take you to explore the implementation details module by module. It is great that Twitter is willing to open source its most valuable and core algorithm. As Elon Musk said, he is really trying to liberate this blue bird and make it more transparent to users.