The basic principle [Recommended] architecture day6 Instagram recommendation algorithm

Each month, more than half of the community members will visit Instagram Instagram Explore page, they are interested in looking for new photos, videos and stories. In the hundreds of millions of options in how large-scale real-time recommendations most relevant to the user's content, give Instagram engineers brought many challenges that require new engineering solutions.

Instagram series of language by creating custom queries, lightweight modeling techniques and tools to support high-speed experiments to address these challenges. These systems support the scale Explore page, while improving developer productivity. In general, these solutions AI describes an efficient system based on a three-stage ranking efficient hopper (3-part ranking funnel), characterized in that it can extract 65000000000, for 90 million per second model predictions.

In this paper, we share the Explore page to make work a key element for the first time a detailed overview, and how we provide personalized content for people on Instagram.

Basic component development Explore page

Before embarking on building a recommendation engine to handle the day to upload a lot of photos and videos on Instagram, we have developed some basic tools to address three important needs. We need the ability to quickly conduct large-scale experiments, needs to capture a stronger signal within the range of interest, requires a highly efficient manner calculated to ensure that our proposals both high quality and fresh, these custom technology is key to achieving our goals .

 

Use IGQL rapid iteration: a new domain-specific language

Top construct algorithms and techniques is one of the machine learning community, and, depending on the task, select the appropriate system process will be very different. For example, although an algorithm can effectively identify long-term interest, it is recommended that aspects of another algorithm based on the latest content in the recognition may perform better. Our engineering team used a different method to iterate, we need a way, can effectively try out new ideas, but there are promising ideas will easily applied to large-scale systems, without having to worry too much about computing resources effects, such as the use of CPU and memory.

To solve this problem, we created and published IGQL, this is the domain-specific language for retrieving candidates for recommendation system optimized. Its implementation is optimized in C ++, which helps minimize delays and computing resources. When testing new research ideas, but also has the scalability and ease of use. IGQL both static verification, but also high-level language. Engineers can write that recommendation algorithms like Python, fast and efficient implementation in C ++.

user
.let(seed_id=user_id)
.liked(max_num_to_retrieve=30)
.account_nn(embedding_config=default)
.posted_media(max_media_per_account=10)
.filter(non_recommendable_model_threshold=0.2)
.rank(ranking_model=default)
.diversify_by(seed_id, method=round_robin)

You can see from the code above example, for engineers without extensive use of the language, IGQL provides high readability, principled way to help recommend a combination of multiple stages and algorithms. For example, we may output a weighted mixture of several sub-query results by using a combination rule in the query, the set of candidate producers to be optimized. By adjusting the weights, you can find the combination can bring the best user experience.

IGQL makes common tasks to perform complex recommendation systems easier, such as building a combination of rule nesting tree. IGQL allows engineers to focus on machine learning and business logic behind the recommendation, such as query to get the appropriate number of candidates for each. Providing a high degree of code reuse. For example, an application ranker (ranker) just add a line in IGQL rules as simple query. Add it to multiple locations is very simple, such as the account rank and rank the content of these accounts posted.

 

Account for embedded personalized ranking

People openly share billions of high-quality media content on Instagram, these elements are eligible to enter the recommended page. On the Explore page for a variety of community interest in maintaining a clear and growing directory of challenges: its topic from calligraphy to have model trains. Therefore, content-based model is difficult to understand such a variety of interest-based communities.

Because there are a lot of Instagram account (such as cats or cars) based on a particular topic of interest concerns, and to this end we have created a pipeline retrieval, focused on information account level, rather than the content-level information. By constructing the embedded account (Account Embeddings), we can more effectively identify which accounts similar to each other in the subject. We used to infer account ig2vec embedded, ig2vec is a similar word2vec embedded framework. Usually, word2vec embedded framework is based on the context of cross-word sentences in the training corpus to learn the words represent. ig2vec user will interact with the account ID (for example, a user's favorite content from the account) considered as a sequence of words in a sentence.

Word2vec application by the same technique, we can predict a person in a given session Instagram application may interact with the account. If a person interacts with a series of accounts in the same session, then compared with the random sequences from different accounts Instagram account, it is more likely to have a consistent theme, which helps to identify accounts with similar themes.

We define a distance measure between the two accounts: the same as the distance metric used in embedded training, usually cosine distance or dot product. On this basis, we made a KNN (K nearest neighbor) lookup to find its theme a similar account for the embedded account. Our embedded version of the cover millions of accounts, we use Facebook the most advanced search engine FAISS as nearest neighbor retrieval support infrastructure.

For each version of the embedded, we train a classifier based only on a set of themes embedded in predicting account. By forecasting themes compared with manual annotation of the topic, we can evaluate these embedded, is able to capture a good topic similarity.

Retrieve a particular person account previously expressed interest in a similar account can help us in a simple and effective way to narrow the scope to develop a smaller list of rankings personalized for each person. Therefore, we can use the most advanced computational-intensive machine learning models each Instagram community members to provide services.

 

Distillation using a model of the relevant candidate to preselect

在使用 ig2vec 根据个人兴趣确定最相关账户之后,我们就需要一种方法来对这些账户进行排名,这对每个人来说,经排名过的账户都是新鲜有趣的。这就需要在每次滚动 Explore 页面时,预测与每个人最相关的内容。

例如,通过深度神经网络为每个滚动动作进行评估,即使只有 500 个内容片段,也需要大量资源。然而,我们为每个用户评估的帖子越多,我们就越有可能从他们的清单中找到最好的、最个性化的内容。

为了能够最大化每个排名请求的内容数量,我们引入了排名蒸馏模型,帮助我们在使用更复杂的排名模型之前对候选对象进行预选。我们的方法是训练一个超轻量级的模型,该模型学习并尽可能接近主要排名模型。我们从更复杂的排名模型中记录具有特征和输出的输入候选对象。然后,使用有限特征集和更简单的神经网络模型结构的记录数据来训练模型以复制结果。其目标函数就是对 NDCG 排名(一种衡量排名质量的指标)的损失进行优化,使其超过主要排名模型的输出。我们使用蒸馏模型(distillation model)中排名最高的帖子作为后期模型的候选。

通过建立蒸馏模型的模拟行为,我们可以最大限度减少调整多个参数和在不同排名阶段维护多个模型的需要。利用这一技术,我们可以有效评估更大的内容集,在每个排名请求中找到最相关媒体内容的同时,保持计算资源仍在控制之下。

我们怎样构建 Explore 页面?

在创建了易于实验、有效识别用户兴趣并产生有效的相关预测所需的关键构件之后,我们就必须在生产中将这些系统结合在一起。利用 IGQL、账户嵌入和蒸馏技术,我们将 Explore 页面的推荐系统分为两个主要阶段:候选对象生成阶段(亦称为寻源阶段)和排名阶段。

 

生成候选对象

首先,我们利用人们之前在 Instagram 上交互过的账户(例如,在某个账户中点赞或保存其发布的媒体内容)来确定人们可能感兴趣的其他账户。我们称之为种子账户。种子账户通常只是 Instagram 上兴趣相似或相同的账户的一小部分。然后,我们使用账户嵌入技术来识别类似于种子账户的账户。最后,基于这些账户,我们就可以找到这些账户发布或参与的内容。

上图展示了 Instagram Explore 推荐的一个典型来源。人们可以通过许多不同的方式与 Instagram 上的账户和内容进行互动(例如关注、点赞、评论、保存和分享等)。还有不同的内容类型(如照片、视频、故事和直播),这意味着我们可以用类似的方案来构建多种来源。利用 IGQL,这个过程变得非常简单:不同的候选源只是表示为不同的 IGQL 子查询。

通过不同类型的来源,我们能够为普通人找到数以万计的合格候选对象。我们希望确保我们推荐的内容,对于包含许多年龄段的全球社区来说,既安全,又适合。使用各种信号,我们在为每个人建立合适的清单之前,过滤掉我们认为不符合推荐资格的内容。除了阻止可能违反政策的内容和错误信息之外,我们还利用机器学习系统来帮助检测和过滤垃圾邮件等内容。

然后,对于每个排名请求,我们为一个普通人确定数千个合格的内容源,从合格清单中抽取 500 个候选对象,然后将候选对象发送到排名阶段。

 

对候选对象进行排名

我们在有 500 个候选对象可供排名的情况下,使用一个三阶段的排名基础设施来帮助平衡排名相关性和计算效率之间的权衡。我们的三个排名阶段如下:

第一阶段:蒸馏模型模仿其他两个阶段的组合,具有最少的特征;从 500 个候选对象中挑选出 150 个质量最高、最相关的候选对象。

第二阶段:一种轻量级神经网络模型,具有完整的密集特征集;挑选 50 个最高质量和最相关的候选对象。

第三阶段:一种深度神经网络模型,具有完整的密集和稀疏特征集。挑选 25 个质量最高、最相关的候选对象(用于 Explore 页面的第一页)。

该动画描述了三部分组成的排名基础设施,我们使用它来平衡排名相关性和计算效率之间的权衡。

如果第一阶段中蒸馏模型模仿了其他两个阶段的排名,那么我们如何确定接下来两个阶段中最相关的内容呢?我们预测人们在每一条内容上采取的个人行为,无论是点赞和保存之类的积极行为,还是像“减少这样的帖子”之类的消息行为。我们使用一个多任务多标签(multi-task multi-label,MTML)神经网络来预测这些事件。共享的多层感知器(Multilayer Perceptron,MLP)允许我们捕获来自不同动作的常见信号。

 

模型架构

我们使用一个称为价值模型的算术公式来组合对不同事件的预测,以捕获不同信号的显著性,从而决定内容是否相关。我们使用预测的加权和,比如[w_like * P(Like) + w_save * P(Save) - w_negative_action * P(Negative Action)]。例如,如果我们认为一个人在 Explore 页面上保存帖子的重要性高于他们喜欢的帖子,那么保存行为的权重应该更高。

我们还希望 Explore 页面能够成为一个人们可以发现新兴趣和现有兴趣之间的平衡的地方。我们在价值模型中加入一个简单的启发式规则,以提高内容的多样性。我们通过添加惩罚因子来降低同一个作者或同一种子账户的帖子的排名,这样就不会在 Explore 页面看到同一个作者或同一个种子账户发出的多个帖子。这个惩罚会随着排名下降,并遇到更多来自同一作者的帖子而增加。

我们基于每个排名候选对象的最终价值模型得分,以子代的方式对最相关的内容进行排名。随着系统的发展,我们的离线重播工具,以及贝叶斯优化工具,可以帮助我们高效且频繁地调整价值模型。

 

一项正在进行的机器学习挑战

在构建 Instagram  Explore 页面的过程中,最令人兴奋的地方之一就是不断寻找新的且有趣的方法,帮助社区发现 Instagram 上最有趣和相关的内容。我们正在不断发展 Instagram 的 Explore 页面,无论是通过添加像故事这样的媒体内容格式,还是通过新类型的内容入口点,如购物帖子和 IGTV 视频。

Instagram 社区和清单的规模都要求能够实现高速实验的文化和开发效率,才能可靠地推荐 Instagram ,以满足每个人的个人兴趣。我们的自定义工具和系统为持续学习和迭代奠定了坚实的基础,这对于构建和扩展 Instagram 的 Explore 页面是必不可少的。

 

作者介绍:

Ivan Medvedev,软件工程师;Haotian Wu,工程经理;Taylor Gordon,研究科学家。转自AI前线

 


=>更多文章请参考《中国互联网业务研发体系架构指南》

https://blog.csdn.net/Ture010Love/article/details/104381157

=>更多行业权威架构案例、领域标准及技术趋势请关注微信公众号 '软件真理与光':

公众号:关注更多实时动态
更多权威内容关注公众号:软件真理与光
发布了195 篇原创文章 · 获赞 491 · 访问量 30万+

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104546841
Recommended