The efficiency of education recommended overflow meter platform and building stability | SOFAStack users are saying

640?wx_fmt=jpeg

This article comes from SOFAArk user - overflow meter contributors education, sharing its internal greatly improve the efficiency of case development and stability of the internal referral system after use SOFAArk components. Thank overflow meters of educational support for SOFAStacksmiley_66.png , but also welcome more users contribute Join us.

SOFAArk is a lightweight Java-based implementation isolation container, mainly to provide class isolation and application (block) the ability to merge deployed by the ants gold dress open source contribution.

| EDITORIAL

Personalized recommendation, I believe everyone is familiar with, it simply is recommended through the model to calculate the appropriate thing according to each person's preferences, these things can be a video, merchandise, articles, movies and so on. After several years of development of the Internet, personalized recommendation has been everywhere, whether it is electricity providers, education, games, finance and industry, the recommendation system to enhance the business has a very important help. Overflow meter education as an Internet education platforms in recent years, the recommended business has developed rapidly, the technical team also continued ability to improve. Rapid business growth, need a high efficiency, high stability of the system is recommended to support the recommendation scene.

This article is based on actual experience we recommend an internal platform efficiency and stability of finishing construction, we introduce the reform of education overflow meters recommendation system optimization. Throughout the process, we do the analysis based on the company's architecture, technology selection and confirmation of the rehabilitation programs, the final choice SOFAArk component development framework based on the open source community SOFAStack, greatly enhance the development of our recommendation system efficiency and stability. We hope to give the same technical team plagued by reference.

| Background

A complete personalized recommendation, usually comprising the steps of recall, filtering, sorting, and so on. Although few steps, but the logic involved is very large, including abtest, user portrait, portrait article, offline data, online data, a model system, the field completion like. Personalized recommendation is very dependent on scene customization, different scenarios corresponding to different processing logic.

We can imagine, this bunch of processing logic are placed inside a system, the application will become very cumbersome and complex, as the business systems continue iterative update, gradually it becomes difficult to maintain, develop efficiency and system stability will face no small challenge. Unfortunately, with the overflow meter rapid development of business, internal recommendation platform has become "over-fat." Whether iterative efficiency, performance, stability, have experienced bottlenecks, such as:

1. Published time: Algorithms team a man to follow a line of business, resulting in frequent iterations business applications publish very often, due to the complexity of the system itself, such a huge release a very slow, reducing the efficiency of engineers;

2. System bloated: all modules unified maintenance, including storage, algorithms, business, etc., almost every iteration are only to rise, reducing the system maintainability;

3. covering risk: multiple teams maintain a common code easily conflict on the branch, merge code coverage there is the risk, reducing the efficiency of teamwork;

4. inconsistent versions: jar package versions use different business teams are inconsistent, a jar each upgrade package will cause a lot of problems, resulting in each team must spend a lot of time to resolve the conflict dependent during development.

Based on the above background, overflow meter platform had to recommend weight-loss application and system improvement, so as to enhance development efficiency and stability of the platform. However, in the actual transformation process, we find that in fact the two are in conflict with each other. In order to improve stability, we will certainly have to do to regulate and control, such as testing, gray, publishing and other processes on the process, which will certainly affect the business iterative efficiency; conversely if you want to improve efficiency, then there will certainly be in the process some give up, followed by a potential risk stability. But people always need to drive the dream of every engineer would like to spend a framework or program, while addressing many common problems, reduce costs and improve efficiency, so that designers can and will not be exhausted and liberate the productive forces have done more innovation challenging work.

| Research

Efficiency and stability is not necessarily a second election, before making recommendation platform upgrade, we combed the main factors affecting overflow internal meter business efficiency and system stability.


Development efficiency System stability
影响因素 业务复杂度+开发复杂度 业务变更:代码变更+数据变更
业务迭代流程+开发流程 非业务变更:配置变更+代码变更
业务变更+服务变更上线 流量变化
稳定性流程 硬件故障

关于开发效率,从上面可以看出来除了开发部分是依赖平台所能提供的便利和开发者个人技术能力之外,其余大部分都是流程上的把控。这些流程上的把控一是为了保障业务迭代的正确性,二是为了提升业务迭代带来的线上服务稳定性,但是简单的流程不足以把控住这些点,而过度复杂的流程会很大程度上影响业务迭代效率,所以我们需要思考并且寻求一种平衡,比如如何降低业务开发复杂度?如何提升平台提供的便利?如何在不影响稳定性的情况下简化业务迭代和维护流程?

关于稳定性,我列举几个在溢米内部遇到的几个类似案例:

  • 推荐服务性能优化上线,功能性测试没有问题,但是没有经过压测导致高峰期服务能力下降,最终导致整个服务不可用,而上游由于没有做好服务治理也受影响变成了服务不可用;

  • 推荐服务所依赖的某个数据源或者 RPC 响应从 10ms 突然增长到 100ms,从而导致推荐服务主要线程池耗尽,最终导致服务不可用;

  • 上游压测或者流量推广或者爬虫导致流量激增,但是推荐服务没有做好限流导致服务被打垮而不可用;

  • 推荐系统依赖业务系统提供的 RPC 服务进行过滤,由于此 RPC 服务变更导致响应变慢,而推荐服务没有区分强弱依赖导致整体服务超时;

  • 某个业务由于排期时间紧张,测试周期太短,上线后导致其它业务异常。

结合这些案例和上文总结的系统稳定性影响因素,可以发现除了硬件故障是不可控之外,其余几点基本都是因为变更而引起的。那么如何不受变更影响而提升稳定性呢?上面我们介绍过最主要也是最有效的是变更流程控制,通过测试、灰度、发布流程规范,其余也可以通过技术手段来控制,比如性能优化、服务治理、业务隔离、强弱依赖区分、多机房容灾、扩容等等。

针对以上开发效率和稳定性分析,最开始确定如下了改造目标:

  • 场景模块化

  • 系统瘦身,拆分模块,提高系统可维护性

  • 模块复用,提升开发效率

  • 模块开发时隔离

  • 各模块单独迭代开发,解决之前统一迭代开发的代码冲突问题

  • 各模块单独测试,提升测试效率

  • 模块运行时隔离

  • 模块运行时类隔离,解决模块间包冲突问题

  • 模块间有明确的服务边界,一定程度的故障隔离

  • 模块动态可插拔

  • 动态升级,秒级发布回滚

| 改造

为了满足改造目标,我们初步确认了三个选择:

1)采用自定义 SPI 的 ServiceLoader 动态加载实现;

2)采用自定义 Classloader 实现;

3)寻求开源软件支持。

基于资源成本、时间成本的考虑,我们选择了寻求开源支持,蚂蚁金服开源其分布式架构吸引了我们的关注,经过技术判断,我们最终决定使用 SOFAStack 社区开源的 SOFAArk 组件开发框架。

SOFAArk 定义了一套相对简单的类加载模型、特殊的打包格式、统一的编程界面、事件机制、易扩展的插件机制等,从而提供了一套较为规范化的插件化、组件化的开发方案。更多内容可以参考官方文档:

  • SOFA JVM 服务:

    https://www.sofastack.tech/sofa-boot/docs/sofa-ark-ark-jvm

  • SOFAArk 官方文档:

    https://www.sofastack.tech/sofa-boot/docs/sofa-ark-readme

  • SOFAArk 源码:

    https://github.com/sofastack/sofa-ark

通过 SOFAArk+SOFABoot 的组合,我们将应用进行拆分,分为宿主应用+数据模块+业务模块:

  • 主应用: 负责整个容器的状态保持;

  • 数据模块: 负责数据通信,包括 Redis,DB,RPC 等基础服务;

  • 业务模块: 只需要负责调用数据模块进行业务实现,最终数据通过主应用进行与外部交互。

640?wx_fmt=png

我们建立了一套模块化开发、测试、发布流程,能够在业务中台上面进行模块开发,并且制定一套模块拆分、开发标准:

  • The bottom of the module, should be more stable, more highly reusability;

  • Do not let stabilization module dependent on unstable modules, reducing dependence;

  • Enhance the reusability of modules, self-completeness;

  • Try not to coupling between business modules.

Before the data transformation module, due to inconsistencies in the way each business team used, stored algorithm used by the team and very complex, and very large amount of data, often encounter expansion, volume reduction, data migration and other kinds of problems, this algorithm development, operation and maintenance have brought great hardship. But after the modular transformation, we have carried out all the data layer exports collapsed, all the underlying data is stored by the control module, whether it is an upgrade, scalable capacity, migration, you only need to publish data module upgrade, the service module absolutely do not need to do anything, this algorithm developers certainly save a lot of costs, the liberation of most of the resources, and unified stability maintenance is also relatively simple data module.

| Future planning

The benefits of modularity, platform, automated Obviously, we are very clear, standardized scene and rapidly expanding greatly enhance the operational efficiency of iterations, access costs are greatly reduced new business, innovation in new low cost scenario above, thereby meet the rapid development of business systems, but there are problems not thought about platform brings it, I here are a few points:

  • Platform issues amplification;

  • After most of the details of the platform only platform for developers to understand, on the one hand will lead to the expansion of human factors issues, and secondly also bring challenges to user habits;

  • DevOps landing problem.

So Fully visible transformation is not easy, in the future we will continue iterations, such as the operation and maintenance of the entire system platform construction were landing platform, to create a complete set of recommendation desk service.

| Summary and Acknowledgments

At present, most of the recommended scene has been completed modular split, has been stable in the production line running for a few months, would like to thank the open source contribution ants gold dress SOFAStack, but also very grateful to the open source maintainer SOFAArk: Shanshi, during use, SOFAStack team provides an efficient and professional support.

You also want to use it? SOFAArk :

https://github.com/sofastack/sofa-ark

640?wx_fmt=png

Guess you like

Origin blog.csdn.net/SOFAStack/article/details/91921595