BFF Layer Aggregation Query Service Asynchronous Transformation and Governance Practice

First of all, I would like to thank Mr. Wang Xiao for his article [Summary of Common Solutions for Interface Optimization ]. It just so happened that recently he was optimizing the management of aggregation query services at the BFF layer of Prudential Financial Management. He focused on the serial and parallel chapters in the article and shared his practical experience. Mainly It involves the process of changing from synchronous to asynchronous, the problems derived from full asynchronization, and the thinking and improvement of governance.

I hope that sharing these experiences can inspire and help everyone in their work. If you have any questions or suggestions, please feel free to ask, thank you for your attention and support!



1. Problem background


Different wealth management products (such as funds, securities companies, insurance, bank wealth management, etc.) are recommended according to different distribution channels, and the products or characteristic data seen by each channel or group are different. In order to facilitate the rapid connection of channels, The BFF layer aggregates and distributes all data, so the BFF layer aggregation relies on a large number of underlying atomic services, so the main problem is to ensure TP99 and availability in scenarios that rely on a large number of upstream interfaces.

case:

Taking the typical commodity recommendation interface as an example, it needs to rely on the local commodity pool cache, algorithm recommendation service, commodity basic information service, position query service, crowd label service, coupon configuration service, available coupon service, and other data services ServN… ...and so on, most of the upstream atomic interfaces have limited support for a single batch query, so in extreme cases, a single push product interface recommends 1-n push products at a time, and if each product needs to bind 10 dynamic attributes, at least Initiate (1~n)*10 io calls.

The process and problems before the transformation:

process:     



question:  

  • First, the logical process is strongly coupled, and many upstream and downstream services are strongly synchronized;

  • The second is that the link is long, and when a certain upstream service is unstable, it is easy to cause the overall link to fail.                          

Processes after transformation and goals achieved:

process:              

Target:

  • The transformation goal is also very clear, which is to transform the existing logic and increase the proportion of weak dependencies as much as possible. First, it is convenient for asynchronous pre-loading. Second, the representative of weak dependencies can be removed, laying the foundation for downgrading operations and reducing the overall impact caused by a certain link jitter. link failure;

New problems after the initial transformation [focus on solving]:

  • Logically, decoupling is relatively simple. It is nothing more than pre-parameters or redundant loading, which will not be discussed this time;

  • Technically, the asynchronous logic in the early stage of the transformation is mainly marked with @Async("tpXXX"), which is also the fastest way to implement, but there are also the following problems, mainly related to governance:

  1. As projects and personnel continue to iterate, @Async annotations are flying everywhere;
  2. When different personnel are not familiar with other modules, it is impossible to define whether different thread pools can be shared, and most of them will declare new thread pools, resulting in flooding of thread pool resources;
  3. Some unreasonable calling scenarios cause too much nesting of @Async or invalid annotations;                    
  4. The downgrade mechanism has too much repetitive code, and various downgrade switches need to be manually declared frequently;
  5. Lack of a unified request-level caching mechanism, although jsf has provided a certain degree of support;
  6. Thread pool context transfer issues;
  7. Lack of unified monitoring and alarm of the thread pool status, it is impossible to observe the status of each thread pool during the actual operation process, and it may be necessary to set the thread pool parameters every time. 


2. The overall transformation path


Entry point:

Since most projects will encapsulate a separate io call layer, such as com.xx.package.xxx.client, so use this as an entry point to focus on transformation and management.

Final goal:

Simple implementation and application, friendly to old code transformation, and reduce transformation cost as much as possible;   

  1. Abstract the io call template, unify the encapsulation specification of the io call layer, standardize the enhanced attribute declaration required by the io call and provide default configurations, such as thread pool allocation, timeout, cache, fuse, downgrade, etc.;
  2. Optimize the @Async call, all io asynchronous operations are uniformly shrunk to the io call layer, and the callback mechanism is implemented in the template layer, and the old code can realize asynchronous callback only by inheriting the template;
  3. Request-level cache implementation, supports r2m by default;
  4. Request-level fuse downgrade support enables services to achieve a certain degree of self-governance in the event of upstream failures;
  5. Centralized management of the thread pool, providing support for the context to automatically pass MDC parameters;
  6. Automatic visual monitoring and alarm realization of thread pool status;
  7. Support configuration center dynamic settings.

Implementation: 

1. io call abstract template

The main function of the template is to standardize and enhance. Currently, two templates are provided, the default template and the cache template. The core idea is to declare most of the behaviors involved in the io operation, such as the thread pool group and request group to which the current service belongs. Enhance the implementation according to the declared attributes, examples are as follows:

The main purpose is to provide default declarations at the code level. From the perspective of daily practice, most of them can be configured at the code level during development.         

2. Agency

此委托属于整个执行过程的桥接实现,io封装实现继承抽象模板后,由模板创建委托代理实例,主要用于对io封装进行增强实现,比如调用前、调用后、以及调用失败自动调用声明的降级方法等处理。

可以理解为:模板专注请求行为,委托关注对象行为进行组合增强。           

3. 执行器选型

基于前面的实现目标,减少自研成本,调研目前已有框架,如 hystrix、sentinel、resilience4j,由于主要目的是期望支持线程池级别的壁舱模式实现,且hystrix集成度要优于resilience4j,最终选型默认集成hystrix,备选resilience4j, 以此实现线程池的动态创建管理、熔断降级、半连接重试等机制,HystrixCommander实现如下:

4. hystrix 适配 concrete 动态配置

  1. 继承concrete.PropertiesNotifier, 注册HystrixPropertiesNotifier监听器,缓存配置中心所有以hystrix起始的key配置;

  2. 实现HystrixDynamicProperties,注册ConcreteHystrixDynamicProperties替换默认实现,最终支持所有的hystrix配置项,具体用法参考hystrix文档。      

5. hystrix 线程池上下文传递改造

hystrix已经提供了改造点,主要是对HystrixConcurrencyStrategy#wrapCallable方法重写实现即可,在submit任务前暂存主线程上下文进行传递。

6. hystrix、jsf、spring注册线程池状态多维可视化监控、报警

主要依赖以下三个自定义组件,注册一个状态监控处理器,单独启动一个线程,定期(每秒)收集所有实现数据上报模板的实例,通过指定的通道实现状态数据推送,目前默认使用PFinder上报:           

  • ThreadPoolMonitorHandler 定义一个线程状态监控处理器,定期执行上报过程;

  • ThreadPoolEndpointMetrics 定义要上报的数据模板,包括应用实例、线程类型(spring、jsf、hystrix……)、类型线程分组、以及线程池的几个核心参数;

  • AbstractThreadPoolMetricsPublisher 定义监控处理器执行上报时依赖的通道(Micrometer、PFinder、UMP……)。

例如以下是hystrix的状态收集实现,最终可实现基于机房、分组、实例、线程池类型、名称等不同维度的状态监控:

  

PFinder实际效果:支持不同维度组合查看及报警     



7. 提供统一await future工具类

由于大部分调用是基于列表形式的异步结果List<Future<T>>、Map<String,Future<T>>,并且hystrix目前暂不支持返回CompletableFuture,方便统一await,提供工具类:

8. 其他小功能

  1. 除了sgm traceId支持,同时内置自定义的traceId实现,主要是处理sgm在子线程内打印traceId需要在控制台手动添加监控方法的问题以及提供对部分无sgm环境的链路Id支持,方便日志跟踪;

  2. 比如针对jsf调用,基于jsf过滤器实现跨应用级别的前后请求id传递支持;

  3. 默认增加jsf过滤器实现日志打印,同时支持provider、consume的动态日志打印开关,方便线上随时开关jsf日志,不再需要在client层重复logger.isDebugerEnabled();

  4. 代理层自动上报io调用方法、fallback等信息至ump,方便监控报警。

日常使用示例:

1. 一个最简单的io调用封装

仅增加继承即可支持异步回调,不重写线程池分组时使用默认分组。

2. 一个支持请求级别熔断的io调用封装

默认支持的熔断级别是服务级别,老服务仅需要继承原请求参数,实现FallbackRequest接口即可,可防止因为某一个特殊参数引起的整体接口熔断。

3. 一个支持请求级别缓存、接口级别熔断降级、独立线程池的io调用封装

4. 上层调用,实际效果

  1. 直接将一个商品列表转换成一个异步属性绑定任务;

  2. 利用工具类await List<Future<T>>;

  3. 在上层无感知的状态下,实现线程池的管理、熔断、降级、或缓存逻辑的增强,且可根据pfinder监控的可视化线程池状态,通过concrete实时调整线程池及超时或熔断参数;

  4. 举例:比如某接口频繁500ms超时,可通过配置直接打开短路返回降级结果,或者调低超时为100ms,快速触发熔断,默认10s内请求总数达到20个,50%失败时打开断路器,每隔5s半链接重试。



三、最后


篇主要是思考如何依赖现有框架、环境的能力,从代码层面系统化的实现相关治理规范。

最后仍引用王晓老师文章结尾来结束

接口性能问题形成的原因思考我相信很多接口的效率问题不是一朝一夕形成的,在需求迭代的过程中,为了需求快速上线,采取直接累加代码的方式去实现功能,这样会造成以上这些接口性能问题。变换思路,更高一级思考问题,站在接口设计者的角度去开发需求,会避免很多这样的问题,也是降本增效的一种行之有效的方式。

以上,共勉!

-end-

本文分享自微信公众号 - 京东云开发者(JDT_Developers)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10082750