[Reprint] two years Flink migration of the road: from standalone to on yarn, five times the processing power upgrade

Flink migration of the road for two years: from standalone to on yarn, five times the processing power upgrade

https://segmentfault.com/a/1190000020209179

 

6

I. Background and pain points

In the first half of 2017, TalkingData of two Game Analytics and App Analytics product, streaming frame using a self-study of td-etl-framework. The framework reduces the complexity of developing streaming tasks for different tasks only need to implement a chain changer can be, and the level of support extended performance is acceptable, once to meet business needs.

But by the end of 2016 and the first half of 2017, we found the following important limitations of this framework:

  1. Performance risks: App Analytics-etl-adaptor and Game Analytics-etl-adaptor these two modules have been on holidays in a serious performance problem (Full-GC), led to a delay index calculation;
  2. Lack of fault tolerance framework: save depends on the offset on Kafka or ZK's, can only reach at-least-once, and the need to rely on other services and storage to achieve exactly-once, and can lead to abnormal restart lost count;
  3. Inadequate skills framework: can not complete expression of the DAG, for complex problem requires several streaming services depend on a number of combinations of the framework together to solve the problem;

TalkingData Both products mainly provides data analysis services to various mobile App and end the game, as in recent years the volume of business continues to expand, you need to select a stronger performance, more full-featured streaming engine to escalate our streaming service. From the end of 2016 began to research, primarily for selecting from Flink, Heron, Spark streaming in.

In the end, we chose Flink, based on the following considerations:

  1. Flink improve fault tolerance, support for Exactly-once;
  2. Flink has integrated rich streaming operator, a custom operator is also more convenient, and can call the API to complete stream of the split and join direct expression can complete the DAG;
  3. Flink independent memory management without completely dependent on the JVM, you can avoid the problem of some services Full-GC current etl-framework to some extent;
  4. Flink's window mechanism to address long game when GA frequency distribution in the one-day game is similar to the distribution of class periods a problem indicators;
  5. Flink idea at the time the most advanced streaming framework: the batch stream as a special case, and ultimately approved uniform flow;

Second, the evolution path

2.1 standalone-cluster (1.1.3->1.1.5->1.3.2)

We are beginning to deploy standalone cluster model. Starting from the first half of 2017, we gradually Game Analytics in some small traffic etl-job migration to Flink, when to April, the product has been receiving etl-job data for each version of the SDK is fully migrated to Flink, and become integrated a job. Formed the following data stream and a stream graph:

clipboard.png

A data flow graph in FIG. 1. Game Analytics-etl-adaptor migrate to Flink

clipboard.png

Figure 2. Game Analytics-etl the stream graph

In the above data flow diagram, flink-job to call etl-service through Dubbo, so that the access logic external storage are abstracted to etl-service in, flink-job is not required to consider the complexity of the memory access logic, and the job of self Cache, so that both the completion of the common services, but also reduce the job itself GC pressure.

In addition, we constructed a monitor from service, because it was the 1.1.3 version of Flink provides monitoring metric less, the ZK and because of its Kafka-connector using Kafka08 of low-level API, Kafka did not offset the consumption submitted Therefore we need to build a monitor to monitor the activity of instantaneous velocity, sediment and other consumer metric job of Flink and complete access to corporate owl monitor alarms.

At this time, Flink's standalone cluster has been undertaken of all traffic from Game Analytics, handling an average of about one billion messages a total throughput of 12TB per day. To the summer, when the average daily volume of logs rose to 1.8 billion a day, the daily throughput of about 20TB, TPS peak of 30,000.

In this process, we encountered the job Flink consumption is not balanced, on standalone cluster job is not balanced deploy other issues, caused by the online consumer siltation, and clustering for no reason and automatic restart after automatic restart job can not be restarted successfully. (We will describe in detail in Chapter III of these issues and the performance was typical solutions.)

After a summer vacation, we think Flink stood the test, so the start of the App Analytics etl-job also migrated to Flink. Forming a data flow diagram below:

clipboard.png

Standard processing SDK FIG 3. App Analytics-etl-adaptor migrate to the data flow graph Flink

clipboard.png

FIG. 4. App Analytics-etl-flink job stream graph of

Began in March 2017 a large number of users begin to migrate to a unified JSON SDK, Kafka topic of the new SDK peak flows up from the 8K / s to the end of 3W / s. At this point, the entire Flink standalone cluster deployed a total of four job two products, the average daily throughput reached 35TB.

Then I met two very serious problems:

1) With a standalone cluster preempt resources in the job, and standalone cluster mode only task slot can only be done on a resource isolation task manager of the heap memory. At the same time due to the previously mentioned Flink deploy job in standalone cluster in a way it would have been caused by the uneven distribution of resources, which will lead to App Analytics traffic flow line sometimes cause problems Game Analytics line siltation;

2) 我们的source operator的并行度等同于所消费Kafka topic的partition数量,而中间做etl的operator的并行度往往会远大于Kafka的partition数量。因此最后的job graph不可能完全被链成一条operator chain,operator之间的数据传输必须通过Flink的network buffer的申请和释放,而1.1.x 版本的network buffer在数据量大的时候很容易在其申请和释放时造成死锁,而导致Flink明明有许多消息要处理,但是大部分线程处于waiting的状态导致业务的大量延迟。

这些问题逼迫着我们不得不将两款产品的job拆分到两个standalone cluster中,并对Flink做一次较大的版本升级,从1.1.3(中间过度到1.1.5)升级成1.3.2。最终升级至1.3.2在18年的Q1完成,1.3.2版本引入了增量式的checkpoint提交并且在性能和稳定性上比1.1.x版本做了巨大的改进。升级之后,Flink集群基本稳定,尽管还有消费不均匀等问题,但是基本可以在业务量增加时通过扩容机器来解决。

2.2 Flink on yarn (1.7.1)

因为standalone cluster的资源隔离做的并不优秀,而且还有deploy job不均衡等问题,加上社区上使用Flink on yarn已经非常成熟,因此我们在18年的Q4就开始计划将Flink的standalone cluster迁移至Flink on yarn上,并且Flink在最近的版本中对于batch的提升较多,我们还规划逐步使用Flink来逐步替换现在的批处理引擎。

clipboard.png

图5. Flink on yarn cluster规划

如图5,未来的Flink on yarn cluster将可以完成流式计算和批处理计算,集群的使用者可以通过一个构建service来完成stream/batch job的构建、优化和提交,job提交后,根据使用者所在的业务团队及服务客户的业务量分发到不同的yarn队列中,此外,集群需要一个完善的监控系统,采集用户的提交记录、各个队列的流量及负载、各个job的运行时指标等等,并接入公司的OWL。

从19年的Q1开始,我们将App Analytics的部分stream job迁移到了Flink on yarn 1.7中,又在19年Q2前完成了App Analytics所有处理统一JSON SDK的流任务迁移。当前的Flink on yarn集群的峰值处理的消息量达到30W/s,日均日志吞吐量达约到50亿条,约60TB。在Flink迁移到on yarn之后,因为版本的升级性能有所提升,且job之间的资源隔离确实优于standalone cluster。迁移后我们使用Prometheus+Grafana的监控方案,监控更方便和直观。

我们将在后续将Game Analytics的Flink job和日志导出的job也迁移至该on yarn集群,预计可以节约1/4的机器资源。

三、重点问题的描述与解决

在Flink实践的过程中,我们一路上遇到了不少坑,我们挑出其中几个重点坑做简要讲解。

1.少用静态变量及job cancel时合理释放资源

在我们实现Flink的operator的function时,一般都可以继承AbstractRichFunction,其已提供生命周期方法open()/close(),所以operator依赖的资源的初始化和释放应该通过重写这些方法执行。当我们初始化一些资源,如spring context、dubbo config时,应该尽可能使用单例对象持有这些资源且(在一个TaskManager中)只初始化1次,同样的,我们在close方法中应当(在一个TaskManager中)只释放一次。

static的变量应该慎重使用,否则很容易引起job cancel而相应的资源没有释放进而导致job重启遇到问题。规避static变量来初始化可以使用org.apache.flink.configuration.Configuration(1.3)或者org.apache.flink.api.java.utils.ParameterTool(1.7)来保存我们的资源配置,然后通过ExecutionEnvironment来存放(Job提交时)和获取这些配置(Job运行时)。

示例代码:

Flink 1.3
设置及注册配置

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Configuration parameters = new Configuration(); parameters.setString("zkConnects", zkConnects); parameters.setBoolean("debug", debug); env.getConfig().setGlobalJobParameters(parameters); 

获取配置(在operator的open方法中)

@Override
public void open(Configuration parameters) throws Exception { super.open(parameters); ExecutionConfig.GlobalJobParameters globalParams = getRuntimeContext().getExecutionConfig().getGlobalJobParameters(); Configuration globConf = (Configuration) globalParams; debug = globConf.getBoolean("debug", false); String zks = globConf.getString("zkConnects", ""); //.. do more .. } 

Flink 1.7
设置及注册配置

ParameterTool parameters = ParameterTool.fromArgs(args);

// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); env.getConfig().setGlobalJobParameters(parameters); 

获取配置

public static final class Tokenizer extends RichFlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { ParameterTool parameters = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters(); parameters.getRequired("input"); // .. do more .. 

2.NetworkBuffer及operator chain

如前文所述,当Flink的job 的上下游Task(的subTask)分布在不同的TaskManager节点上时(也就是上下游operator没有chained在一起,且相对应的subTask分布在了不同的TaskManager节点上),就需要在operator的数据传递时申请和释放network buffer并通过网络I/O传递数据。

其过程简述如下:上游的operator产生的结果会通过RecordWriter序列化,然后申请BufferPool中的Buffer并将序列化后的结果写入Buffer,此后Buffer会被加入ResultPartition的ResultSubPartition中。ResultSubPartition中的Buffer会通过Netty传输至下一级的operator的InputGate的InputChannel中,同样的,Buffer进入InputChannel前同样需要到下一级operator所在的TaskManager的BufferPool申请,RecordReader读取Buffer并将其中的数据反序列化。BufferPool是有限的,在BufferPool为空时RecordWriter/RecordReader所在的线程会在申请Buffer的过程中wait一段时间,具体原理可以参考:[1], [2]。

简要截图如下:

clipboard.png

图6. Flink的网络栈, 其中RP为ResultPartition、RS为ResultSubPartition、IG为InputGate、IC为inputChannel。

在使用Flink 1.1.x和1.3.x版本时,如果我们的network buffer的数量配置的不充足且数据的吞吐量变大的时候,就会遇到如下现象:

clipboard.png

图7. 上游operator阻塞在获取network buffer的requestBuffer()方法中

clipboard.png

图8. 下游的operator阻塞在等待新数据输入

clipboard.png

图9. 下游的operator阻塞在等待新数据输入

我们的工作线程(RecordWriter和RecordReader所在的线程)的大部分时间都花在了向BufferPool申请Buffer上,这时候CPU的使用率会剧烈的抖动,使得Job的消费速度下降,在1.1.x版本中甚至会阻塞很长的一段时间,触发整个job的背压,从而造成较严重的业务延迟。

这时候,我们就需要通过上下游operator的并行度来计算ResultPartition和InputGate中所需要的buffer的个数,以配置充足的taskmanager.network.numberOfBuffers。

clipboard.png

图10. 不同的network buffer对CPU使用率的影响

当配置了充足的network buffer数时,CPU抖动可以减少,Job消费速度有所提高。

在Flink 1.5之后,在其network stack中引入了基于信用度的流量传输控制(credit-based flow control)机制[2],该机制大限度的避免了在向BufferPool申请Buffer的阻塞现象,我们初步测试1.7的network stack的性能确实比1.3要高。

但这毕竟还不是最优的情况,因为如果借助network buffer来完成上下游的operator的数据传递不可以避免的要经过序列化/反序列化的过程,而且信用度的信息传递有一定的延迟性和开销,而这个过程可以通过将上下游的operator链成一条operator chain而避免。

因此我们在构建我们流任务的执行图时,应该尽可能多的让operator都chain在一起,在Kafka资源允许的情况下可以扩大Kafka的partition而使得source operator和后继的operator 链在一起,但也不能一味扩大Kafka topic的partition,应根据业务量和机器资源做好取舍。更详细的关于operator的training和task slot的调优可以参考: [4]。

3.Flink中所选用序列化器的建议

在上一节中我们知道,Flink的分布在不同节点上的Task的数据传输必须经过序列化/反序列化,因此序列化/反序列化也是影响Flink性能的一个重要因素。Flink自有一套类型体系,即Flink有自己的类型描述类(TypeInformation)。Flink希望能够掌握尽可能多的进出operator的数据类型信息,并使用TypeInformation来描述,这样做主要有以下2个原因:

  1. 类型信息知道的越多,Flink可以选取更好的序列化方式,并使得Flink对内存的使用更加高效;
  2. TypeInformation内部封装了自己的序列化器,可通过createSerializer()获取,这样可以让用户不再操心序列化框架的使用(例如如何将他们自定义的类型注册到序列化框架中,尽管用户的定制化和注册可以提高性能)。

总体上来说,Flink推荐我们在operator间传递的数据是POJOs类型,对于POJOs类型,Flink默认会使用Flink自身的PojoSerializer进行序列化,而对于Flink无法自己描述或推断的数据类型,Flink会将其识别为GenericType,并使用Kryo进行序列化。Flink在处理POJOs时更高效,此外POJOs类型会使得stream的grouping/joining/aggregating等操作变得简单,因为可以使用如:dataSet.keyBy("username") 这样的方式直接操作数据流中的数据字段。

除此之外,我们还可以做进一步的优化:
1) 显示调用returns方法,从而触发Flink的Type Hint:
dataStream.flatMap(new MyOperator()).returns(MyClass.class)
returns方法最终会调用TypeExtractor.createTypeInfo(typeClass) ,用以构建我们自定义的类型的TypeInformation。createTypeInfo方法在构建TypeInformation时,如果我们的类型满足POJOs的规则或Flink中其他的基本类型的规则,会尽可能的将我们的类型“翻译”成Flink熟知的类型如POJOs类型或其他基本类型,便于Flink自行使用更高效的序列化方式。

//org.apache.flink.api.java.typeutils.PojoTypeInfo

@Override
@PublicEvolving
@SuppressWarnings("unchecked") public TypeSerializer<T> createSerializer(ExecutionConfig config) { if (config.isForceKryoEnabled()) { return new KryoSerializer<>(getTypeClass(), config); } if (config.isForceAvroEnabled()) { return AvroUtils.getAvroUtils().createAvroSerializer(getTypeClass()); } return createPojoSerializer(config); } 

对于Flink无法“翻译”的类型,则返回GenericTypeInfo,并使用Kryo序列化:

//org.apache.flink.api.java.typeutils.TypeExtractor

@SuppressWarnings({ "unchecked", "rawtypes" })
private <OUT,IN1,IN2> TypeInformation<OUT> privateGetForClass(Class<OUT> clazz, ArrayList<Type> typeHierarchy, ParameterizedType parameterizedType, TypeInformation<IN1> in1Type, TypeInformation<IN2> in2Type) { checkNotNull(clazz); // 尝试将 clazz转换为 PrimitiveArrayTypeInfo, BasicArrayTypeInfo, ObjectArrayTypeInfo // BasicTypeInfo, PojoTypeInfo 等,具体源码已省略 //... //如果上述尝试不成功 , 则return a generic type return new GenericTypeInfo<OUT>(clazz); } 

2) 注册subtypes: 通过StreamExecutionEnvironment或ExecutionEnvironment的实例的registerType(clazz)方法注册我们的数据类及其子类、其字段的类型。如果Flink对类型知道的越多,性能会更好;

3) 如果还想做进一步的优化,Flink还允许用户注册自己定制的序列化器,手动创建自己类型的TypeInformation,具体可以参考Flink官网:[3];

在我们的实践中,最初为了扩展性,在operator之间传递的数据为JsonNode,但是我们发现性能达不到预期,因此将JsonNode改成了符合POJOs规范的类型,在1.1.x的Flink版本上直接获得了超过30%的性能提升。在我们调用了Flink的Type Hint和env.getConfig().enableForceAvro()后,性能得到进一步提升。这些方法一直沿用到了1.3.x版本。

在升级至1.7.x时,如果使用env.getConfig().enableForceAvro()这个配置,我们的代码会引起校验空字段的异常。因此我们取消了这个配置,并尝试使用Kyro进行序列化,并且注册我们的类型的所有子类到Flink的ExecutionEnvironment中,目前看性能尚可,并优于旧版本使用Avro的性能。但是最佳实践还需要经过比较和压测KryoSerializerAvroUtils.getAvroUtils().createAvroSerializerPojoSerializer才能总结出来,大家还是应该根据自己的业务场景和数据类型来合理挑选适合自己的serializer。

4.Standalone模式下job的deploy与资源隔离共享

结合我们之前的使用经验,Flink的standalone cluster在发布具体的job时,会有一定的随机性。举个例子,如果当前集群总共有2台8核的机器用以部署TaskManager,每台机器上一个TaskManager实例,每个TaskManager的TaskSlot为8,而我们的job的并行度为12,那么就有可能会出现下图的现象:

第一个TaskManager的slot全被占满,而第二个TaskManager只使用了一半的资源!资源严重不平衡,随着job处理的流量加大,一定会造成TM1上的task消费速度慢,而TM2上的task消费速度远高于TM1的task的情况。假设业务量的增长迫使我们不得不扩大job的并行度为24,并且扩容2台性能更高的机器(12核),在新的机器上,我们分别部署slot数为12的TaskManager。经过扩容后,集群的TaskSlot的占用可能会形成下图:

新扩容的配置高的机器并没有去承担更多的Task,老机器的负担仍然比较严重,资源本质上还是不均匀!

除了standalone cluster模式下job的发布策略造成不均衡的情况外,还有资源隔离差的问题。因为我们在一个cluster中往往会部署不止一个job,而这些job在每台机器上都共用JVM,自然会造成资源的竞争。起初,我们为了解决这些问题,采用了如下的解决方法:

  1. 将TaskManager的粒度变小,即一台机器部署多个实例,每个实例持有的slot数较少;
  2. 将大的业务job隔离到不同的集群上。

这些解决方法增加了实例数和集群数,进而增加了维护成本。因此我们决定要迁移到on yarn上,目前看Flink on yarn的资源分配和资源隔离确实比standalone模式要优秀一些。

四、总结与展望

Flink在2016年时仅为星星之火,而只用短短两年的时间就成长为了当前最为炙手可热的流处理平台,而且大有统一批与流之势。经过两年的实践,Flink已经证明了它能够承接TalkingData的App Analytics和Game Analytics两个产品的流处理需求。接下来我们会将更复杂的业务和批处理迁移到Flink上,完成集群部署和技术栈的统一,最终实现图5 中Flink on yarn cluster 的规划,以更少的成本来支撑更大的业务量。

参考资料:
[1] https://cwiki.apache.org/conf...
[2] https://flink.apache.org/2019...
[3] https://ci.apache.org/project...
[4] https://mp.weixin.qq.com/s/XR...

About the author:
Xiao Qiang: TalkingData senior engineer, TalkingData Statistical Analysis and Game Analytics App Analytics technical director. Graduated from the Beijing University of Aeronautics and Astronautics, is mainly engaged in big data platform, distributed computing and storage convection certain studies.

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/11876627.html