Flink use (two) - Flink Cluster Resource Planning

Foreword

  This article translated from Flink Forward 2017 Berlin station Robert Metzger of cluster How about planning to size your flink cluster article. This article is mainly on account of network resources, bloggers combine their experience in the use of text omitted to do a certain supplement, but also very much welcome everyone to add comments.

  This article non-literal translation, the original link is as follows: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

  This paper allowed just the right place, and are accompanied by the original English text. If there is inadequate representation, everyone welcome message said.


  1, key parameters and resources

  Flink clusters to estimate the resources required, first of all we need to give the minimum resource requirements of the cluster (baseline) According to Flink task indicators.

  1.1 Indicators (Metric) :

    1) per second record for each record of the number and size;

    2) the number of different key generation and key size for each of the state;

    3) state and the state's way of updating the access mode

  Also you need to consider the SLA (Service Level Agreement). For example, it may be willing to accept downtime, the maximum acceptable delay or throughput, such as the size of the cluster SLA will affect Flink.

  1.2 Resources

    When Flink cluster to do the planning, we need to consider cluster resources, but the resources here, it generally refers to what? Generally have the following:

    1) network capacity. When considering network capacity, we also need to take into account other services may use the network, such as Kafka, HDFS and so on;

    2) disk bandwidth. When our fault tolerance is disk-based, such as RockDB, HDFS, this time there may need to be considered Kafka, because it is the presence of disk-based data;

    3) the number of nodes, and can provide the CPU and memory;


  2, an example

  Flink example of a topology follows:

   After this example from consumption kafka message, a user id (userId) do keyBy, polymerization through window operator (operator window is sliding window, which window size 5min, interval 1min), the processed message is written to the kafka .

  2.1 Task metrics

  The average size of the record kafka consumption is 2KB, throughput of 1 million / s, userId number of 500 million (5 * 10 ^ 9). The key indicators of the task (key metric) as follows:

   2.2 Hardware

  1) 5 nodes, each node has a TaskManager; 2) Gigabit network; 3) is connected via a network disk (in this example cluster deployment on the cloud, a physical machine to give additional consideration); in addition, a separate cluster Kafka. FIG 2 is as follows:

  Each node is a core 16, for simplicity not consider the text and memory requirements CUP. In the actual production, according to the task required logic and fault-tolerant manner to consider the memory. State by storing the present embodiment is of RockDB manner that smaller memory requirements.

   2.3 single-node resource requirements

    To facilitate the analysis, we consider the resource requirements on a single node, the overall demand for cluster nodes can generally be obtained by multiplying. Example, each operator is the same as the degree of parallelism and no other special restrictions scheduling, task stream per node of all operators, i.e., each node has Kafka source, window, Kafka sink operator, Figure 3 below:

  To facilitate the computing resource, given the above figure KeyBy operator alone, but in practice is KeyBy Kafka properties of the link between the operator and the operator window. Below in connection with the demand (network resource requirement) analysis of network resources from top to bottom in FIG. 3.

  2.3.1  Kafka Source

  It is calculated from a single Kafka Source to get the data, we first calculate to get comprehensive data from Kafka, calculated as follows:

  1) 1,000,000 per second, each size of 2KB, the total data per second is obtained as follows:

    2KB×1,000,000/s=2GB/s

  2) Flink each node in the cluster of data per second is obtained

    2GB/s÷5=400MB/s

  2.3.2 Shuffle过程(KeyBy)

  经过KeyBy后,具有相同userId的数据将会在一个节点上,但是Kafka可能根据不同的元数据进行分区(partitioned according to a different partitioning scheme),因此对一个key(userId),单个节点直接从Kafka得到的数据为400MB/s÷5=80MB/s,这样就有320MB/s的需要通过shuffle获得。

  2.3.3 window emit和Kafka sink

    window会发送多少数据,有多少数据会到Kafka sink?分析如下:

    window算子为每个key(userId)聚合生成4个long数,每分钟发射一次,这样window每分钟为每个key会发射2个int字段(userId、window_ts)和4个long字段,总的数据量如下:

    (2 x 4 bytes) + (4 x 8 bytes) = 40 bytes per key

  这样5个节点,每个节点的数据量为:

    500,000,000 keys x 40 bytes÷5 = 4GB

  每秒的数据量为4GB/min ÷ 60 = 67MB/s,因为每个节点上都有Kafka sink,不需要额外的重分区,因此从Flink到Kafka的数据为67MB/s。在实际中,算子不会以67MB/s的恒定速度发送数据,而是每分钟最大限度地利用可用带宽几秒钟。

  单节点数据总流向总结如下:

  • Data in: 720MB/s (400 + 320) per machine
  • Data out: 387MB/s (320 + 67) per machine

  整个过程可以总结如下:

    2.3.4  获取state和checkpointing

    到目前为止,我们只考虑Flink处理的数据。实际上,还需考虑到state存储和checkpoint过程中所需要的网络资源。

    1)state消耗的网络带宽

    为弄清window算子的state大小,我们需要从另外一个角度去分析该问题。Flink的计算窗口大小为5min,滑动尺度为1min,为此Flink通过维持五个窗口实现“滑动窗口”。如在2.3.3节中提到,每个窗口Flink需要维持40Bytes的数据。每当一个event到达时,Flink将会从已有state中获得数据(40Bytes)去更新聚合值,然后将更新后的数据写入state(磁盘),如下图:

   这意味每个节点将会产生40MB/s的网络消耗,计算方式如下:

  40 bytes of state x 5 windows x 200,000 msg/s per machine = 40MB/s

  正如文中开始提及的,磁盘是通过网络连接的,所以state读取产生的网络消耗也得考虑进去,则单节点整体的网络资源情况如下:

   2)checkpoint过程

    每当有新event到来上述state过程就会被触发,有时间我们为了保证当任务失败后可以恢复会开启checkpoint,本例中checkpoint设置为每隔一分钟周期性触发,每个checkpoint过程会将现有的state通过网络拷贝到系统中。每个节点一次checkpoint会拷贝的数据为:

  40bytes of state x 5 windows x 100,000,000 keys = 20GB

  每秒中的数据为20GB ÷ 60 = 333 MB/s。当然checkpoint过程数据同样不是以稳定的速率发送到系统中,而是会以最大的速率发送。此外,从Flink1.3以后,基于RockDB是可以实现增量checkpoint,本例暂时不考虑该特性。单节点整个任务过程网络消耗如下:

   集群整体网络消耗如下:

    760 + 760 x 5 + (40×2)×5 + (400+67)×5 = 10335 MB/s

  (40×2)×5是5个节点state的读写过程消耗,(400+67)×5是从Kafka读和写过程消耗的(kafka数据会落盘)。

  该数据仅为上述硬件设置中的可用网络容量的一半以上,如下图。

   2.3.5 总结

    该例子中,每个节点流进和流出的数据为760MB/s,仅为节点容量的60%(每个节点为1250MB/s),剩下的40%可以用来应对突发的情况,如网络开销、checkpoint恢复期间的数据重放或者由于数据倾斜导致的节点之间数据shuffle过大的情况等。


 3、其他建议

  1)CUP个数,Flink官网给出的建议是和slot的个数成比例,从而也就和任务的并行度有关了,换句话说,在考虑任务的并行度时要结合CPU的个数考虑;

  2)尽量申请多的内存,内存的最小和可以通过在测试集群中测试后,大致成比例的放大到生成集群中;

  3)考虑I/O,数据盘最好和日志盘分离;

  4)还有其他如JobManager最好和TaskManager节点分离等;

  

Guess you like

Origin www.cnblogs.com/love-yh/p/11939023.html