Flink Stream Batch Integrated Computing (7): Flink Optimization

Table of contents

configuration memory

set parallelism

Operation scene

specific settings

Replenish

Configure process parameters

Operation scene

specific configuration

Configure netty network communication

Operation scene

specific configuration

configuration memory

Flink relies on memory calculations, and insufficient memory during the calculation process has a great impact on Flink 's execution efficiency. You can monitor GC ( Garbage Collection ), evaluate memory usage and remaining conditions to determine whether memory has become a performance bottleneck, and optimize according to the situation.

Monitor the YARN Container GC log of the node process. If Full GC occurs frequently , GC needs to be optimized .

conf/flink-conf.yaml

env.java.opts: -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch -server
-XX:+HeapDumpOnOutOfMemoryError

Adjust the ratio of the old generation to the new generation. Add parameters to the env.java.opts configuration item: -XX:NewRatio.

For example, "-XX:NewRatio=2 means that the ratio of the old generation to the new generation is 2:1, the new generation occupies 1/3 of the entire heap space, and the old generation occupies 2/3.

set parallelism

Operation scene

The degree of parallelism controls the number of tasks and affects the number of blocks into which the data is divided after the operation. Adjust the degree of parallelism to optimize the number of tasks and the data processed by each task and the processing power of the machine.

Check the CPU usage and memory usage. When tasks and data are not evenly distributed on each node, but concentrated on individual nodes, you can increase the degree of parallelism so that tasks and data are more evenly distributed on each node. Increase the parallelism of tasks and make full use of the computing power of cluster machines.

specific settings

The parallelism of tasks can be specified through the following four levels (arranged in descending order of priority), and users can adjust the parallelism parameters according to the actual memory, CPU , data and application logic.

  • Operator level

The parallelism of an operator, data source and sink can be specified by calling the setParallelism() method, for example

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = [...]
DataStream<Tuple2<String, Integer>> wordCounts = text
    .flatMap(new LineSplitter())
    .keyBy(0)
    .timeWindow(Time.seconds(5))
    .sum(1).setParallelism(5);
wordCounts.print();
env.execute("Word Count Example");
  • Execution Environment Hierarchy

Flink programs run in the execution environment. The execution environment defines a default parallelism for all executed operators, data sources, and data sinks .

The default parallelism of the execution environment can be specified by calling the setParallelism() method. For example:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);
DataStream<String> text = [...]
DataStream<Tuple2<String, Integer>> wordCounts = [...]
wordCounts.print();
env.execute("Word Count Example");
  • Client Hierarchy

The degree of parallelism can be set when the client submits the job to Flink . For CLI clients, the degree of parallelism can be specified via the "-p" parameter. For example:

./bin/flink run -p 10 ../examples/*WordCount-java*.jar
  • system level

At the system level, you can specify the default parallelism of all execution environments by modifying the "parallelism.default" configuration option in the "flink-conf.yaml" file in the Flink client conf directory.

Replenish

When developing Flink applications, optimize DataStream's data partitioning or grouping operations.

  • When partitioning causes data skew, you need to consider optimizing partitions.
  • Avoid non-parallel operations. Some operations on DataStream will not be parallelized, such as WindowAll.
  • keyBy try not to use String.

Configure process parameters

Operation scene

In Flink on YARN mode, there are two processes, JobManager and TaskManager . In the process of task scheduling and running, JobManager and TaskManager take great responsibility.

Therefore, the parameter configuration of JobManager and TaskManager has a great influence on the execution of Flink applications. Users can optimize the performance of Flink cluster through the following operations .

specific configuration

  • Configure JobManager Memory

JobManager is responsible for task scheduling and message communication between TaskManager and RM. When the number of tasks increases and the parallelism of tasks increases, the memory of the JobManager needs to be increased accordingly.

You can set an appropriate memory for the JobManager according to the actual number of tasks.

        When using the yarn-session command, add the "-jm MEM" parameter to set the memory.
        When using the yarn-cluster command, add the "-yjm MEM" parameter to set the memory.

  • Configure the number of TaskManagers

Each TaskManager can run one task per core at the same time, so increasing the number of TaskManagers is equivalent to increasing the concurrency of tasks. In the case of sufficient resources, the number of TaskManagers can be increased accordingly to improve operating efficiency.

  • Number of allocated TaskManager slots

Multiple cores of each TaskManager can run multiple tasks at the same time, which is equivalent to increasing the concurrency of tasks. However, since all cores share the memory of TaskManager, it is necessary to balance the memory and the number of cores.

        When using the yarn-session command, add the "-s NUM" parameter to set the number of SLOTs.
        When using the yarn-cluster command, add the "-ys NUM" parameter to set the number of SLOTs.

  • Configure TaskManager memory

The memory of TaskManager is mainly used for task execution, communication, etc. When a task is very large, it may require more resources, so the memory can be increased accordingly.

        When using the yarn-session command, add the "-tm MEM" parameter to set the memory.
        When using the yarn-cluster command, add the "-ytm MEM" parameter to set the memory.

Configure netty network communication

Operation scene

Flink communication mainly relies on the netty network, so the setting of netty is particularly important during the execution of Flink applications . The quality of network communication directly determines the speed of data exchange and the efficiency of task execution.

specific configuration

The following configurations can be modified and adapted in the "conf/flink-conf.yaml" configuration file of the client . The default is a relatively optimal solution. Please modify it carefully to prevent performance degradation.

  • taskmanager.network.netty.num-arenas:

The default is taskmanager.numberOfTaskSlots, indicating the number of netty domains.

  • taskmanager.network.netty.server.numThreads和taskmanager.network.netty.client.numThreads:

The default is taskmanager.numberOfTaskSlots, which indicates the number of threads of netty's client and server.

  • taskmanager.network.netty.client.connectTimeoutsec:

The default is 120s, which means the client connection timeout time of taskmanager.

  • taskmanager.network.netty.sendReceiveBufferSize:

The default is the system buffer size (cat /proc/sys/net/ipv4/tcp_[rw]mem), generally 4MB, indicating the buffer size of netty's sending and receiving.

  • taskmanager.network.netty.transport:

The default is nio mode, which means the transmission mode of netty, and there are two modes: nio and epoll.

Guess you like

Origin blog.csdn.net/victory0508/article/details/131436357