Replace the disk in the hadoop big data cluster, the speed of balance is slow (solved)

Replace the disk in the hadoop big data cluster, the speed of balance is slow (solved)

insert image description here
Look at the phenomenon that only 4 bloucks are executing

Adjustment parameters:

Increase the configuration parameters and observe the speed of reload

修改配置文件 hdfs-site.xml
dfs.datanode.balance.max.concurrent.moves=100
dfs.balancer.max-size-to-move=21474836480
dfs.balancer.moverThreads=1300
dfs.balabcer.getBlocks.size=4294967296
dfs.datanode.balance.bandwidthPerSec=20971520

The above parameters need to restart hdfs
Parameter description:
dfs.datanode.balance.max.concurrent.moves: This parameter defines the maximum number of concurrent moves of data blocks by the data balancer. It limits the number of tasks doing data movement at the same time.

dfs.balancer.max-size-to-move: This parameter defines the maximum size of a single data block that the data balancer can move. Data blocks larger than this size will not be moved.

dfs.balancer.moverThreads: This parameter specifies the number of threads used by the data balancer. It determines the number of threads performing data movement tasks concurrently.

dfs.balancer.getBlocks.size: This parameter defines the maximum number of data block information obtained each time during data balancing. When the data balancer needs to know the distribution of data blocks on the DataNodes, it can request this information.

dfs.datanode.balance.bandwidthPerSec: This parameter defines the maximum bandwidth limit for data balancing per second. It limits the bandwidth usage of the data balancer when moving data around the cluster

Command to view default parameters:

hdfs getconf -confKey dfs.datanode.balance.max.concurrent.moves

Parameters that need to be adjusted:
insert image description here
dfs.namenode.replication.max-streams: This parameter is used to specify the maximum number of streams that perform data block replication operations at the same time. Specifically, it limits the number of block transfer streams that can occur in parallel during the copy process. By default, its value is equal to the value of dfs.namenode.replication.work.multiplier.per.iteration.

dfs.namenode.replication.max-streams-hard-limit: This is a hard limit parameter used to limit the maximum number of streams that perform data block copy operations at the same time. It provides a more enforceable limit than the above parameters, regardless of the value of dfs.namenode.replication.work.multiplier.per.iteration, the hard limit will not be exceeded. If this parameter is not set, its value will be equal to the value of dfs.namenode.replication.max-streams.

dfs.namenode.replication.work.multiplier.per.iteration: This parameter is used to control the amount of replication work performed during each iteration. It represents the number of new replication tasks spawned in each iteration. Specifically, each iteration will generate replication tasks based on the product of the value of dfs.replication and this parameter. For example, if dfs.replication is 3 and dfs.namenode.replication.work.multiplier.per.iteration is 2, each iteration will generate 6 replication tasks. By default, the value of this parameter is 4.

dfs.replication.considerLoad: This parameter considers the load of data nodes when performing data block replication operations. The default value is true, that is, the replication operation will consider the load of the data nodes to allocate tasks. If you want to ignore the load of the node when copying data blocks, you can set this parameter to false.

insert image description here

much faster

おすすめ

転載: blog.csdn.net/qq_43688472/article/details/132564377