hadoop cluster load balancing

    A good hadoop cluster should have data evenly distributed on each node, not that one node's disk is full, and the other disk usage is less than 10%. Here is a brief introduction to the principle of hadoop data storage and how to ensure data Evenly distributed on each node.

    Local data upload hdfs storage process:

        The first copy: First, the cluster will determine whether the uploading host is a DataNode. If it is a DataNode and the space is sufficient, it will first store the data on this DataNode (key point). If the uploading host is not a DataNode, it will find nodes with not too slow disks for data storage

        Second copy: If a rack is configured and specified in the configuration file, the second copy will be stored on a node with abundant disk space in a different rack. If no rack is configured, a random copy with sufficient disk space will be found. node for data storage

        The third block copy: store according to the mechanism of the second block copy storage    

    At this point, you will find the problem. For the production cluster, in order to improve the utilization rate of the host, the uploading data node generally also has the function of the DataNode, so the problem comes. For the day-to-day data transmission, uploading The DataNode data storage disk usage of the data must be very high, and even if the data uploading host is not a DataNode, for a heavy-duty hadoop cluster, a large amount of data import and data cleaning will lead to data imbalance on each node.

    So the question is, even if my cluster data is unbalanced, what is the impact?

    Everyone knows that the hadoop cluster uses mr for computing. If the cluster data is unbalanced, it is impossible to use mr for localized computing. That is, for host A, the data required by the running map task is not on host A, but on the host. On B, then cross-node data reading must be performed, which will definitely slow down the execution efficiency of the task to a certain extent, especially for nodes whose disk usage reaches 100%, and even cause the task to fail.

    In order to prevent the imbalance of cluster data and the insufficient utilization of cluster resources, there is a data balance mechanism (hereinafter referred to as: balancer) in hadoop. After the balancer task is started, the cluster will automatically read the disk usage of each node, according to the set host space usage rate Calculate the difference, start data replication from the node data far exceeding the average value to the host whose space usage is lower than the average value, delete the original node data after the replication is completed, because the data replication and transmission will occupy the disk IO, Therefore, in order not to affect the normal provision of services by the cluster, the replication and transmission efficiency is low. It can be set through the hdfs-site.xml configuration file. If the cluster data does not change much, use the default value. If the data balancing speed cannot meet the requirements, you can modify the Modify the following parameters in the configuration file:

  1. <property>  
  2. <name>dfs.balance.bandwidthPerSec</name>  
  3. <value>1048576</value> =>1M/s  
  4. </property></span> 

    Configuration of data balance:

    Use the hadoop balancer -Threshold 10 command on any cluster node to balance the cluster data. There is no explanation in English. For parameter 10, it means that the disk space utilization of each node in the cluster does not differ by more than 10%. adjust the situation.

    Scheduled task settings:

    I know the method of data balance, but I can't just stare at it every day, and the data is balanced today, but what should I do if the new data will cause imbalance tomorrow, then I use a scheduled task, and you can use it according to the data change of the cluster. Execute timed tasks, for example: our company cluster is not used on weekends, then I can set the time to Friday night, or our company has a large amount of data imported at the beginning of each month, I can set the data balance time after the data is imported. , the following comes with a simple crontab setup for data balancing at 10pm every Friday:

00 22 * * 5 hadoop balancer -Threshold 5 >>/home/zcb/log/balancer_`date +"\%Y\%m\%d"`.log 2>&1


Pure hand-beating, I hope to like and comment, thank you!



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325973719&siteId=291194637