Big data cluster tuning

HDFS

scene 1:
When the DataNode starts to work (power on), it will report the local data block to the NameNode client for registration. At this time, the number of clients is more important. If it is too small, the DataNode will always time out or connect when connecting to the NameNode. Rejected; if the setting is too large, the memory overhead will increase, causing unnecessary waste, and eventually may cause memory overflow.
Parameters: dfs.namenode.handler.count
NameNode has a worker thread pool, which is used to process the concurrent heartbeats of different DataNodes and concurrent metadata operations of the client. The size of the thread pool is specified according to this parameter, the default value is 10, this value The general setting principle is to set it to the natural logarithm of the cluster size multiplied by 20, that is, 20logN , where N is the cluster size.
#配置文件hdfs-site.xml
<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
</property>
Python can be used for specific calculations. The following example uses 8 nodes as an example
[root@fan102 ~]# python -c 'import math ; print int(math.log(8) * 20)'
41

 

YARN

Scenario 1: A total of 7 machines, hundreds of millions of data per day, data source->Flume->Kafka->HDFS->Hive

Facing problems: HiveSQL is mainly used for data statistics, there is no data skew, small files have been merged, the opened JVM is reused, and the IO is not blocked, and the memory is less than 50%. But it still runs very slowly, and when the data volume peaks, the entire cluster will go down. Based on this situation, is there any optimization plan.

Analysis: Insufficient memory utilization . This is generally caused by the two configurations of Yarn, the maximum memory size that a single task can apply for, and the available memory size of a single Hadoop node. Adjusting these two parameters in the yarn-site.xml file can improve the utilization of system memory.

Parameter 1: yarn.nodemanager.resource.memory-mb

Indicates the total amount of physical memory that can be used by YARN on this node. The default is 8192 (MB). Generally, the 128G server is configured as 100G, and the 64G server is configured as 50G. Our server is 188G, so I set this value to 120G. Note that if your node memory resources are not enough 8GB, you need to reduce this value, and YARN will not intelligently detect the total physical memory of the node.
Parameter 2: yarn.scheduler.maximum-allocation-mb
represents the maximum amount of physical memory that a single task can apply for, the default is 8192 (MB), this value needs to be determined according to the size of the task data, 128MB corresponds to 1GB of memory

 

Guess you like

Origin blog.csdn.net/shenyuye/article/details/108827842