HDFS
Parameters: dfs.namenode.handler.count
NameNode has a worker thread pool, which is used to process the concurrent heartbeats of different DataNodes and concurrent metadata operations of the client. The size of the thread pool is specified according to this parameter, the default value is 10, this value The general setting principle is to set it to the natural logarithm of the cluster size multiplied by 20, that is, 20logN , where N is the cluster size.
#配置文件hdfs-site.xml
<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
</property>
[root@fan102 ~]# python -c 'import math ; print int(math.log(8) * 20)'
41
YARN
Scenario 1: A total of 7 machines, hundreds of millions of data per day, data source->Flume->Kafka->HDFS->Hive
Facing problems: HiveSQL is mainly used for data statistics, there is no data skew, small files have been merged, the opened JVM is reused, and the IO is not blocked, and the memory is less than 50%. But it still runs very slowly, and when the data volume peaks, the entire cluster will go down. Based on this situation, is there any optimization plan.
Analysis: Insufficient memory utilization . This is generally caused by the two configurations of Yarn, the maximum memory size that a single task can apply for, and the available memory size of a single Hadoop node. Adjusting these two parameters in the yarn-site.xml file can improve the utilization of system memory.
Parameter 1: yarn.nodemanager.resource.memory-mb
Indicates the total amount of physical memory that can be used by YARN on this node. The default is 8192 (MB). Generally, the 128G server is configured as 100G, and the 64G server is configured as 50G. Our server is 188G, so I set this value to 120G. Note that if your node memory resources are not enough 8GB, you need to reduce this value, and YARN will not intelligently detect the total physical memory of the node.
Parameter 2: yarn.scheduler.maximum-allocation-mb
represents the maximum amount of physical memory that a single task can apply for, the default is 8192 (MB), this value needs to be determined according to the size of the task data, 128MB corresponds to 1GB of memory