The principle of cluster parameter configuration is to rewrite the configuration and override by default, otherwise it will take effect by default. The following summarizes the commonly used configuration file parameters of Haoop. Commonly used configuration files: core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, configured in the two instances of Hadoop and Yarn, one of the two components of Hadoop and Yarn is responsible for storing one It is a resource management framework, which is equivalent to computing and storage. Some companies have separate computing nodes and storage nodes, and some do not use them as required.
1. core-site.xml is the core configuration file of the NameNode. It mainly sets the properties of the NameNode, and it only takes effect on the NameNode node.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://HadoopHhy</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zk1:2015,zk2:2015,zk3:2015</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/bigdata/hadoop/tmp</value>
<final>true</final>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<final>true</final>
</property>
<property>
<name>fs.trash.interval</name>
<value>1440</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
</configuration>
Parameter configuration and explanation:
2. The hdfs-site.xml file
is the core configuration file of HDFS. It mainly configures some HDFS-based attribute information of NameNode and DataNode, and it takes effect on NameNode and DataNode.
<configuration>
<property>
<name>dfs.nameservices</name>
<value>test</value>
</property>
<property>
<name>dfs.ha.namenodes.test</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.test.nn1</name>
<value>host1:9000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.test.nn2</name>
<value>host2:9000</value>
</property>
<property>
<name>dfs.namenode.http-address.test.nn1</name>
<value>host1:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.test.nn2</name>
<value>host2:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://host1:8485;host2:8485;host3:8485/test</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.test</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/journal</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/data/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.image.transfer.bandwidthPerSec</name>
<value>314572800</value>
</property>
<property>
<name>dfs.image.transfer.timeout</name>
<value>120000</value>
</property>
<property>
<name>dfs.namenode.checkpoint.txns</name>
<value>5000000</value>
</property>
<name>dfs.namenode.edits.dir</name>
<value>/data/editlog</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/etc/hadoop/hosts-exclude</value>
</property>
<property>
<name>dfs.datanode.balance.bandwidthPerSec</name>
<value>20971520</value>
</property>
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>0</value>
</property>
<property>
<name>dfs.namenode.decommission.interval</name>
<value>30</value>
</property>
</configuration>
Parameter configuration and explanation:
3 yarn-site.xml file
This file is the core configuration file of the Yarn resource management framework. All configurations of Yarn are set in this file.
Parameter configuration and explanation:
<configuration>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>xxx</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>host1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>host2</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>host1:2015,host2:2015,host3:2015</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/yarn</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/data/logs</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>host1:port</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>host1:port</value>
<final>true</final>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>host1:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm1</name>
<value>host1:port</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm1</name>
<value>host1:port</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm1</name>
<value>host1:port</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>host2:port</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>host2:port</value>
<final>true</final>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>host2:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm2</name>
<value>host2:port</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm2</name>
<value>host2:port</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm2</name>
<value>host2:port</value>
</property>
<property>
<name>yarn.client.failover-proxy-provider</name>
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>false</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>172800</value>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>21600</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>24576</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://host2:port/jobhistory/logs</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>25600</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>6</value>
</property>
<property>
<name>yarn.resourcemanager.nodemanager-connect-retries</name>
<value>10</value>
</property>
</configuration>
-
The mapred-site.xml file
MR configuration file.
JobHistory is used to record the complete information of MapReduce tasks in the HDFS directory.
Parameter configuration and explanation: