1. Background
The role of HDFS
NameNode: Only in the Master node, responsible for the storage of metadata (attribute information) such as files and names.
DataNode: Only in the Core node, responsible for the storage of file data
2. Memory configuration
-
HADOOP-HDFS memory configuration
- Confirm memory parameter values
The memory parameters are configured in the following file:
/etc/hadoop/conf/hadoop-env.sh
You can also directly use the command to view, the unit is mb, the default is 1000mb
cat /etc/hadoop/conf/hadoop-env.sh | grep -1n HADOOP_HEAPSIZE
The memory here (HADOOP_HEAPSIZE) actually allocates the memory of each daemon process below.
- NameNode
- SecondaryNameNode
- DataNode
- JobTracker
- TaskTracker
-
NameNode memory configuration
- In AWS, the memory configuration is [HADOOP_NAMENODE_HEAPSIZE], the unit is mb
cat /etc/hadoop/conf/hadoop-env.sh | grep HADOOP_NAMENODE_HEAPSIZE
The following are various memory parameters configured by default for each model in EMR.
[Hadoop daemon configuration settings]
Hadoop daemon configuration settings - Amazon EMR
- Another configuration method
In other articles, the configuration method seen is [HADOOP_NAMENODE_OPTS]
HADOOP_NAMENODE_OPTS=-Xmx4g
The configuration here means the process parameters when the program starts
In addition to memory, other startup parameters can be configured, such as enabling audit logs:
HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT"
3. Audit log configuration
-
Configuration file path
The opening of the audit log and related parameter files are in the following path
/etc/hadoop/conf/log4j.properties
-
Related parameters
You can use the following command to view the relevant configuration
cat /etc/hadoop/conf/log4j.properties | grep hdfs.audit
in:
hdfs.audit.logger: The way to record logs. The default NullAppender does not record logs.
hdfs.audit.log.maxfilesize: The maximum size of each log file. Once the size is exceeded, the next log will be rolled and generated.
hdfs.audit.log.maxbackupindex: The maximum number of log copies. If the number of logs exceeds this value, the oldest log file will be rewritten.
The log path is
/mnt/var/log/hadoop-hdfs/hdfs-audit.log
-
Configuration parameters
To enable the audit log, you need to change the parameters of [hdfs.audit.logger] to the following
hdfs.audit.logger=INFO,RFAAUDIT
4. Modify parameters in Json format
In EMR, if you want the entire instance group to maintain consistent parameters, the best way is to configure json parameters and let the instance group reconfigure the parameters of each instance.
The following are the officially given json format parameters
[
{
"classification": "hadoop-env",
"properties": {},
"configurations": [
{
"classification": "export",
"properties": {
"HADOOP_NAMENODE_OPTS": "\"$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT\"",
"HADOOP_NAMENODE_HEAPSIZE": "xx"
},
"configurations": []
}
]
},
{
"classification": "hadoop-log4j",
"properties": {
"hdfs.audit.log.maxbackupindex": "xx",
"hdfs.audit.log.maxfilesize": "xxMB"
},
"configurations": []
}
]
in
- "HADOOP_NAMENODE_OPTS": "\"$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT\"" This parameter will enable audit logs
- "HADOOP_NAMENODE_HEAPSIZE":"xx" In the xx part, enter the memory parameter value you want to configure.
- "hdfs.audit.log.maxbackupindex": "xx" In the xx part, enter the number of audit logs you want to configure.
- "hdfs.audit.log.maxfilesize": In the "xxMB" part of xx, enter the audit log rorate size you want to configure.
If you modify the EMR cluster configuration through this method, after configuring json, EMR will automatically restart the entire instance group, which will cause the service to be unavailable for a period of time.
If this is the case, it will have a great impact on the business. Unless it is just when the cluster is created, it cannot be configured like this.
5. Restart the service
If it is configured manually, you also need to manually restart the service and then observe whether it takes effect.
Refer to the official document "How to restart the service in Amazon EMR?" 》
Restart a service in Amazon EMR
Use different instructions according to different EMR cluster versions
systemctl stop hadoop-hdfs-namenode.service systemctl start hadoop-hdfs-namenode.service
※Notice! ! !
If it is an active namenode, you need to switch the master and backup first.
If you force the switch manually, it may lead to a split-brain state (both devices become active).
You can switch between active and standby by restarting the zkfc service on the active node.
systemctl restart hadoop-hdfs-zkfc.service