HDFS service configuration parameters in AWS EMR

1. Background

The role of HDFS

NameNode: Only in the Master node, responsible for the storage of metadata (attribute information) such as files and names.

DataNode: Only in the Core node, responsible for the storage of file data

2. Memory configuration

  1. HADOOP-HDFS memory configuration

    1. Confirm memory parameter values

The memory parameters are configured in the following file:

/etc/hadoop/conf/hadoop-env.sh 

You can also directly use the command to view, the unit is mb, the default is 1000mb

cat /etc/hadoop/conf/hadoop-env.sh | grep -1n HADOOP_HEAPSIZE

The memory here (HADOOP_HEAPSIZE) actually allocates the memory of each daemon process below.

  • NameNode
  • SecondaryNameNode
  • DataNode
  • JobTracker
  • TaskTracker

  1. NameNode memory configuration

  • In AWS, the memory configuration is [HADOOP_NAMENODE_HEAPSIZE], the unit is mb
cat /etc/hadoop/conf/hadoop-env.sh | grep HADOOP_NAMENODE_HEAPSIZE

The following are various memory parameters configured by default for each model in EMR.

[Hadoop daemon configuration settings]

Hadoop daemon configuration settings - Amazon EMR

  • Another configuration method

In other articles, the configuration method seen is [HADOOP_NAMENODE_OPTS]

HADOOP_NAMENODE_OPTS=-Xmx4g

The configuration here means the process parameters when the program starts

In addition to memory, other startup parameters can be configured, such as enabling audit logs:

HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT"

3. Audit log configuration

  1. Configuration file path

The opening of the audit log and related parameter files are in the following path

/etc/hadoop/conf/log4j.properties

  • Related parameters

You can use the following command to view the relevant configuration

cat /etc/hadoop/conf/log4j.properties | grep hdfs.audit

in:

hdfs.audit.logger: The way to record logs. The default NullAppender does not record logs.

hdfs.audit.log.maxfilesize: The maximum size of each log file. Once the size is exceeded, the next log will be rolled and generated.

hdfs.audit.log.maxbackupindex: The maximum number of log copies. If the number of logs exceeds this value, the oldest log file will be rewritten.

The log path is

/mnt/var/log/hadoop-hdfs/hdfs-audit.log

  • Configuration parameters

To enable the audit log, you need to change the parameters of [hdfs.audit.logger] to the following

hdfs.audit.logger=INFO,RFAAUDIT

4. Modify parameters in Json format

In EMR, if you want the entire instance group to maintain consistent parameters, the best way is to configure json parameters and let the instance group reconfigure the parameters of each instance.

The following are the officially given json format parameters

[
    {
        "classification": "hadoop-env",
        "properties": {},
        "configurations": [
            {
                "classification": "export",
                "properties": {
                    "HADOOP_NAMENODE_OPTS": "\"$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT\"",
                    "HADOOP_NAMENODE_HEAPSIZE": "xx"
                },
                "configurations": []
            }
        ]
    },
    {
        "classification": "hadoop-log4j",
        "properties": {
            "hdfs.audit.log.maxbackupindex": "xx",
            "hdfs.audit.log.maxfilesize": "xxMB"
        },
        "configurations": []
    }
]

in

  • "HADOOP_NAMENODE_OPTS": "\"$HADOOP_NAMENODE_OPTS -Dhdfs.audit.logger=INFO,RFAAUDIT\"" This parameter will enable audit logs
  • "HADOOP_NAMENODE_HEAPSIZE":"xx" In the xx part, enter the memory parameter value you want to configure.
  • "hdfs.audit.log.maxbackupindex": "xx" In the xx part, enter the number of audit logs you want to configure.
  • "hdfs.audit.log.maxfilesize": In the "xxMB" part of xx, enter the audit log rorate size you want to configure.

If you modify the EMR cluster configuration through this method, after configuring json, EMR will automatically restart the entire instance group, which will cause the service to be unavailable for a period of time.

If this is the case, it will have a great impact on the business. Unless it is just when the cluster is created, it cannot be configured like this.

5. Restart the service

If it is configured manually, you also need to manually restart the service and then observe whether it takes effect.

Refer to the official document "How to restart the service in Amazon EMR?" 》

Restart a service in Amazon EMR

Use different instructions according to different EMR cluster versions

systemctl stop hadoop-hdfs-namenode.service
systemctl start hadoop-hdfs-namenode.service

※Notice! ! !

If it is an active namenode, you need to switch the master and backup first.

If you force the switch manually, it may lead to a split-brain state (both devices become active).

You can switch between active and standby by restarting the zkfc service on the active node.

systemctl restart hadoop-hdfs-zkfc.service

Guess you like

Origin blog.csdn.net/dwe147/article/details/126300478