Hadoop of big data technology (optimization & new features)

Chapter 1 HDFS—Troubleshooting

1.1 Cluster Security Mode

1) Security mode: the file system only accepts read data requests, but not deletion, modification and other change requests

2) Enter the safe mode scene

  • NameNode is in safemode during image loading and editing logs

  • When NameNode receives DataNode registration again , it is in safe mode

3) Exit safe mode conditions

  • dfs.namenode.safemode.min.datanodes: the minimum number of available datanodes, default 0;

  • dfs.namenode.safemode.threshold-pct: The percentage of blocks with the minimum number of replicas to the total number of blocks in the system, the default is 0.999f.

  • dfs.namenode.safemode.extension: stable time, the default value is 30000 milliseconds, which is 30 seconds

4) Basic grammar

The cluster is in safe mode and cannot perform critical operations (write operations). After the cluster startup is complete, it automatically exits the safe mode.

bin/hdfs dfsadmin -safemode get       #(功能描述:查看安全模式状态)
bin/hdfs dfsadmin -safemode enter     #(功能描述:进入安全模式状态)
bin/hdfs dfsadmin -safemode leave     #(功能描述:离开安全模式状态)
bin/hdfs dfsadmin -safemode wait      #(功能描述:等待安全模式状态)

1.2 NameNode troubleshooting

1) Requirements:

The NameNode process hangs and the stored data is lost, how to restore the NameNode

2) Fault simulation

(1) kill -9 NameNode process

[atguigu@hadoop102 current]$ kill -9 NameNode的进程号

(2) Delete the data stored by NameNode (/opt/module/hadoop-3.1.3/data/tmp/dfs/name)

[atguigu@hadoop102 hadoop-3.1.3]$ rm -rf /opt/module/hadoop-3.1.3/data/dfs/name/*

3) Problem solving

(1) Copy the data in the SecondaryNameNode to the original NameNode storage data directory

[atguigu@hadoop102 dfs]$ scp -r atguigu@hadoop104:/opt/module/hadoop-3.1.3/data/dfs/namesecondary/* ./name/

(2) Restart the NameNode

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs --daemon start namenode

(3) Upload a file to the cluster

1.3 Summary

1.3.1 Metadata loss

  1. Delete all metadata in NN (edits, edits_inprogress, fsimage)

  1. After stopping NN and restarting NN, it was found that it could not be started.

  1. Emergency recovery (this will not happen in the production environment because HA will be used) copy the edits and fsimage in 2NN to NN

  1. Only part of the data can be recovered. It will also stay in safe mode.

If you don't care about that part of the data cluster and want to return to normal:

hdfs dfsadmin -safemode forceExit

1.3.2 Block loss

  1. When we delete two blocks in the three nodes and restart the NameNode, it will enter the safe mode

  1. For normal operation, exit safe mode

  1. Restarting the NameNode will still enter safe mode (unless the metadata of the missing block (by command or deleted on the page) will no longer stay in safe mode)


Chapter 2 HDFS—Multiple Directories

2.1 DataNode multi-directory configuration

1) DataNode can be configured into multiple directories, and the data stored in each directory is different (data is not a copy)

2) The specific configuration is as follows

Add the following content to the hdfs-site.xml file.

<property>
<name>dfs.datanode.data.dir</name>
 <value>file://${hadoop.tmp.dir}/dfs/data,
        file://${hadoop.tmp.dir}/dfs/data2
</value>
</property>

3) View the results

[atguigu@hadoop102 dfs]$ ll

总用量 12
drwx------. 3 atguigu atguigu 4096 4月   4 14:22 data
drwx------. 3 atguigu atguigu 4096 4月   4 14:22 data2
drwxrwxr-x. 3 atguigu atguigu 4096 12月 11 08:03 name1
drwxrwxr-x. 3 atguigu atguigu 4096 12月 11 08:03 name2

4) Upload a file to the cluster, and observe the contents of the two folders again and find that they are inconsistent (one has a number and the other does not)

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put wcinput/word.txt /

2.2 Data balance among disks for cluster data balance

In the production environment, due to insufficient hard disk space, it is often necessary to add a hard disk. When the newly loaded hard disk has no data, you can execute the disk data balance command. (Hadoop3.x new features).

(1) Generate a balanced plan ( we only have one disk and will not generate a plan )

hdfs diskbalancer -plan hadoop102

(2) Execute a balanced plan

hdfs diskbalancer -execute hadoop102.plan.json

(3) View the execution status of the current balance task

hdfs diskbalancer -query hadoop102

(4) Cancel the balance task

hdfs diskbalancer -cancel hadoop102.plan.json

Chapter 3 HDFS—Cluster Expansion and Reduction

3.1 Service the new server

1) demand

With the growth of the company's business, the amount of data is getting larger and larger, and the capacity of the original data nodes can no longer meet the needs of storing data. It is necessary to dynamically add new data nodes on the basis of the original cluster.

2) Environment preparation

(1) Clone another hadoop105 host on the hadoop100 host

(2) Modify the IP address and host name

[root@hadoop105 ~]# vim /etc/sysconfig/network-scripts/ifcfg-ens33
[root@hadoop105 ~]# vim /etc/hostname

(3) Copy the /opt/module directory and /etc/profile.d/my_env.sh of hadoop102 to hadoop105

[atguigu@hadoop102 opt]$ scp -r module/*atguigu@hadoop105:/opt/module/

[atguigu@hadoop102 opt]$ sudo scp /etc/profile.d/my_env.shroot@hadoop105:/etc/profile.d/my_env.sh

[atguigu@hadoop105 hadoop-3.1.3]$ source /etc/profile

(4) Delete the historical data, data and log data of Hadoop on hadoop105

[atguigu@hadoop105 hadoop-3.1.3]$ rm -rf data/ logs/

(5) Configure hadoop102 and hadoop103 to hadoop105 ssh non-secret login

[atguigu@hadoop102 .ssh]$ ssh-copy-id hadoop105
[atguigu@hadoop103 .ssh]$ ssh-copy-id hadoop105

3) Specific steps for serving new nodes

Start the DataNode directly to associate with the cluster

[atguigu@hadoop105 hadoop-3.1.3]$ hdfs --daemon start datanode
[atguigu@hadoop105 hadoop-3.1.3]$ yarn --daemon start nodemanager

3.2 Data balance between servers

1) Enterprise experience:

In enterprise development, if tasks are often submitted on Hadoop102 and Hadoop104, and the number of copies is 2, due to the principle of data locality, there will be too much data in Hadoop102 and Hadoop104, and the amount of data stored in Hadoop103 will be small.

Another situation is that the data volume of the new server is relatively small, and the cluster balance command needs to be executed.

2) Enable the data balance command

[atguigu@hadoop105 hadoop-3.1.3]$ sbin/start-balancer.sh -threshold 10

For the parameter 10, it means that the disk space utilization of each node in the cluster does not differ by more than 10%, which can be adjusted according to the actual situation.

3) Stop the data balance command

[atguigu@hadoop105 hadoop-3.1.3]$ sbin/stop-balancer.sh

Note: Since HDFS needs to start a separate RebalanceServer to perform the Rebalance operation, try not to execute start-balancer.sh on the NameNode, but find a relatively idle machine.

3.3 Add whitelist

Whitelist: Indicates that the host IP addresses in the whitelist can be used to store data.

In the enterprise: Configure a whitelist to prevent malicious access attacks by hackers.

The steps to configure the whitelist are as follows:

1) Create whitelist and blacklist files respectively in the /opt/module/hadoop-3.1.3/etc/hadoop directory of the NameNode node

(1) Create a whitelist

[atguigu@hadoop102 hadoop]$ vim whitelist

# 在whitelist中添加如下主机名称,假如集群正常工作的节点为102 103 :
hadoop102
hadoop103

(2) Create a blacklist

[atguigu@hadoop102 hadoop]$ touch blacklist

Just keep it empty.

2) Add the dfs.hosts configuration parameter in the hdfs-site.xml configuration file

<!-- 白名单-->
<property>
     <name>dfs.hosts</name>
     <value>/opt/module/hadoop-3.1.3/etc/hadoop/whitelist</value>
</property>
 
<!-- 黑名单-->
<property>
     <name>dfs.hosts.exclude</name>
     <value>/opt/module/hadoop-3.1.3/etc/hadoop/blacklist</value>
</property>

3) Distribution configuration file whitelist, hdfs-site.xml

[atguigu@hadoop104 hadoop]$ xsync hdfs-site.xml whitelist

4) The cluster must be restarted to add the whitelist for the first time, not the first time, just refresh the NameNode node

[atguigu@hadoop102 hadoop-3.1.3]$ myhadoop.sh stop
[atguigu@hadoop102 hadoop-3.1.3]$ myhadoop.sh start

5) View the DN on a web browser, http://hadoop102:9870/dfshealth.html#tab-datanode

6) Modify the whitelist twice and add hadoop104

[atguigu@hadoop102 hadoop]$ vim whitelist

修改为如下内容
hadoop102
hadoop103
hadoop104
hadoop105

7) Refresh the NameNode

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -refreshNodes

Refresh nodes successful

8) View the DN on a web browser, http://hadoop102:9870/dfshealth.html#tab-datanode

3.4 Blacklist Decommissioned Servers

Blacklist: Indicates that the host IP address in the blacklist is not allowed to store data.

In the enterprise: configure the blacklist to decommission the server.

The blacklist configuration steps are as follows:

1) Edit the blacklist file in the /opt/module/hadoop-3.1.3/etc/hadoop directory

[atguigu@hadoop102 hadoop] vim blacklist

添加如下主机名称(要退役的节点)
hadoop105

Note: If there is no configuration in the whitelist, you need to add the dfs.hosts configuration parameter in the hdfs-site.xml configuration file

<!-- 黑名单-->
<property>
     <name>dfs.hosts.exclude</name>
     <value>/opt/module/hadoop-3.1.3/etc/hadoop/blacklist</value>
</property>

2) Distribution configuration file blacklist, hdfs-site.xml

[atguigu@hadoop104 hadoop]$ xsync hdfs-site.xml blacklist

3) The cluster must be restarted to add the blacklist for the first time, not the first time, just refresh the NameNode node

[atguigu@hadoop102 hadoop-3.1.3]$ hdfs dfsadmin -refreshNodes

Refresh nodes successful

4) Check the web browser, the status of the decommissioned node is decommission in progress (decommissioning), indicating that the data node is copying blocks to other nodes

5) Wait for the status of the decommissioned node to be decommissioned (all blocks have been copied), stop the node and the node resource manager. Note: If the number of replicas is 3, and the number of nodes in service is less than or equal to 3, the decommissioning cannot be successful. You need to modify the number of replicas before decommissioning

[atguigu@hadoop105 hadoop-3.1.3]$ hdfs --daemon stop datanode
stopping datanode

[atguigu@hadoop105 hadoop-3.1.3]$ yarn --daemon stop nodemanager
stopping nodemanager

6) If the data is unbalanced, you can use commands to rebalance the cluster

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-balancer.sh -threshold10

Chapter 4 Hadoop Enterprise Optimization

4.1 MapReduce optimization method

The MapReduce optimization method mainly considers six aspects : data input, Map phase, Reduce phase, IO transmission, data skew and commonly used tuning parameters.

  1. data input

  1. Merge small files : Merge small files before executing MR tasks. A large number of small files will generate a large number of MapTasks, which will increase the number of MapTask loads. Task loading is time-consuming, resulting in slow MR operation.

  1. Use CombineTextInputFormat as input to solve a large number of small file scenarios at the input end.

  1. Map stage

  1. Reduce the number of Spills : By adjusting the mapreduce.task.io.sort.mb and mapreduce.map.sort.spill.percent parameter values, increase the memory limit that triggers Spills, reduce the number of Spills, and reduce disk IO.

  1. Reduce the number of merges (Merge) : By adjusting the mapreduce.task.io.sort.factor parameter, increase the number of Merge files and reduce the number of Merge, thereby shortening the MR processing time .

  1. After Map, without affecting the business logic , Combime processing is performed first to reduce IO.

  1. Reduce stage

  1. Reasonably set the number of Map and Reduce : neither can be set too little, nor can it be set too much. Too little will cause Task to wait and prolong the processing time; too much will cause resource competition between Map and Reduce tasks, resulting in processing timeout and other errors.

  1. Set Map and Reduce to coexist : Adjust the mapreducejob.reduce.slowstart.completedmaps parameter so that after Map runs to a certain extent, Reduce also starts running, reducing the waiting time for Reduce.

  1. Avoid using Reduce : Because Reduce will generate a lot of network consumption when used to connect data sets.

  1. Reasonably set the Buffer on the Reduce side : By default, when the data reaches a threshold, the data in the Buffer will be written to the disk, and then Reduce will obtain all the data from the disk. That is to say, Buffer and Reduce are not directly related, and the process of writing to disk -> reading disk multiple times in the middle, since this disadvantage exists, it can be configured through parameters so that part of the data in Buffer can be directly sent to Reduce. To reduce IO overhead: mapreduce.reduce.input.buffer.percent, the default is 0.0. When the value is greater than 0, the specified proportion of memory will be reserved to read the data in the Buffer and directly use it for Reduce. In this way, memory is required for setting buffers, memory is required for reading data, and memory is required for reduce calculations, so adjustments should be made according to the running status of the job.

  1. I/O standard

  1. Use data compression method : reduce the time of network IO. Install Sappy and LZO compression encoders.

  1. Use SequenceFile binary files.

  1. data skew problem

  1. data skew problem

Data skew phenomenon:

  • Data frequency skew - the amount of data in a certain area is much larger than that in other areas;

  • Data size skew - some records are much larger in size than the average.

  1. Ways to Reduce Data Skew

  1. Sampling and range partitioning : Partition boundary values ​​can be preset through the result set obtained by sampling the original data.

  1. Custom Partitioning : Perform custom partitioning based on background knowledge of the output key. For example, if the Map output key has words from a book. And some of them have more professional vocabulary. Then you can customize the partition to send these professional words to a fixed part of the Reduce instance. And send the others to the remaining Reduce instances.

  1. Combiner: Using Combiner can greatly reduce data skew. The purpose of the Combimner is to aggregate and condense data where possible.

  1. Use Map Join and try to avoid Reduce Join.

  1. Common Tuning Parameters

1) Resource related parameters

(1) The following parameters can take effect only after configuration in the user's own MR application ( mapred-default.xml )

configuration parameters

Parameter Description

mapreduce.map.memory.mb

The resource upper limit (unit: MB) that a MapTask can use, the default is 1024. If the amount of resources actually used by MapTask exceeds this value, it will be forcibly killed.

mapreduce.reduce.memory.mb

The resource upper limit (unit: MB) that a ReduceTask can use, the default is 1024. If the amount of resources actually used by the ReduceTask exceeds this value, it will be forcibly killed.

mapreduce.map.cpu.vcores

The maximum number of cpu cores that can be used by each MapTask, default value: 1

mapreduce.reduce.cpu.vcores

The maximum number of cpu cores that can be used by each ReduceTask, default value: 1

mapreduce.reduce.shuffle.parallelcopies

The parallel number of each Reduce to fetch data from the Map. The default value is 5

mapreduce.reduce.shuffle.merge.percent

The percentage of the data in the buffer to start writing to the disk. The default value is 0.66

mapreduce.reduce.shuffle.input.buffer.percent

Buffer size as a percentage of Reduce's available memory. Default value 0.7

mapreduce.reduce.input.buffer.percent

Specify how much memory is used to store the data in Buffer, the default value is 0.0

(2) It should be configured in the configuration file of the server before YARN starts to take effect ( yarn-default.xml )

configuration parameters

Parameter Description

yarn.scheduler.minimum-allocation-mb

The minimum memory allocated to the application Container, default value: 1024

yarn.scheduler.maximum-allocation-mb

给应用程序Container分配的最大内存,默认值:8192

yarn.scheduler.minimum-allocation-vcores

每个Container申请的最小CPU核数,默认值:1

yarn.scheduler.maximum-allocation-vcores

每个Container申请的最大CPU核数,默认值:32

yarn.nodemanager.resource.memory-mb

给Containers分配的最大物理内存,默认值:8192

(3)Shuffle性能优化的关键参数,应在YARN启动之前就配置好(mapred-default.xml

配置参数

参数说明

mapreduce.map.maxattempts

每个Map Task最大重试次数,一旦重试次数超过该值,则认为Map Task运行失败,默认值:4。

mapreduce.reduce.maxattempts

每个Reduce Task最大重试次数,一旦重试次数超过该值,则认为Map Task运行失败,默认值:4。

mapreduce.task.timeout

Task超时时间,经常需要设置的一个参数,该参数表达的意思为:如果一个Task在一定时间内没有任何进入,即不会读取新的数据,也没有输出数据,则认为该Task处于Block状态,可能是卡住了,也许永远会卡住,为了防止因为用户程序永远Block住不退出,则强制设置了一个该超时时间(单位毫秒),默认是600000(10分钟)。如果你的程序对每条输入数据的处理时间过长(比如会访问数据库,通过网络拉取数据等),建议将该参数调大,该参数过小常出现的错误提示是:“AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster.”。

4.2 Hadoop小文件优化方法

4.2.1 Hadoop小文件弊端

HDFS上每个文件都要在NameNode上创建对应的元数据,这个元数据的大小约为150byte,这样当小文件比较多的时候,就会产生很多的元数据文件,一方面会大量占用NameNode的内存空间,另一方面就是元数据文件过多,使得寻址索引速度变慢。

小文件过多,在进行MR计算时,会生成过多切片,需要启动过多的MapTask。每个MapTask处理的数据量小,导致MapTask的处理时间比启动时间还小,白白消耗资源。

4.2.2 Hadoop小文件解决方案

1)小文件优化的方向:

(1)在数据采集的时候,就将小文件或小批数据合成大文件再上传HDFS。

(2)在业务处理之前,在HDFS上使用MapReduce程序对小文件进行合并。

(3)在MapReduce处理时,可采用CombineTextInputFormat提高效率。

(4)开启uber模式,实现jvm重用

2)Hadoop Archive

是一个高效的将小文件放入HDFS块中的文件存档工具,能够将多个小文件打包成一个HAR文件,从而达到减少NameNode的内存使用

3)CombineTextInputFormat

CombineTextInputFormat用于将多个小文件在切片过程中生成一个单独的切片或者少量的切片。

4)开启uber模式,实现JVM重用。

默认情况下,每个Task任务都需要启动一个JVM来运行,如果Task任务计算的数据量很小,我们可以让同一个Job的多个Task运行在一个JVM中,不必为每个Task都开启一个JVM

开启uber模式,在mapred-site.xml中添加如下配置:

<!--  开启uber模式 -->
<property>
    <name>mapreduce.job.ubertask.enable</name>
    <value>true</value>
</property>

<!-- uber模式中最大的mapTask数量,可向下修改  --> 
<property>
    <name>mapreduce.job.ubertask.maxmaps</name>
    <value>9</value>
</property>
<!-- uber模式中最大的reduce数量,可向下修改 -->
<property>
    <name>mapreduce.job.ubertask.maxreduces</name>
    <value>1</value>
</property>
<!-- uber模式中最大的输入数据量,默认使用dfs.blocksize 的值,可向下修改 -->
<property>
    <name>mapreduce.job.ubertask.maxbytes</name>
    <value></value>
</property>

第5章 Hadoop扩展

5.1 集群间数据拷贝

1)scp实现两个远程主机之间的文件复制

scp -r hello.txt root@hadoop103:/user/atguigu/hello.txt              # 推push
scp -r root@hadoop103:/user/atguigu/hello.txt  hello.txt             #拉pull
scp -r root@hadoop103:/user/atguigu/hello.txtroot@hadoop104:/user/atguigu   #是过本地主机中转实现两个远程主机的文件复制;如果在两个远程主机之间ssh没有配置的情况下可以使用该方式。
  1. 采用distcp命令实现两个Hadoop集群之间的递归数据复制

hadoop distcp [在集群1中的文件路径] [在集群2中的文件路径]

[atguigu@hadoop102 hadoop-3.1.3]$ 
hadoop distcp hdfs://hadoop102:8020/user/atguigu/hello.txt hdfs://hadoop105:8020/user/atguigu/hello.txt

5.2 小文件存档

1)案例实操

(1)需要启动YARN进程

[atguigu@hadoop102 hadoop-3.1.3]$ start-yarn.sh

(2)归档文件

把/user/atguigu/input目录里面的所有文件归档成一个叫input.har的归档文件,并把归档后文件存储到/user/atguigu/output路径下。

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop archive -archiveName input.har-p /user/atguigu/input /user/atguigu/output

(3)查看归档

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls har:///user/atguigu/output/input.har

(4)解归档文件

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp har:/// user/atguigu/output/input.har/*    /user/atguigu

5.3 回收站

开启回收站功能,可以将删除的文件在不超时的情况下,恢复原数据,起到防止误删除、备份等作用。

1)回收站参数设置及工作机制

2)启用回收站

修改core-site.xml,配置垃圾回收时间为1分钟。

<property>
    <name>fs.trash.interval</name>
    <value>1</value>
</property>

3)查看回收站

回收站目录在hdfs集群中的路径:/user/atguigu/.Trash/….

4)通过程序删除的文件不会经过回收站,需要调用moveToTrash()才进入回收站

Configuration conf = new Configuration();

//设置HDFS的地址
conf.set("fs.defaultFS","hdfs://hadoop102:8020");

//因为本地的客户端拿不到集群的配置信息 所以需要自己手动设置一下回收站
conf.set("fs.trash.interval","1");
conf.set("fs.trash.checkpoint.interval","1");

//创建一个回收站对象
Trash trash = new Trash(conf);

//将HDFS上的/input/wc.txt移动到回收站
trash.moveToTrash(new Path("/input/wc.txt"));

5)通过网页上直接删除的文件也不会走回收站。

6)只有在命令行利用hadoop fs -rm命令删除的文件才会走回收站。

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -rm -r /user/atguigu/input

7)恢复回收站数据

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mv /user/atguigu/.Trash/Current/user/atguigu/input    /user/atguigu/input

Guess you like

Origin blog.csdn.net/m0_57126939/article/details/129441334