Design MapReduce programs
According to the division of tasks: tasks only map, map, reduce task has
map and reduce
Data ETL process
map phase: slicing process, a large task split
- "Data filtering
- "Data completion
For example: provinces get information based on IP
- "field formatting
On a field format
time:
dd/MM/yyyy:HH:mmss
timestamp
1,522,233.123 seconds
1522233123 ms
Standard format: yyyy-MM-dd HH: mm: ss
Domain Name Process:
URL address to get domain name
reduce stage: merger deal
merge
map
Sqoop is essentially a MapReduce program, but only the Map task
sql +hadoop
SecondaryNameNode
NameNode function is doing
Management sub-node
Processing client requests sent,
Metadata Management
Where is metadata? ? ? Stored? ? ?
Stored in HDFS? ? Endless loop
In order to speed up access to the memory elements in the machine where NameNode
Memory read and write speed, a request can be accelerated
But in memory, not safe, how do shut down
in conclusion:
(1) stored in memory, also exist on disk, which is a file, the file name f simage
(2)/opt/modules/hadoop-2.7.3/data/tmpData/dfs/name/current
n- AME N ODE startup
Loading content Fsimage into memory, subsequent changes in the HDFS (file upload, modify, delete) to find ways to synchronize
SecondaryNameNode function
Auxiliary nameNode synchronized local metadata
fsimge(old) + edits = fsimage(new)
edits: This document is very important to record the HDFS modification operations, can not be lost
Edits log file information can be parsed metadata
In the process of the merger, before the merger to fs.temp file, after the completion of the merger, change the name to fsimage, delete fs.temp
shuffle process
Realize the function
Subdivision:
Current key decisions that reduce handed over for processing
Default: Key hash value according to the number of take reduce I
HashPartitioner
Grouping:
The same key merger
Sort by:
Sort, dictionary sort each group in accordance with key keyvalue
Implementation process
map端shuffle
Image Transfer: https://blog.csdn.net/h1025372645/article/details/94757137
spill: spill write
After the processing result of each map into the ring buffer (memory 100M)
Partition: For each partition a keyvalue (tagging)
Sort: the same data partition sort partition
When the entire ring buffer reaches the threshold of 80%, began to overflow writing, when data is written after the sort of partition to disk, the file becomes a file, and ultimately generate a lot of small files.
merge: merge, the resulting spill merge multiple small files
file1: file2:
Sort: the same data partition sort
end_file
map task is over, notification app Master, AppMaster inform reduce pull up data
reduce端shuffle
Image Transfer: https://blog.csdn.net/h1025372645/article/details/94757137
map task1:
map task2
reduce start-up data network to a plurality of threads each machine by pulling up their own partition
merge: merging the results for each of the Map task pulls data merging and sorting
Sort: I belong to the whole partition data sorting
reduce1:
Groups: merging the same key value
hive, <1,1,1,1>
spark,<1,1,1,1>
shuffle过程的优化
combiner合并
在map阶段提前进行了一次合并,一般来讲等同于提前执行了reduce操作。
好处:可以降低reduce的压力,
在map阶段的进行合并是 并行的(分布式的)。
job.setCombinerClass(WordCountReducer.class);
注意:并不是所有的程序都适合combiner:测试
设置combiner之和和之后的结果要一致,不能因为性能优化导致结果不对
A + (B +C) = (A+B) + C
compress压缩
大大减少磁盘IO以及网络IO
MapReduce有很多地方都可以压缩
输入的就是一个压缩文件
map shuffle中合并成一个大文件,对该文件进行压缩,reduce过来取数据就是压缩之后的数据
reduce输出
Hadoop常见的压缩格式
检查本地库支持哪些压缩:
bin/hadoop checknative
修改压缩库,只需要替换native包即可
常用用的压缩格式
snappy,lzo,lz4
hadoop中设置压缩
输入:
一般不配置
map的中间结果:
mapreduce.map.output.compress = true;
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec
压缩方式的寻找:DefaultCodec -》找打该类的目录-》其他压缩类和他在同一个目录下
reduce输出
mapreduce.output.fileoutputformat.compress = true;
mapreduce.output.fileoutputformat.compress.codec= org.apache.hadoop.io.compress.Lz4Codec
配置方式:
方式一:main方法中Configuration
Configuration configuration = new Configuration();
configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.codec",
"org.apache.hadoop.io.compress.Lz4Codec");
方式二:在配置文件中
全局修改,所有MapReduce都生效
方式三:运行的时候通过自定义配置
bin/yarn jar xxx.jar -DXX=yy -Daa=bb MianClass input_path output_Path
检查是否配置成功
- 方式一:
8088 -》history -》Configuration-》查看对应配置参数
配置前:
配置后:
- 方式二:查看计数器
Map output materialized bytes