Design MapReduce programs

Design MapReduce programs

 

According to the division of tasks: tasks only map, map, reduce task has

map and reduce

Data ETL process

map phase: slicing process, a large task split

      - "Data filtering

      - "Data completion

For example: provinces get information based on IP

      - "field formatting

On a field format

time:

      dd/MM/yyyy:HH:mmss

timestamp

       1,522,233.123 seconds

       1522233123 ms

Standard format: yyyy-MM-dd HH: mm: ss

Domain Name Process:

      URL address to get domain name

reduce stage: merger deal

merge

 

map

Sqoop is essentially a MapReduce program, but only the Map task

sql +hadoop

 

SecondaryNameNode

NameNode function is doing

Management sub-node

Processing client requests sent,

Metadata Management

Where is metadata? ? ? Stored? ? ?

Stored in HDFS? ? Endless loop

In order to speed up access to the memory elements in the machine where NameNode

Memory read and write speed, a request can be accelerated

But in memory, not safe, how do shut down

in conclusion:

(1) stored in memory, also exist on disk, which is a file, the file name f simage

(2)/opt/modules/hadoop-2.7.3/data/tmpData/dfs/name/current

 

n- AME N ODE startup

Loading content Fsimage into memory, subsequent changes in the HDFS (file upload, modify, delete) to find ways to synchronize

SecondaryNameNode function

Auxiliary nameNode synchronized local metadata

fsimge(old) + edits = fsimage(new)

edits: This document is very important to record the HDFS modification operations, can not be lost

Edits log file information can be parsed metadata

In the process of the merger, before the merger to fs.temp file, after the completion of the merger, change the name to fsimage, delete fs.temp

 

shuffle process

Realize the function

Subdivision:

Current key decisions that reduce handed over for processing

Default: Key hash value according to the number of take reduce I

HashPartitioner

Grouping:

The same key merger

Sort by:

Sort, dictionary sort each group in accordance with key keyvalue

Implementation process

map端shuffle

Image Transfer: https://blog.csdn.net/h1025372645/article/details/94757137 

spill: spill write

After the processing result of each map into the ring buffer (memory 100M)

Partition: For each partition a keyvalue (tagging)

           

Sort: the same data partition sort partition

           

When the entire ring buffer reaches the threshold of 80%, began to overflow writing, when data is written after the sort of partition to disk, the file becomes a file, and ultimately generate a lot of small files.

merge: merge, the resulting spill merge multiple small files

file1:                                                                    file2:

                 

Sort: the same data partition sort

end_file

map task is over, notification app Master, AppMaster inform reduce pull up data

reduce端shuffle

Image Transfer: https://blog.csdn.net/h1025372645/article/details/94757137 

map task1:

map task2

reduce start-up data network to a plurality of threads each machine by pulling up their own partition

merge: merging the results for each of the Map task pulls data merging and sorting

Sort: I belong to the whole partition data sorting

reduce1:

Groups: merging the same key value

hive, <1,1,1,1>

spark,<1,1,1,1>

 

 

shuffle过程的优化

 

combiner合并

在map阶段提前进行了一次合并,一般来讲等同于提前执行了reduce操作。

好处:可以降低reduce的压力,

在map阶段的进行合并是 并行的(分布式的)。

        job.setCombinerClass(WordCountReducer.class);

注意:并不是所有的程序都适合combiner:测试

设置combiner之和和之后的结果要一致,不能因为性能优化导致结果不对

A + (B +C)  =  (A+B) + C 

compress压缩

大大减少磁盘IO以及网络IO

MapReduce有很多地方都可以压缩

输入的就是一个压缩文件

map shuffle中合并成一个大文件,对该文件进行压缩,reduce过来取数据就是压缩之后的数据

reduce输出

Hadoop常见的压缩格式

检查本地库支持哪些压缩:

bin/hadoop checknative

修改压缩库,只需要替换native包即可

常用用的压缩格式

snappy,lzo,lz4

 

hadoop中设置压缩

输入:

一般不配置

map的中间结果:

mapreduce.map.output.compress = true;
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec

压缩方式的寻找:DefaultCodec -》找打该类的目录-》其他压缩类和他在同一个目录下

reduce输出

mapreduce.output.fileoutputformat.compress = true;
mapreduce.output.fileoutputformat.compress.codec= org.apache.hadoop.io.compress.Lz4Codec

配置方式:

方式一:main方法中Configuration

        Configuration configuration = new Configuration();

        configuration.set("mapreduce.output.fileoutputformat.compress","true");
        configuration.set("mapreduce.output.fileoutputformat.compress.codec",
                "org.apache.hadoop.io.compress.Lz4Codec");

 

方式二:在配置文件中

全局修改,所有MapReduce都生效

方式三:运行的时候通过自定义配置

bin/yarn jar xxx.jar -DXX=yy -Daa=bb MianClass input_path output_Path

检查是否配置成功

  • 方式一:

8088 -》history -》Configuration-》查看对应配置参数

配置前:

配置后:

  • 方式二:查看计数器

  Map output materialized bytes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_35315363/article/details/94762783