[Big Data] MapReduce

MapReduce

A programming framework for distributed computing programs is the core framework for users to develop "Hadoop-based data analysis applications".

MapReduce core functions

With the user written business logic code and comes with default components to integrate a complete distributed computing program (integration code, become suitable for running on a distributed program), and run on a Hadoop cluster.

Advantages of MapReduce :

  • MapReduce is easy to program: it simply implements some interfaces to complete a distributed program, which can be distributed to a large number of cheap PC machines to run. In other words, writing a distributed program is exactly the same as writing a simple serial program .
  • MapReduce has good scalability: After the computing resources are not satisfied, it can expand its computing power by simply adding machines, and improve the reliability through the copy mechanism.
  • High fault tolerance: For example, if one of the machines is hung up, the above computing task can be transferred to another node, so that the task will not fail, and this process is automatic .
  • Suitable for massive offline data processing above the PB level: to achieve concurrent operations of thousands of clusters.

Disadvantages of MapReduce :

  • MapReduce cannot return results in milliseconds or seconds like mysql.
  • Not good at streaming computing, the data set input by MapReduce is static .
  • Not good at DAG calculation

The core programming ideas of MapReduce:

map task——> is to split the file , split each line of input and output, hello word, split into the form of key value, and hand it over to the ruduce process for further processing.

Reduce task——> is to count the map results.


The input of the map is also in the form of key1 and value1, key1 is the offset, and value1 is the content of each line.

Offset:

hello word

hello nihao

0 1 2 3 4 5 6 7

map(0,“hello word”)

map(12,“hello nihao”)

//空格算一个,换行不计

The output value2 of the map is an element, and value3 is a collection iterator, which is a collection of value2

reduce(hello,(1,1,1,1)) //map的输出作为reduce的输入,将相同key放在一起作为一个集合

Reasons for using data types in Hadoop:

student{
string name;
string age;
string set;
}              java封装后通过网络传输——————————>100kb
               hadoop类型已经实现了序列化传输——————————>10kb


Serialization

It is to convert objects in memory into byte sequences (or other data transfer protocols) for storage to disk (persistence) and network transmission, and deserialization and vice versa.

Features of hsdoop serialization:

  • Compact and efficient use of storage space. In the above example, only (name, age, sex) is transmitted
  • Fast, little additional development for reading and writing data.
  • Scalable, upgrade with the upgrade of communication protocol
  • Support multi-language interaction, r, scala, c++

 

MapReduce detailed workflow:

 

Shuffer mechanism:

After the map method, the data processing before the reduce method is called shuffle

Ring buffer: there is no head or tail, overflow at 80%

The shuffle process is just from step 7 to step 16.

  1. maptask collects the kv pairs output by our map() method and puts them in the memory buffer.
  2. The local disk file is continuously overflowing from the memory buffer, and multiple files may be overflowed.
  3. Multiple overflow files will be merged into a large overflow file
  4. During the overflow process and the merge process, partitioner must be called to partition and sort keys.
  5. According to its own partition number, reducetask goes to each maptask machine to fetch the corresponding result partition data.
  6. reducetask will fetch the result files from different maptasks in the same partition. reducetask will merge these files (merge and sort).
  7. After merging into a large file, the shuffle process is over, and then the logic operation process of reducetask is entered (take out a key-value pair group from the file and call the user-defined reduce method.

Shuffle summary:

The buffer size will affect the execution efficiency of mapreduce: in principle, the larger the buffer, the fewer the number of disk IOs, the faster the execution speed.

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/Qmilumilu/article/details/104650793