MapReduce in the end is what?

hadoop main two basic elements to understand. Last understanding of what HDFS, this chapter is mainly to understand some basic principles of MapReduce.

MapReduce file system: it is a programming model for large data sets (greater than 1TB) parallel computing. The MapReduce divided into two parts: Map (mapping) and Reduce (reduction).

When you to submit a computing job mapreduce frame, it will first of all computing jobs into several map tasks, and assigned to different nodes up performed, a portion of each map task processing the input data, when the map task is completed, it will generate some intermediate files, these intermediate files as input data will reduce the task. The main objective of the mission is to Reduce summary in front of a number of map data together and outputs.

 

MapReduce architecture:

Master-slave structure: the master node, there is only one: the JobTracker; slave node, there are many: Task Trackers

JobTracker responsible for: receiving a computing task presented by the customer; the computing tasks to the Task Trackers execution; monitoring of the implementation of the Task Tracker;

Task Trackers responsible for: JobTracker perform computational tasks assigned.

 

MapReduce is a distributed computing model, proposed by google, mainly used in the search field, to solve computational problems of massive data.

MR consists of two phases: Map and Reduce, users only need to implement the map () and reduce () two functions, you can achieve distributed computing, very simple.

Parameter of these two functions is key, value, input information representing the function.

MapReduce implementation process:

 

 

 

 

 

 

MapReduce principle:

 

 

 

 

Steps:

 1. map tasking

1.1 reads the input file contents, parsed into key, value pairs. Each line of the input file, parsed into key, value pairs. Each key-value pair called once the map function.

1.2 write their own logic, the input of the key, value processing, is converted into the new key, value output.

1.3 pairs of output key, value partition.

1.4 pairs of data in different partitions, sorted according to the key packet. The same key value into a set.

1.5 (optional) data packets reduction.

2.reduce tasking

2.1 pairs of outputs of the plurality of map tasks, according to a different partition, the network nodes reduce to a different copy.

2.2 pairs of output more map tasks to merge sort. Write reduce function own logic, the input key, value processing, is converted into the new key, value output.

Save 2.3 reduce the output to a file.

Example: to achieve WordCountApp

 

map, reduce key-value pair format

function

Enter key-value pairs

Output value pairs

map()

<k1,v1>

<k2,v2>

reduce()

<k2,{v2}>

<k3,v3>

 

 

 

JobTracker

Responsible for receiving jobs submitted by the user, responsible for initiating, tracking task execution.

JobSubmissionProtocol is an interface to communicate with JobClient JobTracker.

InterTrackerProtocol is an interface to communicate with TaskTracker JobTracker.

 

TaskTracker

Responsible for the implementation tasks

 

JobClient

The user is working with the primary interface JobTracker interaction.

Responsible for submitting operations, responsible for initiating, tracking task execution, task status, and access logs.

 

 

 

 

MapReduce driver default settings

InputFormat (input)

TextInputFormat

MapperClass (map class)

IdentityMapper

MapOutputKeyClass

LongWritable

MapOutputValueClass

Text

PartitionerClass

HashPartitioner

ReduceClass

IdentityReduce

OutputKeyClass

LongWritable

OutputValueClass

Text

OutputFormatClass

TextOutputFormat

 

 

Concept of serialization:

Serialization ( the Serialization) refers to the structure of the object into a byte stream.

De-serialization ( deserialization) is the sequence of the reverse process. I.e., flow back into the byte structured object.

Java serialization (java.io.Serializable)

 

 

Hadoop- serialization format features:

Compact: efficient use of storage space.

Fast: small overhead reading and writing data

Scalable: transparently read the old data format

Interoperability: supports interactive multi-language

 

Hadoop serialization format: Writable

 

Serialization in the role of the two distributed environment: The inter-process communication, permanent storage.

Hadoop inter-node communication.

 

 

 

Input processing class MapReduce:

 

FileInputFormat: is the base class for all data in the source file InputFormat implemented, FileInputFormat save all job file as input, and implements the calculation method splits the input file. As for the method to obtain recorded have different subclasses --TextInputFormat be achieved.

InPutFormat responsible for processing the input portion MR.

InPutFormat of three roles:

Input validation job is standard

The input file cut InputSplit

Providing RecordReader implementation classes, the Mapper InputSplit read for processing.

 

FileInputSplit:

◆ Before performing MapReduce, the original data is divided into a plurality of split, each split map as input a task during execution split map is decomposed into a recording (key-value pair), a map will be sequentially processed each recording.

◆ FileInputFormat only divide larger than HDFS block file, so the result is divided FileInputFormat part of this document or this document.                

◆ If a file size smaller than the block, will not be divided, and this is Hadoop processing efficiency of processing large files than the reason for the high efficiency of many small files.

◆ When Hadoop dealing with many small files (the file size is less than hdfs block size), because of FileInputFormat small files will not be divided, so every little file will be split as a task and assign a map, resulting in under efficiency.

 

 For example: a 1G file, it will be divided into 16 Split 64MB and assign 16 map-tasking, while 10,000 100kb file is processed 10 000 map tasks.   

 

TextInputFormat:

◆ TextInputformat is the default handler classes, deal with ordinary text files.

 ◆ each line of the file as a record, he would start offset each row in the file as a content key, each row as a value.

 ◆ default to \ n or enter key as a row.

 ◆ TextInputFormat inherited FileInputFormat.

InputFormat class hierarchy:

 

 

Guess you like

Origin www.cnblogs.com/wendyw/p/11531356.html