MapReduce Experiment (1) Principle

Official website

http://hadoop.apache.org/

Three components of hadoop

HDFS: Distributed Storage System

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

MapReduce: Distributed Computing System

http://hadoop.apache.org/docs/r2.8.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

YARN: resource scheduling system for hadoop

http://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

Recalling that I had done a project of laser measurement of track leveling of China Railway Track, the size of a 50KM database was 400G, just looking for space to copy it would be too big, and now with a distributed database and computing platform, it can be carried out very conveniently.

58-801162951

1-1632059338

Mapper

A mapper maps input key/value pairs into a set of intermediate key/value pairs.

  • Mapping is the single task of transforming input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair can be mapped to zero or more output pairs.
  • Hadoop's MapReduce framework produces a map task generated by each InputSplit job InputFormat.
  • In general, the implementation of mapping is passed to the method of the job setmapperclass (class). Framework call graph (writablecomparable, write, context) per key/value pair, task in InputSplit pair. The application can then override the cleanup (context) method to perform any required cleanup.
  • The output pair does not need to be the same type as the input pair. A given input pair can be mapped to zero or more output pairs. The output is written to the called context (writablecomparable, writable).

Applications can use counters to report their statistics.

  • All intermediate values ​​associated with a given output key are then grouped by the framework and passed to the reducer to determine the final output. The user can group by job control by specifying a comparator. setgroupingcomparatorclass(class).
  • Sort the mapper output, then partition each reducer. The total number of partitions is the same as the number of reduce tasks for the task. The user can control where keys (and therefore records) go by implementing a custom splitter.
  • Users can choose to assign a synth to work through. setcombinerclass(class), where aggregation of intermediate outputs is performed, which helps reduce the amount of data going from the grapher to the reducer.
  • The output of intermediate sorts is always stored in a simple (key, key, value, value) format. The application can control what, and how, the intermediate output is compressed and the compressioncodec can be configured through.

Reducer

  • The key to slowing down a set of values ​​with a smaller share of intermediate values.
  • Reduce the number of users working through workgroups. setnumreducetasks(int).
  • In general, the realization of the reducer is through the work of the post. The setreducerclass (class) method can be overridden to initialize itself. The framework calls reduce(writablecomparable, <write>, <context) for each key method (list of values)> on the grouped input pairs. Applications can override cleanup(context) to perform any required cleanup methods.
  • Reducers have 3 main phases: shuffle, sort and reduce.

Shuffle

  • A map of the sorted outputs of the input reducer. At this stage the framework brings all mapper outputs to the corresponding partitions, via HTTP.

Partitioner partition

  • partitionsSpace partitioning is the key.
  • The output of the key intermediate graph of partition assignments. "Ice-derive of a key or a subset of keys) uses a partition, typically a city's hash function. The total number of partitions is the same as the number of reduce tasks. This hence m, the intermediate key of Johnson Controls' reduce tasks. key and hence records) of the post-glacial two reductions.
  • hashpartitioner is the default partition.

Counter

  • Counters are tools for MapReduce applications to report their statistics.
  • Mapper and reducer implementations can use counters to report statistics.
  • Hadoop MapReduce comes with a library of generally useful mappers, reducers, and schedulers.

In fact, MapReduce talks about the program processing concept of divide and conquer, which divides a complex task into several simple tasks to do separately. In addition, it is the scheduling problem of the program, which tasks are handled by which Mappers are a key consideration. The fundamental principle of MapReduce is the localization of information processing. Which PC holds the corresponding data to be processed will be responsible for processing this part of the data. The significance of this is to reduce the burden of network communication. Finally, add a classic picture to make the final addition. After all, charts are often more convincing than words.

82cfa1c7_hd

If the 400G database is still there, it will be divided into 400 tasks, and each task will process about 1G of data, and the theoretical speed is 400 times that of the original.

For details, please refer to google mapreduce

https://wenku.baidu.com/view/1aa777fd04a1b0717fd5dd4a.html

How MapReduce Works

Let us understand this with an example -

Assuming the following input data to a MapReduce program, count the number of words in the following data:

Welcome to Hadoop Class

Hadoop is good

Hadoop is bad

913101012959

The final output of the MapReduce task is:

bad

1

Class

1

good

1

Hadoop

3

is

2

to

1

Welcome

1

These data go through the following stages

Input split:

Input to a MapReduce job is divided into fixed-size chunks called input splits. Input splits are consuming input chunks by a single map.

Mapping - Mapping

This is the first stage in the execution of the map-reduce program. Each segmented data in this stage is passed to the mapping function to produce an output value. In our case, the task of the mapping stage is to count the number of occurrences of each word in the input split (more details about the input split are given below) and compile a list in some form <word, frequency of occurrence>

rearrange

This stage consumes the output of the map stage. Its task is to merge related records output from the map stage. In our example, the same words and their respective frequencies.

Reducing

At this stage, the output values ​​from the rearrangement stage are aggregated. This stage combines the values ​​from the rearrangement stage and returns an output value. In summary, this stage aggregates the complete dataset.

In our case, this stage aggregates the values ​​from the rearrangement stage, calculating the sum of the occurrences of each word.

How does MapReduce organize work?

Hadoop divides work into tasks. There are two types of tasks:

  1. Map task (segmentation and mapping)
  2. Reduce tasks (rearrange, restore)

as above

The complete execution flow (executing Map and Reduce tasks) is controlled by two types of entities, called

  1. Jobtracker : acts like a master (responsible for full execution of submitted jobs)
  2. Multitasking tracker: acts like slaves, they each perform work

For each job submission executed in the system, there is one JobTracker residing on the Namenode and multiple TaskTrackers residing on the Datanode.

  • Jobs are split into multiple tasks and run to multiple data nodes in the cluster.
  • JobTracker's responsibility is to coordinate active scheduling tasks to run on different data nodes.
  • The execution of a single task is then handled by the TaskTracker, which is part of the execution work, on each data node.
  • The TaskTracker's responsibility is to send progress reports to the JobTracker.
  • In addition, the TaskTracker periodically sends a "heartbeat" signal to the JobTracker to inform the system of its current state.
  • This allows JobTracker to track the overall progress of each job. In the case of a task failure, the JobTracker can reschedule it in a different TaskTracker.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324481537&siteId=291194637