MapReduce distributed computing system

MapReduce is a programming model for large data sets (greater than 1TB) parallel computing. The concept "Map (Mapping)" and "Reduce (reduction)", and their main idea is borrowed, and borrowed from the vector programming language properties from functional programming languages. It is very easy for programmers in the case will not be distributed and parallel programming will own programs running on a distributed system. The current software implementation is to specify a Map (mapping) function, a key-value pair is used to map into a new set of key-value pairs specified concurrent Reduce (reduction) function to ensure that all of the key-value mappings each group share the same key.

Run the program wordcount

cd /opt/module/hadoop-2.7.3/share/hadoop/mapreduce into the path wordcount lies.
Run touch in.txt, create In.txt file as input file.
(If in.txt empty file, run vi in.txt, statistical word frequency as the input content of the input file)
output directory / output must not exist, automatically create the program running.
Run wordcount:
hadoop hadoop-JAR-examples-2.7.3.jar MapReduce wordcount /adir/in.txt the Output /
after a successful run, enter / output directory, open the file part-r-00000 View counting results.

 

MapReduce provides the following main functions:

 

1) calculated data partitioning and scheduling:

 

The system automatically a job (the Job) to be processed a large number of data is divided into data blocks, each data block corresponds to a calculation task (the Task), and automatically schedule calculation processing nodes corresponding data block. Job and task scheduling function is mainly responsible for allocating and scheduling computing nodes (Map Reduce node or nodes), while the state is responsible for monitoring the implementation of these nodes, and is responsible for the implementation of synchronous control node Map.

 

2) data / code mutual positioning:

 

In order to reduce the data communications, a fundamental principle of data processing is localized, i.e. a processing data computing node on its local disk storage is distributed as much as possible, which enables the migration code to the data; this is not possible when the localized data processing , and then look for other available data uploaded from the network node and sent to the node (data migration to the code), it will be possible to find an available node from where data on the local rack to reduce communication delay.

 

3) System optimization:

 

To reduce the overhead of data communication, the intermediate result data will enter certain Reduce front merge processing node; Reduce a node data may be processed from a plurality of nodes Map, in order to avoid occurrence of data Reduce correlation calculation stage, the intermediate output node Map the results need to use a certain strategy appropriate division processing, guaranteed delivery data related to the same node Reduce; in addition, the system also some performance optimization calculation processing, such as computing tasks slowest multi backups, selected from the fastest to complete who as a result.

 

4) error detection and recovery:

 

MapReduce large-scale computing clusters of low-end commercial server configuration, node hardware (host, disk, memory, etc.) and error software error is the norm, and therefore MapReduce need to be able to detect and isolate errors nodes, and schedule a new node is assigned to take over the error node computing tasks. At the same time, the system will maintain the reliability of data storage, improve the reliability of data storage with multi-redundant backup storage mechanism, and timely error detection and recovery of data.

 

Guess you like

Origin www.cnblogs.com/yo123/p/10927008.html