The basic principle of MapReduce, the difference between MR and MPP

Overview of MapReduce

  • MapReduce (MR) is essentially a programming model for data processing; MapReduce is used for the calculation of massive data , and HDFS is used for the storage of massive data (Hadoop Distributed File System, Hadoop Distributed File System).
  • Hadoop MapReduce is a programming framework. In the Hadoop environment, MapReduce programs written in various languages ​​can be run to create applications that process large amounts of data on large commercial hardware clusters. Similar to the JRE environment, applications can be developed under this architecture program.
  • MapReduce programs are essentially parallel, and the essence is to increase computing power through parallel computing.
  • MapReduce is a programming model for processing large data sets through parallel distributed algorithms on a cluster. MapReduce will divide the task into small parts, assign them to different systems to process each part independently, after all the parts are processed and analyzed, the output is collected in one place, and then output dataset for the given problem .
  • The basic unit of information used by MapReduce is the key-value pair . All structured or unstructured data needs to be converted into key-value pairs before being passed through the MapReduce model .
  • The MapReduce model has two distinct functions, the map function and the reduce function .
  • The working mode of MapReduce is mainly divided into Map phase and reduction phase (shuffle phase and reducer phase).
  • The order of operations is always: Map -> Shuffle -> Reduce
    • Map stage: The Map stage is a key step in the MapReduce framework. The mapper will provide structure for unstructured data. The mapper will process one key-value pair at a time. One input can produce any number of outputs. The Map function will process the data and generate Several small data blocks .
    • Reduction phase: The shuffle phase and the reducer phase together are called the reduction phase. The Reducer takes the output from the mapper as input and makes the final output as specified by the programmer. This new output will be saved to HDFS . Reducer will fetch all key-value pairs from mapper and check all key-value associations; will fetch all values ​​associated with a single key and will provide output with any number of key-value pairs.
    • MapReduce is a sequential calculation. To ensure the normal operation of the Reducer, the Mapper must complete the execution, otherwise the Reducer stage will not run.
  • In a Hadoop cluster, computing nodes are generally the same as storage nodes, that is, both the MapReduce framework and HDFS (Hadoop Distributed File System) run on the same group of nodes . This configuration allows the framework to efficiently schedule jobs on nodes where data already exists, enabling cross-cluster bandwidth to have a high degree of aggregation and efficient use of resources.

How MapReduce works

A MapReduce task (Job) usually divides the input data set into independent blocks, and these blocks are processed by map tasks in a completely parallel manner. The framework sorts the output of the map and then feeds it into the reduce task. Typically, both input and output of a job are stored in the file system. The framework is responsible for scheduling tasks, monitoring tasks, and re-executing failed tasks .

As mentioned above, the MapReduce framework only processes key-value pairs in the form of <key, value> key-value pairs.
The framework will treat the input of the task as a set of <key, value> key-value pairs, and finally generate a set of <key, value> key-value pairs as the result . The key and value can be understood as different types according to specific problems.

The classes of key and value must be serialized by the framework, so all we need to do is to implement the writable interface (Writable). In addition, some key classes must also implement the WritableComparable interface, so that the framework can sort them.

A MapReduce job goes through the following process from input to output:
(input raw data) <k1, v1> -> Map -> <k2, v2> -> Combine -> <k2, v2> -> Reduce -> <k3, v3> (calculated result of output) .
insert image description here

insert image description here

ResourceManager

The MapReduce framework consists of a single master node (Master) ResourceManager , each slave node (Slave) NodeManager and each application MRAppMaster .

After the programming framework is perfected and packaged, Hadoop's job client (job client) can submit jobs (usually jar packages or executable files) and configuration items to ResourceManager, and ResourceManager is responsible for distributing job codes and configuration items to slave nodes ( Slave), after which ResourceManager is responsible for job scheduling and monitoring, and also provides status and diagnostic information to job clients.
insert image description here

  • Client Service:Application submission, termination, output information (status information of applications, queues, clusters, etc.).
  • Adaminstration Service: Queue, node, Client authority management.
  • ApplicationMasterService: Register and terminate ApplicationMaster, obtain ApplicationMaster's resource application or cancellation request, and send it to Scheduler asynchronously, single-threaded processing.
  • ApplicationMaster Liveliness Monitor: Receive the heartbeat message from ApplicationMaster. If an ApplicationMaster does not send a heartbeat within a certain period of time, the task will be invalidated and its resources will be recycled. Then ResourceManager will reassign an ApplicationMaster to run the application (2 attempts by default).
  • Resource Tracker Service: Register nodes, receive heartbeat messages of each registered node.
  • NodeManagers Liveliness Monitor: Monitor the heartbeat message of each node. If no heartbeat message is received for a long time, the node is considered invalid. At the same time, all Containers on the node are marked as invalid, and tasks will not be scheduled to run on the node.
  • ApplicationManager: Manage applications, record and manage completed applications.
  • ApplicationMaster Launcher: After an application is submitted, it is responsible for interacting with NodeManager, allocating Container and loading ApplicationMaster, and also responsible for termination or destruction.
  • YarnScheduler: Resource scheduling allocation, there are FIFO (with Priority), Fair, Capacity methods.
  • ContainerAllocationExpirer: Manage allocated but not enabled Containers, and recycle them after a certain period of time.

The difference between MPP and MapReduc

The difference between MPP and MapReduce is mainly reflected in the calculation method. Both MPP and MapReduce are technologies used to realize parallel processing, but they adopt different parallelization strategies .

About MPP Understanding
MPP (Massively Parallel Processing) is a large-scale parallel processing, which is a parallel computing model that distributes workload among multiple processors. It is often used in traditional relational database management systems to improve the performance and throughput of database processing. quantity. The MPP system usually consists of hundreds or thousands of nodes (a node refers to a group of processors and memory), each node runs an instance of the database, each node communicates with each other through a high-speed network, and the MPP system will transfer the data table Sharding (by cutting rows in the table), these data pieces will be allocated to each node, and each node has independent memory.

The parallel computing method of the MPP system is to divide the database into several sub-parts, set several operations that can be used for parallel computing, and each operation runs on a node, so as to process in parallel. Shard storage, so generally speaking, each part of MPP computing tasks is bound to a fixed node.

MapReduc is a parallel computing model based on "Map" and "Reduce", which is mainly used for distributed processing of massive data. Generally speaking, MapReduce divides a large data set into several small data blocks, and assigns each data block to a different computing node for processing. Each node independently performs "Map" operation on data blocks to obtain intermediate data, and then sends parts of the same intermediate data to the same node for "Reduce" operation, and finally combines the obtained data to obtain the final result.

Therefore, the adoption strategies of MPP and MapReduce in parallel computing are different, and they are more applied in different fields. MPP is mainly used for large-scale parallel processing of traditional relational databases, and is suitable for relatively simple computing scenarios. MapReduce is more suitable for distributed computing, analysis and processing of massive data, and is suitable for more complex and larger scenarios. While both techniques have their pros and cons, they both effectively facilitate the parallel processing of computations in different situations.

MapReduce: Simplified Data Processing on Large Clusters

Guess you like

Origin blog.csdn.net/qq_37432174/article/details/132126105