Introduction to MapReduce and its advantages and disadvantages

1. What is MapReduce?

MapReduce is a programming model for large-scale data processing, used for parallel operations on large-scale data sets. The core function of Mapreduce is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs concurrently on a Hadoop cluster. And in a reliable, fault-tolerant way to process terabytes of massive data sets in parallel

2. Why use MapReduce?

Large amounts of data are processed on a single machine due to hardware resource limitations, which is not competent. Once the stand-alone version of the program is extended to a cluster for distributed operation, it will greatly increase the complexity of the program and the difficulty of development. After introducing the mapreduce framework, developers can integrate most of them. Work is concentrated on the development of business logic, and the complexity of distributed computing is handed over to the framework to handle.
Issues considered in mapreduce distributed schemes

  • Should the arithmetic logic divide first and then close?
  • How does the program allocate computing tasks (slices)?
  • How to start the two-stage program? How to coordinate?
  • Monitoring during the running of the entire program? Fault tolerance? Retry?
    Distributed solutions need to consider many issues, but we can encapsulate the public functions in the distributed program into a framework, allowing developers to focus on business logic. And mapreduce is such a general framework for distributed programs.

3. Advantages of MapReduce

  • The model is simple and easy to program.
    It simply implements some interfaces to complete a distributed program, which can be distributed to a large number of cheap PC machines to run. In other words, writing a distributed program is exactly the same as writing a simple serial program. It is because of this feature that Mapreduce programming has become very popular.
  • Good scalability.
    When your computing resources are not satisfied, you can expand its computing power simply by adding machines.
  • Flexible
    structured and unstructured data
  • Parallel processing The
    programming model naturally supports parallel processing, which is suitable for offline processing of PB-level and large amounts of data
  • Strong Fault Tolerance The
    original intention of Mapreduce is to enable programs to be deployed on cheap PC machines, which requires it to be highly fault-tolerant. For example, if a machine is hung up, it can transfer the above computing tasks to another node to run, so that the task will not fail, and this process does not require human involvement, but is completely completed internally by hadoop.

4. Disadvantages of MapReduce

  • Not good at real-time calculation.
    Mapreduce cannot achieve millisecond or second return results like Mysql.

  • Not good at streaming
    computing The input data of streaming computing is dynamic, while the input data set of Mapreduce is static and cannot change its flow. This is because the design characteristics of Mapreduce itself determine that the data source must be static.

  • Not good at DAG (directed graph) calculation of
    multiple application dependencies, the input of the latter application is the output of the previous application, in this case, Mapreduce is not impossible to do, but each Mapreduce after use The output of the job will be written to the disk, which will cause a lot of disk IO, resulting in very low performance.

    Reference: Advantages and disadvantages of Mapreduce

Guess you like

Origin blog.csdn.net/ytangdigl/article/details/109223914