MapReduce—Basic Introduction

Original author: How can natural science tyrants, everything is paid off

Original address: Introduction to MapReduce

table of Contents

Scenes

MapReduce produces the background

MapReduce function:

to sum up


Scenes

For example, there are a large number of text files, such as orders and page click event records. The volume is very large, and the stand-alone version is difficult to handle. How to solve the calculation of massive data?

Sum: 1 + 5 +7 + 3 +4 +9 +3 + 5 +6
Insert picture description here

MapReduce produces the background

If you are asked to count the total number of times a URL appears in the log, let you write a stand-alone program and write a logic: it is nothing more than reading one line of the file, and then cutting out that place. After cutting out, then You can put it in a HashMap, use Map to de-duplicate, see a new URL, put it in, and then +1, if you see it again next time, you can directly +1, if you don’t see it, put it in, stand-alone version If the logic is very easy to implement, but the amount of data is large, do you think the stand-alone version can still be handled?

First of all, 2T files, you may not be able to save them on a single computer, what if you have more files? For example, thousands of files and dozens of T can't be saved in a single machine, so where are they stored in hdfs. Because you can put a lot on HDFS, for example, there are 100 nodes on HDFS, and each node can withstand 8T hard disks, then there are 800T, 800T, if you save 3 copies of each file, you at least It can store more than 100 T files, which consumes about 6 T of space, but once you put it on HDFS, there is a problem: your files will be fragmented, and will be fragmented to many machines. At this time , You can count them again. At this time, according to the original logic, will there be problems ?

Any one of your nodes stores certain blocks of a certain file. Assuming that you are doing statistics on that machine, the statistics you get will always be partial data. Then you write a special client, my program Running on this client, I read the data, read a little bit of statistics, and read the entire file, and the statistical results will come out. If the problem is that, your program becomes a stand-alone version, then Your memory is also not enough, because you want to come in for statistics at the first point, do you want to save some intermediate data, then you may not have enough memory, and because you are a stand-alone version of the program, so your speed is also It's very slow, and you have to constantly get those data from the network, it will be very slow, so what about this time? If you write a client specifically to do statistics, it must be inappropriate.

Then should you distribute your program to each DN of the cluster to do statistics, that is , move the calculation to the data instead of moving the data to the calculation , and move my calculation logic to the data end. Wherever the data is, I will perform calculations, but there is also a problem, because the calculation has become a distributed one, and each of your calculation results is only a partial result , then there are also problems at this time:

How do you distribute your code to many machines to run? Who will do this for you? If you want to write your own program, do you have to have a USB flash drive to copy your jar package, and copy the calculations one by one? , Start after copying, and each machine starts jar. When you start the last one, the previous one has already run out, and the last one has just started. Isn’t it inappropriate for you to do this manually? So when you turn a simple logic into this kind of distributed operation, you find that many problems arise:

How do I distribute my code, how to configure the startup environment, how to start it , does this have to be done by a huge system, that is to say, you should additionally develop such a configuration system called resource distribution and Java startup program, this Do you know the system? Can it be written? If you write, do you have to spend a lot of time? You have to write a lot of things, because your current Java is not necessarily good at that field, so the cost of this is very high

The data, such as the log data just now, is put on HDFS, but it does not mean that every DN on HDFS has the content of this part of the data, because our cluster is very large, and your file is stored in it. At that time, it may only occupy 30 of the nodes. Some 30 nodes have your files, and the other nodes do not have your files. Then our code and operation logic are best placed on those 30 nodes. To do statistics, if you put it on other machines, it can run, but its data must come from the network, is it less efficient, that is to say, on which machines your code is distributed to run? There is a question of strategy , so does this strategy also have a certain algorithm, then at this time, in order to realize your simple logic, you should develop such a system again. Is it a large amount of development? Then consider another question , If you have done the two tasks just now, your code really successfully ran on those 30 machines. After running, assuming that one of the machines is down, then the part of the partial data you count Is there no more, the partial result is gone, and the partial result is not. Is it correct if you have a summary result? It doesn’t make sense anymore, that is to say you have to solve a problem, that is, you monitor the running status of your program all the time. Which node is normal and which node is abnormal. This problem is also very complicated . Suppose you also Solved, there is another problem:

Just now your logic is just to count the intermediate results. At this time, do you have to summarize? The summary means that you can only summarize the results among the 30 nodes, or you can transfer them all to one machine. Summarize, but if you tune it to a single machine, will the load on that machine be very high, right? Suppose I tuned to the aggregation of polymorphic machines, the logic becomes complicated. For example, as long as the statistics are on your 30 machines, how many URLs are on each machine, and you have all the data of all the URLs Summarize to a certain aggregation node. If there are two URLs, which URL is distributed to that node for aggregation, does this strategy become very complicated? Then you have to make an intermediate data scheduling system. Very troublesome.

In this case, we have discovered that even if it is a very simple thing, you have to turn it into a distributed running program. Do you face many other problems, problems that have nothing to do with our logic, often these problems It is much more complicated than solving the logic. Then these problem solving is not what we are good at. A large number of our ordinary programmers have not reached that level. Is it troublesome to write such a complicated problem? You can’t ask every programmer to be able to achieve that skill. We just write a simple logic to count the total number of times which URL appears in this text. It’s a very simple thing. Therefore, MapReduce is what ordinary programmers like us do. Gospel.

That is to say, when we are faced with massive data processing, the logic may be very simple, but when facing massive data processing, if our logic code becomes a distributed operation, it will become very complicated, and those very complicated things are not us. What I care about is the logic. At this time, someone will encapsulate all the things that you are not good at and that must be solved, and that have little to do with your logic. Then at this time, we are Instead of writing the logic directly, it will be the MapReduce framework and Yarn. Do these two do calculations? These two frameworks encapsulate all the things we just talked about. This is the background of MapReduce, and this is the question. .

MapReduce function:

In summary, MapReduce provides the following main

  • Data division and computing task scheduling:

The system automatically divides the big data to be processed in a job (Job) into many data blocks, and each data block corresponds to a computing task (Task), and automatically schedules computing nodes to process the corresponding data blocks. The job and task scheduling function is mainly responsible for allocating and scheduling computing nodes (Map nodes or Reduce nodes), and is also responsible for monitoring the execution status of these nodes, and is responsible for the synchronization control of the execution of Map nodes.

  • Data/code mutual positioning:

In order to reduce data communication, a basic principle is localized data processing, that is, a computing node processes the data distributed and stored on its local disk as much as possible, which realizes the migration of code to data ; when such localized data processing is not possible , And then find other available nodes and transfer data from the network to this node (data to code migration), but will try to find available nodes from the local rack where the data is located to reduce communication delay.

  • System Optimization:

In order to reduce the data communication overhead, the intermediate result data will be merged before entering the Reduce node; the data processed by a Reduce node may come from multiple Map nodes . In order to avoid data correlation in the Reduce calculation phase, the middle of the Map node output The results need to use certain strategies for proper division and processing to ensure that the relevant data is sent to the same Reduce node; in addition, the system also performs some calculation performance optimization processing, such as multiple backup execution for the slowest calculation task, and the fastest completion As a result.

  • Error detection and recovery:

In a large-scale MapReduce computing cluster composed of low-end commercial servers, node hardware (host, disk, memory, etc.) errors and software errors are normal. Therefore, MapReduce needs to be able to detect and isolate faulty nodes, and schedule and allocate new nodes to take over faulty nodes Calculation tasks. At the same time, the system will also maintain the reliability of data storage, use multiple backup redundant storage mechanisms to improve the reliability of data storage, and can detect and restore incorrect data in time.

to sum up

It is convenient to extend our simple calculation logic to distributed calculations in massive data scenarios, so the MapReduce program is very simple for our programmers, because it encapsulates all those things, you only need to write business logic, Are you not good at writing business logic? Most of the business logic is processing text and processing strings. Most of the logic we learn is dealing with this problem: processing text, processing strings, and querying the database to get something. Query the database, process the strings, and output the results. The logic itself does not require you to have too many distributed details. You only need to write the logic, but when you write MapReduce, you must conform to other programming specifications. You can’t write it your way. He writes it according to his way of writing. Everyone’s way of writing is different. Then MapReduce can’t be distributed and run for you, so you have to comply with his specifications. What is considered to be compliant? As for your code, any logical implementation of yours must be divided into these two steps:

  1. Map
  2. Reduce

For example, in our statistics log file, the total number of occurrences of the same URL is as follows:

Insert picture description here

example:

Insert picture description here

 

Guess you like

Origin blog.csdn.net/sanmi8276/article/details/113061798