"Distributed technical principles and algorithm analysis" summarizes three: Distributed Computing Technology

Scheduling architecture in the second two-level scheduling is scheduled to complete the frame, usually a frame is calculated, such as Hadoop, Spark like;
programmers based on these calculations frame, can complete the calculation of different types and sizes.

The nature of distributed computing is in a distributed environment, multiple processes coordination to complete a complex matter;
each process carry out their duties, after the completion of their work, and then to the other processes to complete other work;
for there is no dependency work, inter-process can be executed in parallel.

1 MapReduce

The core idea: divide and rule, JDK's Fork-Join is in this framework of thought

step:

1 an exploded original problem (Map): original problem is decomposed into a number of smaller, independent of each other, and the same problems in the form of the original sub-problems;

2 subproblems: If smaller sub-problems to be solved directly and easily solved, otherwise recursively solving each sub-problem;

3 The combined solution (Reduce): The solution of problems of various sub-merged into the original problem solution

MapReduce mainly includes the following three components:
Here Insert Picture Description
Master (MRAppMaster): responsible for assigning tasks, running coordination tasks and assign map is Mapper () function to manipulate, to assign Reducer reduce () function operation;
Mapper worker: Map function is responsible for the function, that is, ; responsible for executing sub-task
results Reduce function is responsible for function, that is responsible for the summary of each sub-task: Reducer worker

Work flow chart:
Here Insert Picture Description
After running MapReduce tasks to complete the entire task process is over, is a short mission mode;
start and stop task process is very time-consuming, so MapReduce is not suitable for real-time processing task: it will first collect data and its cached, wait until the cache is full start processing data. Thus, a drawback is that the calculated bulk, from the data acquisition to the moment of calculation results obtained for a long time

2 Stream

The main task is to deal with the real-time for streaming data, high processing delay requirements, generally requires a permanent service process, waiting for the arrival of the data at any time at any time, so as to ensure low latency;
calculate the mode of stream data task, in distributed field called Stream.

Wherein the data stream: such as live audio and video data stream generated

Continuous data quickly reach;

Large-scale data (TB, PB);

High real-time requirements, over time, will significantly reduce the value of the data

Data can not guarantee the order, which means that the system can not control the order of data elements to be processed.

Once the data will be processed immediately, when a data is processed, stored in the cache are serialized, and then transmitted over the network immediately to the next node, continues to process the next node;
flow computation, will not be stored any data, would have been in circulation

step:
Here Insert Picture Description

In order to timely processing of the data stream, the stream must be calculated frame latency, scalable, highly reliable

3 Actor

MapReduce and Stream calculation mode while the data is treated differently, but they are a specific type of data (corresponding to static data and dynamic data) is calculated as a dimension

Actor pipeline and the process or processes are calculated as a dimension of

Actor represents a parallel distributed computing model;
this model has its own set of rules that the internal logic of computing an Actor, and a communication between the plurality of rules Actor;
in the Actor model, each system corresponds Actor of a component, it is the basic computation unit;

Calculation model with conventional object-oriented programming (OOP) is similar to an object receives a method invocation request (similar to a message), thereby to perform the method;
however, because the data is encapsulated in an OOP object, can not be accessed outside when, i.e. in a synchronized manner accessible by a plurality of external object method invocation, there will deadlocks, race problems, distributed systems can not meet the demand for high concurrency;
the Actor model through a message communication using the asynchronous mode (queue), overcomes the limitations of OOP, the distributed system suitable for highly concurrent.

Actor model is three elements state, behavior and message: Actor model = (+ state behavior) message +

State (State): Information Actor component itself, the equivalent of OOP object attributes;
the state will be affected Actor Actor their own behavior, and can be changed only their own

Behavior (Behavior): Actor calculation processing operations, corresponding to the member function OOP object;
can not load the computation logic between other Actor Actor. Actor only receive messages will trigger their computing behavior

Message (Mail): Actor message delivery by mail communication between a plurality of Actor, Actor each have its own mailbox (the MailBox), for receiving a message from another Actor therefore Actor model messages, also known as mail;
in general, for the message inside the mailbox, read the Actor is achieved according to the order message (FIFO) and processed

Working principle: see FIG queue for processing using Actor2
Here Insert Picture Description
advantages:

To achieve a higher level of abstraction than OOP: asynchronous communication between Actor, multiple Actor can run independently and will not be disturbed, to solve the competition problems in OOP

Non-blocking: Actor model by introducing the message passing mechanism, so as to avoid clogging

Without the use of locks: Actor can only read a message from the MailBox, that is, internal Actor can only deal with a message at the same time, is a natural mutex, so no additional code lock

Concurrent high: Each Actor MailBox only local message processing, and therefore a plurality of parallel Actor work, thereby improving the entire distributed parallel processing system

Easy expansion: Each Actor can create multiple Actor, thereby reducing the workload of a single Actor;
when the local Actor handle, however, they can start Actor on the remote node then forwards the message in the past.

Disadvantages:

Actor lack of inheritance and stratification, small code reusability

Actor dynamically create multiple Actor, makes the behavior of the entire Actor model changing, not easy to achieve

Actor increase, it would also increase the system overhead

Does not apply to systems of strict requirements for message processing sequence;
because messages are asynchronous messages, can not determine the order of execution of each message;
improvements: You can order to solve the problem by blocking Actor, but will seriously affect the tasking model of efficiency Actor

Scene: Akka

4 lines

A big task into multiple steps, different processes may be employed in different steps performed, so that different tasks may be performed in parallel, thereby improving system efficiency

Scene: machine learning pipeline processing

MapReduce mode and the pipeline mode, there will be a large task into multiple sub-tasks, the difference is that the relationship between particle size and dividing subtasks:

MapReduce granularity task, the task is divided into a large plurality of smaller tasks, each task needs to perform a complete, the same step, the same task can be executed in parallel, it can be said to be a parallel computing task model;
pipeline computing step size mode, a task into multiple steps, each performed by a different logic types of the plurality of tasks by the same step to overlap the parallel implementation of different computing tasks, can be said to be a parallel pattern data .

MapReduce respective sub-task can be performed independently, without disturbing each other, a plurality of sub-tasks executed, the results combined to give the overall results of the task, and therefore is not required between the sub-task dependencies;
the pipeline mode is among the plurality of subtasks have a dependency relationship, the output of a previous input subtask after a subtask

MapReduce task parallelism for the calculation mode scenario, the pipelined calculation mode for the scene of the same type of data parallel processing task.

Published 237 original articles · won praise 266 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_41594698/article/details/105244689