Introduction to Parallel and Distributed Computing (3) Basic Principles of Parallel Programming


The fish is short, try to be saltier! ! !

Section 3 Basic design principles of parallel programs

Designing a parallel program generally starts from two aspects:

  • From the perspective of practical engineering, we generally try to improve existing serial programs instead of creating parallel programs out of thin air. This is the so-called principle of incremental parallelization.
  • From the perspective of parallel design, we generally use Foster's abstract model to design parallel programs with strict logic, clear structure, and high efficiency.

Let us introduce one by one

3.1 Incremental parallelization

General parallel program design process

  1. Study a serial program
  2. Looking for serial performance bottlenecks and parallelization possibilities
  3. Try to get all the processors to work!

3.2 Foster's design concept (four-step method)

Divide, communicate, gather, and map (Tucao: The next part of Luo's PPT is completely copied from the textbook)
Insert picture description here

3.2.1 Division

Main content: Choose an appropriate method to divide calculations and data into small modules

  • Consider the first direction: how to maximize the use of data parallel (the so-called domain decomposition)
    • Divide the data into small pieces
    • Decide how to combine calculations with data (usually considering the largest and most frequently accessed data)
  • Consider the second direction: how to maximize the use of task parallelism
    • Break the calculation into small pieces
    • Decide how to combine data with calculations
  • Consider the third direction: use pipeline parallelism
    • Such as maximizing the use of each cycle
    • Reduce the initial interval as much as possible

Data partition

  1. Decide how data should be distributed among processors
  2. Determine the computing task of each processor

Example: Find the maximum value in an array. The array is equally divided among n-1 processors. Each processor finds the maximum value and summarizes it to the last processor. This processor gives the maximum value

Task division

  1. Assign computing tasks to different processors
  2. Decide which data elements should be read and written by which processors

Example: GUI event handler

Pipelining (production line)

  • Assembly line parallel
    • Each "worker" only does one thing
    • It is conceivable that everyone will pass the stuff to the next person after finishing their own processing part, and then get the materials from the previous person to repeat the process

Example: 3D rendering

Insert picture description hereIn this example, N+3 steps are required to process N data sets (a processor can only produce one data set at the same time, so each data set must be queued to enter the production line in sequence)

Note that the production line can only increase production, and cannot solve the interaction delay between threads or processors (this is obvious)

  • Initial Interval (II) restrictions
    • In the above example, the four logic units want to complete the tasks of two data sets in a pipeline, and the second data set must wait for an additional step

    • Obviously, when II=1, the total output will reach the maximum, but this is not always possibleInsert picture description here

      • In this example, using an interval of 1 will cause an error (when r*v2 and r+v1 are performed at the same time, this may cause an error-if you add it first, the result will be wrong)
        • After adding v1[i-1] to the third processor, the fourth processor starts to multiply r by v2[i-1]. At this time, the third processor also goes back and adds v1[i] to r.

Foster checklist

After completing the design, examine the division steps as follows (the same is true for the following checklist)

  • The original task is at least an order of magnitude higher than the number of processors
  • Redundant calculation and redundant data structure storage are minimized
  • The size of the original task is about the same
  • The number of tasks is an incremental function of the problem size

3.2.2 Communication

  • Local communication: communication channel between some tasks
  • Global communication: communication channel between all tasks

Generally speaking, it is not very useful to plan communication channels for tasks in the algorithm design stage.

Foster checklist

  • Balance the communication operations between tasks
  • Each task only communicates with a small number of neighbors
  • Tasks can execute communication concurrently
  • Tasks can perform calculations concurrently

3.2.3 Gathering

Aggregation is the process of combining tasks into larger tasks in order to improve performance or simplify programming.

In MP programs, there is generally one aggregate task per processor

aims

  • Reduce communication overhead

    • Gather the original tasks of communicating with each other, and the communication between them will be completely eliminated

    • Combine the task group sending the message and the task group receiving the message (sending a small amount of long messages takes less time than sending a large number of short messages with the same total length)

      1

  • Maintain the scalability of parallel design

    • For example, it is a short-sighted approach to gather the second and third dimensions of the 8x128x256 matrix problem, because this will cause the parallel algorithm to be unable to be transplanted to a system with more than 8 CPUs.
  • Reduce software engineering expenses

    • If we are parallelizing a serial program, aggregation allows us to use more ready-made serial code, saving development time and cost

Foster checklist

  • Aggregation increases the locality of parallel algorithms
  • Replicated calculations take less time than the communication they replace
  • The total amount of data copied is small enough to make the algorithm scalable
  • The aggregated tasks have similar computational and communication overhead
  • The number of tasks is an increasing function of the problem size
  • There are as few tasks as possible, but at least as many as the number of processors in the target computer
  • Reasonably weigh the benefits of aggregation and the cost of modifying existing serial codes

3.2.4 Mapping

Mapping is the process of assigning tasks to processors. For centralized multi-processor systems, the operating system will automatically solve this problem, so we assume that the target system is a parallel computer with distributed storage

aims

  • Maximize processor utilization
    • Processor utilization: the average percentage of execution time spent solving the problem
  • Minimize communication between processors
    • Try to map tasks connected by channels to the same processor
  • You can't have both, find a reasonable compromise (but finding the optimal solution for this thing is an NP-hard problem, so the meaning of this sentence is that it depends on feeling...)

Mapping method when the number of tasks is fixed

The mapping method should be adapted to local conditions and balance utilization and communication consumption

  • For highly structured communications
    • Each task is a fixed amount of calculation
      • Group tasks to reduce communication
      • Each processor opens an aggregation task
    • Varying amount of calculation for each task
      • Assign tasks to processors periodically
  • For unstructured communication
    • Use static load balancing algorithm (Static Load Balancing Algorithm)

Dynamic changes in the number of tasks

  • Lots of communication between tasks
    • Use dynamic load balancing algorithm (Dynamic Load Balancing Algorithm)
  • There are a lot of small tasks with a short execution time
    • Runtime Task-Scheduling Algorithm using accompanying program

Common task scheduling algorithms when the number of tasks changes dynamically

  • Centralized algorithm
    • There is a manager CPU. When the worker CPU completes the task, the manager applies for a task, and the manager responds to a task. After the worker completes the task, he returns to the solution and applies for another task.
  • Distributed algorithm
    • Push: Processors with too many available tasks will assign tasks to neighboring processors
    • Pull: A processor without a task requests a task from a neighboring processor
  • Comprehensive algorithm

to sum up

Insert picture description here

Foster checklist

  • Have you considered a design based on one processor corresponding to one task and one processor corresponding to multiple tasks?
  • Have you evaluated statically and dynamically assigning tasks to processors
    • Dynamic: Is the manager the performance bottleneck
    • Static: The ratio of the number of tasks to the number of processors is not less than 10:1

Guess you like

Origin blog.csdn.net/Kaiser_syndrom/article/details/105185317