Parallel Algorithm performance metric - Speedup Performance Model (2-2)

First, the basic concepts and speedup

1, the basic concept

1) Time plot processor
product of the number of processors and processing time to measure resource utilization of the processors run.
If the time program running on a processor P table Tp is the maximum number of work, then this station processor P completed within the time interval Tp Tp * P.
The processor may actually curve as a time integral of the effective work of the processors complete.
Efficiency than the effective maximum workload and workload.
2) Parallel degree (Degree Of Parallelism-DOP)
degree of parallelism is the number of a program executed by the processor within a certain time interval.
3). Parallelism profile
execution parallelism profile versus time when a given program.
Product corresponding to the interval time and the degree of parallelism is the work to be done or processor workload.
Shown as a parallel distribution of FIG.
Parallelism profile

2. Speedup

1) Absolute speedup
compared to the best serial algorithm and parallel algorithm. (Where "best" is not absolute, sometimes the best deal is the fastest, and sometimes to be the most optimal solution good, sometimes the fastest and most will combine, so analyze specific issues)
defined a (related to a particular machine) the best serial algorithm in run time with a table running parallel algorithm of N compared time.
Definition of two (independent of the particular machine) the best serial algorithm execution time on the order of the fastest machines compared to the parallel running time on a parallel machine.
Here Insert Picture Description
(Numerator represents a serial machine, the denominator represents the parallel engine)
2) relative speedup
algorithm running on a single node with the run-time than the time on the same node processor system composed of a plurality of identical parallel.
This definition focuses on the description of the algorithm and the scalable parallel computer itself.
Here Insert Picture Description
Representation of the S can be seen as the number of processors N, speedup S is increased, this increase if a linear relationship, it is called linear speedup; exhibiting superconductivity if this speed increases linearly, it is called super-linear speedup; S if the growth rate decreasing gradually showing the relationship, it is called morbid speedup.
Linear speedup: intermediate small overhead, less communication, weak coupling is calculated
superlinear speedup: may occur when an application requires a large memory
morbid speedup: speedup decrement calculation amount may be too small

Second, the speedup performance model (three kinds)

1. The fixed load speedup performance model -Amdahl Law

In many real-time applications, calculate the size of the load often fixed. In a parallel machine, this load may be distributed to multiple parallel execution, the acceleration is obtained than is called fixed-load speedup. Load a problem can be expressed as:
W is = + Wp of the Ws of
wherein, Ws Representative of the non-parallel load serial portion, Wp represents a parallelizable part load.
N the case of the next node, the speedup can be expressed as follows:

Here Insert Picture Description
Serial factor α is provided to the serial portion occupied ratio. That is:
Here Insert Picture Description
substituting obtain Amdahl'law:
Here Insert Picture Description
no matter how many processors use, is expected to reach the best speedup:
Here Insert Picture Description
Efficiency En may be expressed as:
Here Insert Picture Description
the larger the number of processor n, the lower the efficiency En.
Amdahl's Law tells us: As a result of a component system to increase the proportion of the overall implementation of this frequency or total execution time performance of the system after a certain way related to faster execution.
Here Insert Picture Description
Speedup of two determinants:
1) The total time to perform a task in a computer can be improved percentage of time portions can be improved partially occupied time / task execution time of the entire front improvement, referred to as Fe , it is always less than 1.
2) Better use of part of the improvements over the former does not use performance improvements fold increase, i.e.
part of the execution time improvement on improved / modified portion of the improved execution time, referred to as Se.
Here Insert Picture Description
Example 1:
Suppose speeds up the processing of a component of a system to 10 times, the processing time of the original member but only 40% of the running time, the overall system performance is improved much?
Solution: = 0.4 of Fe, Se = 10,
Here Insert Picture Description
Amdahl'law acceleration yet fixed size, the size of the problem does not change with a change ratio of the processor model. Fixed problem size, parallel technology can achieve the look with a minimum amount of time is.
In fixed ratio scale model acceleration, load and execution time with the number of processors in the system n changes as shown below:
Here Insert Picture Description
However, the serial working time (regardless of how many the number of processors, serial and parallel load is fixed. are fixed, in parallel with the increase in processor, used in reducing the operating time and the total time is reduced with increasing processor)
when the number of processors n = 1024, with the acceleration α of change than Sn as follows:
Here Insert Picture Description
draw a graph below:
Here Insert Picture Description
[Alpha] can be compared to the different impact of different speedup:
Here Insert Picture Description
(red curve of α = 0, red is lower α = 0.01, blue α = 0.1, the blue is below 0.9 = [alpha])
[alpha] = 0 when obtained over the speedup, when the value of α increases, the acceleration performance than sharp decline.
Conclusion **: ** speedup curve rises sharply decreases with α, partially due to the presence Ws of the order, not the number of processors used to solve the system increased. This property in the past two decades to the people causing the parallel processing very pessimistic impression.
Impact: two views:
1) to discourage manufacturers massively parallel computer. (Of course, the actual situation is not consistent, it has been increasing the number of processors.)
2). Parallel compiler, to decrease the value of α, thereby improving system performance.
Prescribed load speedup possible applications range models:
the strict application of time-critical.

2. a fixed time speedup performance model -Gustafsun Law

有许多应用领域强调精度而不是运行时间。1988年,Gustafsun提出了固定时间加速比模型。当机器的规模扩大时,解题的规模也随着扩大,从而得到更加精确的解,而使运行时间保持不变。
比如:有限元方法做结构分析,流体动力学做天气预报解PDE(偏微分方程组)就需要提高精度。
粗格要求的计算量较少,而细格的计算量多,得到的精确度也较高。天气预报模拟求解四维PDE,如果使每个实际方向(X,Y,Z)的格点距离减少到原来的十分之一,并以同一幅度增加时间步,那么可以说格点增加了104倍,因而工作负载也至少增大了10000倍。

BACKGROUND proposed model:
fixed load model has drawbacks: since the Amdahl'law, and [alpha] depends on the efficiency of the parallel compiler, the inherent characteristics of the system can not be described.
Acceleration ratio formula:
Here Insert Picture Description
wherein, Wp '= nWp and Ws + Wp = Ws' + Wp ' / n time as a fixed condition. Ws '+ Wp' / n represents the average load in case of increasing the number of processors enlarged load (execution time), and the load that it should not expand in the case where the average load (execution time) Ws + Wp equal. Namely Ws + Wp = Ws '+ Wp ' / n. Meanwhile, the serial portion of the load does not change, that is Ws = Ws'.
: Acceleration time than at a fixed model, load and execution time with the number of processors in the system change below n
A fixed load and execution time Time speedup model of
is increased to the problem that the size of all processors stay busy state, and expand to match the available computing power in question , the sequence part of the program is no longer the bottleneck.
When the number of processors n = 1024, where the acceleration α with the following change ratio Sn:
Here Insert Picture Description
Here Insert Picture Description

3. limited to speedup memory model

1993年,由Sun和Ni提出。
大型科学计算和工程设计需要较大的存储空间,许多应用问题是存储器受限,而不是CPU受限或者I/O受限。
比如:在分布存储系统中常遇到,总存储容量随节点数线性增加,许多节点集合起来解一个大题。
基本思想:要在存储空间有限条件下解尽可能大的问题,这同样需要扩展工作负载,才能提供较高的加速比、较高的精度和较好的资源利用率。

Speedup can be represented as follows:
Here Insert Picture Description
wherein:
size or workload of the system and issues executed sequentially on a single processor scale independent, namely:
Here Insert Picture Description
the G (n) reflects the parallel workload fold increase in the memory capacity is increased by n times.
Discussion:
1.G (n) =. 1, namely a fixed load;
2.G (n) = n, i.e., n times increase in memory, the load is increased n times, the case of a fixed time;
3.G (n )> where increased n, the computational load increases faster than the memory, have higher speedup.
Comparison of three speedup, for the same number of processors, there are:
Here Insert Picture Description
(Sun.Ni model or greater than or equal to Amdahl Gustafsun Model Model)

Be limited to the speedup memory model, load and execution time with the number of processors in the system n changes as shown below:
Load and execution time than the case where the acceleration is limited by the memory model
Example: n-dimensional matrix multiplication: A * B = C, where A, B, C are n * n square matrix. Each element to obtain the required C n multiplications, additions n, so the total amount of computation is: (n + n) N2 = 2n3. 3N2 storage is required (two active matrix, a resultant matrix). If n computers composed of multi-computer systems, the expansion of the storage capacity of n times, then the dimension of the matrix (for the original n) may be increased, is set to N times, then the solution much speedup: storage capacity becomes: nM = n- 3N2 = 3n3, the amount of memory required for the N-dimensional 3N2, calculation amount becomes 2N3, there are:
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

4. Application of parallel computing model

Randomizer size increases, the workload growth pattern below:
Here Insert Picture Description
image above:
using the formula given acceleration limited by the ratio of the memory model,
[theta] corresponding to the curve of G (n) = n-th power of 1.5 (close to the y-axis)
gamma] curve corresponding to G (n) = n (red)
beta] corresponds to the curve G (n) = 0.5n (blue)
[alpha] corresponding to the curve G (n) = 1 (near the x-axis)
there speedup formula:
Here Insert Picture Description
given a program, assume Ws / Wp = 0.4, then the efficiency:
Here Insert Picture Description
a respective number of processors - efficiency curve below:
Here Insert Picture Description
conclusions:
1. If the workload (the size of the problem) remains unchanged, then the efficiency E randomizer the rapid decline in the size of the increase, the reason is the increased overhead h faster than the size of the machine, in order to maintain efficiency at a certain level, we can scale up the machine size and scale of the problem.
2. If the workload exponential growth pattern, to maintain a constant efficiency or good speedup, the scale of the problem must be jumped for the job, which would exceed the memory or I / O bound, and the scale of the problem only allowed to be used in computer memory within the limits of growth.
Application of parallel computer model as shown below:
Here Insert Picture Description
Here Insert Picture Description

Published 152 original articles · won praise 124 · views 20000 +

Guess you like

Origin blog.csdn.net/qq_44762986/article/details/104757661