Gang Scheduling Performance Benefits for Fine-Grain Synchronization

Gang Scheduling Performance Benefits for Fine-Grain Synchronization

topic

  • What is gang secheduling? Group scheduling, either all of a group is executed, or none of them are executed
  • What is fine-grain synchronization? Fine-grained synchronization CSDN Blog

Summary

  • gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. gang scheduling A set of threads are scheduled to execute simultaneously on a set of processors
  • It allows the threads to interact efficiently by using busy waiting, without the risk of waiting for a thread that currently is not running 使用忙等来高效地互动。
  • Otherwise, using blocks to achieve synchronization will cause context switching and cause a lot of overhead.
  • have developed a model to evaluate the performance of different combinations of
    synchronization mechanisms and scheduling policies
  • gang schedule is more efficient with fine-grained synchronization requirements

INTRODUCTION

  • With the development of "virtual processor" and multi-program multi-processor, it is invisible during program execution and can only be managed by system software for mapping and scheduling
  • Thread interaction in parallel applications is reflected in thread synchronization.
  • explicit synchronization display synchronization barrier synchronization mutual exclusion synchronization
  • Implicit synchronization Implicit synchronization, data exists between producers and consumers, etc.

  • In the case of fine-grain interactions it is shown that it is best for the threads to execute simultaneously on distinct processors , and to coordinate their activities with busy waiting. Simultaneous execution, coordinated by busy waiting
  • In coarse-grained synchronization, if the variance of execution time is large, it is best to synchronize by blocking. In this way, the processor can be released and will be executed by other threads.

Related work

This post is the first to analyze the performance of gang scheduling, and the interplay between synchronization and scheduling.

  • This allows us to identify the situations in which gang scheduling
    should be used, namely when the application is based on fine-grain synchronization.

  • It is also the first to report
    experiments based on a real implementation of gang scheduling on a multiprocessor.

  • Gang scheduling may cause an effect reminiscent of fragmentation, if the gang sizes do not fit the number of available processors.

  • In this paper we
    show that for fine-grain computations gang scheduling
    can more than double the processing capability.

  • Gang scheduling refers to the simultaneous scheduling of a group of threads to a group of processors using a one-to-one mapping.

Question: Regarding the details of the synchronization process, if this problem is solved, many problems in the future will probably be solved.
Assumption 1: When performing synchronization, must all threads participating in synchronization be executing at the same time? That is, all threads execute on the processor at the same time so that they can interact with each other.
Assumption 2: When doing synchronization, it is not necessary for all threads to be executing on the processor at the same time. When performing synchronous operations, threads only need to transfer messages to a message queue, or similar to common memory. Based on this assumption, consider that when using blocking and uncoordinated scheduling, when the last thread completes the calculation part, it can perform synchronous tasks. Then the synchronization ends and the next round can be executed directly. For other threads, when reawakened, complete the unfinished synchronization tasks and continue to execute the next round of tasks.

2 Synchronization And Scheduling

  • busy waiting: The performance of busy waiting is very dependent on the system's scheduling strategy

  • blocking

  • Two-phase: Two-phase blocking, the first phase uses a period of busy waiting, and then uses blocking. When used with gang scheduling in fine-grained situations, basically only the first phase of busy waiting is used. Using other scheduling is basically blocking, and it is more expensive than blocking.

  • gang scheduling: All interactive threads run concurrently, but there is a limit to the number of threads in a group.

  • uncoordinated scheduling: Threads on each processor are scheduled separately.

The combination of gang scheduling and blocking is meaningless, because even if the blocking thread switches the context and gives up the processor, the processor still cannot be switched to another gang thread.
So there are three combinations:

  • busy waitng with gang scheduling
  • busy waiting with uncoordinated scheduling
  • blocking with uncoordinated scheduling

2.1 Model and Assumptions

20221124195008

  • Threads perform iterative calculations
  • Perform the calculation of $t_p^{ij}$ time in each iteration, and then perform the synchronization operation
  • When performing synchronization, he will also xiao'hao'yi. Since each thread needs to execute, directly include this part of the overhead in $t_p^{ij} $
  • Each thread, when executing an iteration, will have a processing time and a waiting time

Average waiting time
tp = 1 nk ∑ i = 1 n ∑ j = 1 ktpij t_p = \frac{1}{nk}\sum_{i=1}^n\sum_{j=1}^k t_p^{ij}tp=nk1i=1nj=1ktpij

t p m a x j = m a x 1 ≤ i ≤ n t p i j t_p^{max j} = max_{1\leq i\leq n} t_p^{ij} tpm a x j=max1intpij

The waiting time of thread i at iteration j is

t w i j = t p m a x , j − t p i j t_w^{ij} = t_p^{max,j} - t_p^{ij} twij=tpmax,jtpij

Average waiting time
tw = 1 nk ∑ i = 1 n ∑ j = 1 ktwij t_w = \frac{1}{nk}\sum_{i=1}^n \sum_{j=1}^k t_w^{ij}tw=nk1i=1nj=1ktwij

note: t p + t w = ( 1 k ) ∑ j = 1 k t p m a x j t_p + t_w = (\frac{1}{k})\sum_{j=1}^k t_p^{max j} tp+tw=(k1)j=1ktpm a x j

Since the overhead of synchronization may be related to the number of processors, tp t_ptpThe value also needs to be adjusted.

System Characterization

In the case of multiprogramming and multiprocessors, each process maps multiple threads. And, adopt the time-division strategy to provide the same service time for all threads.
Assumptiong:

  • the number of threads on each processor is denoted by l
  • We only focus on a single gang of n threads, mapped to n different processors, the remaining n (l-1) threads are not our concern

For the blocking strategy, the performance has a lot to do with the thread behavior of other gangs. Because the blocking strategy is beneficial:

  1. All threads behave the same and perform iterative operations
  2. The remaining l-1 groups of threads are independently scheduled, purely computing threads

The scheduling time slice is recorded as τ q \tau_qtq
The overhead of context switching is recorded as τ cs \tau_cstcThe overhead of s
blocking is recorded asα \alphaα times the context overhead,α \alphaα must be greater than 1

In the coarse-grained case:
tp t_ptpRelative to τ q \tau_qtqCan be much larger, many time slices may be executed before the synchronization step is performed

In fine-grained conditions:
tp t_ptpRelative to τ q \tau_qtqMay be much smaller, assuming τ q \tau_qtqmay execute 1 0 4 − 1 0 5 10^4 - 10^5104105 instructions, with an interaction performed every 10-100 instructions.

k ( t p + t w ) ≫ τ q k(t_p + t_w) \gg \tau_q k(tp+tw)tq
In the fine-grained case, k needs to be large enough

2.2 Performance Dervition

Busy Waiting with Gang Scheduling

The time to perform k iterations is:

T = ( k ( t p + t w ) + k ( t p + t w ) τ q τ c s ) l T = (k(t_p + t_w) + \frac{k(t_p+t_w)}{\tau_q}\tau_{cs})l T=(k(tp+tw)+tqk(tp+tw)tcs) l
is multiplied by l because it is necessary to consider that each processor maps l threads, and adopts a time-division strategy, and needs to provide services of the same duration for other threads.

The average time required for each iteration is:
t = ( 1 + τ cs τ q ) ( tq + tw ) lt = (1 + \frac{\tau_{cs}}{\tau_q})(t_q + t_w)lt=(1+tqtcs)(tq+tw)l

因为 t p + t w = ( 1 k ) ∑ j = 1 k t p m a x j t_p + t_w = (\frac{1}{k})\sum_{j=1}^k t_p^{max j} tp+tw=(k1)j=1ktpm a x j, so the execution rate depends on the slowest executing thread.

Busy Waiting with Uncoordinated Scheduling

The behavior and granularity of using busy-wait and uncoordinated scheduling are related.

Coarse-grained case :
In the coarse-grained case, Busy Waiting with Uncoordinated Scheduling behaves almost the same as gang scheduling.
Since the threads of the same group do not have to execute at the same time, it will result in tp + tw t_p + t_wtp+twMuch larger than τ q \tau_qtq, each iteration will require several schedules.
Computing t is the same as before:
t = ( 1 + τ cs τ q ) ( tq + tw ) lt = (1 + \frac{\tau_{cs}}{\tau_q})(t_q + t_w)lt=(1+tqtcs)(tq+tw)l

Question: The synchronization operation cannot be completed until all threads are scheduled at the same time, right? So tw t_wtwThe calculation method is different from before?

In the fine-grained case:
first of all, it is clear that without intervention, there is also a certain probability that n threads are scheduled at the same time. When this happens, many iterations are done.

When the cycle period is Λ \LambdaWhen Λ , n segment length isλ \lambdaλ is overlapping withλ n Λ n − 1 \frac{\lambda^n}{\Lambda^{n-1}}Ln1ln

In fine-grained cases :

  • λ τ q \lambda \tau_ql tq
  • Λ = l ( τ q + τ cs ) \Lambda = l(\tau_q + \tau_{cs})L=l ( tq+tcs) is a scheduling period
  • The thread overlap time of each scheduling cycle is τ qnln − 1 ( τ q + τ cs ) n − 1 \frac{\tau_q^n}{l^{n-1}(\tau_q + \tau_{cs})^{ n-1}}ln 1 (tq+ tcs)n1tqn
  • The number of iterations completed in each scheduling cycle is: τ qnln − 1 ( τ q + τ cs ) n − 1 ( tq + tw ) \frac{\tau_q^n}{l^{n-1}(\tau_q + \ tau_{cs})^{n-1}(t_q + t_w)}ln 1 (tq+ tcs)n1(tq+tw)tqn

The number of scheduling rounds required to execute k iterations $m =\frac{kl^{n-1}(\tau_q + \tau_{cs})^{n-1}(t_q + t_w)}{\tau_q^n } $

The total execution time is:
20221124220150

The average time per iteration is:
20221124220216

If assume τ q ≫ τ cs \tau_q \gg \tau_{cs}tqtcs, then using uncoordinated scheduling is slower than gang scheduling by ln − 1 l^{n-1}ln 1 times

In fact, it should be satisfied that each scheduling cycle completes at least one iteration.
t = min ( E q . ( 8 ) , ( τ q + τ cs ) l ) t = min(Eq.(8), (\tau_q + \tau_{cs})l)t=min(Eq.(8),( tq+tcs)l)

Question: Why does each scheduling cycle complete at least one iteration? He said it was a hypothesis? no explanation

Blocking Mechanism

Blocking can free up the CPU, so there is a kind of altruism.
Consider two cases:

  1. All threads behave the same, will perform iterative calculation and synchronization, and block when waiting
  2. Competing threads are independently scheduled and purely computational, so they never block

All threads behave the same, coarse-grained
Since they are independently scheduled, at any moment there is only tptp + tw \frac{t_p}{t_p + t_w}tp+twtpthreads are active.
Due to the coarse-grained situation, most context switches are due to time slice expiration rather than blocking. Therefore, the additional overhead caused by blocking is negligible.

So the total time to run is:

20221125001843

In fact, it is equivalent to saving the waiting time.
The time required for each iteration is:

20221125001933

Other threads behave differently, coarse-grained

At any time, the number of active processes is basically l
20221125100704

All threads behave the same, fine-grained

Threads will be blocked frequently, and the actual scheduling time slice will be reduced.
If it is the last thread, it can go directly to the next iteration without blocking. Of all n threads, only n-1 threads will be blocked.
The cost of blocking is α \alphaAlpha times thread context switching.

So the time to run k iterations is

20221125004825

The time required for each iteration is

20221125004834

Compared to busy wait:

  • In the case of coarse granularity, the iteration time of each round is reduced
  • Under fine-grained conditions, the iteration time of each round increases

Other threads behave differently, fine-grained

In each scheduling cycle, it can only iterate one round, and it is over, and other processes can still execute τ q \tau_qtqtime slice, which fully embodies the characteristics of altruism

20221125100834

At this time, the rotation of RR is not fair to the blocked process.

3 Implementation And Experiments

Experimental Results

The program is multiplied by several groups, each group has only one thread on each processor.

  • LOAD : the number of groups participating in the competition
  • GRAIN: Granularity, the number of instructions in each iteration except for the synchronization process. If there are multiple particle sizes, take the smallest one.
  • VAR: The difference between the maximum and minimum number of instructions for different activities in each iteration. The number of instructions for the activities in the experiment follows the uniform distribution from GRAIN to GRAIN+VAR: the longest activity is GRAIN+(n/(n+1))VAR, and the shortest activity is GRAIN+(1/(n+1))VAR

In the experiment, each unit of GRAIN takes 0.00141ms, and the interaction time is 0.137ms

20221125102551

t_w is based on20221125102955

The number of iterations was 30000 or 50000 in most experiments

In order to ensure that it starts without coordination, a startup helper thread is also added at the beginning.

Experiment 1

20221125103939

This experiment shows the additional overhead caused by blocking,
and when the load is 1, blocking cannot achieve any gain

Experiment 2

20221125104431

LOAD =3
shows
the performance of busy waiting with gang scheduling and blocking with uncoordinated scheduling.

At the smallest granularity, gang scheduling can achieve twice the blocking performance.

20221125104953

Experiment 3

20221125105049

Experiment 3 shows the altruism of the blocking mechanism

One set of experiments is that all threads behave the same.
Another set of experiments is that only one set of threads will block.

Experiment 4

20221125105356

This experiment shows that busy waiting without gang scheduling is indeed not a viable option

4 Summary and discussion

  1. busy waiting with gang scheduling works best for fine-grained jobs
  2. Blocking will release system resources and reduce running time for coarse-grained jobs. But for fine-grained jobs, see how it will run. If the competition does not block first, it will reduce the execution efficiency of the blocked thread (altruistic)
  3. busy waiting with uncoordinated scheduling is a waste of resources, especially on fine-grained jobs. On a coarse-grained job, the number of busy waits is very large.
  4. Two-phase blocking, which limits waste but does not improve performance unless combined with gang scheduling.

When to use busy waiting with gang scheduling

busy waiting with gang scheduling The execution time of each iteration is recorded as t BW t_{BW}tBW
20221125110629

The execution time of each iteration using bloking is recorded as t BLK t_{BLK}tBLK
20221125110700
Of course it is assumed that all threads have the same behavior

When under what conditions t WB t BLK ≤ σ < 1 \frac{t_{WB}}{t_{BLK}} \leq σ < 1tBLKtWBp<1

into the above formula

20221125111013

20221125111522

In the gray area, it can be said that BW is at least 1 σ larger than BLK \frac{1}{\sigma}p1times

Similarly if t BLK t BW ≤ σ < 1 \frac{t_{BLK}}{t_{BW}} \leq σ < 1tBWtBLKp<1 , blocking is dominant.

20221125112145

When all threads behave differently, Eq(13) should be taken as t BLK t_{BLK}tBLK
20221125112313

Therefore, substituting one can get
20221125112345

In addition, busy waiting is more conducive to the use of chaches, because the frequent context switching caused by blocking reduces the effectiveness of the cache.

Not only that, but hardware support is also required. If the pot interaction time is too long, it will easily make it exceed the fine-grained time boundary.

Guess you like

Origin blog.csdn.net/greatcoder/article/details/128028106