Gang Scheduling Performance Benefits for Fine-Grain Synchronization
topic
- What is gang secheduling? Group scheduling, either all of a group is executed, or none of them are executed
- What is fine-grain synchronization? Fine-grained synchronization CSDN Blog
Summary
- gang scheduling, where a set of threads are scheduled to execute simultaneously on a set of processors. gang scheduling A set of threads are scheduled to execute simultaneously on a set of processors
- It allows the threads to interact efficiently by using busy waiting, without the risk of waiting for a thread that currently is not running 使用忙等来高效地互动。
- Otherwise, using blocks to achieve synchronization will cause context switching and cause a lot of overhead.
- have developed a model to evaluate the performance of different combinations of
synchronization mechanisms and scheduling policies - gang schedule is more efficient with fine-grained synchronization requirements
INTRODUCTION
- With the development of "virtual processor" and multi-program multi-processor, it is invisible during program execution and can only be managed by system software for mapping and scheduling
- Thread interaction in parallel applications is reflected in thread synchronization.
- explicit synchronization display synchronization barrier synchronization mutual exclusion synchronization
- Implicit synchronization Implicit synchronization, data exists between producers and consumers, etc.
In the case of fine-grain interactions it is shown that it is best for the threads to execute simultaneously on distinct processors , and to coordinate their activities with busy waiting. Simultaneous execution, coordinated by busy waiting- In coarse-grained synchronization, if the variance of execution time is large, it is best to synchronize by blocking. In this way, the processor can be released and will be executed by other threads.
Related work
This post is the first to analyze the performance of gang scheduling, and the interplay between synchronization and scheduling.
-
This allows us to identify the situations in which gang scheduling
should be used, namely when the application is based on fine-grain synchronization. -
It is also the first to report
experiments based on a real implementation of gang scheduling on a multiprocessor. -
Gang scheduling may cause an effect reminiscent of fragmentation, if the gang sizes do not fit the number of available processors.
-
In this paper we
show that for fine-grain computations gang scheduling
can more than double the processing capability. -
Gang scheduling refers to the simultaneous scheduling of a group of threads to a group of processors using a one-to-one mapping.
Question: Regarding the details of the synchronization process, if this problem is solved, many problems in the future will probably be solved.
Assumption 1: When performing synchronization, must all threads participating in synchronization be executing at the same time? That is, all threads execute on the processor at the same time so that they can interact with each other.
Assumption 2: When doing synchronization, it is not necessary for all threads to be executing on the processor at the same time. When performing synchronous operations, threads only need to transfer messages to a message queue, or similar to common memory. Based on this assumption, consider that when using blocking and uncoordinated scheduling, when the last thread completes the calculation part, it can perform synchronous tasks. Then the synchronization ends and the next round can be executed directly. For other threads, when reawakened, complete the unfinished synchronization tasks and continue to execute the next round of tasks.
2 Synchronization And Scheduling
-
busy waiting: The performance of busy waiting is very dependent on the system's scheduling strategy
-
blocking
-
Two-phase: Two-phase blocking, the first phase uses a period of busy waiting, and then uses blocking. When used with gang scheduling in fine-grained situations, basically only the first phase of busy waiting is used. Using other scheduling is basically blocking, and it is more expensive than blocking.
-
gang scheduling: All interactive threads run concurrently, but there is a limit to the number of threads in a group.
-
uncoordinated scheduling: Threads on each processor are scheduled separately.
The combination of gang scheduling and blocking is meaningless, because even if the blocking thread switches the context and gives up the processor, the processor still cannot be switched to another gang thread.
So there are three combinations:
- busy waitng with gang scheduling
- busy waiting with uncoordinated scheduling
- blocking with uncoordinated scheduling
2.1 Model and Assumptions
- Threads perform iterative calculations
- Perform the calculation of $t_p^{ij}$ time in each iteration, and then perform the synchronization operation
- When performing synchronization, he will also xiao'hao'yi. Since each thread needs to execute, directly include this part of the overhead in $t_p^{ij} $
- Each thread, when executing an iteration, will have a processing time and a waiting time
Average waiting time
tp = 1 nk ∑ i = 1 n ∑ j = 1 ktpij t_p = \frac{1}{nk}\sum_{i=1}^n\sum_{j=1}^k t_p^{ij}tp=nk1i=1∑nj=1∑ktpij
t p m a x j = m a x 1 ≤ i ≤ n t p i j t_p^{max j} = max_{1\leq i\leq n} t_p^{ij} tpm a x j=max1≤i≤ntpij
The waiting time of thread i at iteration j is
t w i j = t p m a x , j − t p i j t_w^{ij} = t_p^{max,j} - t_p^{ij} twij=tpmax,j−tpij
Average waiting time
tw = 1 nk ∑ i = 1 n ∑ j = 1 ktwij t_w = \frac{1}{nk}\sum_{i=1}^n \sum_{j=1}^k t_w^{ij}tw=nk1i=1∑nj=1∑ktwij
note: t p + t w = ( 1 k ) ∑ j = 1 k t p m a x j t_p + t_w = (\frac{1}{k})\sum_{j=1}^k t_p^{max j} tp+tw=(k1)∑j=1ktpm a x j
Since the overhead of synchronization may be related to the number of processors, tp t_ptpThe value also needs to be adjusted.
System Characterization
In the case of multiprogramming and multiprocessors, each process maps multiple threads. And, adopt the time-division strategy to provide the same service time for all threads.
Assumptiong:
- the number of threads on each processor is denoted by l
- We only focus on a single gang of n threads, mapped to n different processors, the remaining n (l-1) threads are not our concern
For the blocking strategy, the performance has a lot to do with the thread behavior of other gangs. Because the blocking strategy is beneficial:
- All threads behave the same and perform iterative operations
- The remaining l-1 groups of threads are independently scheduled, purely computing threads
The scheduling time slice is recorded as τ q \tau_qtq
The overhead of context switching is recorded as τ cs \tau_cstcThe overhead of s
blocking is recorded asα \alphaα times the context overhead,α \alphaα must be greater than 1
In the coarse-grained case:
tp t_ptpRelative to τ q \tau_qtqCan be much larger, many time slices may be executed before the synchronization step is performed
In fine-grained conditions:
tp t_ptpRelative to τ q \tau_qtqMay be much smaller, assuming τ q \tau_qtqmay execute 1 0 4 − 1 0 5 10^4 - 10^5104−105 instructions, with an interaction performed every 10-100 instructions.
k ( t p + t w ) ≫ τ q k(t_p + t_w) \gg \tau_q k(tp+tw)≫tq
In the fine-grained case, k needs to be large enough
2.2 Performance Dervition
Busy Waiting with Gang Scheduling
The time to perform k iterations is:
T = ( k ( t p + t w ) + k ( t p + t w ) τ q τ c s ) l T = (k(t_p + t_w) + \frac{k(t_p+t_w)}{\tau_q}\tau_{cs})l T=(k(tp+tw)+tqk(tp+tw)tcs) l
is multiplied by l because it is necessary to consider that each processor maps l threads, and adopts a time-division strategy, and needs to provide services of the same duration for other threads.
The average time required for each iteration is:
t = ( 1 + τ cs τ q ) ( tq + tw ) lt = (1 + \frac{\tau_{cs}}{\tau_q})(t_q + t_w)lt=(1+tqtcs)(tq+tw)l
因为 t p + t w = ( 1 k ) ∑ j = 1 k t p m a x j t_p + t_w = (\frac{1}{k})\sum_{j=1}^k t_p^{max j} tp+tw=(k1)∑j=1ktpm a x j, so the execution rate depends on the slowest executing thread.
Busy Waiting with Uncoordinated Scheduling
The behavior and granularity of using busy-wait and uncoordinated scheduling are related.
Coarse-grained case :
In the coarse-grained case, Busy Waiting with Uncoordinated Scheduling behaves almost the same as gang scheduling.
Since the threads of the same group do not have to execute at the same time, it will result in tp + tw t_p + t_wtp+twMuch larger than τ q \tau_qtq, each iteration will require several schedules.
Computing t is the same as before:
t = ( 1 + τ cs τ q ) ( tq + tw ) lt = (1 + \frac{\tau_{cs}}{\tau_q})(t_q + t_w)lt=(1+tqtcs)(tq+tw)l
Question: The synchronization operation cannot be completed until all threads are scheduled at the same time, right? So tw t_wtwThe calculation method is different from before?
In the fine-grained case:
first of all, it is clear that without intervention, there is also a certain probability that n threads are scheduled at the same time. When this happens, many iterations are done.
When the cycle period is Λ \LambdaWhen Λ , n segment length isλ \lambdaλ is overlapping withλ n Λ n − 1 \frac{\lambda^n}{\Lambda^{n-1}}Ln−1ln
In fine-grained cases :
- λ τ q \lambda \tau_ql tq
- Λ = l ( τ q + τ cs ) \Lambda = l(\tau_q + \tau_{cs})L=l ( tq+tcs) is a scheduling period
- The thread overlap time of each scheduling cycle is τ qnln − 1 ( τ q + τ cs ) n − 1 \frac{\tau_q^n}{l^{n-1}(\tau_q + \tau_{cs})^{ n-1}}ln − 1 (tq+ tcs)n−1tqn
- The number of iterations completed in each scheduling cycle is: τ qnln − 1 ( τ q + τ cs ) n − 1 ( tq + tw ) \frac{\tau_q^n}{l^{n-1}(\tau_q + \ tau_{cs})^{n-1}(t_q + t_w)}ln − 1 (tq+ tcs)n−1(tq+tw)tqn
The number of scheduling rounds required to execute k iterations $m =\frac{kl^{n-1}(\tau_q + \tau_{cs})^{n-1}(t_q + t_w)}{\tau_q^n } $
The total execution time is:
The average time per iteration is:
If assume τ q ≫ τ cs \tau_q \gg \tau_{cs}tq≫tcs, then using uncoordinated scheduling is slower than gang scheduling by ln − 1 l^{n-1}ln − 1 times
In fact, it should be satisfied that each scheduling cycle completes at least one iteration.
t = min ( E q . ( 8 ) , ( τ q + τ cs ) l ) t = min(Eq.(8), (\tau_q + \tau_{cs})l)t=min(Eq.(8),( tq+tcs)l)
Question: Why does each scheduling cycle complete at least one iteration? He said it was a hypothesis? no explanation
Blocking Mechanism
Blocking can free up the CPU, so there is a kind of altruism.
Consider two cases:
- All threads behave the same, will perform iterative calculation and synchronization, and block when waiting
- Competing threads are independently scheduled and purely computational, so they never block
All threads behave the same, coarse-grained
Since they are independently scheduled, at any moment there is only tptp + tw \frac{t_p}{t_p + t_w}tp+twtpthreads are active.
Due to the coarse-grained situation, most context switches are due to time slice expiration rather than blocking. Therefore, the additional overhead caused by blocking is negligible.
So the total time to run is:
In fact, it is equivalent to saving the waiting time.
The time required for each iteration is:
Other threads behave differently, coarse-grained
At any time, the number of active processes is basically l
All threads behave the same, fine-grained
Threads will be blocked frequently, and the actual scheduling time slice will be reduced.
If it is the last thread, it can go directly to the next iteration without blocking. Of all n threads, only n-1 threads will be blocked.
The cost of blocking is α \alphaAlpha times thread context switching.
So the time to run k iterations is
The time required for each iteration is
Compared to busy wait:
- In the case of coarse granularity, the iteration time of each round is reduced
- Under fine-grained conditions, the iteration time of each round increases
Other threads behave differently, fine-grained
In each scheduling cycle, it can only iterate one round, and it is over, and other processes can still execute τ q \tau_qtqtime slice, which fully embodies the characteristics of altruism
At this time, the rotation of RR is not fair to the blocked process.
3 Implementation And Experiments
Experimental Results
The program is multiplied by several groups, each group has only one thread on each processor.
- LOAD : the number of groups participating in the competition
- GRAIN: Granularity, the number of instructions in each iteration except for the synchronization process. If there are multiple particle sizes, take the smallest one.
- VAR: The difference between the maximum and minimum number of instructions for different activities in each iteration. The number of instructions for the activities in the experiment follows the uniform distribution from GRAIN to GRAIN+VAR: the longest activity is GRAIN+(n/(n+1))VAR, and the shortest activity is GRAIN+(1/(n+1))VAR
In the experiment, each unit of GRAIN takes 0.00141ms, and the interaction time is 0.137ms
t_w is based on
The number of iterations was 30000 or 50000 in most experiments
In order to ensure that it starts without coordination, a startup helper thread is also added at the beginning.
Experiment 1
This experiment shows the additional overhead caused by blocking,
and when the load is 1, blocking cannot achieve any gain
Experiment 2
LOAD =3
shows
the performance of busy waiting with gang scheduling and blocking with uncoordinated scheduling.
At the smallest granularity, gang scheduling can achieve twice the blocking performance.
Experiment 3
Experiment 3 shows the altruism of the blocking mechanism
One set of experiments is that all threads behave the same.
Another set of experiments is that only one set of threads will block.
Experiment 4
This experiment shows that busy waiting without gang scheduling is indeed not a viable option
4 Summary and discussion
- busy waiting with gang scheduling works best for fine-grained jobs
- Blocking will release system resources and reduce running time for coarse-grained jobs. But for fine-grained jobs, see how it will run. If the competition does not block first, it will reduce the execution efficiency of the blocked thread (altruistic)
- busy waiting with uncoordinated scheduling is a waste of resources, especially on fine-grained jobs. On a coarse-grained job, the number of busy waits is very large.
- Two-phase blocking, which limits waste but does not improve performance unless combined with gang scheduling.
When to use busy waiting with gang scheduling
busy waiting with gang scheduling The execution time of each iteration is recorded as t BW t_{BW}tBW
The execution time of each iteration using bloking is recorded as t BLK t_{BLK}tBLK
Of course it is assumed that all threads have the same behavior
When under what conditions t WB t BLK ≤ σ < 1 \frac{t_{WB}}{t_{BLK}} \leq σ < 1tBLKtWB≤p<1
into the above formula
In the gray area, it can be said that BW is at least 1 σ larger than BLK \frac{1}{\sigma}p1times
Similarly if t BLK t BW ≤ σ < 1 \frac{t_{BLK}}{t_{BW}} \leq σ < 1tBWtBLK≤p<1 , blocking is dominant.
When all threads behave differently, Eq(13) should be taken as t BLK t_{BLK}tBLK
Therefore, substituting one can get
In addition, busy waiting is more conducive to the use of chaches, because the frequent context switching caused by blocking reduces the effectiveness of the cache.
Not only that, but hardware support is also required. If the pot interaction time is too long, it will easily make it exceed the fine-grained time boundary.