The Application Slowdown Model

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory（2015）

本文提出了一种新的估算应用在多核多任务的情况下的减速率的应用减速模型。该模型充分的考虑到了共享cache和主存争用所带来的影响，同时使用应用的请求整体行为来估算减速比。最后实验表明该模型能够在几乎不影响其它应用运行的情况下，比之前的模型估算错误率更低。同时论文还结合ASM模型，进一步改善了共享资源的管理的问题。
应用的减速
$在多核多任务的情况下，某一个应用的减速情况使用 \frac{s h a r d - e x e c u t i o n - t i m e}{a l o n e - e x e c u t i o n - t i m e} 这个比值来衡量$ $在多核多任务的情况下，某一个应用的减速情况使用\frac{shard-execution-time}{alone-execution-time}这个比值来衡量$

shared execution time：该应用与其它应用共同执行的时间

alone execution time：该应用单独在同一个系统中执行所需要的时间。这个时间更加难以测量，尤其是要求在系统运行过程中，不影响其它应用正常执行的情况下。
先前减速估计模型：
- STFM(stall time fair memory)：使用程序独自占有主存运行时间和共享占用主存的运行时间的比值作为减速的估计。STFM会记录程序发出的每个请求由于其它应用的干扰所带来的阻塞延迟，但是由于大量的并行的存在，这个延迟会非常难以测量
- MISE(memory-interference induced slowdown estimation)：该模型利用了一个现象：应用程序的性能与其内存请求服务速率相关。因此使用这个速率的比值来估计减速的程度。为了测量单独应用的内存请求速率，MISE定期向应用程序的请求提供访问内存的最高优先级。
- FST(fairness via source throttling)和PTCA（per-thread cycle accounting)两个模型给出了由于共享cache的容量和主存带宽的共享干扰所带来的应用的减速情况。FST和PTCA使用了类似于STFM模型中的方法来量化主存的干扰。为了量化共享cache带来的干扰，两者首先确定了哪些cache miss本来在单独使用cache的情况下是不会miss的cache 访问，同时记录了为了解决这些misses而增加的周期数。
- FST和PTCA的不同在于两者识别由于干扰所带来的cache miss的机制不同。FST使用了一个pollution filter来记录对于某个应用，那些由于其它应用而被替换出去的cache line。如果某次的cache 访问在cache中miss，但是在filter中命中了，则认为此次访问即为由于干扰而产生的miss。PTCA为每个应用使用一个额外的tag存储，模仿一个cache，如果一个cache访问在cache中miss了，但是却在tag store中命中，则认为此次访问即为由于干扰而产生的miss。
STFM和MISE的主要问题在于没有考虑到由于共享cache所带来的干扰。FST和PTCA一方面有STFM的问题，另一方面，两者为了测量contention misses所带来的硬件开销太大了。
相对于MISE提出的一个现象，论文提出了“The performance of each application is proportional to the rate at which it accesses the shared cache”（应用的性能与访问共享cache的速率成正比）。实验观察的结果如图所示，实验环境为：Intel Core-i5 processor with a 6MB shared cache。

$p e r f o r m a n c e \propto c a c h e - a c c e s s - r a t e (C A R)$ $performance \propto cache-access-rate(CAR)$

$S l o w d o w n = \frac{p e r f o r m a n c e_{a l o n e}}{p e r f o r m a n c e_{s h a r e d}} = \frac{C A R_{a l o n e}}{C A R_{s h a r e d}} (s h a r e d 更容易测量)$ $Slowdown=\frac{performance_{alone}}{performance_{shared}}=\frac{CAR_{alone}}{CAR_{shared}}(shared更容易测量)$
CAR(alone)的测量
- Minimizing main memory interference
  
  实现：周期性的在内存控制器中设置每个应用访存请求在很短的时间内为最高优先级，类似于MISE模型的做法。
  
  结果：1) 消除了大部分的由于主存争用而带给CAR(alone)测量时所带来的负面影响
  
  2) 为ASM提供了准确的cache未命中时的服务时间的估计
- Quantifying shared cache interference
  
  实现：1) 首先为每个应用增设额外的tag store来识别 contention misses（竞争缺失）
  
  2) 利用tag store来确定竞争缺失的数量，结合平均cache未命中的服务时间和平均的cache命中的服务时间来估算为了服务contention misses的时间，从而也就定量的确定了共享cache的干扰
因为应用程序的每个执行阶段都有不一样的特征，因此ASM将执行分为多个阶段，每个阶段时间长度为Q cycles。在每一个阶段的结束时测量shared和alone情况下的CAR，给出应用slowdown的报告

$C a c h e - A c c e s s - R a t e_{s h a r e d} = \frac{# S h a r e d - C a c h e - A c c e s s e s}{Q} C A R_{a l o n e} = \frac{# R e q u e s t s d u r i n g a p p l i c a t i o n^{'} s e p o c h s}{T i m e t o s e r v e r e q u e s t s w h e n r u n a l o n e} = \frac{e p o c h h i t s + e p o c h m i s s e d}{(e p o c h c o u n t * E) - e p o c h e x c e s s c y c l e s}$ $Cache-Access-Rate_{shared}=\frac{\#Shared-Cache-Accesses}{Q}\\ CAR_{alone}=\frac{\#Requests\ during\ application's\ epochs}{Time\ to\ serve\ requests\ when\ run\ alone}\\ =\frac{epoch\ hits+epoch\ missed}{(epoch\ count*E)-epoch\ excess\ cycles}$

$e p o c h e x c e s s c y c l e s = (# C o n t e n t i o n M i s s e s) * (a v g m i s s t i m e - a v g h i t t i m e)$ $epoch\ excess\ cycles=(\#Contention\ Misses)*(avg\ miss\ time-avg\ hit\ time)$

$# C o n t e n t i o n M i s s e s = (e p o c h A T S h i t s) - (e p o c h h i t s) a v g - m i s s - t i m e = \frac{e p o c h m i s s t i m e}{e p o c h m i s s e s} a v g - h i t - t i m e = \frac{e p o c h h i t t i m e}{e p o c h h i t s}$ $\#Contention\ Misses=(epoch\ ATS\ hits)-(epoch\ hits)\\avg-miss-time=\frac{epoch\ miss\ time}{epoch\ misses}\\avg-hit-time=\frac{epoch\ hit\ time}{epoch\ hits}$

$e p o c h c o u n t * E : 在设置优先级之后应用程序实际运行时间 e p o c h e x c e s s c y c l e s : 应用为了解决 c o n t e n t i o n m i s s e s 而花费的时间 a v g m i s s t i m e : 平均的 c a c h e 缺失服务时间 e p o c h c o u n t : 分配给应用程序的 e p o c h 的个数 e p o c h h i t s : 在分配的 e p o c h 中，应用程序总的 c a c h e 命中次数 e p o c h A T S h i t s : 应用程序在分配的 e p o c h 中，额外的 t a g s t o r e 中命中的次数$ $epoch\ count*E: 在设置优先级之后应用程序实际运行时间\\ epoch\ excess\ cycles: 应用为了解决contention misses而花费的时间\\ avg\ miss\ time: 平均的cache缺失服务时间\\ epoch\ count: 分配给应用程序的epoch的个数\\ epoch\ hits: 在分配的epoch中，应用程序总的cache命中次数\\ epoch\ ATS\ hits: 应用程序在分配的epoch中，额外的tag store中命中的次数$
模型中存在的内存访问队列的延迟问题：尽管当前应用在这个时间片段内request的优先级最高，但是如果应用的request非常少，内存控制器会处理其它应用的request。当该应用需要执行内存访问的request时，就需要在请求队列中等待。因此在计算CAR(alone)的时候就需要考虑这个延迟的影响

$a v g q u e u e i n g d e l a y = \frac{# q u e u e i n g c y c l e s}{e p o c h m i s s e s} C A R_{a l o n e} = \frac{e p o c h h i t s + e p o c h m i s s e d}{(e p o c h c o u n t * E) - e p o c h e x c e s s c y c l e s - (e p o c h A T S m i s s e s * a v g q u e u e i n g d e l a y)}$ $avg\ queueing\ delay=\frac{\#queueing\ cycles}{epoch\ misses}\\ CAR_{alone}=\frac{epoch\ hits+epoch\ missed}{(epoch\ count*E)-epoch\ excess\ cycles-(epoch\ ATS\ misses*avg\ queueing\ delay)}$

$q u e u e i n g c y c l e s : 指的是应用在时间片内，应用的访存请求由于 M C 服务其它应用的 r e q u e s t 所等待的始终周期数$ $queueing\ cycles: 指的是应用在时间片内，应用的访存请求由于MC服务其它应用的request所等待的始终周期数$
为了减少每个应用的辅助tag store的硬件开销，最终实现使用了sampling的技术。此时epoch-ATS-hits和misses的计算将发生改变。Ats_hit_fraction指的是在设置了采样之后的tag store上的命中的次数占总的次数的比例

$e p o c h A T S h i t s = (a t s h i t f r a c t i o n) * (e p o c h a c c e s s e s) e p o c h A T S m i s s e s = (a t s m i s s f r a c t i o n) * (e p o c h a c c e s s e s) e p o c h a c c e s s e s = e p o c h h i t s + e p o c h m i s s e s a t s h i t f r a c t i o n = \frac{a t s h i t s}{a t s h i t s + a t s m i s s e s}$ $epoch\ ATS\ hits=(ats\ hit\ fraction)*(epoch\ accesses)\\ epoch\ ATS\ misses=(ats\ miss\ fraction)*(epoch\ accesses)\\ epoch\ accesses=epoch\ hits+epoch\ misses\\ ats\ hit\ fraction=\frac{ats\ hits}{ats\ hits+ats\ misses}$
ASM的结果：在大多数的spec2006的测试集上，ASM的slowdown的错误率均比FST和PTCA低。但是ASM也存在随着CPU的核个数越多，三个模型的错误率都随着提升，但是shared cache的大小对此却没有太大的影响
ASM的showdown结果的应用
- cache partition
  
  主要思想：通过减速比的计算，给由于增加了cache大小而使得减速比下降最多的应用分配更多的cache资源
- Memory Bandwidth Partitioning
  
  主要思想：应用减速比越大获得的带宽应该越多
  
  $A_{i} = \frac{s h o w d o w n (A_{i})}{\sum_{k} s l o w d o w n (A_{k})}$ $A_i=\frac{showdown(A_i)}{\sum_k{slowdown(A_k)}}$
- Job Migration and Admission Control
- Fair Pricing in Cloud Systems

The Application Slowdown Model

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory（2015）

猜你喜欢