Real-Time Rendering——Chapter 18Pipeline Optimization管道优化

“We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil.”
—Donald Knuth

“我们应该忘记小的效率，比如说97%的时候:过早的优化是万恶之源。” —唐纳德·克努特

Throughout this volume, algorithms have been presented within a context of quality, memory, and performance trade-offs. In this chapter we will discuss performance problems and opportunities that are not associated with particular algorithms. Bottleneck detection and optimization are the focus, starting with making small, localized changes, and ending with techniques for structuring an application as a whole to take advantage of multiprocessing capabilities.

在这本书里，算法已经在质量、内存和性能权衡的背景下被提出。本章我们将讨论与特定算法无关的性能问题和机会。瓶颈检测和优化是重点，从进行小的局部更改开始，到将应用程序作为一个整体来构建以利用多处理能力的技术结束。

As we saw in Chapter 2, the process of rendering an image is based on a pipelined architecture with four conceptual stages: application, geometry processing, rasterization,and pixel processing. There is always one stage that is the bottleneck—the slowest process in the pipeline. This implies that this bottleneck stage sets the limit for the throughput, i.e., the total rendering performance, and so is a prime candidate for optimization.

正如我们在第二章中所看到的，渲染图像的过程是基于流水线架构的，有四个概念阶段:应用、几何处理、光栅化和像素处理。总有一个阶段是瓶颈——流水线中最慢的过程。这意味着该瓶颈阶段设置了吞吐量的限制，即总渲染性能，因此是优化的主要候选。

Optimizing the performance of the rendering pipeline resembles the procedure of optimizing a pipelined processor (CPU) [715] in that it consists mainly of two steps.First, the bottleneck of the pipeline is located. Second, that stage is optimized in some way; and after that, step one is repeated if the performance goals have not been met. Note that the bottleneck may or may not be located at the same place after the optimization step. It is a good idea to put only enough effort into optimizing the bottleneck stage so that the bottleneck moves to another stage. Several other stages may have to be optimized before this stage becomes the bottleneck again. For this reason, effort should not be wasted on over-optimizing a stage.

优化渲染流水线的性能类似于优化流水线处理器(CPU) [715]的过程，因为它主要由两个步骤组成。第一，管道的瓶颈所在。第二，以某种方式优化该阶段；之后，如果没有达到性能目标，则重复第一步。请注意，在优化步骤之后，瓶颈可能位于也可能不位于相同的位置。将足够的精力放在优化瓶颈阶段，以便瓶颈转移到另一个阶段，这是一个好主意。在这个阶段再次成为瓶颈之前，可能必须对其他几个阶段进行优化。出于这个原因，不应该浪费精力过度优化一个阶段。

The location of the bottleneck may change within a frame, or even within a draw call. At one moment the geometry stage may be the bottleneck because many tiny triangles are rendered. Later in the frame pixel processing could be the bottleneck because a heavyweight procedural shader is evaluated at each pixel. In a pixel shader execution may stall because the texture queue is full, or take more time as a particular loop or branch is reached. So, when we talk about, say, the application stage being the bottleneck, we mean it is the bottleneck most of the time during that frame. There is rarely only one bottleneck.

瓶颈的位置可能在帧内改变，甚至在绘制调用内改变。在某个时刻，几何阶段可能是瓶颈，因为许多小三角形被渲染。在帧的后期，像素处理可能会成为瓶颈，因为在每个像素处都会评估一个重量级的程序着色器。在像素着色器中，执行可能会因为纹理队列已满而停止，或者在到达特定循环或分支时花费更多时间。因此，当我们谈到应用程序阶段成为瓶颈时，我们的意思是它在该阶段的大部分时间都是瓶颈。很少只有一个瓶颈。

Another way to capitalize on the pipelined construction is to recognize that when the slowest stage cannot be optimized further, the other stages can be made to work just as much as the slowest stage. This will not change performance, since the speed of the slowest stage will not be altered, but the extra processing can be used to improve image quality [1824]. For example, say that the bottleneck is in the application stage, which takes 50 milliseconds (ms) to produce a frame, while the others each take 25 ms. This means that without changing the speed of the rendering pipeline (50 ms equals 20 frames per second), the geometry and the rasterizer stages could also do their work in 50 ms. For example, we could use a more sophisticated lighting model or increase the level of realism with shadows and reflections, assuming that this does not increase the workload on the application stage.

另一种利用流水线结构的方法是认识到当最慢的级不能被进一步优化时，其他级可以像最慢的级一样工作。这不会改变性能，因为最慢阶段的速度不会改变，但额外的处理可用于提高图像质量[1824]。例如，假设瓶颈在应用程序阶段，它需要50毫秒(ms)来产生一个帧，而其他每个需要25毫秒。这意味着在不改变渲染管道的速度(50毫秒等于每秒20帧)的情况下，几何和光栅化阶段也可以在50毫秒内完成工作。例如，我们可以使用更复杂的照明模型或增加阴影和反射的真实感级别，假设这不会增加应用程序阶段的工作量。

Compute shaders also change the way we think about bottlenecks and unused resources. For example, if a shadow map is being rendered, vertex and pixel shaders are simple and the GPU computational resources might be underutilized if fixed-function stages such as the rasterizer or the pixel merger become the bottleneck. Overlapping such draws with asynchronous compute shaders can keep the shader units busy when these conditions arise [1884]. Task-based multiprocessing is discussed in the final section of this chapter.

计算着色器还改变了我们对瓶颈和未使用资源的思考方式。例如，如果正在渲染阴影贴图，顶点和像素着色器很简单，如果光栅化器或像素合并等固定功能阶段成为瓶颈，GPU计算资源可能会利用不足。当这些情况出现时，将此类绘制与异步计算着色器重叠可以使着色器单元保持忙碌[1884]。基于任务的多重处理将在本章的最后一节讨论。

Pipeline optimization is a process in which we first maximize the rendering speed, then allow the stages that are not bottlenecks to consume as much time as the bottleneck.That said, it is not always a straightforward process, as GPUs and drivers can have their own peculiarities and fast paths. When reading this chapter, the dictum

流水线优化是这样一个过程，我们首先最大化渲染速度，然后让不是瓶颈的阶段消耗与瓶颈一样多的时间。也就是说，这并不总是一个简单的过程，因为GPU和驱动程序可能有自己的特点和快速路径。阅读本章时，格言

KNOW YOUR ARCHITECTURE 了解您的架构

should always be in the back of your mind, since optimization techniques vary greatly for different architectures. That said, be wary of optimizing based on a specific GPU’s implementation of a feature, as hardware can and will change over time [530]. A related dictum is, simply,

应该永远在你的脑海中，因为优化技术对于不同的架构有很大的不同。也就是说，要警惕基于特定GPU的特性实现的优化，因为硬件可以而且将会随着时间而改变[530]。一个相关的格言是，简单地说，

MEASURE, MEASURE, MEASURE.测量，测量，测量。

18.1 Profiling and Debugging Tools

18.1分析和调试工具

Profiling and debugging tools can be invaluable in finding performance problems in your code. Capabilities vary and can include:

分析和调试工具对于发现代码中的性能问题非常有用。功能各不相同，可能包括:

• Frame capture and visualization. Usually step-by-step frame replay is available,with the state and resources in use displayed.
• Profiling of time spent across the CPU and GPU, including time spent calling the graphics API.

• Shader debugging, and possibly hot editing to see the effects of changing code.
• Use of debug markers set in the application, to help identify areas of code.

帧捕捉和可视化。通常一步一步的帧重放是可用的，显示使用中的状态和资源。

分析CPU和GPU所花费的时间，包括调用图形API所花费的时间。

着色器调试，以及可能的热编辑，以查看更改代码的效果。

使用应用程序中设置的调试标记，帮助识别代码区域。

Profiling and debugging tools vary with the operating system, the graphics API,and often the GPU vendor. There are tools for most combinations, and that’s why the gods created Google. That said, we will mention a few package names specifically for interactive graphics to get you started on your quest:

分析和调试工具因操作系统、图形API以及GPU供应商而异。大多数组合都有工具，这也是神创造谷歌的原因。也就是说，我们将提到几个专门用于交互式图形的包名，以帮助您开始探索:

• RenderDoc is a high-quality Windows debugger for DirectX, OpenGL, and Vulkan, originally developed by Crytek and now open source.
• GPU PerfStudio is AMD’s suite of tools for their graphics hardware offerings,working on Windows and Linux. One notable tool provided is a static shader analyzer that gives performance estimates without needing to run the application. AMD’s Radeon GPU Profiler is a separate, related tool.
• NVIDIA Nsight is a performance and debugging system with a wide range of features. It integrates with Visual Studio on Windows and Eclipse on Mac OS and Linux.
• Microsoft’s PIX has long been used by Xbox developers and has been brought back for DirectX 12 on Windows. Visual Studio’s Graphics Diagnostics can be used with earlier versions of DirectX.
• GPUView from Microsoft uses Event Tracing for Windows (ETW), an efficient event logging system. GPUView is one of several programs that are consumers of ETW sessions. It focuses on the interaction between CPU and GPU, showing which is the bottleneck [783].
• Graphics Performance Analyzers (GPA) is a suite from Intel, not specific to their graphics chips, that focuses on performance and frame analysis.
• Xcode on OSX provides Instruments, which has several tools for timing, performance, networking, memory leaks, and more. Worth mentioning are OpenGL ES Analysis, which detects performance and correctness problems and proposes solutions, and Metal System Trace, which provides tracing information from the application, driver, and GPU.

RenderDoc是用于DirectX、OpenGL和Vulkan的高质量Windows调试器，最初由Crytek开发，现在开源。

GPU PerfStudio是AMD为其图形硬件产品提供的工具套件，可在Windows和Linux上工作。提供的一个值得注意的工具是静态着色器分析器，它无需运行应用程序即可提供性能估计。AMD的镭龙GPU分析器是一个独立的相关工具。

NVIDIA Nsight是一个性能和调试系统，具有广泛的功能。它集成了Windows上的Visual Studio以及Mac OS和Linux上的Eclipse。

微软的PIX长期以来一直被Xbox开发者使用，并在Windows上为DirectX 12带来了它。Visual Studio的图形诊断可以与早期版本的DirectX一起使用。

微软的GPUView使用Windows (ETW)事件跟踪，这是一个高效的事件记录系统。GPUView是ETW会话的几个消费者程序之一。它侧重于CPU和GPU之间的交互，显示哪个是瓶颈[783]。图形性能分析器(GPA)是英特尔的一个套件，不针对其图形芯片，侧重于性能和帧分析。

OSX上的Xcode提供了Instruments，它有几个工具用于计时、性能、联网、内存泄漏等等。值得一提的是OpenGL ES Analysis，它检测性能和正确性问题并提出解决方案，以及Metal System Trace，它提供来自应用程序、驱动程序和GPU的跟踪信息。

These are the major tools that have existed for a few years. That said, sometimes no tool will do the job. Timer query calls are built into most APIs to help profile a GPU’s performance. Some vendors provide libraries to access GPU counters and thread traces as well.

这些是已经存在了几年的主要工具。也就是说，有时没有工具可以完成这项工作。大多数API都内置了计时器查询调用，以帮助分析GPU的性能。一些供应商还提供了访问GPU计数器和线程跟踪的库。