Real-Time Rendering——18.4 Optimization Optimization

Once a bottleneck has been located, we want to optimize that stage to boost the performance. In this section we present optimization techniques for the application,geometry, rasterization, and pixel processing stages.

Once we find the bottleneck, we want to optimize that stage to improve performance. In this section, we describe optimization techniques for the application, geometry, rasterization, and pixel processing stages.

18.4.1 Application Stage

The application stage is optimized by making the code faster and the memory accesses of the program faster or fewer. Here we touch upon some of the key elements of code optimization that apply to CPUs in general.

Optimize the application phase by making the code faster and making the program's memory accesses faster or less. Here we discuss some key elements of code optimization for CPUs.

For code optimization, it is crucial to locate the place in the code where most of the time is spent. A good code profiler is critical in finding these code hot spots,where most time is spent. Optimization efforts are then made in these places. Such locations in the program are often inner loops, pieces of the code that are executed many times each frame.

For code optimization, it is crucial to find where the most time is spent in the code. A good code profiler is crucial to spot these code hotspots where most of the time is spent. Then optimize in these places. These places in a program are usually inner loops, pieces of code that are executed multiple times per frame.

The basic rule of optimization is to try a variety of tactics: Reexamine algorithms,assumptions, and code syntax, trying variants as possible. CPU architecture and compiler performance often limit the user’s ability to form an intuition about how to write the fastest code, so question your assumptions and keep an open mind.

The basic rule of optimization is to try various strategies: re-examine algorithms, assumptions and code syntax, try variants where possible. CPU architecture and compiler performance often limit the user's ability to develop an intuition on how to write the fastest code, so question your assumptions and keep an open mind.

One of the first steps is to experiment with the optimization flags for the compiler.There are usually a number of different flags to try. Make few, if any, assumptions about what optimization options to use. For example, setting the compiler to use more aggressive loop optimizations could result in slower code. Also, if possible, try different compilers, as these are optimized in different ways, and some are markedly superior. Your profiler can tell you what effect any change has.

The first step is to experiment with the compiler's optimization flags. There are often many different flags to try. Almost no assumptions are made about what optimization options to use. For example, setting the compiler to use more aggressive loop optimizations may result in slower code. Also, try different compilers if possible, as they optimize in different ways, and some are significantly better. Your profiler can tell you what impact any change will have.

Memory IssuesMemory issues

Years ago the number of arithmetic instructions was the key measure of an algorithm’s efficiency; now the key is memory access patterns. Processor speed has increased much more rapidly than the data transfer rate for DRAM, which is limited by the pin count. Between 1980 and 2005, CPU performance doubled about every two years, and DRAM performance doubled about every six [1060]. This problem is known as the Von Neumann bottleneck or the memory wall. Data-oriented design focuses on cache coherency as a means of optimization.

A few years ago, the number of arithmetic instructions was the key to measuring the efficiency of an algorithm; now the key is the memory access pattern. Processor speeds have increased much faster than DRAM data transfer rates, which are limited by pin count. Between 1980 and 2005, CPU performance doubled roughly every two years, and DRAM performance doubled roughly every six years [1060]. This problem is known as the von Neumann bottleneck or memory wall. Data-Oriented Design focuses on cache coherence as an optimization.

On modern GPUs, what matters is the distance traveled by data. Speed and power costs are proportional to this distance. Cache access patterns can make up to an orders-of-magnitude performance difference [1206]. A cache is a small fastmemory area that exists because there is usually much coherence in a program, which the cache can exploit. That is, nearby locations in memory tend to be accessed one after another (spatial locality), and code is often accessed sequentially. Also, memory locations tend to be accessed repeatedly (temporal locality), which the cache also exploits [389]. Processor caches are fast to access, second only to registers for speed.Many fast algorithms work to access data as locally (and as little) as possible.

On modern GPUs, it's the distance the data travels that matters. Speed ​​and power consumption are directly proportional to this distance. Cache access patterns can make an order of magnitude difference in performance [1206]. A cache is a small area of ​​fast memory, and since there is often a lot of coherency in a program, the cache can take advantage of these coherencies. That is, adjacent locations in memory tend to be accessed one after the other (spatial locality), and code is often accessed sequentially. Furthermore, memory locations tend to be accessed repeatedly (temporal locality), which is also exploited by caches [389]. Processor cache access is fast, second only to registers. Many fast algorithms access data as locally as possible (and as little as possible).

Registers and local caches form one end of the memory hierarchy, which extends next to dynamic random access memory (DRAM), then to storage on SSDs and hard disks. At the top are small amounts of fast, expensive memory, at the bottom are large amounts of slow and inexpensive storage. Between each level of the hierarchy the speed drops by some noticeable factor. See Figure 18.1. For example, processor registers are usually accessed in one clock cycle, while L1 cache memory is accessed in a few cycles. Each change in level has an increase in latency in this way. As discussed in Section 3.10, sometimes latency can be hidden by the architecture, but it is always a factor that must be kept in mind.

Registers and local caches form one end of the memory hierarchy, which extends to dynamic random-access memory (DRAM) and then to storage on solid-state drives and hard drives. At the top is a small amount of fast and expensive memory, and at the bottom is a lot of slow and cheap storage. Between each level of the scale, the speed drops by some significant factor. See Figure 18.1. For example, processor registers are typically accessed in one clock cycle, while L1 cache is accessed in a few cycles. Every change in level adds delay in this way. As discussed in Section 3.10, sometimes latency can be hidden by architecture, but it is always a factor that must be kept in mind.

Figure 18.1. The memory hierarchy. Speed and cost decrease as we descend the pyramid. 

Figure 18.1. memory hierarchy. As we go down the pyramid, both speed and cost decrease.

Bad memory access patterns are difficult to directly detect in a profiler. Good patterns need to be built into the design from the start [1060]. Below is a list of pointers that should be kept in consideration when programming.

Bad memory access patterns are difficult to detect directly in a profiler. Good patterns need to be built into the design from the start [1060]. Below is a list of pointers that should be considered while programming.

• Data that is accessed sequentially in the code should also be stored sequentially in memory. For example, when rendering a triangle mesh, store texture coordinate #0, normal #0, color #0, vertex #0, texture coordinate #1, and normal #1, sequentially in memory if they are accessed in that order. This can also be important on the GPU, as with the post-transform vertex cache (Section 16.4.4).Also see Section 16.4.5 for why storing separate streams of data can be beneficial.

Data that is accessed sequentially in code should also be stored sequentially in memory. For example, when rendering a triangle mesh, if you access texture coordinate #0, normal #0, color #0, vertex #0, texture coordinate #1, and normal #1 in that order, store them in memory in that order . This also matters on the GPU, as does the transformed vertex buffer (Section 16.4.4). See also Section 16.4.5 for why storing separate data streams is beneficial.

• Avoid pointer indirection, jumps, and function calls (in critical parts of the code), as these may significantly decrease CPU performance. You get pointer indirection when you follow a pointer to another pointer and so on. Modern CPUs try to speculatively execute instructions (branch prediction) and fetch memory (cache prefetching) to keep all their functional units busy running code.These techniques are highly effective when the code flow is consistent in a loop,but fail with branching data structures such as binary trees, linked lists, and graphs; use arrays instead, as possible. McVoy and Staelin [1194] show a code example that follows a linked list through pointers. This causes cache misses for data both before and after, and their example stalls the CPU more than 100 times longer than it takes to follow the pointer (if the cache could provide the address of the pointer). Smits [1668] notes how flattening a pointer-based tree into a list with skip pointers considerably improves hierarchy traversal.Using a van Emde Boas layout is another way to help avoid cache misses—see Section 19.1.4. High-branching trees are often preferable to binary trees because they reduce the tree depth and so reduce the amount of indirection.

Avoid indirect pointers, jumps, and function calls (in critical sections of code), as these can significantly slow down CPU performance. When you follow one pointer to another, you get indirect pointers, and so on. Modern CPUs try to speculatively execute instructions (branch prediction) and fetch memory (cache prefetch) to keep all functional units busy running code. These techniques work well when the code flow is consistent across loops, but not for branching data structures such as binary trees, linked lists, and graphs; use arrays whenever possible. McVoy and Staelin [1194] show a code example that follows a linked list via pointers. This results in a cache miss for the data before and after, and their example stalls the CPU for over 100 times longer than it would take to follow the pointer (if the cache can provide the address of the pointer). Smits [1668] noticed how flattening a pointer-based tree into a list with skip pointers greatly improved level traversal. Using a van Emde Boas layout is another way to help avoid cache misses, see Section 19.1.4. Highly branched trees are generally preferable to binary trees because they reduce the depth of the tree and thus reduce indirection.

• Aligning frequently used data structures to multiples of the cache line size can significantly improve overall performance. For example, 64 byte cache lines are common on Intel and AMD processors [1206]. Compiler options can help, but it is wise to design your data structures with alignment, called padding, in mind.Tools such as VTune and CodeAnalyst for Windows and Linux, Instruments for the Mac, and the open-source Valgrind for Linux can help identify caching bottlenecks. Alignment can also affect GPU shader performance [331].

Sizing frequently used data structures to multiples of the cache line size can significantly improve overall performance. For example, 64-byte cache lines are common on Intel and AMD processors [1206]. Compiler options may help, but when designing data structures, it's best to keep alignment, i.e. padding, in mind. Tools like VTune and CodeAnalyst for Windows and Linux, Instruments for Mac, and open source Valgrind for Linux can help identify cache bottlenecks. Alignment also affects GPU shader performance [331].

• Try different organizations of data structures. For example, Hecker [698] shows how a surprisingly large amount of time was saved by testing a variety of matrix structures for a simple matrix multiplier. An array of structures,

Experiment with different data structure organizations. For example, Hecker [698] shows how to save a lot of time by testing various matrix structures for simple matrix multipliers. a set of structures,

struct Vertex { float x,y,z ;};
Vertex myvertices [1000];

or a structure of arrays,

or an array structure,

struct VertexChunk { float x [1000] , y [1000] , z [1000];};
VertexChunk myvertices ;

may work better for a given architecture. This second structure is better for using SIMD commands, but as the number of vertices goes up, the chance of a cache miss increases. As the array size increases, a hybrid scheme,

may be more suitable for a given architecture. The second structure is better suited to use SIMD commands, but as the number of vertices increases, so does the chance of a cache miss. As the array size increases, mixed schemes,

struct Vertex4 { float x[4] ,y[4] ,z [4];};
Vertex4 myvertices [250];

may be the best choice.

Probably the best option.

• It is often better to allocate a large pool of memory at start-up for objects of the same size, and then use your own allocation and free routines for handling the memory of that pool [113, 736]. Libraries such as Boost provide pool allocation. A set of contiguous records is more likely to be cache coherent than those created by separate allocations. That said, for languages with garbage collection, such as C# and Java, pools can actually reduce performance.

It is usually better to allocate a large memory pool for objects of the same size at startup, and then use your own allocation and free routines to handle memory from this pool [113, 736]. Libraries such as Boost provide pool allocation. A contiguous set of records is more likely to be cache-coherent than those created by separate allocations. That said, for languages ​​with garbage collection, such as C# and Java, the pool can actually degrade performance.

While not directly related to memory access patterns, it is worthwhile to avoid allocating or freeing memory within the rendering loop. Use pools and allocate scratch space once, and have stacks, arrays, and other structures only grow (using a variable or flags to note which elements should be treated as deleted).

While not directly related to memory access patterns, it pays to avoid allocating or freeing memory in the render loop. Use a pool and allocate scratch space once, and stacks, arrays, and other structures will only grow (use variables or flags to note which elements should be considered deleted). 

18.4.2 API Calls API calls

Throughout this book we have given advice based on general trends in hardware. For example, indexed vertex buffers objects are usually the fastest way to provide the accelerator with geometric data (Section 16.4.5). This section is about how to best call the graphics API itself. Most graphics APIs have similar architectures, and there are well-established ways of using them efficiently.

In this book, we base our recommendations on general trends in hardware. For example, indexed vertex buffer objects are often the fastest way to provide geometry data to accelerators (Section 16.4.5). This section is about how best to call the graphics API itself. Most graphics APIs have a similar architecture, and there are good ways to use them effectively.

Understanding object buffer allocation and storage is basic to efficient rendering [1679]. For a desktop system with a CPU and a separate, discrete GPU, each normally has its own memory. The graphics driver is usually in control of where objects reside, but it can be given hints of where best to store them. A common classification is static versus dynamic buffers. If the buffer’s data are changing each frame, using a dynamic buffer, which requires no permanent storage space on the GPU, is preferable. Consoles, laptops with low-power integrated GPUs, and mobile devices usually have unified memory, where the GPU and CPU share the same physical memory. Even in these setups, allocating a resource in the right pool matters. Correctly tagging a resource as CPU-only or GPU-only can still yield benefits. In general, if a memory area has to be accessed by both chips, when one writes to it the other has to invalidate its caches—an expensive operation—to be sure not to get stale data.

Understanding object buffer allocation and storage is fundamental to efficient rendering [1679]. For desktop systems with a CPU and a separate GPU, each usually has its own memory. The graphics driver usually controls the placement of objects, but can also give it hints about where it's best to store them. A common classification is static vs. dynamic buffers. If the buffer's data is changing every frame, then it's better to use a dynamic buffer because it doesn't require permanent storage on the GPU. Consoles, laptops and mobile devices with low-power integrated GPUs often have unified memory, where the GPU and CPU share the same physical memory. Even in these setups, it's important to allocate resources in the correct pool. There are still benefits to properly marking resources as CPU dedicated or GPU dedicated. In general, if a memory region has to be accessed by two chips, when one chip writes to it, the other chip has to invalidate its cache (an expensive operation) to make sure you don't get stale data.

If an object is not deforming, or the deformations can be carried out entirely by shader programs (e.g., skinning), then it is profitable to store the data for the object in GPU memory. The unchanging nature of this object can be signaled by storing it as a static buffer. In this way, it does not have to be sent across the bus for every frame rendered, thus avoiding any bottleneck at this stage of the pipeline. The internal memory bandwidth on a GPU is normally much higher than the bus between CPU and GPU.

Storing the object's data in GPU memory is advantageous if the object is not deformed, or if the deformation can be performed entirely by the shader program (e.g. skinning). The immutable nature of this object can be represented by storing it as a static buffer. This way, every frame for rendering doesn't have to be sent across the bus, avoiding any bottlenecks at this stage of the pipeline. The internal memory bandwidth on a GPU is usually much higher than the bus between the CPU and GPU.

State Changes State Changes

Calling the API has several costs associated with it. On the application side, more calls mean more application time spent, regardless of what the calls actually do. This cost can be minimal or noticeable, and a null driver can help identify it. Query functions that depend on values from the GPU can potentially halve the frame rate due to stalls from synchronization with the CPU [1167]. Here we will delve into optimizing a common graphics operation, preparing the pipeline to draw a mesh. This operation may involve changing the state, e.g., setting the shaders and their uniforms, attaching textures, changing the blend state or the color buffer used, and so on.

There are several costs associated with calling an API. On the application side, more calls means more application time spent, regardless of what the calls actually do. This cost can be minimal or significant, and an empty driver can help identify it. Query functions that rely on values ​​from the GPU may experience frame rate halves due to synchronization pauses with the CPU [1167]. Here, we'll delve into how to optimize common graphics operations, preparing a pipeline for drawing meshes. This operation may involve changing state, for example, setting up shaders and their uniforms, attaching textures, changing blending states or color buffers used, etc.

A major way for the application to improve performance is to minimize state changes by grouping objects with a similar rendering state. Because the GPU is an extremely complex state machine, perhaps the most complex in computer science,changing the state can be expensive. While a little of the cost can involve the GPU,most of the expense is from the driver’s execution on the CPU. If the GPU maps well to the API, the state change cost tends to be predictable, though still significant. If the GPU has a tight power constraint or limited silicon footprint, such as with some mobile devices, or has a hardware bug to work around, the driver may have to perform heroics that cause unexpectedly high costs. State change costs are mostly on the CPU side, in the driver.

One of the main ways applications can improve performance is by minimizing state changes by grouping objects with similar rendering states. Because a GPU is an extremely complex state machine, possibly the most complex state machine in computer science, changing state can be very expensive. While a small portion of the cost involves the GPU, most of the cost comes from the execution of the driver on the CPU. If the GPU maps well to the API, the cost of state changes tends to be predictable, though still significant. If the GPU has a hard power limit or a limited silicon footprint, such as on some mobile devices, or has a hardware bug that needs to be resolved, the driver may have to perform heroic actions that result in unexpectedly high costs. The state change cost is mostly on the CPU side, i.e. in the driver.

One concrete example is how the PowerVR architecture supports blending. In older APIs blending is specified using a fixed-function type of interface. PowerVR’s blending is programmable, which means that their driver has to patch the current blend state into the pixel shader [699]. In this case a more advanced design does not map well to the API and so incurs a significant setup cost in the driver. While throughout this chapter we note that hardware architecture and the software running it can affect the importance of various optimizations, this is particularly true for state change costs. Even the specific GPU type and driver release may have an effect.While reading, please imagine the phrase “your mileage may vary” stamped in large red letters over every page of this section.

A concrete example is how the PowerVR architecture supports hybridization. In the old API, mixins were specified using interfaces of fixed function types. PowerVR's blending is programmable, meaning their drivers must patch the current blending state into the pixel shader [699]. In this case, the higher level design doesn't map well to the API, thus incurring a large setup cost in the driver. Although in this chapter we note that the hardware architecture and the software running it can affect the importance of various optimizations, this is especially true for state change costs. Even specific GPU types and driver versions can have an effect. As you read, imagine the phrase "Your mileage may vary" printed in large red letters on every page of this section.

Everitt and McDonald [451] note that different types of state changes vary considerably in cost, and give some rough idea as to how many times a second a few could be performed on an NVIDIA OpenGL driver. Here is their order, from most expensive to least, as of 2014:

Everitt and McDonald [451] note that different types of state changes vary widely in cost and give some rough ideas of how many times per second the NVIDIA OpenGL driver can do it. Here are the orders as of 2014, from most expensive to least expensive:

• Render target (framebuffer object), ∼60k/sec.
• Shader program, ∼300k/sec.
• Blend mode (ROP), such as for transparency.
• Texture bindings, ∼1.5M/sec.
• Vertex format.
• Uniform buffer object (UBO) bindings.
• Vertex bindings.
• Uniform updates, ∼10M/sec.

• Render targets (framebuffer objects), ∼60k/sec.

• Shader programs, ∼300k/sec.

• Blending modes (ROP), such as transparency.

• Texture binding, ∼1.5M/sec.

• Vertex format.

• Uniform Buffer Object (UBO) bindings.

• Vertex binding.

• Unified update, ∼10M/sec.

This approximate cost order is borne out by others [488, 511, 741]. One even more expensive change is switching between the GPU’s rendering mode and its compute shader mode [1971]. Avoiding state changes can be achieved by sorting the objects to be displayed by grouping them by shader, then by textures used, and so on down the cost order. Sorting by state is sometimes called batching.

This approximate cost order is borne by others [488, 511, 741]. A more expensive change is to switch between the GPU's rendering mode and compute shader mode [1971]. Avoiding state changes can be achieved by ordering the objects to be displayed by grouping them by shader, then by texture used, and so on. Sorting by status is sometimes called batching.

Another strategy is to restructure the objects’ data so that more sharing occurs.A common way to minimize texture binding changes is to put several texture images into one large texture or, better yet, a texture array. If the API supports it,bindless textures are another option to avoid state changes (Section 6.2.5). Changing the shader program is usually relatively expensive compared to updating uniforms,so variations within a class of materials may be better represented by a single shader that uses “if” statements. You might also be able to make larger batches by sharing a shader [1609]. Making shaders more complex can also lower performance on the GPU, however. Measuring to see what is effective is the only foolproof way to know.

Another strategy is to reorganize the object's data for more sharing. A common way to minimize texture binding changes is to put multiple texture images into one large texture, or better yet, a texture array. Unbound textures are another option to avoid state changes if supported by the API (Section 6.2.5). Changing a shader program is usually relatively expensive compared to updating a uniform, so a single shader with an "if" statement can better represent changes in a class of materials. You can also do larger batches by sharing shaders [1609]. However, making shaders more complex also reduces GPU performance. Measuring what works is the only surefire way.

Making fewer, more effective calls to the graphics API can yield some additional savings. For example, often several uniforms can be defined and set as a group, so binding a single uniform buffer object is considerably more efficient [944]. In DirectX these are called constant buffers. Using these properly saves both time per function and time spent error-checking inside each individual API call [331, 613].

Making fewer, more efficient calls to the graphics API can yield some additional savings. For example, multiple uniforms can often be defined and set as a group, so it is much more efficient to bind a single uniform buffer object [944]. In DirectX, these are called constant buffers. Proper use of these functions can save time per function and time spent checking for errors in each individual API call [331, 613].

Modern drivers often defer setting state until the first draw call encountered. If redundant API calls are made before then, the driver will filter these out, thus avoiding the need to perform a state change. Often a dirty flag is used to note that a state change is needed, so going back to a base state after each draw call may become costly. For example, you may want to assume state X is off by default when you are about to draw an object. One way to achieve this is “Enable(X); Draw(M1); Disable(X);” then “Enable(X); Draw(M2); Disable(X);” thus restoring the state after each draw.However, it is also likely to waste significant time setting the state again between the two draw calls, even though no actual state change occurs between them.

Modern drivers typically defer setting state until the first draw call is encountered. If redundant API calls were made before then, the driver will filter out these calls, thus avoiding the need to perform state changes. Dirty flags are often used to indicate that state needs to be changed, so returning the base state after each call to draw can be expensive. For example, when you want to draw an object, you might want to assume that state X is off by default. One way to do this is to "enable(X); draw(M1); disable(X)" and then "enable(X); draw(M2); disable(X)" thus restoring the state after each draw. However, even if there is no actual state change between two draw calls, a lot of time can be wasted setting the state again.

Usually the application has higher-level knowledge of when a state change is needed. For example, changing from a “replace” blending mode for opaque surfaces to an “over” mode for transparent ones normally needs to be done once during the frame. Issuing the blend mode before rendering each object can easily be avoided.Galeano [511] shows how ignoring such filtering and issuing unneeded state calls would have cost their WebGL application up to nearly 2 ms/frame. However, if the driver already does such redundancy filtering efficiently, performing this same testing per call in the application can be a waste. How much effort to spend filtering out API calls primarily depends on the underlying driver [443, 488, 741].

Typically, the application has a higher level of knowledge of state changes. For example, changing from an opaque surface's Replace blend mode to a transparent surface's Overlay mode usually needs to be done once per frame. It's easy to avoid emitting blend modes before rendering each object. Galeano [511] shows how ignoring this filtering and issuing unnecessary state calls can cost WebGL applications nearly 2ms/frame. However, if the driver already does this redundant filtering effectively, it might be a waste to execute the same test for every invocation in the application. The effort to filter out API calls depends mostly on the underlying driver [443488741].

Consolidating and Instancing Consolidation and Instancing

Using the API efficiently avoids having the CPU be the bottleneck. One other concern with the API is the small batch problem. If ignored, this can be a significant factor affecting performance in modern APIs. Simply put, a few triangle-filled meshes are much more efficient to render than many small, simple ones. This is because there is a fixed-cost overhead associated with each draw call, a cost paid for processing a primitive, regardless of size.

Effective use of the API prevents the CPU from becoming a bottleneck. Another problem with the API is the mini-batch problem. If ignored, this can be a significant factor affecting the performance of modern APIs. Simply put, a few triangle-filled meshes render much more efficiently than many small, simple meshes. This is because each draw call has a fixed overhead, regardless of size, of paying for processing primitives.

Back in 2003, Wloka [1897] showed that drawing two (relatively small) triangles per batch was a factor of 375 away from the maximum throughput for the GPU tested. Instead of 150 million triangles per second, the rate was 0.4 million, for a 2.7 GHz CPU. For a scene rendered consisting of many small and simple objects, each with only a few triangles, performance is entirely CPU-bound by the API; the GPU has no ability to increase it. That is, the processing time on the CPU for the draw call is greater than the amount of time the GPU takes to actually draw the mesh, so the GPU is starved.

As early as 2003, Wloka [1897] showed that drawing two (relatively small) triangles per batch differed by a factor of 375 from the maximum throughput of the test GPU. For a 2.7 GHz CPU, the rate is 400,000, not 150 million triangles per second. For scenes consisting of many small and simple objects, each with only a few triangles, the performance is entirely CPU bound by the API; the GPU has no capacity to increase it. That is, the draw calls take longer to process on the CPU than the GPU takes to actually draw the mesh, so the GPU is starved.

Wloka’s rule of thumb is that “You get X batches per frame.” This is a maximum number of draw calls you can make per frame, purely due to the CPU being the limiting factor. In 2003, the breakpoint where the API was the bottleneck was about 130 triangles per object. Figure 18.2 shows how the breakpoint rose in 2006 to 510 triangles per mesh. Times have changed. Much work was done to ameliorate this draw call problem, and CPUs became faster. The recommendation back in 2003 was 300 draw calls per frame; in 2012, 16,000 draw calls per frame was one team’s ceiling [1381]. That said, even this number is not enough for some complicated scenes.With modern APIs such as DirectX 12, Vulkan, and Metal, the driver cost may itself be minimized—this is one of their major advantages [946]. However, the GPU can have its own fixed costs per mesh.

Wloka's rule of thumb is: "You can get X batches per frame." This is the maximum number of draw calls you can make per frame, purely because CPU is the limiting factor. In 2003, the breaking point where the API became a bottleneck was around 130 triangles per object. Figure 18.2 shows how the 2006 breakpoint goes up to 510 triangles per mesh. Times have changed. A lot of work was done to improve this draw call issue, and the CPU got faster. The 2003 recommendation was 300 draws per frame; in 2012, 16,000 draws per frame was the upper limit for a team [1381]. That said, even this number isn't enough for some complex scenes. With modern APIs such as DirectX 12, Vulkan, and Metal, the driver cost itself can be minimized, which is one of their main advantages [946]. However, GPUs can have their own fixed mesh costs.

Figure 18.2. Batch performance benchmarks for an Intel Core 2 Duo 2.66 GHz CPU using an NVIDIA G80 GPU, running DirectX 10. Batches of varying size were run and timed under different conditions. The “Low” conditions are for triangles with just the position and a constant-color pixel shader; the other set of tests is for reasonable meshes and shading. “Single” is rendering a single batch many times. “Instancing” reuses the mesh data and puts the per-instance data in a separate stream.“Constants” is a DirectX 10 method where instance data are put in constant memory. As can be seen, small batches hurt all methods, but instancing gives proportionally much faster performance.At a few hundred triangles, performance levels out, as the bottleneck becomes how fast vertices are retrieved from the vertex buffer and caches. (Graph courtesy of NVIDIA Corporation.) 

Figure 18.2. Batch processing performance benchmark on Intel Core 2 Duo 2.66 GHz CPU running DirectX 10 using NVIDIA G80 GPU. Batches of different sizes are run and timed under different conditions. The "low" condition is for triangles with only position and constant-color pixel shaders; another set of tests is reasonable meshes and shadows. "Single" renders a single batch multiple times. "Instancing" reuses the grid data and puts each instance's data into a separate stream. "Constants" is a DirectX 10 method where instance data is stored in constant memory. It can be seen that small batch size hurts all methods, but instantiation improves performance accordingly. In the case of a few hundred triangles, performance degrades as the bottleneck becomes the speed at which vertices are retrieved from the vertex buffer and cache. (Chart courtesy of NVIDIA Corporation.)

One way to reduce the number of draw calls is to consolidate several objects into a single mesh, which needs only one draw call to render the set. For sets of objects that use the same state and are static, at least with respect to one another, consolidation can be done once and the batch can be reused each frame [741, 1322]. Being able to consolidate meshes is another reason to consider avoiding state changes by using a common shader and texture-sharing techniques. The cost savings from consolidation are not just from avoiding API draw calls. There are also savings from the application itself handling fewer objects. However, having batches that are considerably larger than needed can make other algorithms, such as frustum culling, be less effective [1381].One practice is to use a bounding volume hierarchy to help find and group static objects that are near each other. Another concern with consolidation is selection,since all the static objects are undifferentiated, in one mesh. A typical solution is to store an object identifier at each vertex in the mesh.

One way to reduce the number of draw calls is to combine multiple objects into a single mesh that requires only one draw call to render the set. For sets of objects that use the same state and are static at least with respect to each other, merging can be done once, and batches can be reused at each frame [741, 1322]. The ability to merge meshes is another reason to consider common shader and texture sharing techniques to avoid state changes. Cost savings from consolidation don't just come from avoiding API calls. Cost savings can also be achieved by the application itself handling fewer objects. However, having batches much larger than necessary may make other algorithms such as frustum culling less efficient [1381]. One way to do this is to use a bounding volume hierarchy to help find and group static objects that are close to each other. Another problem with merging is selection, since all static objects are indistinguishable within a mesh. A typical solution is to store object identifiers at each vertex in the mesh.

The other approach to minimize application processing and API costs is to use some form of instancing [232, 741, 1382]. Most APIs support the idea of having an object and drawing it several times in a single call. This is typically done by specifying a base model and providing a separate data structure that holds information about each specific instance desired. Beyond position and orientation, other attributes could be specified per instance, such as leaf colors or curvature due to the wind, or anything else that could be used by shader programs to affect the model. Lush jungle scenes can be created by liberal use of instancing. See Figure 18.3. Crowd scenes are a good fit for instancing, with each character appearing unique by selecting different body parts from a set of choices. Further variation can be added by random coloring and decals.Instancing can also be combined with level of detail techniques [122, 1107, 1108]. See Figure 18.4 for an example.

Another way to minimize application processing and API costs is to use some form of instantiation [2327411382]. Most APIs support the idea of ​​having an object and drawing that object multiple times in one call. This is usually done by specifying a base model and providing a separate data structure that holds information about each specific instance needed. In addition to position and orientation, other properties can be specified for each instance, such as foliage color or curvature caused by wind, or any other property that a shader program can use to affect the model. Lush jungle scenes can be created through liberal use of instancing. See Figure 18.3. Crowd scenes are great for instancing, with each character appearing unique by choosing different body parts from a set selection. More variety can be added with random coloring and decals. Instancing can also be combined with level-of-detail techniques [12211071108]. See Figure 18.4 for an example.

Figure 18.3. Vegetation instancing. All objects the same color in the lower image are rendered in a single draw call [1869]. (Image from CryEngine1, courtesy of Crytek.) 

Figure 18.3. Vegetation instancing. All objects of the same color in the lower image are rendered in a single draw call [1869]. (Image via CryEngine1, courtesy of Crytek.)

Figure 18.4. Crowd scene. Using instancing minimizes the number of draw calls needed. Level of detail techniques are also used, such as rendering impostors for distant models [1107, 1108]. (Image courtesy of Jonathan Ma¨ım, Barbara Yersin, Mireille Clavien, and Daniel Thalmann.) 

Figure 18.4. Crowd scene. Use instancing to minimize the number of draw calls required. Level-of-detail techniques are also used, such as rendering impostors for distant models [11071108]. (Photo courtesy of Jonathan Maım, Barbara Yersin, Mireille Clavien and Daniel Thalmann)

A concept that combines consolidation and instancing is called merge-instancing,where a consolidated mesh contains objects that may in turn be instanced [146, 1382].In theory, the geometry shader can be used for instancing, as it can create duplicate data of an incoming mesh. In practice, if many instances are needed, this method can be slower than using instancing API commands. The intent of the geometry shader is to perform local, small-scale amplification of data [1827]. In addition, for some architectures, such as Mali’s tile-based renderer, the geometry shader is implemented in software. To quote Mali’s best practices guide [69], “Find a better solution to your problem. Geometry shaders are not your solution.”

The concept of combining merging and instancing is called merging instancing, where merging meshes contain objects that can be instantiated sequentially [1461382]. In theory, a geometry shader could be used for instancing, since it creates duplicate data that is passed into the mesh. In practice, this approach can be slower than using the instance API commands if many instances are required. The purpose of geometry shaders is to perform local, small-scale upscaling of data [1827]. Also, for some architectures, such as Mali's tile-based renderer, the geometry shader is implemented in software. To quote Mali's best practice guide [69], "Find a better solution to your problem. Geometry shaders are not your solution"

18.4.3 Geometry Stage

The geometry stage is responsible for transforms, per-vertex lighting, clipping, projection,and screen mapping. Other chapters discuss ways to reduce the amount of data flowing through the pipeline. Efficient triangle mesh storage, model simplification,and vertex data compression (Chapter 16) all save both processing time and memory.Techniques such as frustum and occlusion culling (Chapter 19) avoid sending the full primitive itself down the pipeline. Adding such large-scale techniques on the CPU can entirely change performance characteristics of the application and so are worth trying early on in development. On the GPU such techniques are less common. One notable example is how the compute shader can be used to perform various types of culling [1883, 1884].

The geometry stage is responsible for transformations, per-vertex lighting, clipping, projection and screen mapping. Other chapters discuss ways to reduce the amount of data flowing through the pipeline. Efficient triangle mesh storage, model simplification, and vertex data compression (Chapter 16) all save processing time and memory. Techniques such as frustums and occlusion culling (Chapter 19) avoid sending the entire primitive itself down the pipeline. Adding such a large-scale technique on the CPU can completely change the performance characteristics of an application, so it is worth trying it early in development. On GPUs, this technique is less common. A notable example is how compute shaders can be used to perform various types of culling [18831884].

The effect of elements of lighting can be computed per vertex, per pixel (in the pixel processing stage), or both. Lighting computations can be optimized in several ways. First, the types of light sources being used should be considered. Is lighting needed for all triangles? Sometimes a model only requires texturing, texturing with colors at the vertices, or simply colors at the vertices.

Effects on lighting elements can be computed per vertex, per pixel (during the pixel processing stage), or both. Lighting calculations can be optimized in several ways. First, the type of light source used should be considered. Do all triangles need lighting? Sometimes a model just needs to be textured, textured with color at the vertices, or just needs to be added with color at the vertices.

If light sources are static with respect to geometry, then the diffuse and ambient lighting can be precomputed and stored as colors at the vertices. Doing so is often referred to as “baking” on the lighting. A more elaborate form of prelighting is to precompute the diffuse global illumination in a scene (Section 11.5.1). Such illumination can be stored as colors or intensities at the vertices or as light maps.

If the light source is static relative to the geometry, diffuse and ambient light can be precomputed and stored as colors at vertices. Doing this is often referred to as "baking" the lighting. A more complex form of pre-lighting is precomputed diffuse global illumination in the scene (Section 11.5.1). Such lighting can be stored as color or intensity or lightmaps at vertices.

For forward rendering systems the number of light sources influences the performance of the geometry stage. More light sources means more computation. A common way to lessen work is to disable or trim down local lighting and instead use an environment map (Section 10.5).

For forward rendering systems, the number of lights affects the performance of the geometry stage. More lights means more calculations. A common way to reduce the workload is to disable or clip local lighting and use environment maps instead (Section 10.5).

18.4.4 Rasterization Stage

Rasterization can be optimized in a few ways. For closed (solid) objects and for objects that will never show their backfaces (for example, the back side of a wall in a room), backface culling should be turned on (Section 19.3). This reduces the number of triangles to be rasterized by about half and so reduces the load on triangle traversal. In addition, this can be particularly beneficial when the pixel shading computation is expensive, as backfaces are then never shaded.

Rasterization can be optimized in several ways. Backface culling (Section 19.3) should be enabled for closed (solid) objects and objects whose backfaces are never shown (for example, the backside of a wall in a room). This cuts the number of triangles to be rasterized by about half, reducing the load of triangle traversal. Also, this can be especially beneficial when pixel shading is computationally expensive, since the backside is never shaded.

18.4.5 Pixel Processing Stage

Optimizing pixel processing is often profitable, since usually there are many more pixels to shade than vertices. There are notable exceptions. Vertices always have to be processed, even if a draw ends up not generating any visible pixels. Ineffective culling in the rendering engine might make the vertex shading cost exceed pixel shading.Too small a triangle not only causes more vertex shading evaluation than may be needed, but also can create more partial-covered quads that cause additional work.More important, textured meshes that cover only a few pixels often have low thread occupancy rates. As discussed in Section 3.10, there is a large time cost in sampling a texture, which the GPU hides by switching to execute shader programs on other fragments, returning later when the texture data has been fetched. Low occupancy can result in poor latency hiding. Complex shaders that use a high number of registers can also lead to low occupancy by allowing fewer threads to be available at one time (Section 23.3). This condition is referred to as high register pressure. There are other subtleties, e.g., frequent switching to other warps may cause more cache misses.Wronski [1911, 1914] discusses various occupancy problems and solutions.

Optimizing pixel processing is often profitable because there are usually many more pixels than vertices to shade. There are notable exceptions. Vertices always have to be processed, even if drawing doesn't end up producing any visible pixels. Ineffective culling in the rendering engine can make vertex shading more expensive than pixel shading. Triangles that are too small not only cause vertex shading to evaluate more than needed, but also create more partially covered quads, which causes extra work. What's more, textured meshes that cover only a few pixels usually have low thread usage. As mentioned in Section 3.10, sampling a texture is time-consuming, and the GPU hides the texture by switching to executing shader programs on other fragments, and returning after fetching the texture data. Low occupancy can lead to poor latency concealment. Complex shaders that use a lot of registers can also lead to low occupancy, since it allows fewer threads to be used at a time (Section 23.3). This condition is called high register pressure. There are other subtleties, for example, frequent switching to other warps can cause more cache misses. Wronski [1911914] discusses various occupancy problems and solutions.

To begin, use native texture and pixel formats, i.e., use the formats that the graphics accelerator uses internally, to avoid a possible expensive transform from one format to another [278]. Two other texture-related techniques are loading only the mipmap levels needed (Section 19.10.1) and using texture compression (Section 6.2.6). As usual, smaller and fewer textures mean less memory used, which in turn means lower transfer and access times. Texture compression also can improve cache performance,since the same amount of cache memory is occupied by more pixels.

First, use native texture and pixel formats, i.e. use the format used internally by the graphics accelerator, to avoid a possible expensive conversion from one format to another [278]. Two other texture-related techniques are loading only the required mipmap levels (Section 19.10.1) and using texture compression (Section 6.2.6) . As always, smaller and smaller textures mean less memory usage, which in turn means lower transfer and access times. Texture compression can also improve cache performance because more pixels are occupied for the same amount of cache memory.

One level of detail technique is to use different pixel shader programs, depending on the distance of the object from the viewer. For example, with three flying saucer models in a scene, the closest might have an elaborate bump map for surface details that the two farther away do not need. In addition, the farthest saucer might have specular highlighting simplified or removed altogether, both to simplify computations and to reduce “fireflies,” i.e., sparkle artifacts from undersampling. Using a color per vertex on simplified models can give the additional benefit that no state change is needed due to the texture changing.

One detail technique is to use a different pixel shader program depending on the object's distance from the viewer. For example, with three flying saucer models in a scene, the closest one might have a carefully crafted bump map for unwanted surface detail on the two farther ones. Also, the farthest dish may have the specular highlights simplified or completely removed to simplify calculations and reduce "firebugs", i.e. flare artifacts due to undersampling. Using per-vertex color on simplified models can provide the added benefit of not needing to change state due to texture changes.

The pixel shader is invoked only if the fragment is visible at the time the triangle is rasterized. The GPU’s early-z test (Section 23.7) checks the z-depth of the fragment against the z-buffer. If not visible, the fragment is discarded without any pixel shader evaluation, saving considerable time. While the z-depth can be modified by the pixel shader, doing so means that early-z testing cannot be performed.

The pixel shader is only called when the fragment is visible when the triangle is rasterized. The GPU's early z-test (Section 23.7) checks the z-depth of a fragment against the z-buffer. If not visible, the fragment will be discarded without any pixel shader evaluation, saving a lot of time. While pixel shaders can modify z-depth, doing so means that early z-tests cannot be performed.

To understand the behavior of a program, and especially the load on the pixel processing stage, it is useful to visualize the depth complexity, which is the number of surfaces that cover a pixel. Figure 18.5 shows an example. One simple method of generating a depth complexity image is to use a call like OpenGL’s glBlendFunc(GL ONE,GL ONE), with z-buffering disabled. First, the image is cleared to black. All objects in the scene are then rendered with the color (1/255, 1/255, 1/255). The effect of the blend function setting is that for each primitive rendered, the values of the written pixels will increase by one intensity level. A pixel with a depth complexity of 0 is then black and a pixel of depth complexity 255 is full white, (255, 255, 255).

To understand the behavior of a program, especially the load of the pixel processing stage, it is useful to visualize the depth complexity, which is the number of surfaces covering a pixel. Figure 18.5 shows an example. An easy way to generate depth-complexity images is to use a call like OpenGL's glBlendFunc(GL ONE, GL ONE), disabling the z-buffer. First, the image is cleared to black. All objects in the scene will then be rendered using the color (1/255, 1/255 and 1/255). The effect of the blend function setting is that for each primitive rendered, the value written to the pixel will increase by one intensity level. Pixels with depth complexity 0 are black, and pixels with depth complexity 255 are all white (255255255).

Figure 18.5. The depth complexity of the scene on the left is shown on the right. (Images created using NVPerfHUD from NVIDIA Corporation.) 

Figure 18.5. The depth complexity of the scene on the left is shown on the right. (Image created using NVIDIA Corporation's NVPerfHUD.)

The amount of pixel overdraw is related to how many surfaces actually were rendered.The number of times the pixel shader is evaluated can be found by rendering the scene again, but with the z-buffer enabled. Overdraw is the amount of effort wasted computing a shade for a surface that is then hidden by a later pixel shader invocation.An advantage of deferred rendering (Section 20.1), and ray tracing for that matter, is that shading is performed after all visibility computations are performed.

The amount of pixel overdraw is related to the actual number of surfaces rendered. By rendering the scene again, but with the z-buffer enabled, the number of pixel shader evaluations can be found. Overdraw is the wasted work of computing shadows for a surface that is then hidden by a later pixel shader call. One advantage of deferred rendering (Section 20.1) and ray tracing is that shading is performed after all visibility calculations are performed.

Say two triangles cover a pixel, so the depth complexity is two. If the farther triangle is drawn first, the nearer triangle overdraws it, and the amount of overdraw is one. If the nearer is drawn first, the farther triangle fails the depth test and is not drawn, so there is no overdraw. For a random set of opaque triangles covering a pixel,the average number of draws is the harmonic series [296]:

Assuming two triangles cover one pixel, then the depth complexity is 2. If the farther triangles are drawn first, the closer triangles are overdrawn by 1. If the closer triangles are drawn first, the farther triangles fail the depth test and are not drawn, so they are not overdrawn. For a random collection of opaque triangles covering a pixel, the average number of draws is the harmonic series [296]:

The logic behind this is that the first triangle rendered is one draw. The second triangle is either in front or behind the first, a 50/50 chance. The third triangle can have one of three positions compared to the first two, giving one chance in three of it being frontmost. As n goes to infinity, 

The logic behind this is that the first triangle rendered is a tie. There is a 50/50 chance that the second triangle is either in front or behind the first. Compared to the first two triangles, the third triangle can have one of three positions, one of these three being frontmost. When n tends to infinity,

where γ = 0.57721 . . . is the Euler-Mascheroni constant. Overdraw rises rapidly when depth complexity is low, but quickly tapers off. For example, a depth complexity of 4 gives an average of 2.08 draws, 11 gives 3.02 draws, but it takes a depth complexity of 12,367 to reach an average of 10.00 draws. 

Where γ=0.57721...is the Euler-Mascheroni constant. Overdraw increases rapidly when depth complexity is low, but decreases rapidly. For example, when the depth complexity is 4, it draws 2.08 sheets on average, and when it is 11, it draws 3.02 sheets, but the depth complexity needs to be 12367 to reach an average of 10.00 sheets.

So, overdraw is not necessarily as bad as it seems, but we would still like to minimize it, without costing too much CPU time. Roughly sorting and then drawing the opaque objects in a scene in an approximate front-to-back order (near to far) is a common way to reduce overdraw [240, 443, 488, 511]. Occluded objects that are drawn later will not write to the color or z-buffers (i.e., overdraw is reduced). Also, the pixel fragment can be rejected by occlusion culling hardware before even reaching the pixel shader program (Section 23.5). Sorting can be accomplished by any number of methods. An explicit sort based on the distance along the view direction of the centroids of all opaque objects is one simple technique. If a bounding volume hierarchy or other spatial structure is already in use for frustum culling, we can choose the closer child to be traversed first, on down the hierarchy.

So overdrawing isn't necessarily as bad as it seems, but we still want to minimize it without spending too much CPU time. Roughly sorting opaque objects in a scene and then drawing them in approximate front-to-back order (near to far) is a common way to reduce overdraw [240, 443, 488, 511]. Occluded objects drawn later will not be written to the color or z-buffer (i.e. to reduce overdraw). Additionally, pixel fragments can be rejected by the occlusion culling hardware before reaching the pixel shader program (Section 23.5). Sorting can be done in any number of ways. Explicit sorting based on the distance along the view direction of the centroids of all opaque objects is a simple technique. If a bounding volume hierarchy or other spatial structure is already used for frustum culling, we can select the closer sub-objects to traverse first, down the hierarchy.

Another technique can be useful for surfaces with complex pixel shader programs.Performing a z-prepass renders the geometry to just the z-buffer first, then the whole scene is rendered normally [643]. This eliminates all overdraw shader evaluations,but at the cost of an entire separate run through all the geometry. Pettineo [1405] writes that the primary reason his team used a depth prepass in their video game was to avoid overdraw. However, drawing in a rough front-to-back order may provide much of the same benefit without the need for this extra work. A hybrid approach is to identify and first draw just a few large, simple occluders likely to give the most benefit [1768]. As McGuire [1177] notes, a full-draw prepass did not help performance for his particular system. Measuring is the only way to know which technique, if any,is most effective for your application.

Another technique can be used for surfaces with complex pixel shader programs. Performing z-preprocessing first renders the geometry to the z-buffer, then renders the entire scene normally [643]. This eliminates all overdraw shader evaluation at the cost of running all geometry individually. Petino [1405] wrote that the main reason his group used deep preprocessing in video games was to avoid overdraw. However, drawing in rough front-to-back order can provide the same benefits without extra work. A hybrid approach is to identify and map first a few large simple occluders likely to provide the greatest benefit [1768]. As McGuire [1177] pointed out, an all-out draw preliminaries did not help his particular system work. Measurements are the only way to know which technique, if any, will work best for your application.

Earlier we recommended grouping by shader and texture to minimize state changes;here we talk about rendering objects sorted by distance. These two goals usually give different object draw orders and so conflict with each other. There is always some ideal draw order for a given scene and viewpoint, but this is difficult to find in advance. Hybrid schemes are possible, e.g., sorting nearby objects by depth and sorting everything else by material [1433]. A common, flexible solution [438, 488, 511,1434, 1882] is to create a sorting key for each object that encapsulates all the relevant criteria by assigning each a set of bits. See Figure 18.6.

Earlier, we suggested grouping shaders and textures to minimize state changes; here we discuss rendering objects sorted by distance. These two goals usually give different object drawing orders and thus conflict with each other. For a given scene and viewpoint, there is always some ideal drawing order, but this is hard to find in advance. Hybrid schemes are possible, e.g. sorting nearby objects by depth and other objects by material [1433]. A common flexible solution [43848851114341882] is to create a sort key per object that encapsulates all relevant criteria by assigning each object a set of bits. See Figure 18.6.

Figure 18.6. Example sort key for draw order. Keys are sorted from low to high. Setting the transparency bit means that the object is transparent, as transparent objects are to be rendered after all opaque objects. The object’s distance from the camera is stored as an integer with low precision.For transparent objects the distance is reversed or negated, since we want objects in a back-to-front order. Shaders are each given a unique identification number, as are textures. 

Figure 18.6. Example of a sort key for drawing order. Keys are sorted from low to high. Setting the transparency bit means the object is transparent, since transparent objects will be rendered after all opaque objects. The object's distance from the camera is stored as a low-precision integer. For transparent objects, the distance is reversed or reversed, since we want the objects to be in front-to-back order. Every shader has a unique identification number, and so do textures.

We can choose to favor sorting by distance, but by limiting the number of bits storing the depth, we can allow grouping by shader to become relevant for objects in a given range of distances. It is not uncommon to sort draws into even as few as two or three depth partitions. If some objects have the same depth and use the same shader, then the texture identifier is used to sort the objects, which then groups objects with the same texture together.

We could choose to support sorting by distance, but by limiting the number of bits of storage depth we can allow grouping by shader to be related to objects within a given distance range. It is not uncommon to classify plots into two or three depth partitions. If some objects have the same depth and use the same shader, use the texture identifier to sort the objects and then group objects with the same texture together.

This is a simple example and is situational, e.g., the rendering engine may itself keep opaque and transparent objects separate so that the transparency bit is not necessary. The number of bits for the other fields certainly varies with the maximum number of shaders and textures expected. Other fields may be added or substituted in,such as one for blend state and another for z-buffer read and write. Most important of all is the architecture. For example, some tile-based GPU renderers on mobile devices do not gain anything from sorting front to back, so state sorting is the only important element to optimize [1609]. The main idea here is that putting all attributes into a single integer key lets you perform an efficient sort, thus minimizing overdraw and state changes as possible.

This is a simple example and is situational, e.g. the rendering engine itself can separate opaque and transparent objects, making the transparency bit unnecessary. The number of bits in other fields will of course vary with the expected maximum number of shaders and textures. Additional fields can be added or replaced in , such as one for blend state and another for z-buffer reads and writes. The most important thing is architecture. For example, some tile-based GPU renderers on mobile devices do not benefit from front-to-back ordering, so state ordering is the only important factor for optimization [1609]. The main idea here is that having all properties in one integer key allows efficient sorting to be performed with as few overdraws and state changes as possible.

18.4.6 Framebuffer Techniques

Rendering a scene often incurs a vast amount of accesses to the framebuffer and many pixel shader executions. To reduce the pressure on the cache hierarchy, a common piece of advice is to reduce the storage size of each pixel of the framebuffer. While a 16-bit floating point value per color channel provides more accuracy, an 8-bit value is half the size, which means faster accesses assuming that the accuracy is sufficient. The chrominance is often subsampled in many image and video compression schemes, such as JPEG and MPEG. This can often be done with negligible visual effect due to fact that the human visual system is more sensitive to luminance than to chrominance.For example, the Frostbite game engine [1877] uses this idea of chroma subsampling to reduce bandwidth costs for post-processing their 16-bits-per-channel images.

Rendering a scene typically results in heavy framebuffer accesses and execution of many pixel shaders. To reduce the pressure on the cache hierarchy, a common suggestion is to reduce the storage size per pixel in the framebuffer. While 16-bit floating-point values ​​for each color channel provide more precision, 8-bit values ​​are half their size, meaning faster access where precision is sufficient. Chroma is often subsampled in many image and video compression schemes such as JPEG and MPEG. Since the human visual system is more sensitive to luminance than chrominance, this can often be done with negligible visual impact. For example, the Frostbite game engine [1877] uses this idea of ​​chroma subsampling to reduce the bandwidth cost of post-processing 16-bit-per-channel images.

Mavridis and Papaioannou [1144] propose that the lossy YCoCg transform, described on page 197, is used to achieve a similar effect for the color buffer during rasterization. Their pixel layout is shown in Figure 18.7. Compared to RGBA, this halves the color buffer storage requirements (assuming A is not needed) and often increases performance, depending on architecture. Since each pixel has only one of the chrominance components, a reconstruction filter is needed to infer a full YCoCg per pixel before converting back to RGB before display. For a pixel missing the Co-value,for example, the average of the four closest Co-values can be used. However, this does not reconstruct edges as well as desired. Therefore, a simple edge-aware filter is used instead, which is implemented as

Mavridis and Papaioannou [1144] propose that the lossy YCoCg transform described on page 197 is used to achieve a color buffer-like effect during rasterization. Their pixel layout is shown in Figure 18.7. This reduces color buffer storage requirements (assuming no A's are needed) compared to RGBA, and generally improves performance, depending on the architecture. Since there is only one chrominance component per pixel, a reconstruction filter is required to infer the full YCoCg of each pixel and then convert back to RGB before display. For example, for pixels with missing Co values, the average of the four closest Co values ​​can be used. However, this does not reconstruct the edges as desired. Therefore, a simple edge-aware filter is used, which is implemented as

Figure 18.7. Left: 4 × 2 pixels, each storing four color components (RGBA). Right: an alternative representation where each pixel stores the luminance, Y , and either the first (Co) or the second (Cg) chrominance component, in a checkerboard pattern. 

Figure 18.7. Left: 4×2 pixels, each storing four color components (RGBA). Right: Another representation where each pixel stores luma, Y, and first (Co) or second (Cg) chrominance components in a checkerboard pattern.

 for a pixel that does not have Co, where Co,i and Li are the values to the left, right,top, and bottom of the current pixel, L is the luminance of the current pixel, and t is a threshold value for edge detection. Mavridis and Papaioannou used t = 30/255.The step(x) function is 0 if x < 0, and 1 otherwise. Hence, the filter weights wi are either 0 or 1, where they are zero if the luminance gradient, |Li−L|, is greater than t.A WebGL demo with source code is available online [1144].

For pixels without Co, where Co, i, and Li are the values ​​of left, right, top, and bottom of the current pixel, L is the current pixel brightness, and t is the threshold for edge detection. Mavridis and Papaioannou used t=30/255. The step(x) function is 0 if x<0, 1 otherwise. Therefore, the filter weight wi is 0 or 1 if the brightness gradient |Li−L| is greater than t. A WebGL demo with source code is available online [1144].

Because of the continuing increase in display resolutions and the shader execution cost savings, using a checkerboard pattern for rendering has been used in several systems [231, 415, 836, 1885]. For virtual reality applications, Vlachos [1824] uses a checkerboard pattern for pixels around the periphery of the view, and Answer [59] reduces each 2 × 2 quad by one to three samples.

Rendering with checkerboard patterns has been used in several systems due to increasing display resolutions and savings in shader execution costs [2314158361885]. For virtual reality applications, Vlachos [1824] uses a checkerboard pattern to represent pixels around the view, and Answer [59] reduces each 2×2 quarter to three samples.

18.4.7 Merging Stage

Make sure to enable blend modes only when useful. In theory “over” compositing could be set for every triangle, opaque or transparent, since opaque surfaces using “over” will fully overwrite the value in the pixel. However, this is more costly than a simple “replace” raster operation, so tracking objects with cutout texturing and materials with transparency is worthwhile. Alternately, there are some raster operations that cost nothing extra. For example, when the z-buffer is being used, on some systems it costs no additional time to also access the stencil buffer. This is because the 8-bit stencil buffer value is stored in the same word as the 24-bit z-depth value [890].

Make sure to enable blending modes only when useful. In theory, it is possible to set an "overlay" compositing for each triangle (opaque or transparent), since an opaque surface using "overlay" will completely cover the values ​​in the pixels. However, this is more expensive than a simple "replace" raster operation, so tracking objects with clipped textures and transparent materials is worth it. Alternatively, some raster operations do not cost extra. For example, when using the z-buffer, on some systems there is also no extra time required to access the stencil buffer. This is because 8-bit stencil buffer values ​​are stored in the same word as 24-bit z-depth values ​​[890].

Thinking through when various buffers need to be used or cleared is worthwhile. Since GPUs have fast clear mechanisms (Section 23.5), the recommendation is to always clear both color and depth buffers since that increases the efficiency of memory transfers for these buffers.

It pays to carefully consider when various buffers need to be used or cleared. Due to the fast clearing mechanism of the GPU (Section 23.5), it is recommended to always clear the color and depth buffers, as this improves the efficiency of memory transfers for these buffers.

You should normally avoid reading back render targets from the GPU to the CPU if you can help it. Any framebuffer access by the CPU causes the entire GPU pipeline to be flushed before the rendering is returned, losing all parallelism there [1167, 1609].

You should generally avoid reading render targets from the GPU back to the CPU if you can help it. Any CPU access to the framebuffer causes the entire GPU pipeline to be flushed before rendering returns, losing all parallelism [11671609].

If you do find that the merging stage is your bottleneck, you may need to rethink your approach. Can you use lower-precision output targets, perhaps through compression?Is there any way to reorder your algorithm to mitigate the stress on this stage?For shadows, are there ways to cache and reuse parts where nothing has moved?

If you do find that the merge stage is your bottleneck, you may need to rethink your approach. Can you use a lower precision output target, or perhaps via compression? Is there a way to reorder your algorithm to ease this stage? For shadows, is there a way to cache and reuse the parts that didn't move?

In this section we have discussed ways of using each stage well by searching for bottlenecks and tuning performance. That said, be aware of the dangers of repeatedly optimizing an algorithm when you may be better served by using an entirely different technique.

In this section, we discuss ways to put each stage to good use by searching for bottlenecks and tuning for performance. That said, be aware of the dangers of duplicating an optimization algorithm when using an entirely different technique might serve you better.

Guess you like

Origin blog.csdn.net/m0_37609239/article/details/127420386