An article to understand the bandwidth and synchronization of Vulkan in mobile rendering

1. Mobile GPU Architecture

Immediate Mode Rendering Architecture(IMR)

6f938a9e9872d4fff6621d3d266812fa.png

Most of the earlier PC-side GPUs used the IMR architecture. The general flow of the IMR architecture is shown in the figure. It is worth noting that for each pixel, after several color/depth reads and writes, it is finally drawn to the framebuffer. Each of these reads and writes directly interacts with the memory. Therefore, frequent reading and writing of memory will consume a lot of bandwidth, and this process will generate a lot of heat. For PCs, for higher image quality and frame rate, better fans and heat dissipation can be used to solve the heating problem from the outside, but for mobile phones with very limited space, it is definitely impossible to install heat dissipation.

Tile-Based Rendering Architecture(TBR)

Therefore, in order to solve the heating problem caused by bandwidth, mobile phone GPUs generally use a tiled-based architecture. Adreno/Mali/PowerVR GPUs have their own unique optimizations on this basis. For example, HSR in PowerVR’s TBDR architecture can Completely eliminate overdraw when rendering opaque objects, so we won't discuss it here.

However, their core ideas are the same, and they all open up an on-chip memory for the GPU. The feature of this on-chip memory is that compared with the interaction with the memory, the overhead of reading and writing it by the GPU is very small.

Tile-based rendering divides a complete framebuffer into several tiles. Before the content of each tile is completely drawn, the GPU only reads and writes the on-chip memory. After a tile is completely rendered, the content in the on-chip memory is rendered One-time write to memory, the following figure describes the overall drawing process.

5bf3a7812e7d2638066fa9176afb7a5f.png

In this way, originally drawing each pixel requires direct reading and writing of memory, but now it is changed to on-chip memory with extremely low reading and writing overhead, thus saving bandwidth and reducing power consumption.

2. Why use Vulkan?

This is also an issue that must be considered in mobile rendering development, requiring the engine technical team to have a very clear understanding of their project requirements, Vulkan API, and related mobile phone hardware features. The Vulkan API has the following special advantages:

1) Its drivers are thinner. Vulkan is closer to the bottom layer, and the driver will not make a lot of guesses and judgments like GLES, so it can significantly reduce the CPU load if used well;

2) Vulkan provides explicit synchronization and bandwidth control instructions. Effective and correct use of these instructions can improve GPU operating efficiency and reduce GPU power consumption.

3) Vulkan burns and submits the commandbuffer architecture, which better supports multi-threaded rendering. For example, the Unity PC end uses secondary commandbuffer multithreading to render objects in the same scene at the same time. UE also uses the characteristics of Vulkan to separate the RHI thread from the rendering thread, average the multi-core load, and improve the multi-core efficiency.

4) Because the Vulkan standard is relatively new and has been valued, it directly includes many methods of using the latest hardware features when customizing the Vulkan standard.

Therefore, in the case of reasonable scheduling, Vulkan can enjoy the CPU benefits brought by driver thinness and multi-threading. On the GPU side, with a full understanding of the hardware architecture and the characteristics of each driver, manual control of synchronization and bandwidth can effectively improve game performance. However, some projects often spend a lot of time and energy on optimizing the rendering algorithm, improving the rendering process, or adding a custom rendering pipeline, but only because they do not understand a certain hardware feature, or use a certain API incorrectly, The performance degradation is very serious or the optimized algorithm is negatively improved, even including the early Unity and Unreal default mobile rendering pipelines, which are more or less insufficient in this area.

3. Bandwidth and power consumption in games

Let's start with a very simple example.

e584aafce9a41f61ebb36c862fdbf464.png

The picture above is a custom post-processing rendering process implemented with Unity's rendering commandbuffer. When switching the rendertarget, if we just use the default interface to set RT and pass it to the bottom layer, we can see that the load action of Vulkan renderpass is load, that is, the previous rendering result is retained.

And when we use the following interface to explicitly set the loadaction of the rendertarget to change the load action of Vulkan to dontcare, we can see that the frame rate has increased by more than 1 frame when the GPU is bound.

5390811b127e895ed4d919878b879ee2.png

This is because, when we load this rendertarget, on the Tile-base architecture, the rendertarget is loaded from memory to on-chip memory, so additional overhead is generated. And if we explicitly set dontcare, the loading time and bandwidth will be reduced. In the case of GPU bound, the performance improvement is directly reflected in the frame rate.

Application of Tile-based Rendering in Delayed Rendering

c1f2f86541d54d9e445d3c78ec7fe39e.png

In the traditional deferred rendering method, after the Gbuffer is rendered, it is written into the memory, and then the gbuffers are sampled in the lighting stage. And because each pixel needs the gbuffer information at its own pixel position in the lighting stage, we can use subpass to store the gbuffer in the on-chip memory, and read it directly from the on-chip memory in the subsequent lighting stage. It saves the bandwidth consumption of storing memory first and then sampling twice.

The following is a demo of delayed rendering, the left is the sample method, and the right is the subpass method:

6b54979cdf412eedca4633e9c71a1a26.png

In both cases, the lighting algorithm, resolution, number of lights, fps including GPU frequency are all fixed. The only difference is whether it interacts with the memory. The left side is the Gbuffer saved in the sampling memory; the right side does not interact with the memory at all, and directly writes and reads on-chip memory.

a2aca24ea48832cbad9e59de2fa585eb.png

It can be seen from the test results that the calculation amount of the shader is the same, but the memory frequency and bandwidth are greatly improved by the subpass solution. Here, the read and write bandwidth is reduced by 4.9Gb/s in total, and the memory frequency is also reduced.

69fcd0fbd1abb0b5f9aa59c2b380eb11.png

In the case of no difference in frame rate, GPU usage and frequency, purely because of the reduction in bandwidth, a power difference of 567mW was generated, and the average temperature of the GPU was also reduced by 5 degrees.

96ae4476eb3d7f9e58edc777a2431445.png

Here is a general reference. The power consumption generated by the memory bandwidth is almost equal to the power consumption of the GPU. Every time 1GB of memory bandwidth is consumed, it will generate about 120mW of power consumption, which is also consistent with the previous 5GB, 567mW test results.

Not only delayed rendering, as long as you want to read the historical color and depth of the current pixel position in the rendering process, you can use subpass. For example, decal, semi-transparent rendering of some particle effects, MSAA anti-aliasing, etc., can get considerable benefits in terms of bandwidth by using subpass reasonably.

Of course, the tile-based architecture still has some disadvantages:

For example, in the tiling phase, all VS calculations need to be completed, and all varying needs to be stored in memory. Therefore, compared to IMR, in addition to the common overhead of fetching geometry data, the tiling phase also has additional delay and bandwidth consumption.

204a28b56521687892bfa278e08a8bb5.png

So the organization of vertex data is also very important. For example, in the custom vertex data shown in the figure below, aColor will be passed to the fragment shader for custom rendering.

ffd0a2a840634b3ca0ca0ca6ddf58524.png

Note that each data is an integer with only 1 significant digit, but it is stored in a 32-bit format. In the case of complex models and many vertices, this is also a considerable overhead. In fact, 16-bit, or even 8-bit format is enough. Therefore, in the case of knowing the purpose of each vertex data in our project, we should use the smallest possible format.

Index-Driven Vertex Shading(IDVS)

The figure below shows the optimization done on ARM to reduce the bandwidth of fetch geometry data. Strictly speaking, it cannot be regarded as an optimization specially designed for the TBR architecture, but its benefits are indeed great.

7f3663e61a35e712e6902ba8bd37b132.png

The idea of ​​IDVS is that in VS, only position-related calculations are done first, and after the elimination is completed, other attribute data will be fetched and varying-related calculations will be done. However, this optimization requires developers to store position and other attributes in separate buffers, so that the GPU can fetch them separately.

Most of the mobile games on the market are on the left side, all vertex data is stored in the same vkbuffer; the right side is the recommended way, one vkbuffer only stores position data, and the remaining varying is stored in another vkbuffer , in some cases where vs is heavier, the power consumption gap between the two exceeds 30%. So if the vertex data is complex, it is worth splitting the vertex data.

4. Synchronization in Vulkan

A very important element in Vulkan synchronization is pipeline barriers. As long as a project modifies the engine's native pipeline, it will definitely involve the modification of pipeline barriers. In practice, almost all projects have some inappropriate uses. Moreover, if the project development cycle is relatively long and an earlier version of Unity and Unreal engines are used, then the native code may be more or less improperly used.

The following three points are the role of the pipeline barrier mentioned above in the Vulkan spec.

First, it can control the execution order. When the GPU actually executes the instructions, it does not necessarily execute them in the order in which we submit the instructions. Therefore, adding a pipeline barrier between the instructions can ensure that the instructions before the barrier are before the instructions after the barrier. The instruction is executed.

Only the order in which instructions start is guaranteed. In the case of parallelism, the order in which instructions end cannot be controlled. Problems will arise when memory modification is involved.

Second, the pipeline barrier also guarantees the memory dependencies between instructions, which will be explained in detail later.

The third point is that the layout conversion of the image is actually equally important. Interested readers can refer to relevant information.

Hazards

Because the actual situation is much more complicated, here we simplify the read-write model into three parts: GPU core-cache-memory:

a7e72c511b390a8ebf64e6e1aa5ad35b.png

When writing data, the GPU core needs to flush the cache after modifying the cache, and copy the data in it to the memory. This process is called make memory available.

When the GPU core reads data, it must first invalidate the cache and load the data in the memory into the cache. This process is called make memory visible.

After understanding this simplified model, let's look at the three hazards of memory reading and writing: read after write, write after write, and write after read.

If there is no synchronization, such as read after write, it is likely that when the read command is executed, the previous write command has not had time to modify the memory, resulting in rendering errors. This kind of situation needs to be avoided through synchronization commands.

The specific process of Pipline Barrier processing WAR, RAW, WAW

Among the three problems, the best solution is to write after reading. For it, as long as the order of instruction execution is restricted through the pipeline barrier, the correctness of data reading can be guaranteed. Because the read instruction is executed first, the data has already been loaded into the cache. Even in the worst case, the subsequent write operation is completed immediately, and it only modifies the content in the memory, and cannot affect what we are reading. Data that has been loaded into the cache.

Next, let's look at the other two slightly more complicated synchronizations.

Pipeline stage mask and access mask determine which segment of memory the pipeline barrier takes effect. Then, if the same segment of memory is written first and then read, the read operation must be performed after all data is written into the memory.

1c73fad21b6e000f49550026875aab57.png

Therefore, under the protection of the pipeline barrier, the entire synchronization process is shown on the upper right:

First, the read command will be blocked, waiting for the write command to be executed; srcAccessMask means to make this memory available to ensure that the data is written into the memory. dstAccessMask means to make this memory visible, that is, to ensure that the data just written into the memory is successfully loaded into the cache.

Finally, execute the read command. In this way, the entire memory synchronization is completed.

The principle of writing after writing is similar. First, block the second write operation, wait for the first write operation to complete, and the data has been flushed into the memory, and then perform the next write operation.

0cf97896fd57f3332e6d3e4e408a9517.png

The above is a typical synchronous example of read-after-write. This is a delayed rendering demo that uses sampling to read Gbuffer. Simulation is the most common process in games. It is to render an RT first, and then a fragment later The shader samples the result of this RT. In addition to delayed rendering, shadowmap, sss, some real-time wind fields, pre-process of footprints, etc. are common.

The right side is completely correct. First block the fs in the lighting stage, and then make available the memory written in the color attachment of the color output stage in the previous Gbuffer stage, and then make visible the same memory to be read by the shader. Unblock fs, so that the fs in the lighting stage can read the Gbuffer correctly.

The only difference on the left is that the fragment stage that should be waiting is changed to vertex, which causes the GPU to block the pipeline in advance during the vertex shading stage.

This problem is very hidden. First of all, Vulkan's validation layer will not report an error, because it is just inefficient, not an error. If you roughly test alu, read and write bandwidth, and rendering time, you can see that there is almost no difference in the test results of these data on the graph. So this question is easily overlooked.

Use of test analysis tools

Each Soc manufacturer provides its own GPU analysis and testing tools, such as Mali streamline for Mali chips, PVRTune for PowerVR chips, and Snapdragon profiler for Qualcomm. In this example, the mobile phone we use is a Mali GPU, so use Mali's test tool for analysis:

82f09ea26655360ef0f14535d38a7464.png

As shown in the figure above, there is no difference in the amount of calculation and the read and write bandwidth. The only thing that changes is the External Bus Read Latency and External Bus Stall.

354802fb6a954a4782f0135403639621.png

Specifically, there is a 6-fold gap in the blocking cycle of reading data from the memory bus, which is because our synchronization instructions make the waiting stage reach the vertex shading stage earlier. But since the GPU frequency is high enough and the scene is not complex enough to reach GPU bound, these delays are not high enough to affect the frame rate.

Now, let's artificially limit the GPU frequency and simulate the situation of GPU bound. At this time, the memory bus read and write delay cannot be covered, and the problem is exposed.

b613f50af44fc2652eff8d0668496bfc.png

It can be seen that the impact of Bus Stall on the frame rate is still obvious. The frametime difference is 3 milliseconds, that is, if 8 frames are not tested in depth, it is easy to ignore this problem. Rendering pressure gradually increases, and when the problem is exposed, it may not be possible to find the original cause.

To sum up, the synchronization problem is very important. We need to fully understand the core-cache-memory model, clearly know what kind of hazards the pipeline encounters, and use the pipeline barrier correctly to fully utilize the performance benefits brought by manual control synchronization.

Other synchronization instructions

Of course, there are many other synchronization instructions in Vulkan, such as subpass dependency, which is almost the same as barrier, but only the memory related to attachment in renderpass can be synchronized. Semaphore is used for synchronization between queues, and fence is used for synchronizing GPU and CPU. Event is rarely used in mobile games. Currently, except for one or two games that have modified their engines, only the latest version of unreal is available in mobile shading renderer The occlusion query part uses it.

5. Summary

This article first introduces the mobile rendering architecture and its characteristics, then explains the advantages of the Vulkan API, then analyzes the advantages and disadvantages of Tile-based rendering based on actual test results, and finally focuses on the explicit synchronization control of Vulkan, combined with specific scenarios and Based on the measured data, the optimization scheme is given and the root cause is analyzed.

I hope that readers can gain a deeper understanding of the Vulkan API and mobile rendering architecture through this article, combine specific development scenarios, and rationally use test analysis tools to improve power consumption, bandwidth and synchronization issues in mobile rendering.

References

The PowerVR Advantage (imgtec.com)

https://docs.imgtec.com/Architecture_Guides/PowerVR_Architecture/topics/powervr_architecture_the_powervr_advantage.html

Synchronization Examples · KhronosGroup/Vulkan-Docs Wiki · GitHub

https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples

ARM-software/Vulkan best practice for mobile developers

https://github.com/ARM-software/vulkan_best_practice_for_mobile_developers

SaschaWillems/Vulkan: Examples and demos for the new Vulkan API

https://github.com/SaschaWillems/Vulkan

Vulkan Usage Recommendations | Samsung Developers

https://developer.samsung.com/galaxy-gamedev/resources/articles/usage.html

baldurk/renderdoc

https://github.com/baldurk/renderdoc

Snapdragon Profiler - Qualcomm Developer Network

https://developer.qualcomm.com/software/snapdragon-profiler

Past

Expect

push

recommend

Analysis of Arm's latest processor architecture in 2023 - X4, A720 and A520

Live preview | Linux kernel page to folio changes

On the importance of a good name: the change from Linux kernel page to folio

44d3228a1699b67ff4cd5e5adef79f14.gif

Long press to follow Kernel Craftsman WeChat

Linux Kernel Black Technology | Technical Articles | Featured Tutorials

Guess you like

Origin blog.csdn.net/feelabclihu/article/details/131989994