Batch/batch rendering (Batch), instanced Instancing

It can be simply understood as: Batch rendering is to improve the logic line and rendering line as a whole by reducing the number of times the CPU sends rendering commands (DrawCall) to the GPU, and reducing the number of times the GPU switches rendering states, so that the GPU can do more things at a time. efficiency. But this is only meaningful when the GPU is relatively idle and the CPU spends more time on submitting rendering commands.

The most important prerequisite for batching: the materials must be the same! ! !
Combining batches saves the workload of CPU-related preparations.

After merging batches, after VS, PS, trial testing, and template testing, there is no concept of texture, vertex, and index at this time, only isolated pixels are left, and there is no relationship between each pixel. After the pixels are sent to the GPU, they are processed in batches and presented to the screen hardware. Therefore, batching has nothing to do with the GPU and has almost no impact. Regardless of whether it is one batch or multiple batches, the number of pixels finally sent to the GPU in this frame is equal, and the data is the same. Divided into multiple batches, the pixel data is submitted to the GPU multiple times within one frame. Batch or not, the impact on the GPU is only whether the arrival of pixels is slow or fast, and it hardly affects the performance of the GPU.

1. Offline Batch

Offline batching is to use tools to batch related resources before the game runs, so as to reduce the burden of real-time batching on the engine.
Static models and scene objects are suitable for offline batching. Such as the decorative surface of the scene: stone/brick, etc.
Offline batching methods include:

  1. Art is combined with professional modeling tools. Such as 3D Max/Maya etc.
  2. Utilize engine plugins or tools. For example, Unity's plug-ins MeshBaker and DrawCallMinimizer can batch static objects.
  3. Self-made offline batching tool. If the third-party plug-in cannot meet the needs of the project, it is necessary to program a special offline batching tool.
    insert image description here
    In the open source library of meshoptimizer, there is a very good mesh merger. First merge the same Materials, delete redundant data, and then merge the meshes of the same Materials!

2. Runtime Batch

The Unity engine has built-in two batch rendering technologies: Static batching (static batching) and Dynamic batching (dynamic batching).

The first static batching;
the same material and cannot be transformed (rotation, translation, etc.)
the second dynamic batching;
the third instantiated rendering;
Unity has two kinds of batching, dynamic and static, and the static essence is for Mesh marked as static merged automatically, without any difference. The dynamic batching is to copy and paste several Mesh data together, that is, in real time, and each frame is merged.
So one sentence summarizes dynamic batching: use the performance consumption of copying data in exchange for the performance consumption of submitting Drawcall. Modern computers have better hardware and APIs, so dynamic batching has shown a negative optimization trend on new devices. The official also noticed this and developed a new SRP Batcher system. The idea of ​​SRP Batcher is to abandon the merging of models and use modern APIs instead . Such a feature provides SRP Batcher:
Finally, to summarize: Static batching is Mesh merging, and dynamic batching is Mesh merging every frame.

The pros and cons of static batching:

Static batching adopts the strategy of exchanging space for time to improve rendering efficiency.

Its advantage is : grids are usually merged during the preprocessing stage (packaging), and the vertex and index information will not change during runtime, so there is no need for CPU consumption and maintenance; Multiple relatively independent objects can be rendered at the same time, reducing the number of DrawCalls. Before rendering, frustum culling can be performed first, which reduces the number of times the vertex shader processes invisible vertices and improves the efficiency of the GPU.

The disadvantage is that the batched grids will be resident in memory, which may not be applicable in some scenarios. For example, the grid of each tree in the forest is the same. If a static batching strategy is adopted for it, the grid after batching is basically equivalent to: single tree grid x number of trees, and the memory consumption may be as small as Very huge.

All in all, static batching is very useful when solving objects in the scene that have basically the same material, different meshes, and remain static from beginning to end.

The difference between dynamic batching and static batching:

1. Dynamic batching will not create a "merged grid" that resides in memory, that is to say, it will not cause a significant increase in memory during runtime, nor will it affect the package size when packaging; 2. Dynamic
batching Vertices are converted to the world coordinate system before drawing, and then filled into vertices and index buffers; after static batching, the sub-grid does not accept any transformation operations, and only the Root node after manual batching can be operated, so The information in the statically batched vertices and index buffers will not be modified (Root’s transformation information will be passed in through the Constant Buffer);
3. Because of 2, the main overhead of dynamic batching is when traversing vertices for spatial transformation The overhead of CPU performance; static batching does not have this operation, so there is no such overhead;
4. Dynamic batching uses the public buffer allocated according to the type of renderer, while static batching uses its own dedicated buffer.

GPU Instancing

GPU Instancing is often said that instancing does not have the limitation on the number of grids like dynamic batching, nor does it require such a large memory like static grids. It makes up for the shortcomings of the two, but there are also some limitations. , we will elaborate below one by one. Different from dynamic and static batching, GPU Instancing does not reduce Drawcall by merging grids. The processing process of GPU Instancing is to submit only one model grid to let the GPU draw many places. The grids drawn in these different places The grid can have different operations on the zoom size, rotation angle and coordinates. Although the shaders are the same, the properties of the shaders can have their own differences . As shown in the figure below, if you use mesh to merge the two pictures, the usefulness is not obvious! If you draw a lot of instances of your model like this, you'll quickly hit a performance bottleneck with too many draw calls. Using the glDrawArrays or glDrawElements functions to tell the GPU to draw your vertex data will consume more performance than drawing the vertices themselves, because OpenGL needs to do a lot of preparation work before drawing the vertex data (such as telling the GPU which buffer to read data from , where to find vertex attributes, and these are all done on the relatively slow CPU to GPU bus (CPU to GPU Bus)). Therefore, even if rendering vertices is very fast, it may not be necessary to order the GPU to render, and the pressure of a large number of vertex vertex shader MVP transformations is quite high! If we can send the transformed data to the GPU at one time, and then use a drawing function to let OpenGL use these data to draw multiple objects. This is instantiation (Instancing).
Instancing is a technique that allows us to draw multiple objects with one render call, saving CPU -> GPU communication every time an object is drawn, it only needs to be done once . If we want to use instanced rendering, we just need to combine glDrawArrays and gl
The rendering calls of DrawElements are changed to glDrawArraysInstanced and glDrawElementsInstanced respectively. The instanced versions of these render functions require an additional parameter called the instance count (Instance Count), which sets the number of instances we need to render. This way we only need to send the necessary data to the GPU once, and then use a single function call to tell the GPU how it should draw those instances. The GPU will render these instances directly without having to constantly communicate with the CPU.

Please add a picture description

There are 500+ cubes here that are actually copied from the same cube and only need to record a Transform. Give him a transformation matrix, which can be scaling, translation, rotation, etc.; including gltf, it can also be instantiated directly!
The water is translated using the instantiation technology translation. A better learning example opengllleran instantiates the asteroid belt
Please add a picture description

Draw Call Essence and Modern DC Issues

When the Mesh data size is the same, the fewer the number of draw call rendering instructions, the faster! (Like a 30W vertex cpu passing a total of 3 times to the GPU at 10W each time? Or a total of 1000 times the cpu passing to the GPU 300 times at a time?) The GPU has been waiting for the CPU! Essentially modern gpu speeds are far behind The speed of the open cpu is not an order of magnitude. There is no difference between processing 30 vertices at one time and processing 10,000 vertices at one time. The GPU has already finished processing and has been waiting for the CPU to transmit (or express it this way:

"The computing performance of modern GPUs is actually very high, and the parallel rendering calculation speed is very fast. If the data transmitted from the CPU to the GPU is not enough, the computing power of the GPU cannot be fully utilized, so the bottleneck lies in the CPU, not in the GPU. The difference between the CPU and the GPU is an order of
magnitude Yes, the third power to the sixth power is far away"

)
is more detailed like this: The communication between cpu and gpu is to first load the data into the video memory to set the rendering state, and then the cpu calls the rendering command to start (dc) and then to the rendering pipeline. That is, the gpu has finished processing the instructions in the instruction pool, and the instructions in the instruction pool have not yet waited for the instructions of the cpu. The cpu is still setting the rendering state , etc. (also commonly known as the context such as the context of opengl or dx). The lighting and rendering states of the same material are consistent, so the batch must be the same The mesh can render many states, such as depth testing and writing, transparent mixing and so on. Just like the example above, the 10W vertex is passed 3 times each time, and the CPU rendering state only needs to be set three times. But if it is to pass 1000 times, each 300 vertex CPU setting rendering state needs 1000 times, the GPU utilization rate is very low and the CPU is too wasteful, because most of the 1000 times are the same!
What needs to be corrected for what was just said above is: data and instructions are not executed serially. Pushing instructions to PushBuffer and data DMA transfer can be performed simultaneously. In OpenGL, your rendering data is not necessarily stored in video memory. When to do data scheduling is related to the implementation of the pipeline driver. Moreover, modern graphics APIs can already execute instructions in parallel, DC is no longer a bottleneck, not to mention tricks such as IndirectDraw, which can do GPU driven rendering. It's just that you need to pay attention to the mutual exclusion relationship between resources and states, and handle the locks yourself .

For performance analysis : GPU and other CPUs appear to be that the CPU is too busy and the GPU is too idle. Generally, the CPU measures the organization of executing drawing instructions, including various algorithms to improve rendering speed and reduce drawing instructions as much as possible. These are all overhead, or in To the extreme, a multi-threaded application outside takes up 99% of the CPU resources, and the computing power you can provide to the organization for rendering is pitifully small, so you need to analyze whether there are other functions in your software that take up the CPU.

Reference materials:
Share a good article if you want to know more about
hardware cpu gpu change


Guess you like

Origin blog.csdn.net/chenweiyu11962/article/details/121340711