A vernacular summary of GPU rendering pipelines and shaders----One article is enough

Reprinted from:
https://blog.csdn.net/newchenxf/article/details/119803489
It’s really well written! Highly recommended

A vernacular summary of GPU rendering pipelines and shaders----One article is enough

Article directory

A vernacular summary of GPU rendering pipelines and shaders----One article is enough
1 Introduction
2 Rendering pipeline
- 2.1 How do CPU and GPU work in parallel?
3 CPU processing----application stage
4 GPU processing----geometry stage
5 GPU processing----rasterization stage
6 Performance Discussion
7 Appendix knowledge
reference

1 Introduction

To make graphics and images, you need to understand the GPU;
to understand the GPU, the most important thing is to understand the rendering pipeline (or pipeline);
and the most important part of the rendering pipeline is the shader, that is, the shader.

Therefore, this article attempts to summarize the GPU rendering process and shader, which can be used as an introductory tutorial for image development.

2 Rendering pipeline

Pipeline, also known as assembly line.

What is an assembly line?

Pipelining means that when a task is repeatedly executed, it is subdivided into many small tasks, and these small tasks are executed overlappingly to improve the overall operating efficiency.

For example:
unloading porters, three people move from the car to the store. The 3 people are A, B, and C.
A gives to B, B gives to C, and C goes to the store.
A does not need to wait for C to put it in the store before starting the next move. Instead, C can start moving the second item while the move is in progress.

CPU processing actually uses pipeline technology, which divides the execution of an instruction into five parts: fetching instructions, decoding, fetching data, computing and writing results. Then the process is similar to the porter above.

The rendering pipeline converts the vertices, textures and other information in a three-dimensional scene into a two-dimensional image. This work is completed by CPU + GPU.
A rendering process is usually divided into three stages:
(1) Application stage
(2) Geometry stage
(3) Rasterization stage

应用阶段由CPU完成，几何阶段和光栅化阶段由GPU完成。

Of course, since it is a pipeline, it means that the execution of the three stages is asynchronous. That is, after the CPU execution is completed, there is no need for the GPU to finish execution before starting the next call.
In addition, the geometry stage and rasterization stage executed by the GPU are also subdivided into subtasks, which also use pipeline technology.

2.1 How do CPU and GPU work in parallel?

The answer is yes 命令缓冲区. The command buffer contains a command queue. CPU add command, GPU read command.
When the CPU needs to render some objects, it adds commands to the command buffer; when the CPU completes a rendering task, it can continue to take out commands from the buffer and execute them.
Insert image description here
The green [rendering model] is what we are talking about Draw Call.
Give an example of openGL:

GLES20.glDrawArrays(GLES20.GL_TRIANGLE_STRIP, 0, 4);

The orange [Change Rendering State] is to load data, change shaders, switch textures, etc., give an example of openGL:

GLES20.glUseProgram(program);
GLES20.glBindTexture(GLES20.GL_TEXTURE_2D, textureId);

When generating a frame of image, if there are multiple object models, DrawCall will be submitted one by one. The GPU draws to the color buffer one by one, and then mixes it. Note that although the GPU processes DrawCall one by one, the GPU also has an internal pipeline. There is no need to wait for A to finish drawing before drawing B. As long as A completes the first small step, it enters the second step. , B can start the first step, which is the vertex shader mentioned below.

After drawing a frame, swapBuffer is called to submit the data in the color buffer to the screen for display.

3 CPU processing----application stage

Main work:
(1) Load data into video memory (GPU memory)
(2) Set rendering status
(3) Call Draw Call

3.1 Load data into video memory

GPU generally cannot directly access memory, so I made my own memory, called video memory (graphics card memory). The CPU loads some data required for rendering from the hard disk or network into memory, and then sends it to the video memory.
There are two main types of data, model data (vertex coordinates + texture coordinates) and texture images. The amount of other data is very small, such as the normal direction. After loading into the video memory, the data in the CPU's memory can be deleted. For example, a texture can be generated using bitmap and can be recycled after being loaded into the GPU.

3.1.1 Example of loading model data

The following uses an example of android video playback and rendering to illustrate how to load model data.

The video is displayed within a two-dimensional rectangular frame, so only 4 vertex coordinates and corresponding 4 texture coordinates are needed. The texture uses the decoded video image.

    //顶点数据
    private float[] vertexData = {
            -1f, -1f,
            1f, -1f,
            -1f, 1f,
            1f, 1f
    };
    //纹理坐标数据
    private float[] fragmentData = {
            0f, 0f,
            1f, 0,
            0f, 1f,
            1f, 1f
    };

	public void onCreate() {
		//顶点坐标数据在JVM的内存，复制到系统内存
        vertexBuffer = ByteBuffer.allocateDirect(vertexData.length * 4)
                .order(ByteOrder.nativeOrder())
                .asFloatBuffer()
                .put(vertexData);
        vertexBuffer.position(0);
        //纹理坐标数据在JVM的内存，复制到系统内存
        fragmentBuffer = ByteBuffer.allocateDirect(fragmentData.length * 4)
                .order(ByteOrder.nativeOrder())
                .asFloatBuffer()
                .put(fragmentData);

		//VBO全名顶点缓冲对象（Vertex Buffer Object）,他主要的作用就是可以一次性的发送一大批顶点数据到显卡上
 		int [] vbos = new int[1];
        GLES20.glGenBuffers(1, vbos, 0);
        vboId = vbos[0];


		//下面四句代码，把内存的数据复制到显存中
        GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, vboId);
        GLES20.glBufferData(GLES20.GL_ARRAY_BUFFER, vertexData.length * 4 + fragmentData.length * 4, null, GLES20. GL_STATIC_DRAW);
        GLES20.glBufferSubData(GLES20.GL_ARRAY_BUFFER, 0, vertexData.length * 4, vertexBuffer);
        GLES20.glBufferSubData(GLES20.GL_ARRAY_BUFFER, vertexData.length * 4, fragmentData.length * 4, fragmentBuffer);
	}

The code has already been commented, but it can still be explained here.

After all, vertexData
is a variable defined by Java. It uses the memory of the JAVA virtual machine, not the system memory, and cannot be directly given to the GPU. Therefore, first use ByteBuffer to copy vertexData to system memory.

Next, create a vertex buffer object (Vertex Buffer
Object). Its main function is to send a large amount of vertex data to the graphics card at one time. Then, call glBufferData to copy the data to the graphics card memory! In this way, when the GPU starts working later, it can be accessed quickly.

3.1.2 Example of loading texture

Here is another example of an android app loading texture images:

    public static int createTexture(Bitmap bitmap){
        int[] texture=new int[1];

        //生成纹理
        GLES20.glGenTextures(1,texture,0);
        //生成纹理
        GLES20.glBindTexture(GLES20.GL_TEXTURE_2D,texture[0]);
        //设置缩小过滤为使用纹理中坐标最接近的一个像素的颜色作为需要绘制的像素颜色
        GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D, GLES20.GL_TEXTURE_MIN_FILTER,GLES20.GL_NEAREST);
        //设置放大过滤为使用纹理中坐标最接近的若干个颜色，通过加权平均算法得到需要绘制的像素颜色
        GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D,GLES20.GL_TEXTURE_MAG_FILTER,GLES20.GL_LINEAR);
        //设置环绕方向S，截取纹理坐标到[1/2n,1-1/2n]。将导致永远不会与border融合
        GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D, GLES20.GL_TEXTURE_WRAP_S,GLES20.GL_CLAMP_TO_EDGE);
        //设置环绕方向T，截取纹理坐标到[1/2n,1-1/2n]。将导致永远不会与border融合
        GLES20.glTexParameterf(GLES20.GL_TEXTURE_2D, GLES20.GL_TEXTURE_WRAP_T,GLES20.GL_CLAMP_TO_EDGE);

        if(bitmap!=null&&!bitmap.isRecycled()){
            //根据以上指定的参数，生成一个2D纹理，上传到GPU
            GLUtils.texImage2D(GLES20.GL_TEXTURE_2D, 0, bitmap, 0);
        }
        return texture[0];
    }

First use glGenTextures to generate a texture id. At this time, the actual data has not been created yet.
Next, set the texture properties, and then call texImage2D, which will first create a texture buffer and then copy the bitmap to the buffer. Then it's done. After the above function ends, the contents of the bitmap are copied to the GPU and can be recycled as needed.

What needs to be emphasized is 不管是加载顶点还是纹理，都只需要在初始化时，一次性完成that it does not need to be loaded every time onDraw, otherwise it will be meaningless.

3.2 Set rendering status

That is, declare how the mesh (model) of the scene is rendered, such as which vertex shader/fragment
shader to use, light source attributes, materials (textures), etc. If you do not switch the rendering state when drawing different meshes, the meshes before and after will use the same rendering state.

3.3 Call Draw Call

DrawCall has been mentioned before. Maybe you already understand that it is actually a command. The initiator is the CPU and the receiver is the GPU.
When a Draw Call is called, the GPU will perform calculations based on the rendering state (material, texture, shader) and all vertex data, and the GPU pipeline will start running.

emphasize again,

3.4 Summary

The first step only requires one load, and there is no need to load it every time you draw. The second and third ones need to be set every time onDraw.
Here we also use the onDraw function for drawing videos or pictures, as an example:

 public void onDraw(int textureId)
    {
        GLES20.glClear(GLES20.GL_COLOR_BUFFER_BIT);
        GLES20.glClearColor(1f,0f, 0f, 1f);
        //设置渲染状态: program根据具体的shader编译，这里就是指定用哪个shader
        GLES20.glUseProgram(program);
        //设置渲染状态：指定纹理
        GLES20.glBindTexture(GLES20.GL_TEXTURE_2D, textureId);
        //设置渲染状态：绑定VBO
        GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, vboId);

        //设置渲染状态：指定顶点坐标在VBO的从0开始，8个数据
        GLES20.glEnableVertexAttribArray(vPosition);
        GLES20.glVertexAttribPointer(vPosition, 2, GLES20.GL_FLOAT, false, 8,
                0);

        //设置渲染状态：指定顶点坐标在VBO的从8*4（float是4个字节）开始，8个数据
        GLES20.glEnableVertexAttribArray(fPosition);
        GLES20.glVertexAttribPointer(fPosition, 2, GLES20.GL_FLOAT, false, 8,
                vertexData.length * 4);

        //调用Draw Call, GPU 管线开始工作
        GLES20.glDrawArrays(GLES20.GL_TRIANGLE_STRIP, 0, 4);

        //绘制完成，恢复现场
        GLES20.glBindTexture(GLES20.GL_TEXTURE_2D, 0);
        GLES20.glBindBuffer(GLES20.GL_ARRAY_BUFFER, 0);
    }

4 GPU processing----geometry stage

You can put the geometry stage and the rasterization stage together and draw a picture.
Insert image description here
Of course, the above picture may not be complete, and there may be some optional steps in the middle. I just listed the most important ones.
Among them, the blue part's vertex shader and fragment shader are completely customizable by developers, and they are also the two steps that most image development students need to pay attention to.

4.1 What is vertex data?

Let’s start with a basic explanation: the data that the GPU understands are only points, lines, and triangles.
This corresponds to 1 vertex, 2 vertices, 3 vertices. These three are also called 图元. Points and lines are generally used in 2D scenes. In 3D scenes, they are basically composed of N triangles that are spliced into an object model.

The picture below is an example of a character model:
Insert image description here
zoom in on some details and find that they are all composed of triangles. The vertices of the triangle are what we call the vertex coordinates.

After all the triangles are pasted with two-dimensional image textures, they become a complete image:

so,顶点数据，一般会包含顶点坐标 + 纹理坐标。

A model file generally contains vertex coordinates, texture coordinates, the texture itself (there can be multiple textures), etc. For example, if you open the model file with the obj suffix of 3D max, you can see the following text content:

Insert image description here

4.2 Vertex Shader

The vertex shader is the first step in the GPU's internal pipeline.

Its processing unit is the vertex, that is 每个顶点，都会调用一次顶点着色器.
Its main work: coordinate conversion and vertex-by-vertex lighting, as well as preparing data for subsequent stages (such as fragment shaders).

坐标转换，即把顶点坐标从模型坐标空间转换到齐次裁切坐标空间。

Here is a Unity shader code. Unity puts the vertex shader and fragment shader codes into a file, but the Unity engine will parse the file, convert the two shader code segments, and pass them to the GPU.

Shader "Unlit/SimpleUnlitTexturedShader"
{
    Properties
    {
        // 我们已删除对纹理平铺/偏移的支持，
        // 因此请让它们不要显示在材质检视面板中
        [NoScaleOffset] _MainTex ("Texture", 2D) = "white" {}
    }
    SubShader
    {
        Pass
        {
            CGPROGRAM
            // 使用 "vert" 函数作为顶点着色器
            #pragma vertex vert
            // 使用 "frag" 函数作为像素（片元）着色器
            #pragma fragment frag

            // 顶点着色器输入
            struct appdata
            {
                float4 vertex : POSITION; // 顶点位置
                float2 uv : TEXCOORD0; // 纹理坐标
            };

            // 顶点着色器输出（"顶点到片元"）
            struct v2f
            {
                float2 uv : TEXCOORD0; // 纹理坐标
                float4 vertex : SV_POSITION; // 裁剪空间位置
            };

            // 顶点着色器
            v2f vert (appdata v)
            {
                v2f o;
                // 将位置转换为裁剪空间
                //（乘以模型*视图*投影矩阵）
                o.vertex = mul(UNITY_MATRIX_MVP, v.vertex);
                // 仅传递纹理坐标
                o.uv = v.uv;
                return o;
            }
            
            // 我们将进行采样的纹理
            sampler2D _MainTex;

            // 像素着色器；返回低精度（"fixed4" 类型）
            // 颜色（"SV_Target" 语义）
            fixed4 frag (v2f i) : SV_Target
            {
                // 对纹理进行采样并将其返回
                fixed4 col = tex2D(_MainTex, i.uv);
                return col;
            }
            ENDCG
        }
    }
}

The code has been clearly commented.
Appdata is the input of the vertex shader, including vertex coordinates and texture coordinates.
Typically, vertex shaders only process vertex coordinates and perform spatial transformations. Texture coordinates are generally transmitted to the fragment shader.

The vert function is the code of the vertex shader. The Unity engine will translate it into a GLSL code with main function similar to the following, and then compile it:

attribute vec4 v_Position;
attribute vec4 f_Position;
uniform mat4 uMVPMatrix;
varying vec2 textureCoordinate;

void main()
{
    gl_Position = uMVPMatrix* v_Position;
    textureCoordinate = f_Position.xy;
}

The space conversion code is extremely simple, just multiply the coordinates by an MVP matrix!

Next, we will MVP矩阵expand on the concept of space transformation based on .

Before we begin, let me emphasize thatmain函数会执行几次？
答案是，有几个顶点，就执行几次，且是并行的！

4.3 Coordinate space

What is coordinate space? In fact, he exists everywhere in our lives. For example, you and your friend agree to meet 100 meters away from the museum gate. At this time, the location you are talking about is a space with the museum gate as the origin. So coordinate space is a relative concept.

The same is true in the gaming world.

According to different reference objects, it can be divided into模型空间-世界空间-观察空间-裁剪空间-屏幕空间。

In the 3D world, the vertex coordinates have three pieces of data, which are (x, y, z), but in order to facilitate translation/rotation/scaling and other transformations, a w component needs to be added. That is, (x, y, z,
w), this 4-dimensional coordinate is called a homogeneous coordinate. For this knowledge, see the appendix knowledge in Section 5 .

Next, let’s discuss these spaces one by one.

4.3.1 Model space

You create a character model in 3D max, which can be adapted in various software.
For the interior of the model, there is a coordinate origin, for example, at the heart, and each part of the body has a coordinate value relative to the heart;
each model object has its own independent coordinate space. So it can also be called object space or local space .

4.3.2 World space

A map in the game can be used as a small world, and the center of the map can be used as the coordinate origin. The character model as a whole, when placed on the map, has a coordinate value relative to the center of the map. This is world space.

Converting the coordinates of the model space to the coordinates of the world space can be obtained through matrix operations. This matrix is called Model Matrix .
In fact, when a model is placed on the map, it may be scaled first and then rotated or translated. These three have corresponding matrices, see the appendix for details .

In other words, this matrix is determined by the scaling, rotation, and translation parameters of the object in the map.
The purpose of conversion is to get the specific location of a certain vertex of the object on the map.

4.3.3 Observation space

Can also be called 摄像机空间
. The observation space means that the game map is so big that we can only see part of the content. So in the game map, define a Camera, which is also in world space. The camera is equivalent to our eyes. Wherever the camera shines, that is the picture we see.

In 3D games, the Camera and the user character are often tied together. Wherever the character object goes, the Camera object moves. The effect is that wherever you go, you can see the scenery!

Therefore, any object in the map, if the Camera is used as the coordinate origin, also has a coordinate value relative to the Camera. The space with Camera as the origin is the observation space.

Converting the coordinates of the world space to the observation space can also be obtained through matrix operations. This matrix is called View Matrix .

What parameters does this matrix depend on?

First of all, the camera itself is also located in the world coordinate system. If you don't consider where the camera is looking, it is actually a very simple translation matrix. That is, if the coordinates of camera A are (-3, 0, 0), then the camera is the result of translation (-3, 0, 0)
from the world origin O.
If the camera coordinates are used as the new origin, then the original world origin O is equivalent to the result of shifting (3, 0, 0) from the camera. Therefore, the transformation matrix of the two coordinate systems is almost a translation matrix.

Of course, the actual camera orientation also depends on whether the eyes are looking frontally or upside down. Don’t know what looking upside down is? Look at the picture below
Insert image description here

Therefore, this matrix is always affected by three factors.
Come, a function solves your worries and forgets the complicated calculation process.

glm::mat4 CameraMatrix = glm::LookAt(
    cameraPosition, // the position of your camera, in world space
    cameraTarget,   // where you want to look at, in world space
    upVector        // probably glm::vec3(0,1,0), but (0,-1,0) would make you looking upside-down, which can be great too
);

4.3.4 Clipping space

Clipping space is the area that the camera can see. This area becomes the optic frustum .
It consists of 6 planes, which also become clipping planes . There are two types of optic frustum , involving two projection methods. One is a perspective projection and the other is an orthographic projection
.

The primitives that are completely located in this space will be retained;
the primitives that are completely outside the space will be completely discarded;
the primitives that intersect with the boundaries of the space will be clipped.

The figure below is a schematic diagram of a perspective projection. The character is completely inside the visual cone, so the small picture in the lower right corner is the 2D picture the end user sees.
Insert image description here
Of course, for Near and Far in the picture, I set relatively small values for the convenience of display. In actual games, Near and Far are very different, such as Near = 0.1, Far =
1000. This is similar to the range that the human eye can see. Things that are close enough to the eye cannot be seen, and things that are very far away cannot be seen.

So, how to determine if something is within the optic cone?
The answer is to transform the vertices into a clipping space through a projection matrix.
This matrix is basically determined by the parameters of the camera.

The parameter diagram is as follows:
Insert image description here

FOV represents the field of view (Field of View);
Near and Far are the origin of the camera and the distance from the clipping plane;
AspectRatio is the aspect ratio of the clipping plane;

The change matrix from the observation space to the clipping space becomes the Projection Matrix .

Matrices can be generated using a function:

// Generates a really hard-to-read matrix, but a normal, standard 4x4 matrix nonetheless
glm::mat4 projectionMatrix = glm::perspective(
    glm::radians(FoV), // The vertical Field of View, in radians: the amount of "zoom". Think "camera lens". Usually between 90&deg; (extra wide) and 30&deg; (quite zoomed in)
    AspectRatio,       // Aspect Ratio. Depends on the size of your window. such as 4/3 == 800/600 == 1280/960, sounds familiar
    Near,              // Near clipping plane. Keep as big as possible, or you'll get precision issues.
    Far             // Far clipping plane. Keep as little as possible.
);

The specific formula generation principle will not be expanded here. For details, you can google it or see here: https://zhuanlan.zhihu.com/p/104768669.

Although the projection matrix has the word projection, it does not perform real projection. Instead, he was making preparations for projection. Projection is not used until the next step, conversion to screen space.

After the projection matrix operation is performed, or in other words, after reaching the clipping space, the w value is not necessarily 0 and 1, and has a special meaning. Specifically, if a vertex is in the visual frustum, its transformed coordinates must satisfy:

-w <= x <= w
-w <= y <= w
-w <= z <= w

Those that do not meet the requirements will either be eliminated or reduced.

For example:
the coordinates of a vertex of a character model's hand to the observation space are
[9, 8.81, -27.31, 1].
After conversion to the clipping space, the values are
[11.691, 15.311, 23.69, 27.3]
. Then, the The vertices are in clip space and can be displayed.

The previous explanations were based on perspective projection. So, what is the difference between orthogonal projection and perspective projection?
透视投影：越远的东西，呈现在屏幕的越小；正交投影，远和近的同样大小东西，呈现在屏幕上一样大.

Here’s a hand-drawn picture and you’ll know what I mean!
Insert image description here

As can be seen from the above, perspective projection has a 3D effect, while orthogonal transparency does not, and is more suitable for 2D games.

4.3.5 Screen space

Screen space is a two-dimensional space. This is also the picture we are about to see, and it is also the last space we need to know (didn’t you take a long breath, haha)

To convert the vertices from clipping space to screen space, three dimensions become two dimensions, so it is called 投影.

This projection mainly does two things:

First, homogeneous division is required . Don't think it's complicated, it's actually just using the w component to locate the x, y, z components. In Open GL, the coordinates obtained in this step are called 归一化的设备坐标(Normalized Device Coordinates, NDC).
The valid value ranges of NDC's x, y, and z are all 负1到1. Points outside this value are eliminated.

The figure below is a schematic diagram of NDC. It looks like a cube with length, width and height in the range [-1,1].
Insert image description here
The meaning of NDC existence, the second step.

Second, it needs to be mapped to the screen to get the screen coordinates. In openGL, the pixel coordinates of the lower left corner of the screen are (0, 0), and the upper right corner is (pixelWidth, pixelHeight).
For example, if the normalized coordinates are (x, y) and the screen width and height are (w, h), then the screen coordinates are

Sx = x*w + w/2
Sy = y*h + h/2

Do you want to say that the z component of NDC is no longer needed?
No, no, the z component also has a lot of value, it acts as a depth buffer. For example, if there are two vertices later with the same screen coordinates and neither is transparent, then it depends on the depth buffer. The one closest to the camera will be displayed.

What needs to be emphasized is that the conversion from clipping space to screen space is completed for us by the bottom layer and does not require code implementation.

顶点着色器，只需要把顶点，从模型空间转换到裁减空间即可 。

In the fragment shader, you can get the position of the fragment in screen space.

4.4 Summary of vertex coordinate transformation

Back in Section 4.2, we talked about an MVP matrix, and now we finally know that it represents a matrix that can transform vertices from model space to clipping space!
This matrix is 模型矩阵 * 观察矩阵 * 投影矩阵obtained by multiplying again!

Let’s illustrate with a picture:
Insert image description here

This finally matches the code in Section 4.2:

o.vertex = mul(UNITY_MATRIX_MVP, v.vertex);

This macro UNITY_MATRIX_MVPis the MVP matrix in the picture.

When you determine the model position/rotation/scale, there is a Model Matrix. When the camera position and orientation are determined, there is a View
Matrix. When the FOV angle of view of the camera, the distance between the near and far cutting planes, etc. are determined, the Projection Matrix is created.
All three are determined, and when multiplied together, the MVP Matrix is determined.

So the vertex shader is very simple, just multiply it by a fixed MVP matrix and that's it! Then we get 裁减空间the coordinates below.
Then, the GPU helps you crop, convert NDC coordinates, screen mapping, and the geometry stage is completed perfectly! !

4.5 Review of vertex data for video playback

Let’s look back at Section 3.1, which is an example of video playback. The vertex coordinates are:

    private float[] vertexData = {
            -1f, -1f,
            1f, -1f,
            -1f, 1f,
            1f, 1f
    };

Why is it so simple? Because it is only used for video playback, the picture is rectangular. Only 4 vertices are needed, the vertex coordinates are two-dimensional, 8 numbers represent 4 vertices, the z value defaults to 0, and the w value defaults to 1.

If you have to write the complete data yourself, that is

    private float[] vertexData = {
            -1f, -1f, 0f, 1f, 
            1f, -1f,  0f, 1f, 
            -1f, 1f,  0f, 1f, 
            1f, 1f,  0f, 1f, 
    };

Therefore, the vertex coordinates here are the same as the values in clip space.

In other words, video playback does not need to worry about the world space or observation space, because it is two-dimensional. Without these spaces, the MVP matrix is an identity matrix.

In addition, because the w value is 1, the coordinates under NDC are the same as the clipping space.

Section 4.3.5 mentioned the NDC coordinate, which is a three-dimensional cube.
If z is not considered, then the NDC coordinate is a two-dimensional square.

The two-dimensional NDC is as shown in the figure below:
Insert image description here
the vertex coordinates of video playback can be used as a reference. That is, the vertex is used to be 4, values, guaranteed to be negative one to one.

After talking about the vertices, let’s talk about the openGL texture coordinate system. Textures are only two-dimensional. Whether it is video playback or the 3D world, they are both two-dimensional.

The coordinate system is defined as follows:
Insert image description here

Back in Section 3.1, when we load the vertex data, we also include the corresponding texture coordinates:

    //纹理坐标数据
    private float[] fragmentData = {
            0f, 0f,
            1f, 0,
            0f, 1f,
            1f, 1f
    };

It can be seen from the data that there is a one-to-one correspondence between vertex coordinates and texture coordinates. For example, the lower left corner position of the vertex coordinates (-1, -1) corresponds to the lower left corner position of the texture coordinates (0, 0).

Finally, let’s take a look at the vertex shader code corresponding to Section 3.1, which is also ridiculously simple:

attribute vec4 v_Position;
attribute vec4 f_Position;

varying vec2 textureCoordinate;

void main()
{
    gl_Position = v_Position;
    textureCoordinate = f_Position.xy;
}

Note that this is different from Section 4.2. This section is already a shader that can be directly compiled for openGL. Section 4.2 is the code form encapsulated by Unity. Finally, the Unity engine will also be translated into code similar to the above, with a main function.

v_Position is the externally input vertex coordinate and gl_Positionis a global built-in variable that represents the final coordinate in the clipping space.

If the above shader must use the MVP matrix, it is also possible, but this MVP is an identity matrix. Multiplied by the identity matrix, it still has the same value.
Right now

gl_Position = mvpMatrix * v_Position;

5 GPU processing----rasterization stage

The main task of rasterization: calculate which pixels are covered by each primitive, and then color these pixels.

5.1 Triangle setup and traversal

Triangle settings:

This is the first stage of the rasterization pipeline.
This stage calculates the information needed to rasterize a triangle mesh.
Specifically, the output of the previous stage is the vertices of the two-dimensional triangular grid on the screen. That is, what we get are the two endpoints of each edge of the triangular mesh. But if we want to get the pixel coverage of the entire triangular grid, we must calculate the pixel coordinates on each edge. In order to be able to calculate the coordinate information of the boundary pixels, we need to obtain the representation of the triangle boundary. Such a process of calculating triangle mesh representation data is called triangle setup. Its output is to prepare for the next stage.

Triangle traversal:
The triangle traversal phase will check 每个像素whether it is covered by a triangle mesh. If it is covered, a fragment will be generated, and the process of finding which pixels are covered by the triangular grid is 三角形遍历. The triangle traversal stage will determine which pixels are covered by a triangular mesh based on the calculation results of the previous stage, and use the vertex information of the three vertices of the triangular mesh to perform operations on the pixels in the entire covered area 插值. The figure below shows the simplified calculation process of the triangle traversal stage.
Insert image description here
According to the vertex information output by the geometry stage, the pixel positions covered by the triangular mesh are obtained. The corresponding pixel will generate a fragment, and the state in the fragment is calculated from the vertex information of the triangle.
A total of 8 fragments are generated in the triangular grid like the picture above.

You must be wondering,绘制一帧，到底会产生多少片元？

It can be said that if it is only used in a simple two-dimensional way, such as video playback, then the triangle has only 4 vertices, spread at the 4 corners of the screen, without any depth. At this time, this should be easy to understand, as shown below 片元数量就是屏幕的宽*高
.
Insert image description here
There are only 2 triangles in total, with the same depth, that is, there are no overlapping triangles. Traverse

But if it's a 3D game, that's not necessarily the case.
Insert image description here
For example, in this scene in the game, there are several physical models, and the models overlap. For example, there is a house behind the stone.

To draw a frame, there are multiple models that need to be drawn, and each model may need to call a Draw Call. A Draw Call may generate some fragments.
If all objects are drawn, basically all screen pixel coordinates will be covered by the grid, and at some pixel coordinates, there will be overlapping fragments.
For example, the triangle mesh corresponding to the pixels in the arrow area of the stone and the triangle mesh of the house will calculate the fragments. The x and y of the two are the same, but the depth z is different.
so,3D游戏场景，片元数量 >= 屏幕宽 * 高

5.2 Fragment shader

Fragment shaders are also customizable codes, mainly used to color fragments.
Let’s start with a piece of shader code for video playback rendering:

varying highp vec2 textureCoordinate;// 片元对应的纹理坐标，又顶点着色器的varying变量会传递到这里

uniform sampler2D sTexture;          // 外部传入的图片纹理 即代表整张图片的数据

void main()
{
     gl_FragColor = texture2D(sTexture, textureCoordinate);  // 从纹理中，找到坐标为textureCoordinate的颜色，赋值给gl_FragColor
}

Did you find it easy? ! Just sample the color based on the texture coordinates and assign it to gl_FragColor as the final color of the fragment!

This means that the final color is completely within your control. Nowadays, a lot of video special effects processing is based on this fragment shader.

For example, we want to play a video, and what we see is the picture of Brother Black and White:

#extension GL_OES_EGL_image_external : require
precision mediump float;
varying vec2 vCoordinate;
uniform samplerExternalOES uTexture;
void main() {
  vec4 color = texture2D(uTexture, vCoordinate);
  float gray = (color.r + color.g + color.b)/3.0;
  gl_FragColor = vec4(gray, gray, gray, 1.0);
}

Make a simple average of rgb, and then the final color has the same rgb component (meaning it is between white and black, gray is 0 black, gray is 255, white, gray is between the two, gray), then you can Get a black and white color. Then assign it to the fragment, and a simple black and white filter is completed.

It should be emphasized that the fragment shader is also executed in parallel. Just execute as many fragments as there are! GPU has very powerful parallel computing capabilities.

5.3 Fragment-by-piece operation

PerFragment Operations is what openGL calls, and in DirectX, it becomes the output merging and mixing stage.

The main tasks at this stage:
(1) Do template testing, depth testing, etc. to determine the visibility of each fragment!
(2) The fragment passes the test, and the color value of this fragment is added to the color already stored in the color buffer, or called 混合.
Insert image description here

5.3.1 Template testing

This is optional. Template data is placed in the stencil buffer (Stencil Buffer).
What does template mean? It's very simple. For example, no matter how the scene is drawn, I only want to see a central circular area of the screen, and other areas are not visible.

Take a template example:
Insert image description here
If the coordinates of the fragment are outside the visible area, they will be discarded directly.

5.3.2 Depth testing

If enabled, the depth value of the fragment, i.e. the z value, will be compared with the depth value that already exists in the depth buffer (if any). The comparison method can be configured. The default is to discard the fragment when it is greater than or equal to it, and keep the fragment when it is less than or equal to it. This is also understandable. The greater the depth, the further away from the Camera. It is blocked by the one closer to the Camera, so there is no need to display it.

5.3.3 Blending

If the previous tests are passed, it is time to enter the mixing stage. This is also configurable.

Why merge? 因为一帧图像生成，需要逐个绘制物体模型，每画一次，就会刷新一次颜色缓冲区。当我们执行某次渲染时，只要不是第一次，那颜色缓冲区就有上一次渲染物体的结果.
So, for the position of the same screen coordinates, should the last color be used, the color this time, or mixed together? This is what is considered at this stage.

For translucent objects, you need to enable the blending function, because you need to be able to see the colors already in the color buffer, so you need to blend them to achieve a transparent effect.

For opaque objects, developers can turn off the blending function. After all, as long as the depth test is passed, it is generally stated that the fragment should be retained, and the color of the fragment can be directly used to overwrite the color of the color buffer.

5.3.4 Image output-double buffering

First of all, what is displayed on the screen is the data in the color buffer.

During the process of drawing a frame of image, the color buffer is constantly being overwritten and mixed. This process must not be visible to users. Therefore, the GPU will use a double buffering strategy to solve the problem.
After the FrontBuffer is drawn, it is handed over to the screen for display, and then the BackBuffer is freed. The BackBuffer will be used for the next rendering.
Draw a picture and it will be clear.
Insert image description here
After drawing a frame of image, call eglSwapBuffers to switch buffers.

6 Performance Discussion

6.1 Where are the performance bottlenecks?

Let’s talk about the conclusion first:一般在CPU阶段，提交命令的耗时。Draw Call数量越多，越可能造成性能问题。

Before each Draw
Call is called, the CPU has to send a lot of content to the GPU, such as data, status, commands, etc. After the CPU completes the preparation work and submits the command, the GPU can start working. The GPU rendering capability is very strong. There is not much difference between rendering 100 triangle meshes and 10,000 triangle meshes. So,
GPU渲染速度往往优于CPU提交命令的速度. Many times, the GPU has finished processing the command buffer and is waiting idle, while the CPU is still working hard to prepare data!

This is like, we have a lot of txt files, such as 1,000. If 1,000 are copied to another folder, it will be very slow. If you package it into a tarball package, even if it is not compressed, and copy it to another folder at once, it will be faster.
why? Because cpy itself is very fast, the time consuming is mainly allocating memory, creating metadata, file context, etc. before each cpy!

如果地图中有N个物体模型，渲染一帧完整图像，可能就要调用N次Draw Call
, and every time it is called, the CPU needs to do a lot of preparatory work (changing the rendering state), so the CPU must be overwhelmed. Therefore, targeted performance optimization is required.

6.2 How to optimize performance

Let’s first take a look at the approximate level of CPU consumption of Draw Call:
NVIDIA once proposed at GDC that 25K batches/sec will fill up a 1GHz CPU and achieve 100% utilization.

official:

DrawCall_Num = 25K * CPU_Frame * CPU_Percentage / FPS

DrawCall_Num: Number of DrawCalls (maximum supported)
CPU_Frame: CPU operating frequency (GHZ unit)
CPU_Percentage: The time rate (percentage) allocated by the CPU to drawcalls
FPS : The desired game frame rate

For example, if we use a Qualcomm 820, the operating frequency is 2GHz, allocate 10% of the CPU time to drawcalls, and we require 60 frames, then a frame can have up to 83 drawcalls (25000 2 10%/60 = 83.33
) , If the allocation is 20% CPU time, that's about 167.

Therefore, the optimization of drawcall is mainly to free up the CPU overhead in calling the graphical interface as much as possible. Our main idea for drawcall is this 每个物体尽量减少渲染次数，多个物体最好一起渲染，或者叫，批处理（batching）.

What kind of objects can be batch processed?
The answer is 使用同一种材质的物体.

The physics of the same material means that only the vertex data is different, but the textures, vertex shaders, and fragment shader codes used are all the same!

So we can merge the vertex data of these objects into one, send it to the GPU, and call DrawCall again.

The Unity engine supports two batch processing methods: 静态批处理and 动态批处理. The processing methods of other game engines will not be described here.

6.2.1 Unity static batch processing

Static batch processing does not limit the number of vertices of the model. It only merges the model once at the beginning of the run. It only merges the model once at the beginning of the motion. This requires that these physics cannot be moved!

The advantage is that you can merge and create them once, and you don’t need to merge them for subsequent renderings, which saves performance.
The disadvantage is that it may take up more memory. For example, there is no static processing, 一些物体共享了网格
for example, the map has 100 trees, sharing a model. If you turn these into static batch processing, 100 times the vertex data memory is required.

6.2.2 Unity dynamic batch processing

Dynamic batch processing requires merging model meshes for each frame of rendering.

The advantage is that these objects can be moved.
The disadvantage is that every time you want to merge, there are some restrictions. For example, the number of vertices cannot exceed 300 (different Unity versions vary).

7 Appendix knowledge

7.1 What are homogeneous coordinates

Homogeneous coordinates use N+1 dimensions to represent N-dimensional coordinates.
For example, the coordinates (x, y, z) of the Cartesian coordinate system are represented by (x, y, z, w). Coordinates can be points, or vectors.

Purpose:

区分向量或者点。
辅助平移T、旋转R、缩放S这3个最常见的仿射变换

Quotes from scriptures:

Homogeneous coordinate representation is one of the important methods in computer graphics. It can not only clearly distinguish vectors and points, but also make it easier to perform affine geometric transformations.
—— FS Hill, JR Author of "Computer Graphics (OpenGL Edition)"

Regarding the distinction between vectors and points, it refers to:
(1) When converting from ordinary coordinates to homogeneous coordinates,
if (x, y, z) is a point, it becomes (x, y, z, 1);
if (x, y,z) is a vector, then it becomes (x,y,z,0)

(2) When converting from homogeneous coordinates to ordinary coordinates,
if it is (x,y,z,1), then it is known that it is a point and becomes (x,y,z);
if it is (x,y,z,0 ), then we know that it is a vector, and it still becomes (x, y, z)

Regarding translation, rotation, and scaling, let’s talk about it slowly.

7.2 Translation matrix

The translation matrix is the simplest transformation matrix. The translation matrix is like this:
Insert image description here
where X, Y, Z are the displacement increments of the point. For example, if you want to translate the vector (10, 10, 10, 1) by
10 units along the

10, 10) Only one dimension w can be added before it can be calculated (requirements for matrix multiplication, a matrix of m n can only be multiplied by a matrix of n t
).

7.3 Scaling matrix

Scaling is also very simple.
Insert image description here
For example, magnify a vector (either point or direction) by 2 times in each direction:

7.4 Rotation matrix

This one is a little more complicated, and the rotations along the X, Y, and Z axes are different, but in the end it is still a matrix. It will not be expanded separately here. If you are interested, please refer to:

7.5 Combination transformation

The operation methods of rotation, translation and scaling are introduced above. These matrices can be combined by multiplying them, for example:

TransformedVector = TranslationMatrix * RotationMatrix * ScaleMatrix *
OriginalVector;

注意，是先执行缩放，接着旋转，最后才是平移。This is how matrix multiplication works, and it is also the common way we put 3D models into maps.

The multiplication of three matrices is still one matrix. This is the matrix of the combined transformation, so it can also be written in the end as:

TransformedVector = TRSMatrix * OriginalVector;

reference

Android OpenGL ES 1. Basic concepts
Computer composition principle – GPU
[Computer Things (8) – Graphics and Image Rendering Principle](http://chuquan.me/2018/08/26/graphics-rending-principle-gpu/
)
[ opengl-tutorial ](http://www.opengl-tutorial.org/cn/beginners-
tutorials/tutorial-3-matrices/)
[ Unity Documentation ]( https://docs.unity3d.com/cn/2019.4/ Manual/SL-
VertexFragmentShaderExamples.html)
Rasterization stage settings
About Drawcall