GPU OpenGL ES Application Performance Optimization--Basic Method (Transfer)

2. Commonly used optimization schemes

     The main task of OpenGL ES optimization is to find the bottleneck that affects performance in the graphics pipeline. The bottleneck generally manifests in the following aspects:

     • In application code, such as conflict detection •  Data transfer between GPU and main memory •  Vertex processing in VP (Vertex Processor) •  Fragment processing in FP (Fragment Processor)
     
     
     

     DS-5 Streamline can be used to locate performance bottlenecks (Locate bottleneck). In order to obtain better performance, the following specific functions can be optimized:

2.1 Textures

     High resolution textures take up a lot of memory, it is the main load on the Mali GPU, can be optimized in the following ways:
     •  Try not to use large textures unless necessary
     •  Always turn on texture mapping (mipmapping), sometimes it may reduce Rendering quality
     •  If possible, sort the triangles so that when rendering in the order of rendering, the triangles that overlap each other will be placed together
     •  Compressed textures can reduce memory usage and transmission bandwidth. Mali-400 MP GPU supports ETC texture compression (per Each pixel occupies 4bits, and does not support alpha channel), GPU hardware can decompress ETC texture, the disadvantage is that it will reduce the image quality

2.2 Anti-aliasing

     •  GPU supports 4x Full Scene Anti-Aliasing (FSAA), and its performance loss is negligible; when creating context and rendering surface, you can activate 4x FSAA by selecting EGL configuration (EGL_SAMPLES=4) • Mali GPU also supports 16x FSAA,
     its  performance Will drop to 1/4 of 4x FSAA

2.3 Draw Mode

    For large meshes, where a vertex is contained in multiple triangles, the number of times such a vertex is processed depends on the drawing function called:

     •  glDrawElements: Each vertex is only processed once, which is more efficient.
     •  glDrawArrays: Each vertex data is transferred and processed once per triangle that uses it

     Storing vertex data in the order of use can improve the effect of vertex cache and reduce the amount of data transferred from RAM to vertex cache.

2.4 Vertex Buffer Objects

     •  The vertex data stored in the vertex array (Vertex Array) is located in the client memory (main memory). When calling glDrawArrays or glDrawElements, these vertex data will be copied from the client memory to the graphics memory.

     •  Vertex Buffer Objects allow OpenGL ES2.0 applications to allocate and cache vertex data in high-performance graphics memory, and then render from this memory. This avoids resending data every time a primitive is drawn.

     •  Vertex Buffer Objects classification:
     1) Array buffer objects (array buffer objects): identified by GL_ARRAY_BUFFER, used to store vertex data (Vertex Data)
     2) element array buffer objects (element array buffer objects): identified by GL_ELEMENT_ARRAY_BUFFER, used to store primitive index (indices of primitive)

2.5 Data Precision

    Where possible, try to use low-precision data and avoid using floating point and other 32-bit data types:
     •  Define vertex positions using GL_SHORT
     •  Define Surface Normal using GL_BYTE
     •  Define colors using GL_UNSIGNED_BYTE

2.6 Volume of Data Processed

    To reduce the amount of data processed by the Mali GPU in the following ways:
     •  Only draw primitives that are visible in the current frame: this can be achieved by clipping or frustum clulling in the application
     •  Use ETC to compress textures
     •  Sort geometry by depth: from front to After the geometry is sorted, the draw calls are sorted according to the depth.  

2.7 Render Targets

     以下因素与渲染目标有关:
     • 按照因素(cause-and-effect)顺序渲染所有的纹理
       1)在纹理被使用之前render to textures
       2)最后渲染后台缓冲区
     • 一次只绘制到一个渲染目标:确保绘制下一个目标之前,已经完成当前目标的所有调用
     • 不要在一帧中修改纹理:当调用API之前,设置好所有当前帧需要的纹理

2.8 处理管道(Processing Pipeline)

     以下因素与图形处理管道有关:

     • 使用eglSwapBuffers:
       如果应用显示动画,确保通过调用eglSwapBuffers来结束一帧。应用接着产生下一帧,这样以确保在计算下一帧时,当前帧仍可稳定显示。

     • 避免使用glReadPixels:
       即使读取很少像素,对性能影响也比较大,因为它暂停了处理管理

     • 限制glDrawElements中顶点个数:
       在调用glDrawElements之后,在以前的操作(如:顶点着色、变换、光照)完成之后才开始创建多边形列表。为了使之并行,确保单次glDrawElements调用中包含的顶点数不要超过当前帧中所有顶点数的1/5。这在立即调用glDrawArrays之前或之后特别重要。

2.9 着色器程序(Shader Programs)

     • 首先执行Shader编译:在应用程序启动时,且在开始向驱动发送顶点或纹理数据之前,完成Shading语言编译器所有相关的调用
     • 使用自定义着色器程序:把大的Shader程序采裁剪成每个Surface所需要的,而不使用大而全的Shader程序,小而精的Shader程序通常运行得更快
     • 考虑程序大小:可以使用Offline Shader Compiler检测程序大小。一个GPU指令可以包含一系列ESSL操作
     • 循环和条件分支:不要手动展开循环。相反,把数据放到数组中,然后在可能的地方使用for循环。当然,也可以使用if语句。
     • 避免使用过多的varyings:在shader编程时,在Fragment Shader程序中,尽量节约使用varings;因为在VP与内存或FP与内存间传递varings时需要消耗内存带宽
     • 避免使用过多的矩阵乘法:4x4的矩阵与4x1的向量相乘,需要执行16次乘法和12次加法,是非常昂贵的;如果需要一个向量与多个矩阵相乘,则向量依次与每个矩阵相乘,而不要先把所有的矩阵相乘,然后再与向量相乘
     • 评估程序的代价:常用的代价级别如下表所示,使用Offline Shader Compiler可以更精确地获取程序的代价。


2.10 着色器运算(Shader Arithmetic)

     • 顶点处理器基于32位浮点值工作:Vertex Shader使用浮点表示整数。为了避免32位值,设置Vertex Shader程序的输出varing的精度为mediump或lowp。
     • Fragment Shader使用16位浮点值工作:其构成为:sign;5位指数,以抵消15;10位尾数,用一个隐含的最显著1位

2.11 其它

     • 使用点精灵:
        而不是三角形或四边形来表示颗粒实体

     • 使用适当尺寸的三角形: 
       避免使用长而细的三角形。FP(Fragment Processor 或Pixel Processor)总是处理4个邻近Fragment的组。因此处理1个像素宽度的Strip比处理2个像素宽度的Strip耗用更多的时间。

     • 优化状态变化:
       避免状态来回变化,可以把相同状态的调用组织在一起,以减少状态变化

     • 清除整个Framebuffer:
       总是调用glClear清除整个Framebuffer。如果可能,在清除framebuffer时,清除所有的buffer,如:color、depth, and stencil buffers。

       void glClear(GLbitfield mask);
       参数说明:
       GLbitfield:可以使用 | 运算符组合不同的缓冲标志位,表明需要清除的缓冲,例如glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)表示要清除颜色缓冲以及深度缓冲,可以使用以下标志位:
       1)GL_COLOR_BUFFER_BIT:    当前可写的颜色缓冲
       2)GL_DEPTH_BUFFER_BIT:    深度缓冲
       3)GL_ACCUM_BUFFER_BIT:   累积缓冲
  4)GL_STENCIL_BUFFER_BIT: 模板缓冲

     • 最小化draw调用:
       当调用glDrawArrays或glDrawElements时,GPU驱动需要收集所有当前OpenGL ES状态、纹理和顶点属性数据;然后驱动处理这些数据并产生可在GPU硬件上执行的命令,以执行真正的draw调用。此操作可以耗费大量的时间,所以如果执行多次调用,它将成为rendering的性能瓶颈。如果多个对象具有相同的rendering参数,但使用不同的纹理,则可以把纹理合并到一个大的纹理中,并调整其对应的纹理坐标即可。

     • 避免使用glFlush和glFinish
       除非你无法避开,否则不要调用glFlush或glFinish,而使用eglSwapBuffers来触发一帧的结束。(注:glFlush只是把命令发送给Server,但并不等待执行完成。如果需要等到Server执行完成时才返回,则需要调用glFinish,但它严重影响性能。)

3. 发现和消除bottleneck

    基本方法:

    1) 使用专业工具(如DS-5 Streamline)

    2) 在值得怀疑的图形管理阶段,增加或减少负荷,然后观察性能变化情况

    发现和消除bottleneck可参考以下方案:

问题点 解决方案
Application code
Reduce the amount of processing that is unrelated to OpenGL ES calls, such as input processing, game logic, collision detection, and audio processing.
Driver overhead
Group geometry with similar state together and eliminate unnecessary state changes.
Vertex attribute transfer
Use smaller data types for the values. Also, use a more economical triangle scheme, and in general use glDrawElements rather than glDrawArrays.
Vertex shader processing, or Transform
and Lighting in OpenGL ES 1.1
Try the following options:
1) Use glDrawElements rather than glDrawArrays.
2) For OpenGL ES 1.1, reduce the number of lights.
2) Minimize the transformations of texture coordinates. You can avoid these transformations by setting the transformation matrix using OpenGL ES 1.1 function glLoadIdentity.
3) For OpenGL ES 2.0, simplify the vertex shader program.
Polygon list building 
Use fewer graphics primitives. Also, avoid drawing significant amounts of the total geometry in any single call to glDrawElements.
Varying data transfer
In OpenGL ES 1.1, use fewer texture coordinates. In OpenGL ES 2.0, use fewer varyings, and specify lower precision on varying variables out of the vertex shader.
Fragment shader processing, texture, color sum, and fog in OpenGL ES 1.1
Lower the resolution of the render target or reduce the size of the viewport.
For OpenGL ES 1.1, use fewer texture stages.
For OpenGL ES 2.0, simplify the fragment shader program.
Texture bandwidth
Try the following options:
1)use fewer texture stages
2)lower the size of the textures, by using a smaller data format for each pixel, lower resolution, or texture compression
3)use a simpler texture filtering mode
4)collapse texture coordinates so that they always read from the same position in the texture.


Transfer to display framebuffer
Try the following options:
1) use a mode with lower pixel precision
2) lower the resolution of the render target.

4. ESSL限定值

   OpenGL ES Shading Language规范定义了各种着色器资源(shader resources)大小的最小值,在Mali GPU实现中,其中一些值大于规范中定义的最小值。常用的如下表所示:

Guess you like

Origin blog.csdn.net/dongtinghong/article/details/80081404