mali opt

https://armkeil.blob.core.windows.net/developer/Arm%20Developer%20Community/PDF/Arm%20Mali%20GPU%20Best%20Practices.pdf

Draw call culling 这部分

1.bounding box frustum checking

2.portal culling

目前项目这部分做的不是很够顶点一定要在cpu这边就去干净之前在LAN里面见到过portal culling+LOD

Use GPU performance counters to verify tiler primitive culling rates. Expect around 50% of the
triangles to be culled because they are back-facing inside the frustum. Higher culling rates can
indicate that the application needs an improved draw call culling approach.

我看很多做优化的人的问题在于对问题理解的不清不知道该关注什么数据

所以稍微好点的知道做了之后模糊的测个帧率

更好的办法是拿到这局部相关的准确数据如上文这样

用这个数据来决定做不做更为复杂的cpu culling

所以这部分的开发设计也是场景数据驱动咯

这docs真不错欸前面也列了我刚得那个结论不在hotspot上的cpu 开销也要用batch降为了省电

======================

要从前向后排序来利用earlyztest

从前向后opque 之后从后向前transparent

从maliT620有FPK

All Mali GPUs since the Mali-T620 GPU includes the FPK optimization. FPK provides automatic
hidden surface removal of fragments that are occluded, but early-zs testing does not kill. This is due to
the use of a back-to-front render order for opaque geometry. However, do not rely on the FPK
optimization alone. An early-zs test is always more energy-efficient, consistent, and works on older Mali
GPUs that do not include hidden surface removal.

这块着实让我理解了一番

数据先过earlyz 通过了这里之后

如果是被遮挡的会被FPK干掉

为什么earlyz没有去掉这个fs呢因为从后往前画的

所以看这里的数据流还是先earlyz test 通过之后再FPK

所以能earlyz 去掉的后面FPK是不用走了的所以还是要从前向后画以充分利用earlyz

就是android opaque要排序从前向后

并且这样FPK就不能去掉顶点计算好吧同样metal的hsr也做不到这点

如果是这样如果一定因为ao要有zprepass 那样还是值得尝试prepass的depth

用来做depth equal test的因为FPK这里的开销可省值得复用

但是考虑到 prepass的halfresolution

这里就需要测试了

prepass增加了 drawcall开销CPU

顶点开销带宽

=============

OpenGL ES GPU pipelining

这样也会性能波动 DVFS

performance oscillations caused by the platform dynamic voltage and frequency scaling (DVFS) logic thinking that the CPU or GPU is under-utilized.

OpenglES看起来是同步的（表面呈现出来的）实际是异步的（指cpu gpu的交互，和gpu内部，和gpu与gpu）

管线drain就是GPU空闲不好

这里我对pipeline drain （which is GPU starve）我对drain这个词的理解。。。

dain 排水 leave the pipeline empty

不要频繁wait

glReadPixels()这个用的时候异步流水方式，不要用它卡住GPU
GL_MAP_UNSYNCHRONIZED glMapBufferRange（）异步

不要用强制同步（所有这些都会导致wait）

glFinish() -----CPU wait GPU
同步执行glReadPixels() ----CPU wait GPU 返回像素结果
glMapBufferRange() 没用GL_MAP_UNSYNCHRONIZED 同时buffer正在被drawcall使用这样会wait 它用完才能map给cpu

不要用glMapBufferRange()with GL_MAP_INVALIDATE_RANGE or GL_MAP_INVALIDATE_BUFFER 这个会导致有copy

不要用glFLush（）因为会强制rende passes split，driver 会在需要的地方flush

-------

我对这段的理解有些糊

https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glMapBufferRange.xhtml

https://www.khronos.org/registry/OpenGL-Refpages/gl4/html/glReadPixels.xhtml

https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/glFinish.xml

glFilsh会卡住cpu等gpu返回

glReadPixels(),glMapBUfferRange() 看上去都是cpu在wait GPU把之前的结果flush掉正是这个行为对应drain吧排完然后gpu empty了此时gpu在处理当前数据还没给新的gpu cmd呢

这下。。。对的上了

flush 本身异步的行为是没什么问题的但split的行为也是隐含了flush gpu 这里和上面两种一样 ---drain 这种split的行为我在nv显卡上见过文章里第一次见对此的描述

又想到一点 set RTA draw 完了bind RTA as SRV也会有一次flush的文中没提这一点可能因为没有cpu wait在这里

------

会造成GPU bubble

性能不稳定--DVFS

曲线：CPU GPU互相wait 有bubble 一会cpu忙一会gpu忙有波动两者utility都不高

（介于这段所以上面不一定都是gpu wait，再理解了下也是很符合drain的行为的）

================================

▪ Invalidate attachments which are not needed outside of a render pass at the end of the pass before changing the framebuffer binding to the next FBO.
▪ If you know you are rendering to a sub-region of framebuffer use a scissor box to restrict the area of clearing and rendering required.

第一句说明切rt不需要invalidate attachments mali会自动做的

invalidate attachments是ogles上为了不load才要做的

第二句建议部分绘制时用sissor这种方式裁剪很可能说明mali上面用sissor可以使用tile 的部分更新

==============

128bits Gbuffer

Light: B10G11R11_UFLOAT

Albedo: RGBA8_UNORM

Normal:RGB10A2_UNORM

PBR material parameters/misc: RGBA8_UNORM

GBuffer pass (subpass0)

Light:COLOR_ATTACHMENT_OPTIMAL

Albedo:COLOR_ATTACHMENT_OPTIMAL

Normal:COLOR_ATTACHMENT_OPTIMAL

PBR:COLOR_ATTACHMENT_OPTIMAL

Depth:DEPTH_STENCIL_ATTACHMENT_OPTIMAL

Light pass(subpass1)

Light:COLOR_ATTACHMENT_OPTIMAL

Albedo:SHADER_READ_ONLY_OPTIMAL

Normal:SHADER_READ_ONLY_OPTIMAL

PBR:SHADER_READ_ONLY_OPTIMAL

Depth:DEPTH_STENCIL_READ_ONLY

==========

可以用 robustBufferAccess() vulkan里的

做内存的越界检测在debug的时候使用

memory 越界很容易 gpu hang （或者device_lost）连validate里也没有信息返回很难查的

=========

1080P ----250K 16fragment/triangle

instanced vb 用一个不要用多个

=======

有个很tricky的用法似乎 alpha test /discard如果不用 depth write

关掉dipth write的情况下就还是earlyz 不是latez

====

直接用glUniform（）或者vulkan的 push constant 而不是用uniform buffers

Register mapped uniforms are effectively free; any spilling to buffers in memory will increase load/store cache accesses to the per thread uniform fetches.

128byte之类大于registers一次能装的量会有system memory的访问

=======

vulkan 有俩slot

vertex/comput hardware slot

fragment hardware slot

（感觉这个对应metal 三种render/compute/blit 不过encoder是cpu那边的不过确实可以说是cpu那边分开写三种类型的cmd /slot 也对的上）

这俩slot尽可能更大程度的并行才能提升效率

queue里面的同步用

subpass dependencies
pipeline barriers
events

queue之间

semaphores

看上去 srcStage dstStage可以标记同步位置

猜你喜欢