DirectX11 With Windows SDK - 29 compute shader: memory model, thread synchronization; achieve order-independent transparency (OIT)

Foreword

Since the transparent mixed results will be different in different drawing order, which requires the drawing before the object to be sorted, and then forward from the rear rendering. But even only render an object (such as the last chapter of the waves), will also experience a transparent drawing order wrong, ordinary drawing can not be avoided. If you want to pursue the right effect, it is necessary for all the pixels are sorted depth value for each pixel location. This chapter introduces a DirectX11 and only order-independent transparency (Order-Independent Transparency, OIT) later to achieve, although it is pixel shaders to achieve, but uses some knowledge of computing shaders inside.

This chapter very complex, before this chapter need to understand the following:

Chapters
11 mixed state
12 depth / template state, drawing plane mirror reflector (depth / stencil state only)
14 depth test
Application 24 Render-To-Texture (RTT) technology
28 calculates shader: wave (wave)
In-depth understanding and use of buffer resources (structured buffer, the buffer byte address)

learning target:

  1. Familiar memory model, thread synchronization
  2. Familiar order independent transparency

DirectX11 With Windows SDK full catalog

Github project source code

Welcome to the QQ group: 727 623 616 DX11 can explore together, and what problems can be reported here.

DirectCompute memory model

DirectCompute provides three memory models: memory based register , device memory and shared memory in the group . Different memory models differ in local memory size, speed, access methods.

Memory-based register : its very fast, but not only limit the number of registers, internal resources register points also limited in size. The texture register (t #), the constant buffer registers (b #), view random access register (u #), temporary register (r # or x #) and the like. And we do not directly use the register specified when a register is used, but by the shader object (e.g. tbuffers, which is inside the GPU memory, and therefore very fast access) corresponds to a register, and then use the colored Object indirectly use this register. With the rear and these registers are compiled shader dead set, so the availability depends on the shader code register currently used.

The following code shows how to declare a register-based memory:

tbuffer tb : register(t0)
{
    float weight[256];      // 可以从CPU更新,只读
}

Device Memory : generally refers to create D3D device out of resources (such as texture, buffer), these resources can persist as long as the reference count is not zero. You can create a lot of memory space for these resources to use, and you can also use them as a shader resource or disorderly access to resources is bound to register for use. Of course, this memory object is not created as the access speed of the shader resources directly on registers faster, since it is stored in a memory in an external GPU. Although these memory can be accessed through the internal bandwidth is very high, but between the request and return value also has a relatively high value of delay. While random access to a view can be used to achieve the same memory register-based operation, but frequently read and write operations performed in the device memory, the performance will be heavily influenced. Further, since each thread can read or write any location by random access resource in view of this need to manually synchronize access to resources, an atomic operation may be used, or the definition of a rational way to avoid multiple access threads access the same data to the device memory.

Shared Memory in the group : are the first two memory model can be used in all stages of programmable shading, but can only be used in the group shared memory compute shader. Its access speed faster than the device memory resources, slower than the register, but there are obvious memory limit - each thread group can assign up to 32KB of memory, all for internal threads. Group within the shared memory must determine how threads interact with memory and uses memory, and therefore it must synchronize access to the memory. This will depend on the algorithm being implemented, but it usually involves threading addressed previously described.

These three types of memory to provide different access speeds and sizes available, so that they can be used for its ability to match different situations, which also calculates a shader provides greater operational flexibility of memory. The following table is a summary of memory model:

Memory model Access speed Available memory Use
Memory-based register quickly small Disclaimer register memory object, global variables
Device memory Slower Big Bind to a particular view through pipelines
The group shared memory Faster Small Only support computing shaders, global variables added to the frontgroupshared

Thread Synchronization

Since the number of threads running simultaneously, and the thread can be shared by the group memory or random access views interact through a corresponding resource, it is necessary to synchronize memory accesses between threads. With the traditional multi-threaded programming, many threads can be used to read and write the same memory location, there is a danger read-after-write (Read After Write, referred to as RAW) lead to memory corruption. How the case without loss of performance GPU parallelism can also be brought about so efficiently synchronize multiple threads? Fortunately, there are several different mechanisms available threads in the thread group for synchronization.

Memory barrier (Memory Barriers)

This is one of the most advanced synchronization technology. HLSL provides a number of built-in functions can be used to synchronize access to the thread group all the threads of memory. Note that it only synchronize threads in the thread group, rather than the entire schedule. These two functions have different properties. The first category is the memory function is called thread is synchronized (device memory, shared memory within the group, or both), the second is assigned to all threads in a given thread group is synchronized to its execution process the same place. From these two attributes, those derived from the following different versions of built-in functions:

The set of synchronization without In-band synchronization group
GroupMemoryBarrier GroupMemoryBarrierWithGroupSync
DeviceMemoryBarrier DeviceMemoryBarrierWithGroupSync
AllMemoryBarrier AllMemoryBarrierWithGroupSync

These functions will continue to block the thread until certain conditions the function of the position satisfies. Where the implementation of a function GroupMemoryBarrior () block the thread until the thread of the group all the threads of the group shared memory of all the write completed. This is used to ensure that when the thread is shared within the group shared memory data with each other, have the opportunity to write the desired value in the group shared memory before being read by other threads. There is an important distinction, namely shader core perform a write command, and that command is actually executed by the GPU memory system, and write memory, and then it will be available again to other threads in memory. From the start to the completion of writing the value written to the target location has a variable amount of time, depending on the hardware implementation. By performing blocking operations until they have completed the write operation is to ensure that developers can determine there is no problem any read-after-write error caused.

But having said so much, gotta practice it. The two-tone ordering individual projects BitonicSort_CS.hlslGroupMemoryBarrierWithGroupSync on line 15 () modified to GroupMemoryBarrier (), after performing multiple runs the program found inconsistent results of an example of the sort would happen. So you can judge: GroupMemoryBarrier () only if all threads in the thread group group there is a thread blocking write operations, so there may be blocked at the end of the thread to complete the vast majority of shared data is written, but there is still a small amount of threads and even I did not start writing the shared data. So in fact rarely saw his chances of playing.

Then GroupMemoryBarriorWithGroupSync () function, compared to a function, he also block those threads of the first to perform this function until all threads have reached this function to continue. Obviously, before the shared memory are loaded in all groups, we do not want any thread forward, which makes it the perfect synchronization method.

And a second synchronization function may also perform similar operations, except that they are operated in the device memory cell. This means that before continuing shader program can synchronize write access resources through disordered view of all pending memory write operation. This is more useful for a greater number of synchronous memory, the shared memory if the desired size is too large is not suitable for the use of shared memory group, then there will be more memory devices in a data resource.

The third synchronization function is performed with the previous two types of synchronization, and synchronization for simultaneous access to the shared memory and the memory device.

Atomic operation

Memory barrier is useful for all thread synchronization thread. However, in many cases, need to synchronize a smaller scale, you may only need a few threads. In other cases, the position of the thread may be synchronized at the same point of execution, it may not perform the same point (e.g., when a different thread in the thread group performs tasks isomers). Shader Model 5 introduces many new atomic operation, may be provided between the finer thread synchronization efforts. So that when multiple threads to access shared resources, to ensure that all other threads can access the same resources in a unified time. Atomic operations to ensure the operation after it begins, it runs to completion:

Atomic operation
InterlockedAdd
InterlockedMin
InterlockedMax
InterlockedOr
InterlockedAnd
InterlockedXor
InterlockedCompareStore
InterlockedCompareExchange
InterlockedExchange

Atomic operations may be used for shared memory and memory resources within the group. Here an example of use, if the calculated desired shader program counts the number of threads meet particular data value retention, then the total count may be initialized to zero, and each thread memory can be shared within the group (with the fastest access speed ) or resource (persistent execution on) between the dispatch call InterLockedAddfunction. The total number of atomic operations to ensure proper increments, without being rewritten intermediate values in different threads.

When each function has its own unique requirements and input operation, so the choice of an appropriate reference function Direct3D 11 documentation. Pixel shader stage can also use these functions, allowing it to cross-resource synchronization (Note pixel shader does not support the group shared memory).

Order independent transparency

Now let's look at the correct transparent calculation method. For each pixel, the background color when the current \ (C_0 \) , then the pixel fragment to be rendered transparent by descending order of the depth values \ (c_1 and, c_2, ..., C_N \) , transparency \ (a_1, a_2, ..., a_n \) is the final pixel color is:
\ [C = [C_N A_N + (. 1 - A_N) ... [A_2 c_2 + (. 1 - A_2) [A_1 c_1 and + (1 - a_1) c_0]
...] \] in the previous drawing, we can not control the drawing order of the transparent pixel fragment, lucky you may even render correctly, once changed the angle of view will be a problem. If the scene in a variety of transparent objects intertwined, basically no matter how you change the viewing angle is not rendering properly mixing. Therefore, in order to realize order-independent transparency, we need to previously collected pixels, and then the depth ordering, and finally calculate the correct pixel color.

Pixel by pixel using a linked list (Per-Pixel Linked Lists)

Direct3D 11 hardware for many new rendering algorithm opens the door, especially written for PS UAV, attached in support of the atomic counter Buffer, bringing the possibility for Per-Pixel Linked Lists, which can be achieved, such as OIT, indirect shadows, ray tracing dynamic scenes.

However, because the shader is only passed by value, no pointers and references, the GPU can not be done based on the use of a linked list of pointers or references. To this end, we use the data structure is a static list, it can be implemented in an array, as originally nextpointer becomes an index value of the next element.

Because the array is a contiguous area of ​​memory, we can in an array, storing multiple static list (as long as the space is large enough). Based on this idea, we can create a list for each pixel, all pixels fragment used to collect the corresponding screen pixel position to be rendered.

The algorithm requires over two steps:

  1. 创建静态链表。通过像素着色器,利用类似头插法的思想在一个大数组中逐渐形成静态链表。
  2. 利用静态链表渲染。通过计算着色器,取出当前像素对应的链表元素,进行排序,然后将计算结果写入到渲染目标。

创建静态链表

首先需要强调的是,这一步需要的是像素着色器5.0而不是计算着色器5.0。因为这一步实际上是把原本要绘制到渲染目标的这些像素片元给拦截下来,放到静态链表当中。而要写入Buffer就需要允许像素着色器支持无序访问视图(UAV),只有5.0及更高的版本才能这样做。

此外,我们还需要原子操作的支持,这也需要使用着色器模型5.0。

最后我们还需要创建两个支持读/写的缓冲区,用于绑定到无序访问视图:

  1. 片元/链接缓冲区:该缓冲区存放的是片元数据和指向下一个元素的索引(即链接),并且由于它需要承担所有像素的静态链表,需要预留足够大的空间来存放这些片元(元素数目通常为渲染目标像素总数的数倍)。因此该算法的一个主要开销是GPU内存空间。其次,片元/链接缓冲区必须要使用结构化缓冲区,而不是多个有类型的缓冲区,因为只有RWStructuredBuffer才能够开启隐藏的计数器,而这个计数器又是实现静态链表必不可少的一部分,它用于统计缓冲区已经存放的链表节点数目。
  2. 首节点偏移缓冲区:该缓冲区的宽度与高度渲染目标的一致,存放的元素是对应渲染目标像素在片元/链接缓冲区对应的静态链表的首节点偏移。而且由于采用的是头插法,指向的通常是当前像素最后一个写入的片元位置。在使用之前我们需要定义-1(若是uint则为0xFFFFFFFF)为达到链表末端,因此每次使用之前都需要初始化链接值为-1.该缓冲区使用的是RWByteAddressBuffer,因为它能够支持原子操作

下图展示了通过像素着色器创建静态链表的过程:

看完这个动图后其实应该基本上能理解了,可能你的脑海里已经有了初步的代码构造,但现在还是需要跟着现有的代码学习才能实现。

首先放出实现该效果需要用到的常量缓冲区、结构体和函数:

// OIT.hlsli

cbuffer CBFrame : register(b6)
{
    uint g_FrameWidth;      // 帧像素宽度
    uint g_FrameHeight;     // 帧像素高度
    uint2 g_Pad2;
}

struct FragmentData
{
    uint Color;             // 打包为R8G8B8A8的像素颜色
    float Depth;            // 深度值
};

struct FLStaticNode
{
    FragmentData Data;      // 像素片元数据
    uint Next;              // 下一个节点的索引
};

// 打包颜色
uint PackColorFromFloat4(float4 color)
{
    uint4 colorUInt4 = (uint4) (color * 255.0f);
    return colorUInt4.r | (colorUInt4.g << 8) | (colorUInt4.b << 16) | (colorUInt4.a << 24);
}

// 解包颜色
float4 UnpackColorFromUInt(uint color)
{
    uint4 colorUInt4 = uint4(color, color >> 8, color >> 16, color >> 24) & (0x000000FF);
    return (float4) colorUInt4 / 255.0f;
}

一个像素颜色的类型为float4,要是用它作为数据存储到缓冲区会特别消耗显存,因为最终显示到后备缓冲区的类型为R8G8B8A8_UNORMB8G8R8A8_UNORM,要是能够将其打包成uint型,就可以节省这部分内存到原来的1/4。

当然,更狠的做法是,如果已知所有透明物体的Alpha值相同(都为0.5),那我们又可以将颜色压缩成R5G6B5_UNORM,然后再把深度值压缩成16为规格化浮点数,这样一个像素只需要一半的内存空间就能够表达了,当然代价为:颜色和深度都是有损的。

接下来是用于存储像素片元的着色器:

#include "Basic.hlsli"
#include "OIT.hlsli"

RWStructuredBuffer<FLStaticNode> g_FLBuffer : register(u1);
RWByteAddressBuffer g_StartOffsetBuffer : register(u2);

// 静态链表创建
// 提前开启深度/模板测试,避免产生不符合深度的像素片元的节点
[earlydepthstencil]
void PS(VertexPosHWNormalTex pIn)
{
    // 省略常规的光照部分,最终计算得到的光照颜色为litColor
    // ...
    
    // 取得当前像素数目并自递增计数器
    uint pixelCount = g_FLBuffer.IncrementCounter();
    
    // 在StartOffsetBuffer实现值交换
    uint2 vPos = (uint2) pIn.PosH.xy;  
    uint startOffsetAddress = 4 * (g_FrameWidth * vPos.y + vPos.x);
    uint oldStartOffset;
    g_StartOffsetBuffer.InterlockedExchange(
        startOffsetAddress, pixelCount, oldStartOffset);
    
    // 向片元/链接缓冲区添加新的节点
    FLStaticNode node;
    // 压缩颜色为R8G8B8A8
    node.Data.Color = PackColorFromFloat4(litColor);
    node.Data.Depth = pIn.PosH.z;
    node.Next = oldStartOffset;
    
    g_FLBuffer[pixelCount] = node;
}

这里面多了许多有趣的部分,需要逐一仔细讲解一番。

首先是UAV寄存器,这里要先留个印象,寄存器索引初值不能从0开始,具体的原因要留到讲C++的某个API时才能说的明白。

来到PS,我们也可以给像素着色器添加属性,就像上面的[earlydepthstencil]那样。因为在绘制透明物体之前我们已经绘制了不透明的物体,而不透明的物体会阻挡它后面的透明像素片元。虽然一般情况下深度测试是在像素着色器之后,但也希望能拒绝掉那些被遮挡的像素片元写入到片元/链接缓冲区种。因此我们可以使用属性[earlydepthstencil],把深度/模板测试提前到光栅化后,像素着色阶段之前,这样就可以有效剔除被遮挡的像素,既减小了性能开销,又保证了渲染的正确。

然后是RWStructuredBuffer特有的方法IncrementCounter,它会返回当前的计数值,并给计数器+1.与之对应的逆操作为DecrementCounter。它也属于原子操作,因为涉及到大量的线程要访问一个计数器,必须要有相应的同步操作才能保证一个时刻只有一个线程访问该计数器,从而确保安全性。

这里又要再提一遍SV_POSITION,在作为顶点着色器的输出时,提供的是未经过透视除法的NDC坐标;而作为像素着色器的输入时,它历经了透视除法、视口变换,得到的是对应像素的坐标值。比如说第233行,154列的像素对应的xy坐标为(232.5, 153.5),抛弃小数部分正好可以用作同宽高纹理相同位置的索引。

紧接着是RWByteAddressBufferInterlockedExchange方法:

void InterlockedExchange(
  in  uint dest,            // 目的地址
  in  uint value,           // 要交换的值
  out uint original_value   // 取出来的原值
);

你可以将其看作是一个写入缓冲区的函数,同时它又吐出原来存储的值。唯一要注意的是一切RWByteAddressBuffer的原子操作的地址值必须为4的倍数,因为它的读写单位都是32位的uint

实际渲染阶段

现在我们需要让片元/链接缓冲区和首节点偏移缓冲区都作为着色器资源。因为还需要准备一个存放渲染了场景中不透明物体的背景图作为混合初值,同时又还要将结果写入到渲染目标,这样的话我们还需要用到TextureRender类,存放与后备缓冲区等宽高的纹理,然后将场景中不透明的物体都渲染到此处。

对于顶点着色器来说,因为是渲染整个窗口,可以直接传顶点:

// OIT_Render_VS.hlsl
#include "OIT.hlsli"

// 顶点着色器
float4 VS(float3 vPos : POSITION) : SV_Position
{
    return float4(vPos, 1.0f);
}

而到了像素着色器,我们需要对当前像素对应的链表进行深度排序。由于访问设备内存的效率相对较低,而且排序又涉及到频繁的内存操作,在UAV进行链表排序的效率会很低。更好的做法是将所有像素拷贝到临时寄存器数组,然后再做排序,这样效率会更高,其实也就是在像素着色器开辟一个全局静态数组来存放这些链表节点的元素。由于是静态数组,数组元素固定,开辟较大的空间并不是一个比较好的选择,这不仅涉及到排序的复杂程度,还涉及到显存开销。因此我们需要限制排序的像素片元数目,同时也意味着只需要读取链表的前面几个元素即可,这是一种比较折中的做法。

由于排序算法的好坏也会影响最终的效率,对于小规模的排序,可以使用插入排序,它不仅是原址操作,对于已经有序的序列不会有多余的交换操作。又因为是线程内的排序,不能使用双调排序。

像素着色器的代码如下:

// OIT_Render_PS.hlsl
#include "OIT.hlsli"

StructuredBuffer<FLStaticNode> g_FLBuffer : register(t0);
ByteAddressBuffer g_StartOffsetBuffer : register(t1);
Texture2D g_BackGround : register(t2);

#define MAX_SORTED_PIXELS 8

static FragmentData g_SortedPixels[MAX_SORTED_PIXELS];

// 使用插入排序,深度值从大到小
void SortPixelInPlace(int numPixels)
{
    FragmentData temp;
    for (int i = 1; i < numPixels; ++i)
    {
        for (int j = i - 1; j >= 0; --j)
        {
            if (g_SortedPixels[j].Depth < g_SortedPixels[j + 1].Depth)
            {
                temp = g_SortedPixels[j];
                g_SortedPixels[j] = g_SortedPixels[j + 1];
                g_SortedPixels[j + 1] = temp;
            }
            else
            {
                break;
            }
        }
    }
}



float4 PS(float4 posH : SV_Position) : SV_Target
{
    // 取出当前像素位置对应的背景色
    float4 currColor = g_BackGround.Load(int3(posH.xy, 0));
    
    // 取出当前像素位置链表长度
    uint2 vPos = (uint2) posH.xy;
    int startOffsetAddress = 4 * (g_FrameWidth * vPos.y + vPos.x);
    int numPixels = 0;
    uint offset = g_StartOffsetBuffer.Load(startOffsetAddress);
    
    FLStaticNode element;
    
    // 取出链表所有节点
    while (offset != 0xFFFFFFFF)
    {
        // 按当前索引取出像素
        element = g_FLBuffer[offset];
        // 将像素拷贝到临时数组
        g_SortedPixels[numPixels++] = element.Data;
        // 取出下一个节点的索引,但最多只取出前MAX_SORTED_PIXELS个
        offset = (numPixels >= MAX_SORTED_PIXELS) ?
            0xFFFFFFFF : element.Next;
    }
    
    // 对所有取出的像素片元按深度值从大到小排序
    SortPixelInPlace(numPixels);
    
    // 使用SrcAlpha-InvSrcAlpha混合
    for (int i = 0; i < numPixels; ++i)
    {
        // 将打包的颜色解包出来
        float4 pixelColor = UnpackColorFromUInt(g_SortedPixels[i].Color);
        // 进行混合
        currColor.xyz = lerp(currColor.xyz, pixelColor.xyz, pixelColor.w);
    }
    
    // 返回手工混合的颜色
    return currColor;
}

HLSL部分结束了,但C++端还有很多棘手的问题要处理。

OITRender类

在进行OIT像素收集时,需要通过替换像素着色器的手段来完成,因此它需要依附于BasicEffect,不好作为一个独立的Effect使用。在此先放出OITRender类的定义:

class OITRender
{
public:
    template<class T>
    using ComPtr = Microsoft::WRL::ComPtr<T>;

    OITRender() = default;
    ~OITRender() = default;
    // 不允许拷贝,允许移动
    OITRender(const OITRender&) = delete;
    OITRender& operator=(const OITRender&) = delete;
    OITRender(OITRender&&) = default;
    OITRender& operator=(OITRender&&) = default;

    HRESULT InitResource(ID3D11Device* device, 
        UINT width,         // 帧宽度
        UINT height,        // 帧高度
        UINT multiple = 1); // 用多少倍于帧像素数的缓冲区存储像素片元

    // 开始收集透明物体像素片元
    void BeginDefaultStore(ID3D11DeviceContext* deviceContext);
    // 结束收集,还原状态
    void EndStore(ID3D11DeviceContext* deviceContext);
    
    // 将背景与透明物体像素片元混合完成最终渲染
    void Draw(ID3D11DeviceContext * deviceContext, ID3D11ShaderResourceView* background);

    void SetDebugObjectName(const std::string& name);

private:
    struct {
        int width;
        int height;
        int pad1;
        int pad2;
    } m_CBFrame;                                                // 对应OIT.hlsli的常量缓冲区
private:
    ComPtr<ID3D11InputLayout> m_pInputLayout;                   // 绘制屏幕的顶点输入布局

    ComPtr<ID3D11Buffer> m_pFLBuffer;                           // 片元/链接缓冲区
    ComPtr<ID3D11Buffer> m_pStartOffsetBuffer;                  // 起始偏移缓冲区
    ComPtr<ID3D11Buffer> m_pVertexBuffer;                       // 绘制背景用的顶点缓冲区
    ComPtr<ID3D11Buffer> m_pIndexBuffer;                        // 绘制背景用的索引缓冲区
    ComPtr<ID3D11Buffer> m_pConstantBuffer;                     // 常量缓冲区

    ComPtr<ID3D11ShaderResourceView> m_pFLBufferSRV;            // 片元/链接缓冲区的着色器资源视图
    ComPtr<ID3D11ShaderResourceView> m_pStartOffsetBufferSRV;   // 起始偏移缓冲区的着色器资源视图

    ComPtr<ID3D11UnorderedAccessView> m_pFLBufferUAV;           // 片元/链接缓冲区的无序访问视图
    ComPtr<ID3D11UnorderedAccessView> m_pStartOffsetBufferUAV;  // 起始偏移缓冲区的无序访问视图

    ComPtr<ID3D11VertexShader> m_pOITRenderVS;                  // 透明混合渲染的顶点着色器
    ComPtr<ID3D11PixelShader> m_pOITRenderPS;                   // 透明混合渲染的像素着色器
    ComPtr<ID3D11PixelShader> m_pOITStorePS;                    // 用于存储透明像素片元的像素着色器
    
    ComPtr<ID3D11PixelShader> m_pCachePS;                       // 临时缓存的像素着色器

    UINT m_FrameWidth;                                          // 帧像素宽度
    UINT m_FrameHeight;                                         // 帧像素高度
    UINT m_IndexCount;                                          // 绘制索引数
};

这里不放出初始化的代码,但在调用初始化的时候需要注意提供合理的帧像素的倍数,若设置的太低,则缓冲区可能不足以容纳透明像素片元而渲染异常。

OITRender::BeginDefaultStore方法--在默认特效下收集像素片元

不管写什么渲染类,渲染状态的管理是最复杂的,一处错误都会导致渲染结果的不理想。

该方法首先要解决两个主要问题:UAV的初始化、绑定到像素着色阶段。

ID3D11DeviceContext::ClearUnorderedAccessViewUint--使用特定值/向量设置UAV初始值

void ClearUnorderedAccessViewUint(
  ID3D11UnorderedAccessView *pUnorderedAccessView,  // [In]待清空UAV
  const UINT [4]            Values                  // [In]清空值/向量
);

该方法对任何UAV都有效,它是以二进制位的形式来清空值。若为DXGI特定类型,如R16G16_UNORM,则该方法会根据Values的前两个元素取出各自的低16位分别复制到每个数组元素的x分量和y分量。若为原始内存的视图或结构化缓冲区的视图,则只取Values的第一个元素来复制到缓冲区的每一个4字节内。

ID3D11DeviceContext::OMSetRenderTargetsAndUnorderedAccessViews--输出合并阶段设置渲染目标并设置UAV

既然像素着色器能够使用UAV,一开始找了半天都没找到ID3D11DeviceContext::PSSetUnorderedAccessViews,结果发现居然是在OM阶段的函数提供UAV绑定。

void ID3D11DeviceContext::OMSetRenderTargetsAndUnorderedAccessViews(
  UINT                      NumRTVs,                        // [In]渲染目标数
  ID3D11RenderTargetView    * const *ppRenderTargetViews,   // [In]渲染目标视图数组
  ID3D11DepthStencilView    *pDepthStencilView,             // [In]深度/模板视图
  UINT                      UAVStartSlot,                   // [In]UAV起始槽
  UINT                      NumUAVs,                        // [In]UAV数目
  ID3D11UnorderedAccessView * const *ppUnorderedAccessViews,    // [In]无序访问视图数组
  const UINT                *pUAVInitialCounts                  // [In]各个无序访问视图的计数器初始值
);

前三个参数和后三个参数应该都没什么问题,但中间的那个参数是一个大坑。对于像素着色器,UAVStartSlot应当等于已经绑定的渲染目标视图数目。渲染目标和无序访问视图在写入的时候共享相同的资源槽,这意味着必须为UAV指定偏移量,以便于它们放在待绑定的渲染目标视图之后的插槽中。因此在前面的HLSL代码中,u寄存器需要从1开始就是这里来的。

注意:RTV、DSV、UAV不能独立设置,它们都需要同时设置。

两个绑定了同一个子资源(也因此共享同一个纹理)的RTV,或者是两个UAV,又或者是一个UAV和RTV,都会引发冲突。

OMSetRenderTargetsAndUnorderedAccessViews在以下情况才能运行正常:

NumRTVs != D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCILNumUAVs != D3D11_KEEP_UNORDERED_ACCESS_VIEWS时,需要满足下面这些条件:

  • NumRTVs <= 8
  • UAVStartSlot >= NumRTVs
  • UAVStartSlot + NumUAVs <= 8
  • 所有设置的RTVs和UAVs不能有资源冲突
  • DSV的纹理必须匹配RTV的纹理(但不是相同)

NumRTVs == D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL时,说明OMSetRenderTargetsAndUnorderedAccessViews只绑定UAVs,需要满足下面这些条件:

  • UAVStartSlot + NumUAVs <= 8
  • 所有设置的UAVs不能有资源冲突

它还会解除绑定下面这些东西:

  • 所有在slots >= UAVStartSlot的RTVs
  • 所有与待绑定的UAVs发生资源冲突的RTVs
  • 所有当前绑定的资源(SOTargets,CS UAVs, SRVs)冲突的UAVs

提供的深度/模板缓冲区会被忽略,并且已经绑定的深度/模板缓冲区并没有被卸下。

NumUAVs == D3D11_KEEP_UNORDERED_ACCESS_VIEWS时,说明OMSetRenderTargetsAndUnorderedAccessViews只绑定RTVs和DSV,需要满足下面这些条件

  • NumRTVs <= 8

  • 这些RTVs相互没有资源冲突

  • DSV的纹理必须匹配RTV的纹理(但不是相同)

它还会解除绑定下面这些东西:

  • 所有在slots < NumRTVs的UAVs

  • 所有与待绑定的RTVs发生资源冲突的UAVs

  • 所有当前绑定的资源(SOTargets,CS UAVs, SRVs)冲突的RTVs

    提供的UAVStartSlot忽略。

现在可以把目光放回到OITRender::BeginDefaultStore上:

void OITRender::BeginDefaultStore(ID3D11DeviceContext* deviceContext)
{
    deviceContext->RSSetState(RenderStates::RSNoCull.Get());
    
    UINT numClassInstances = 0;
    deviceContext->PSGetShader(m_pCachePS.GetAddressOf(), nullptr, &numClassInstances);
    deviceContext->PSSetShader(m_pOITStorePS.Get(), nullptr, 0);

    // 初始化UAV
    UINT magicValue[1] = { 0xFFFFFFFF };
    deviceContext->ClearUnorderedAccessViewUint(m_pFLBufferUAV.Get(), magicValue);
    deviceContext->ClearUnorderedAccessViewUint(m_pStartOffsetBufferUAV.Get(), magicValue);
    // UAV绑定到像素着色阶段
    ID3D11UnorderedAccessView* pUAVs[2] = { m_pFLBufferUAV.Get(), m_pStartOffsetBufferUAV.Get() };
    UINT initCounts[2] = { 0, 0 };
    deviceContext->OMSetRenderTargetsAndUnorderedAccessViews(D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL,
        nullptr, nullptr, 1, 2, pUAVs, initCounts);

    // 关闭深度写入
    deviceContext->OMSetDepthStencilState(RenderStates::DSSNoDepthWrite.Get(), 0);
    // 设置常量缓冲区
    deviceContext->PSSetConstantBuffers(6, 1, m_pConstantBuffer.GetAddressOf());
}

上面的代码有两个点要特别注意:

  1. 因为是透明物体,需要关闭背面消隐
  2. 因为没有产生实际绘制,需要关闭深度写入

OITRender::EndStore方法--结束收集

方法如下:

void OITRender::EndStore(ID3D11DeviceContext* deviceContext)
{
    // 恢复渲染状态
    deviceContext->PSSetShader(m_pCachePS.Get(), nullptr, 0);
    ComPtr<ID3D11RenderTargetView> currRTV;
    ComPtr<ID3D11DepthStencilView> currDSV;
    ID3D11UnorderedAccessView* pUAVs[2] = { nullptr, nullptr };
    deviceContext->OMSetRenderTargetsAndUnorderedAccessViews(D3D11_KEEP_RENDER_TARGETS_AND_DEPTH_STENCIL,
        nullptr, nullptr, 1, 2, pUAVs, nullptr);
    m_pCachePS.Reset();
}

OITRender::Draw方法--对透明像素片元进行排序混合并完成绘制

方法如下,到这一步其实已经没那么复杂了:

void OITRender::Draw(ID3D11DeviceContext* deviceContext, ID3D11ShaderResourceView* background)
{

    UINT strides[1] = { sizeof(VertexPos) };
    UINT offsets[1] = { 0 };
    deviceContext->IASetVertexBuffers(0, 1, m_pVertexBuffer.GetAddressOf(), strides, offsets);
    deviceContext->IASetIndexBuffer(m_pIndexBuffer.Get(), DXGI_FORMAT_R32_UINT, 0);

    deviceContext->IASetInputLayout(m_pInputLayout.Get());
    deviceContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST);

    deviceContext->VSSetShader(m_pOITRenderVS.Get(), nullptr, 0);
    deviceContext->PSSetShader(m_pOITRenderPS.Get(), nullptr, 0);

    deviceContext->GSSetShader(nullptr, nullptr, 0);
    deviceContext->RSSetState(nullptr);

    ID3D11ShaderResourceView* pSRVs[3] = {
        m_pFLBufferSRV.Get(), m_pStartOffsetBufferSRV.Get(), background};
    deviceContext->PSSetShaderResources(0, 3, pSRVs);
    deviceContext->PSSetConstantBuffers(6, 1, m_pConstantBuffer.GetAddressOf());

    deviceContext->OMSetDepthStencilState(nullptr, 0);
    deviceContext->OMSetBlendState(nullptr, nullptr, 0xFFFFFFFF);

    deviceContext->DrawIndexed(m_IndexCount, 0, 0);

    // 绘制完成后卸下绑定的资源即可
    pSRVs[0] = pSRVs[1] = pSRVs[2] = nullptr;
    deviceContext->PSSetShaderResources(0, 3, pSRVs);

}

场景绘制

现在场景中除了山体、波浪,还有两个透明相交的立方体。只考虑开启OIT的GameApp::DrawScene方法如下:

void GameApp::DrawScene()
{
    assert(m_pd3dImmediateContext);
    assert(m_pSwapChain);

    m_pd3dImmediateContext->ClearRenderTargetView(m_pRenderTargetView.Get(), reinterpret_cast<const float*>(&Colors::Silver));
    m_pd3dImmediateContext->ClearDepthStencilView(m_pDepthStencilView.Get(), D3D11_CLEAR_DEPTH | D3D11_CLEAR_STENCIL, 1.0f, 0);
    
    // 渲染到临时背景
    m_pTextureRender->Begin(m_pd3dImmediateContext.Get(), reinterpret_cast<const float*>(&Colors::Silver));
    {
        // ******************
        // 1. 绘制不透明对象
        //
        m_BasicEffect.SetRenderDefault(m_pd3dImmediateContext.Get(), BasicEffect::RenderObject);
        m_BasicEffect.SetTexTransformMatrix(XMMatrixIdentity());
        m_Land.Draw(m_pd3dImmediateContext.Get(), m_BasicEffect);
    
        // ******************
        // 2. 存放透明物体的像素片元
        //
        m_pOITRender->BeginDefaultStore(m_pd3dImmediateContext.Get());
        {
            m_RedBox.Draw(m_pd3dImmediateContext.Get(), m_BasicEffect);
            m_YellowBox.Draw(m_pd3dImmediateContext.Get(), m_BasicEffect);
            m_pGpuWavesRender->Draw(m_pd3dImmediateContext.Get(), m_BasicEffect);
        }
        m_pOITRender->EndStore(m_pd3dImmediateContext.Get());
    }
    m_pTextureRender->End(m_pd3dImmediateContext.Get());
    
    // 渲染到后备缓冲区
    m_pOITRender->Draw(m_pd3dImmediateContext.Get(), m_pTextureRender->GetOutputTexture());

    // ******************
    // 绘制Direct2D部分
    //
    // ...

    HR(m_pSwapChain->Present(0, 0));
}

演示

下面演示了关闭OIT和深度写入、关闭OIT但开启深度写入、开启OIT下的场景渲染效果:

开启OIT的平均帧数为2700,而默认平均帧数为4200。可见影响渲染性能的主要因素有:RTT的使用、场景中透明像素的复杂程度、排序算法的选择和n的限制。因此要保证渲染效率,最好是能够减少透明物体的复杂程度、场景中透明物体的数目,必要时甚至是避免透明混合。

练习题

  1. 尝试改动HLSL代码,将颜色压缩成R5G6B5_UNORM(规定透明物体Alpha统一为0.5),然后再把深度值压缩成16为规格化浮点数。同时也要改动C++端代码来适配。

参考资料

  1. DirectX SDK Samples中的OIT
  2. OIT-and-Indirect-Illumination-using-DX11-Linked-Lists 演示文件

DirectX11 With Windows SDK完整目录

Github项目源码

欢迎加入QQ群: 727623616 可以一起探讨DX11,以及有什么问题也可以在这里汇报。

Guess you like

Origin www.cnblogs.com/X-Jun/p/12272328.html