Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章：计算着色器（The Compute Shader）

原文: Introduction to 3D Game Programming with DirectX 12 学习笔记之 --- 第十三章：计算着色器（The Compute Shader）

代码工程地址：

https://github.com/jiabaodan/Direct12BookReadingNotes

GPU已经被优化为处理单个地址或者连续地址（流操作）的大量内存数据；这和CPU的随机内存访问形成鲜明对比。因为顶点和像素可以独立处理，所以GPU被架构为大量的并行运算；比如NVIDIA的“Fermi”架构支持16个拥有32个CUDA cores的流多处理器（streaming multiprocessors），总共可以由512个CUDA cores。
使用GPU计算非图形的应用称之为普通目的的GPU编程（general purpose GPU (GPGPU) programming）。

学习目标

学习如何编写计算着色器程序；
对硬件如何与线程组和线程处理有一个基本的高级理解；
学习哪些D3D资源可以作为CS的输入，哪些可以作为输出；
理解线程ID变量和他们的用途；
学习共享内存，已经它们如何用来优化性能；
查找更多有关GPGPU编程的资料。

1 线程（THREADS）和线程组（THREAD GROUPS）

在GPU编程中，多个用以处理的线程会划分为一个格子的线程组，一个线程组在单个处理器上执行。所以如果你的GPU有16个多处理器，那么你至少要把你的需求划分为16个线程组，这样你所有的多处理器都可以同时计算。为了有更好的性能，你应该为每个多处理器划分2个线程组，这样就可以切换线程组（[Fung10]）。
每个线程组获取的共享内存，可以让所有线程组内的线程访问；线程不能访问其他线程组的共享内存。
一个线程组包含n个线程。硬件把这些线程划分为warps（32个线程为一个warp），然后warps被多处理器以SIMD32来处理。每个CUDA core处理一个线程并且回顾“Fermi”多处理器，有32个CUDA cores。在D3D中你可以用一个不是32的倍数的值指定一个线程组的大小，但是出于性能考虑，最好还是指定为warp大小的倍数（[Fung10]）。
对于不同的硬件，设置线程组为256看起来是一个好的开始，然后再尝试其他尺寸。

NVIDIA使用warp尺寸（32线程）；ATI使用wavefront尺寸（64线程），并且建议线程组尺寸要一直是wavefront的倍数。当然，warp和wavefront在将来的硬件中可能会改变。

在D3D，线程组有下面的函数开始：

void ID3D12GraphicsCommandList::Dispatch(
	UINT ThreadGroupCountX,
	UINT ThreadGroupCountY,
	UINT ThreadGroupCountZ);

本书只关心2维。下面的例子表示x方向有3个线程组，y方向有2个线程组，所以总共6个：
在这里插入图片描述

2 一个简单的计算着色器

下面是一个简单的计算着色器，对两个相同尺寸的纹理相加：

cbuffer cbSettings
{
	// Compute shader can access values in constant buffers.
};

// Data sources and outputs.
Texture2D gInputA;
Texture2D gInputB;
RWTexture2D<float4> gOutput;

// The number of threads in the thread group. The threads in a group can
// be arranged in a 1D, 2D, or 3D grid layout.
[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID) // Thread ID
{
	// Sum the xyth texels and store the result in the xyth texel of
	// gOutput.
	gOutput[dispatchThreadID.xy] = gInputA[dispatchThreadID.xy] + gInputB[dispatchThreadID.xy];
}

一个计算着色器包含下面的组件：

一个全局变量用来访问常量缓冲；
输入和输出资源，下节介绍；
[numthreads(X, Y, Z)]属性，指定在线程组中线程的数量；
着色器主体执行代码；
线程识别系统参数；

观察上面的代码，线程组中的线程可以有不同的线程拓扑结构，主要根据你的问题需求来选择不同的拓扑结构。尺寸最好是wavefront的倍数（因为同时也是warp的倍数），这样就可以同时兼容两种显卡。

2.1 计算PSO

为了开启计算着色器，我们使用一个特殊的“计算渲染状态描述”。它的属性要比D3D12_GRAPHICS_PIPELINE_STATE_DESC少很多，因为它并不在图形管线中，所以图形管线的各种状态它都不需要。下面是一个创建的例子：

D3D12_COMPUTE_PIPELINE_STATE_DESC wavesUpdatePSO = {};
wavesUpdatePSO.pRootSignature = mWavesRootSignature.Get();
wavesUpdatePSO.CS =
{
	reinterpret_cast<BYTE*> (mShaders["wavesUpdateCS"]->GetBufferPointer()),
		mShaders["wavesUpdateCS"]->GetBufferSize()
};

wavesUpdatePSO.Flags = D3D12_PIPELINE_STATE_FLAG_NONE;
ThrowIfFailed(md3dDevice->CreateComputePipelineState(
	&wavesUpdatePSO,
	IID_PPV_ARGS(&mPSOs["wavesUpdate"])));

根签名描述了哪些输入参数。下面是编译CS代码的例子：

mShaders["wavesUpdateCS"] = d3dUtil::CompileShader(
	L"Shaders\\WaveSim.hlsl", nullptr,
	"UpdateWavesCS", "cs_5_0");

3 输入和输出资源

CS支持2种类型的资源：缓冲和纹理。

3.1 纹理的输入

在上一章的例子中，定义了2个纹理输入：

Texture2D gInputA;
Texture2D gInputB;

它们通过创建(SRVs)来传递：

cmdList->SetComputeRootDescriptorTable(1, mSrvA);
cmdList->SetComputeRootDescriptorTable(2, mSrvB);

这个和像素着色器的绑定是一样的（SRVs是只读的）。

3.2 纹理的输出和无序访问视图（UAVs）

之前的代码中创建了一个输出资源：

RWTexture2D<float4> gOutput;

输出资源比较特殊，并有一个特殊的前缀“RW”表示可以读写（read-write）。相比之下gInputA和gInputB是只读的。并且需要指定类型和维度。比如如果我们需要输出2D的整形类型DXGI_FORMAT_R8G8_SINT，那么需要这样写：

RWTexture2D<int2> gOutput;

绑定输出资源到CS，需要新的视图类型unordered access view (UAV)，它在代码中通过描述句柄和D3D12_UNORDERED_ACCESS_VIEW_DESC描述来表示。它与SRV的创建类似，下面是创建UAV的例子：

D3D12_RESOURCE_DESC texDesc;
ZeroMemory(&texDesc, sizeof(D3D12_RESOURCE_DESC));
texDesc.Dimension = D3D12_RESOURCE_DIMENSION_TEXTURE2D;
texDesc.Alignment = 0;
texDesc.Width = mWidth;
texDesc.Height = mHeight;
texDesc.DepthOrArraySize = 1;
texDesc.MipLevels = 1;
texDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
texDesc.SampleDesc.Count = 1;
texDesc.SampleDesc.Quality = 0;
texDesc.Layout = D3D12_TEXTURE_LAYOUT_UNKNOWN;
texDesc.Flags = D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS;

ThrowIfFailed(md3dDevice->CreateCommittedResource(
	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),
	D3D12_HEAP_FLAG_NONE,
	&texDesc,
	D3D12_RESOURCE_STATE_COMMON,
	nullptr,
	IID_PPV_ARGS(&mBlurMap0)));
	
D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc = {};
srvDesc.Shader4ComponentMapping = D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING;
srvDesc.Format = mFormat;
srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;
srvDesc.Texture2D.MostDetailedMip = 0;
srvDesc.Texture2D.MipLevels = 1;

D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};
uavDesc.Format = mFormat;
uavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE2D;
uavDesc.Texture2D.MipSlice = 0;

md3dDevice->CreateShaderResourceView(mBlurMap0.Get(),
	&srvDesc, mBlur0CpuSrv);
	
md3dDevice->CreateUnorderedAccessView(mBlurMap0.Get(),
	nullptr, &uavDesc, mBlur0CpuUav);

如果一个纹理要绑定为UAV，它必须要通过flag值为D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS来创建。
回顾描述堆的类型D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV可以混合它们到同一个堆上。当放它们到堆上的时候，我们只需要针对分派调用（dispatch call）通过传递描述句柄到根参数上来绑定资源到流水线。下面是针对CS的根签名代码：

void BlurApp::BuildPostProcessRootSignature()
{
	CD3DX12_DESCRIPTOR_RANGE srvTable;
	srvTable.Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1, 0);
	CD3DX12_DESCRIPTOR_RANGE uavTable;
	uavTable.Init(D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, 0);
	
	// Root parameter can be a table, root descriptor or root constants.
	CD3DX12_ROOT_PARAMETER slotRootParameter[3];
	
	// Perfomance TIP: Order from most frequent to least frequent.
	slotRootParameter[0].InitAsConstants(12, 0);
	slotRootParameter[1].InitAsDescriptorTable(1, &srvTable);
	slotRootParameter[2].InitAsDescriptorTable(1, &uavTable);
	
	// A root signature is an array of root parameters.
	CD3DX12_ROOT_SIGNATURE_DESC rootSigDesc(3,
		slotRootParameter,
		0, nullptr,
		D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_// create a root signature with a single slot which points to a
		
	// descriptor range consisting of a single constant buffer
	ComPtr<ID3DBlob> serializedRootSig = nullptr;
	ComPtr<ID3DBlob> errorBlob = nullptr;
	HRESULT hr = D3D12SerializeRootSignature(&rootSigDesc,
		D3D_ROOT_SIGNATURE_VERSION_1,
		serializedRootSig.GetAddressOf(),
		errorBlob.GetAddressOf());
		
	if(errorBlob != nullptr)
	{
		::OutputDebugStringA((char*)errorBlob->GetBufferPointer());
	}
	
	ThrowIfFailed(hr);
	
	ThrowIfFailed(md3dDevice->CreateRootSignature(
		0,
		serializedRootSig->GetBufferPointer(),
		serializedRootSig->GetBufferSize(),
		IID_PPV_ARGS(mPostProcessRootSignature.GetAddressOf())));
}

在分派调用前，我们绑定常量和描述：

cmdList->SetComputeRootSignature(rootSig);
cmdList->SetComputeRoot32BitConstants(0, 1, &blurRadius, 0);
cmdList->SetComputeRoot32BitConstants(0, (UINT)weights.size(), weights.data(), 1);
cmdList->SetComputeRootDescriptorTable(1, mBlur0GpuSrv);
cmdList->SetComputeRootDescriptorTable(2, mBlur1GpuUav);

UINT numGroupsX = (UINT)ceilf(mWidth / 256.0f);
cmdList->Dispatch(numGroupsX, mHeight, 1);

3.3 纹理索引和采样

纹理的元素通过一个2D索引来访问，索引基于分派线程ID（3.4介绍），每一个线程具有唯一的分派ID：

[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Sum the xyth texels and store the result in the xyth texel of
	// gOutput.
	gOutput[dispatchThreadID.xy] =
		gInputA[dispatchThreadID.xy] +
		gInputB[dispatchThreadID.xy];
}

假设我们分派了足够多的线程堆来覆盖到纹理，那么这个代码就将两个纹理相加，保存到gOutput。

因为CS是在GPU上执行，所以它可以访问GPU的工具，我们可以对纹理采用使用滤波器。但是有两个问题。第一：不能使用Sample方法，而是SampleLeve，多了一个mip等级的参数，因为CS不是直接用来渲染，所以不知道相机与它的距离，所以必须设置Mip等级；其中0代表最高级，小数会做线性差值；第二：做纹理采样的时候，我们使用标准纹理坐标系[0, 1]2代替整数索引，纹理尺寸(width, height)可以设置到常量缓冲变量，然后标准化纹理坐标：
在这里插入图片描述
下面的代码展示了CS使用整形索引，第二个相同的版本是使用纹理坐标和SampleLevel（假设纹理尺寸是512*512，并只用最高级mip等级）：

//
// VERSION 1: Using integer indices.
//
cbuffer cbUpdateSettings
{
	float gWaveConstant0;
	float gWaveConstant1;
	float gWaveConstant2;
	float gDisturbMag;
	int2 gDisturbIndex;
};

RWTexture2D<float> gPrevSolInput : register(u0);
RWTexture2D<float> gCurrSolInput : register(u1);
RWTexture2D<float> gOutput : register(u2);

[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID)
{
int x = dispatchThreadID.x;
int y = dispatchThreadID.y;

gNextSolOutput[int2(x,y)] =
	gWaveConstants0*gPrevSolInput[int2(x,y)].r +
	gWaveConstants1*gCurrSolInput[int2(x,y)].r +
	gWaveConstants2*(
		gCurrSolInput[int2(x,y+1)].r +
		gCurrSolInput[int2(x,y-1)].r +
		gCurrSolInput[int2(x+1,y)].r +
		gCurrSolInput[int2(x-1,y)].r);
}

//
// VERSION 2: Using SampleLevel and texture coordinates.
//
cbuffer cbUpdateSettings
{
	float gWaveConstant0;
	float gWaveConstant1;
	float gWaveConstant2;
	float gDisturbMag;
	int2 gDisturbIndex;
};

SamplerState samPoint : register(s0);
RWTexture2D<float> gPrevSolInput : register(u0);
RWTexture2D<float> gCurrSolInput : register(u1);
RWTexture2D<float> gOutput : register(u2);

[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID)
{
// Equivalently using SampleLevel() instead of operator [].
int x = dispatchThreadID.x;
int y = dispatchThreadID.y;

float2 c = float2(x,y)/512.0f;
float2 t = float2(x,y-1)/512.0;
float2 b = float2(x,y+1)/512.0;
float2 l = float2(x-1,y)/512.0;
float2 r = float2(x+1,y)/512.0;

gNextSolOutput[int2(x,y)] =
	gWaveConstants0*gPrevSolInput.SampleLevel(samPoint, c, 0.0f).r +
	gWaveConstants1*gCurrSolInput.SampleLevel(samPoint, c, 0.0f).r +
	gWaveConstants2*(
		gCurrSolInput.SampleLevel(samPoint, b, 0.0f).r +
		gCurrSolInput.SampleLevel(samPoint, t, 0.0f).r +
		gCurrSolInput.SampleLevel(samPoint, r, 0.0f).r +
		gCurrSolInput.SampleLevel(samPoint, l, 0.0f).r);
}

3.4 结构化的缓冲资源

下面的代码展示了HLSL中结构化的缓冲：

struct Data
{
	float3 v1;
	float2 v2;
};

StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);

结构化的缓冲可以简单的看做是缓冲中一个结构类型元素的数组，它可以让用户在HLSL中定义。
它可以作为SRV，也可以作为UAV，创建方法类似：

struct Data
{
	XMFLOAT3 v1;
	XMFLOAT2 v2;
};

// Generate some data to fill the SRV buffers with.
std::vector<Data> dataA(NumDataElements);
std::vector<Data> dataB(NumDataElements);

for(int i = 0; i < NumDataElements; ++i)
{
	dataA[i].v1 = XMFLOAT3(i, i, i);
	dataA[i].v2 = XMFLOAT2(i, 0);
	dataB[i].v1 = XMFLOAT3(-i, i, 0.0f);
	dataB[i].v2 = XMFLOAT2(0, -i);
}
UINT64 byteSize = dataA.size()*sizeof(Data);

// Create some buffers to be used as SRVs.
mInputBufferA = d3dUtil::CreateDefaultBuffer(
	md3dDevice.Get(),
	mCommandList.Get(),
	dataA.data(),
	byteSize,
	mInputUploadBufferA);
	
mInputBufferB = d3dUtil::CreateDefaultBuffer(
	md3dDevice.Get(),
	mCommandList.Get(),
	dataB.data(),
	byteSize,
	mInputUploadBufferB);
	
// Create the buffer that will be a UAV.
ThrowIfFailed(md3dDevice->CreateCommittedResource(
	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),
	D3D12_HEAP_FLAG_NONE,
	&CD3DX12_RESOURCE_DESC::Buffer(byteSize,
		D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS),
	D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
	nullptr,
	IID_PPV_ARGS(&mOutputBuffer)));

结构化的缓冲绑定到流水线和纹理是类似的。我们创建SRV和UAV描述给他们然后以参数的方式传递到描述表类型的根参数上。不同的地方在于，我们可以定义根描述类型的根签名，所以我们可以直接绑定它们的虚拟地址到根参数上，而不通过描述堆（只适用于SRV和UAV，不能用以纹理），考虑下面的根签名描述：

// Root parameter can be a table, root descriptor or root constants.
CD3DX12_ROOT_PARAMETER slotRootParameter[3];

// Perfomance TIP: Order from most frequent to least frequent.
slotRootParameter[0].InitAsShaderResourceView(0);
slotRootParameter[1].InitAsShaderResourceView(1);
slotRootParameter[2].InitAsUnorderedAccessView(0);

// A root signature is an array of root parameters.
CD3DX12_ROOT_SIGNATURE_DESC rootSigDesc(3,
	slotRootParameter,
	0, nullptr,
	D3D12_ROOT_SIGNATURE_FLAG_NONE);

然后我们绑定我们的缓冲到分派调用：

mCommandList->SetComputeRootSignature(mRootSignature.Get());
mCommandList->SetComputeRootShaderResourceView(0,
	mInputBufferA->GetGPUVirtualAddress());
mCommandList->SetComputeRootShaderResourceView(1,
	mInputBufferB->GetGPUVirtualAddress());
mCommandList->SetComputeRootUnorderedAccessView(2,
	mOutputBuffer->GetGPUVirtualAddress());
mCommandList->Dispatch(1, 1, 1);

3.5 拷贝CS结构到系统内存

需要适用堆属性D3D12_HEAP_TYPE_READBACK创建系统内存缓冲。然后我们可以使用ID3D12GraphicsCommandList::CopyResource方法拷贝GPU资源到系统内存资源。系统资源要有相同的大小和格式。最终我们可以通过映射API映射系统内存缓冲，然后再CPU读取。
我们有一个结构化缓冲Demo叫“VecAdd”，只是相加了对应vector：

struct Data
{
	float3 v1;
	float2 v2;
};

StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);

[numthreads(32, 1, 1)]
void CS(int3 dtid : SV_DispatchThreadID)
{
	gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;
	gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;
}

为了简化，这个结构化缓冲只包含32个元素，所以我们只分派了一个线程组（一个线程组处理32个元素）。当CS计算完成后，我们将结果拷贝到系统内存，然后保存到文件。下面的代码展示了如何拷贝到系统内存：

// Create a system memory version of the buffer to read the
// results back from.
ThrowIfFailed(md3dDevice->CreateCommittedResource(
	&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_READBACK),
	D3D12_HEAP_FLAG_NONE,
	&CD3DX12_RESOURCE_DESC::Buffer(byteSize),
	D3D12_RESOURCE_STATE_COPY_DEST,
	nullptr,
	IID_PPV_ARGS(&mReadBackBuffer)));
	
// …
//
// Compute shader finished!
struct Data
{
	XMFLOAT3 v1;
	XMFLOAT2 v2;
};

// Schedule to copy the data to the default buffer to the readback buffer.
mCommandList->ResourceBarrier(1,
	&CD3DX12_RESOURCE_BARRIER::Transition(
		mOutputBuffer.Get(),
		D3D12_RESOURCE_STATE_COMMON,
		D3D12_RESOURCE_STATE_COPY_SOURCE));
		
mCommandList->CopyResource(mReadBackBuffer.Get(), mOutputBuffer.Get());
	
mCommandList->ResourceBarrier(1,
	&CD3DX12_RESOURCE_BARRIER::Transition(
		mOutputBuffer.Get(),
		D3D12_RESOURCE_STATE_COPY_SOURCE,
		D3D12_RESOURCE_STATE_COMMON));
	
// Done recording commands.
ThrowIfFailed(mCommandList->Close());

// Add the command list to the queue for execution.
ID3D12CommandList* cmdsLists[] = { mCommandList.Get() };
mCommandQueue->ExecuteCommandLists(_countof(cmdsLists), cmdsLists);

// Wait for the work to finish.
FlushCommandQueue();

// Map the data so we can read it on CPU.
Data* mappedData = nullptr;
ThrowIfFailed(mReadBackBuffer->Map(0, nullptr, reinterpret_cast<void**>(&mappedData)));
std::ofstream fout("results.txt");
for(int i = 0; i < NumDataElements; ++i)
{
	fout << "(" << mappedData[i].v1.x << ", " <<
	mappedData[i].v1.y << ", " <<
	mappedData[i].v1.z << ", " <<
	mappedData[i].v2.x << ", " <<
	mappedData[i].v2.y << ")" << std::endl;
}

mReadBackBuffer->Unmap(0, nullptr);

In the demo, we fill the two input buffers with
the following initial data:
std::vector<Data> dataA(NumDataElements);
std::vector<Data> dataB(NumDataElements);
for(int i = 0; i < NumDataElements; ++i)
{
	dataA[i].v1 = XMFLOAT3(i, i, i);
	dataA[i].v2 = XMFLOAT2(i, 0);
	dataB[i].v1 = XMFLOAT3(-i, i, 0.0f);
	dataB[i].v2 = XMFLOAT2(0, -i);
}

下面是写到文件中的结果：

(0, 0, 0, 0, 0)
(0, 2, 1, 1, -1)
(0, 4, 2, 2, -2)
(0, 6, 3, 3, -3)
(0, 8, 4, 4, -4)
(0, 10, 5, 5, -5)
(0, 12, 6, 6, -6)
(0, 14, 7, 7, -7)
(0, 16, 8, 8, -8)
(0, 18, 9, 9, -9)
(0, 20, 10, 10, -10)
(0, 22, 11, 11, -11)
(0, 24, 12, 12, -12)
(0, 26, 13, 13, -13)
(0, 28, 14, 14, -14)
(0, 30, 15, 15, -15)
(0, 32, 16, 16, -16)
(0, 34, 17, 17, -17)
(0, 36, 18, 18, -18)
(0, 38, 19, 19, -19)
(0, 40, 20, 20, -20)
(0, 42, 21, 21, -21)
(0, 44, 22, 22, -22)
(0, 46, 23, 23, -23)
(0, 48, 24, 24, -24)
(0, 50, 25, 25, -25)
(0, 52, 26, 26, -26)
(0, 54, 27, 27, -27)
(0, 56, 28, 28, -28)
(0, 58, 29, 29, -29)
(0, 60, 30, 30, -30)
(0, 62, 31, 31, -31)

从下图可以看出，在CPU和GPU之间拷贝内存数据是最慢的。对于图形，我们不要每帧这样做，它会kill性能。对于GPGPU编程，经常需要得到结果到CPU，所以对于GPGPU不是什么大问题（因为不会像每帧调用那么频繁）。
在这里插入图片描述

4 线程表示系统值（THREAD IDENTIFICATION SYSTEM VALUES）

在这里插入图片描述
被标识的线程T，具有线程组ID.(1, 1, 0)，具有组线程ID(1, 5, 0)，具有分派线程ID(1, 1, 0) ⊗
(8, 8, 0) + (2, 5, 0) = (10, 13, 0)；它的组索引ID是5·8 + 2 = 42。

每个线程组会被系统分配一个线程组ID，具有SV_GroupID标识；
线程组内，每一个线程具有一个唯一的ID：SV_GroupThreadID；
每一个分派调用，分派一网格的线程组。分派线程ID在一个分派调用中是唯一的，并且与所有创建的线程组相关联。令ThreadGroupSize =(X,Y,Z)为线程组尺寸，分派线程ID可以由组ID和组线程ID计算出来：

dispatchThreadID.xyz = groupID.xyz * ThreadGroupSize.xyz + groupThreadID.xyz;

它具有SV_DispatchThreadID标识，

一个线性索引版本的组线程ID可以通过D3D的SV_GroupIndex标识获得，它的计算：

groupIndex = groupThreadID.z*ThreadGroupSize.x*ThreadGroupSize.y +
	groupThreadID.y*ThreadGroupSize.x +
	groupThreadID.x;

关于所以坐标系的顺序，第一个坐标是x轴（列）；第二个坐标是y轴（行）。这个和传统的矩阵是相反的。

为什么要用这些ID呢？CS会输入和输出一些数据结构，我们可以将这些ID保存到数据结构中：

Texture2D gInputA;
Texture2D gInputB;

RWTexture2D<float4> gOutput;

[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Use dispatch thread ID to index into output and input textures.
	gOutput[dispatchThreadID.xy] = gInputA[dispatchThreadID.xy] + gInputB[dispatchThreadID.xy];
}

SV_GroupThreadID对于索引本地储存内存很有用。

5 添加和消耗缓冲

假设我们有一个用下面的粒子的数据结构定义的缓冲：

struct Particle
{
	float3 Position;
	float3 Velocity;
	float3 Acceleration;
};

我们希望在CS在根据他的常量加速度和速度来更新它的位置。并且假设我们不关系它们更新的顺序以及写入输出缓冲的顺序。消耗和添加结构化缓冲对于这种情况就是一个方案，并且还提供了不需要考虑索引的便利：

struct Particle
{
	float3 Position;
	float3 Velocity;
	float3 Acceleration;
};

float TimeStep = 1.0f / 60.0f;
ConsumeStructuredBuffer<Particle> gInput;
AppendStructuredBuffer<Particle> gOutput;

[numthreads(16, 16, 1)]
void CS()
{
	// Consume a data element from the input buffer.
	Particle p = gInput.Consume();
	p.Velocity += p.Acceleration*TimeStep;
	p.Position += p.Velocity*TimeStep;
	
	// Append normalized vector to output buffer.
	gOutput.Append( p );
}

当一个数据被消耗掉，它不能再被其他线程消耗。

添加结构化缓冲并不是动态增长的：它必须始终足够大，来保存你添加的数据。

6 共享内存和同步

在CS代码中，共享内存可以这样声明：

groupshared float4 gCache[256];

数组大小可以随意，但是最大是32kb。因为它是线程堆的局部共享内存，所以它由SV_ThreadGroupID索引；所以，例如你可以让线程堆中的每个线程访问共享内存中的一个槽。
使用过多的共享内存可能导致一些性能问题（[Fung10]），假设多处理器支持32kb共享内存，而你需要20kb共享内存；那就代表只有一个线程堆能有足够的共享内存。这就限制了多处理器的并行运算，因为不能切换内存堆来防止等待时间（3.1中讨论过，每个多处理器最好有两个线程堆用以切换）。所以减少共享内存大小可以保证性能。
大部分应用的共享内存是用来保存纹理值的。比如模糊，需要取相同的像素多次。纹理采样是一个比较慢的GPU操作，因为内存带宽和内存等待时间并没有像GPU计算能力提高那么多（[Möller08]）。线程组可以通过将需要的纹理采样放到共享内存数组中，来避免多余的纹理读取，这样性能就可以提高很多。
加入我们使用下面错误的代码来实现这个策略：

Texture2D gInput;
RWTexture2D<float4> gOutput;
groupshared float4 gCache[256];

[numthreads(256, 1, 1)]
void CS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Each thread samples the texture and stores the
	// value in shared memory.
	gCache[groupThreadID.x] = gInput[dispatchThreadID.xy];
	
	// Do computation work: Access elements in shared memory
	// that other threads stored:
	// BAD!!! Left and right neighbor threads might not have
	// finished sampling tzZhe texture and storing it in shared memory.
	float4 left = gCache[groupThreadID.x - 1];
	float4 right = gCache[groupThreadID.x + 1];
	…
}

因为我们没有保证这个线程组中所有线程同时完成，所以导致这个错误的发生。由于相邻的线程还没有完成初始化操作，所以当前线程可能会访问相邻的未初始化的数据。为了修复这个问题，在CS继续计算前，要先等待所有线程完成纹理的加载计算。这个可以通过一个同步命令完成：

Texture2D gInput;
RWTexture2D<float4> gOutput;
groupshared float4 gCache[256];

[numthreads(256, 1, 1)]
void CS(int3 groupThreadID : SV_GroupThreadID, int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Each thread samples the texture and stores the
	// value in shared memory.
	gCache[groupThreadID.x] = gInput[dispatchThreadID.xy];
	
	// Wait for all threads in group to finish.
	GroupMemoryBarrierWithGroupSync();
	
	// Safe now to read any element in the shared memory
	//and do computation work.
	float4 left = gCache[groupThreadID.x - 1];
	float4 right = gCache[groupThreadID.x + 1];
	…
}

7 模糊Demo

这节我们介绍如何实现一个基于CS的模糊Demo。我们从模糊的数学理论开始，然后介绍渲染到纹理技术，生成我们模糊的源纹理，最后实现基于CS的模糊代码。
在这里插入图片描述

7.1 模糊理论

本Demo的模糊算法描述如下：对于在ij位置的点P，计算以P为中心的m × n矩阵像素权重平均值：
在这里插入图片描述
权重总和必须为1，如果大于1图像会变亮，小于1会变暗。
有很多方法计算权重（总和为1），最常用的方法是高斯模糊：

高斯模糊是可以分离的，可以先水平1D模糊，然后再竖直模糊：

对于9x9的矩阵，我们需要81个采样。但是分离到2个1D的时候，我们只需要18个采样。尤其我们是在模糊纹理，纹理提取是很消耗性能的，所以通过分离模糊来减少纹理采样可以提高性能。

7.2 渲染到纹理

目前我们的程序只是渲染到后置缓冲，但是后置缓冲其实也是在交换链中的一张纹理：

Microsoft::WRL::ComPtr<ID3D12Resource> mSwapChainBuffer[SwapChainBufferCount];
CD3DX12_CPU_DESCRIPTOR_HANDLE rtvHeapHandle(mRtvHeap->GetCPUDescriptorHandleForHeapStart());

for (UINT i = 0; i < SwapChainBufferCount; i++)
{
	ThrowIfFailed(mSwapChain->GetBuffer(i, IID_PPV_ARGS(&mSwapChainBuffer[i])));
	
	md3dDevice->CreateRenderTargetView(
		mSwapChainBuffer[i].Get(), nullptr,
		rtvHeapHandle);
		
	rtvHeapHandle.Offset(1, mRtvDescriptorSize);
}

我们通过绑定后置缓冲的RTV到OM阶段来命令D3D渲染到后置缓冲中：

// Specify the buffers we are going to render to.
mCommandList->OMSetRenderTargets(1,
	&CurrentBackBufferView(),
	true, &DepthStencilView());

后置缓冲中的内容最终通过IDXGISwapChain::Present方法显示到屏幕上。

一个纹理如果要用以渲染目标需要使用D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET flag来创建。

所以用一张纹理替换后置缓冲，将结果渲染到它上面，这个技术就叫做渲染到纹理（render-to-off-screen-texture 或者简化版本 render-to-texture）。渲染到纹理主要用以：

阴影映射（Shadow mapping）；
屏幕空间环境光遮蔽（Screen Space Ambient Occlusion）；
立方体贴图动态反射。（Dynamic reflections with cube maps）

我们的迷糊Demo实现方案步骤如下：

正常绘制场景到一张贴图；
使用CS模糊它；
映射模糊后的贴图到一个屏幕大小的方块几何体，然后绘制到后置缓冲。

渲染到纹理的方案是可以实现的；假设后置缓冲的格式和大小与我们纹理的一致，我们还可以先正常渲染到后置缓冲，然后使用CopyResource方法复制资源到纹理：

// Copy the input (back-buffer in this example) to BlurMap0.
cmdList->CopyResource(mBlurMap0.Get(), input);

上面的步骤需要我们先进行正常的渲染流水线，然后切换到CS进行计算，然后切换回渲染流水线。这样的切换是由开销的（[NVIDIA10]）应当尽可能避免这样的切换。

7.3 模糊实现概述

我们假设模糊是分离的，即2个1D模糊。我们需要2张纹理，，我们叫他们A和B，并且绑定SRV输入，UAV输出；那么模糊算法如下：

绑定SRV到A，作为CS的输入；
绑定UAV到B，作为CS的输出；
分派水平模糊，此时B保存的是水平模糊后的纹理；
绑定SRV到B，作为CS的输入；
绑定UAV到A，作为CS的输出；
分派竖直模糊，此时A保存的是模糊后的结果。

因为我们渲染的纹理和窗口的尺寸一致，所以在OnResize函数中需要重新创建我们的模糊纹理：

void BlurApp::OnResize()
{
	D3DApp::OnResize();
	
	// The window resized, so update the aspect ratio and
	// recompute the projection matrix.
	XMMATRIX P = XMMatrixPerspectiveFovLH(
		0.25f*MathHelper::Pi, AspectRatio(),
		1.0f, 1000.0f);
	XMStoreFloat4x4(&mProj, P);
	if(mBlurFilter != nullptr)
	{
		mBlurFilter->OnResize(mClientWidth, mClientHeight);
	}
}

void BlurFilter::OnResize(UINT newWidth, UINT newHeight)
{
	if((mWidth != newWidth) || (mHeight != newHeight))
	{
		mWidth = newWidth;
		mHeight = newHeight;
		
		// Rebuild the off-screen texture resource with new dimensions.
		BuildResources();
		
		// New resources, so we need new descriptors to that resource.
		BuildDescriptors();
	}
}

mBlur变量我们创建的BlurFilter辅助类的一个实例。该类封装了纹理A和B，SRVs和UAVs，提供了开始CS模糊运算的方法。
BlurFilter类封装了纹理资源，通过使用draw/dispatch方法来绑定资源到流水线，我们需要创建这些资源的描述。这代表我们需要在D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV描述堆中申请更多的空间。BlurFilter使用BlurFilter::BuildDescriptors函数，利用descriptor句柄在描述堆中开始定位和保存描述。原因在于当屏幕尺寸变化的时候，可以重新创建资源：

void BlurFilter::BuildDescriptors(
	CD3DX12_CPU_DESCRIPTOR_HANDLE hCpuDescriptor,
	CD3DX12_GPU_DESCRIPTOR_HANDLE hGpuDescriptor,
	UINT descriptorSize)
{
	// Save references to the descriptors.
	mBlur0CpuSrv = hCpuDescriptor;
	mBlur0CpuUav = hCpuDescriptor.Offset(1, descriptorSize);
	mBlur1CpuSrv = hCpuDescriptor.Offset(1, descriptorSize);
	mBlur1CpuUav = hCpuDescriptor.Offset(1, descriptorSize);
	mBlur0GpuSrv = hGpuDescriptor;
	mBlur0GpuUav = hGpuDescriptor.Offset(1, descriptorSize);
	mBlur1GpuSrv = hGpuDescriptor.Offset(1, descriptorSize);
	mBlur1GpuUav = hGpuDescriptor.Offset(1, descriptorSize);
	BuildDescriptors();
}

void BlurFilter::BuildDescriptors()
{
	D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc = {};
	srvDesc.Shader4ComponentMapping = D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING;
	srvDesc.Format = mFormat;
	srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;
	srvDesc.Texture2D.MostDetailedMip = 0;
	srvDesc.Texture2D.MipLevels = 1;
	D3D12_UNORDERED_ACCESS_VIEW_DESC uavDesc = {};
	uavDesc.Format = mFormat;
	uavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE2D;
	uavDesc.Texture2D.MipSlice = 0;

	md3dDevice->CreateShaderResourceView(mBlurMap0.Get(),
		&srvDesc, mBlur0CpuSrv);
	md3dDevice->CreateUnorderedAccessView(mBlurMap0.Get(),
		nullptr, &uavDesc, mBlur0CpuUav);
	md3dDevice->CreateShaderResourceView(mBlurMap1.Get(),
		&srvDesc, mBlur1CpuSrv);
	md3dDevice->CreateUnorderedAccessView(mBlurMap1.Get(),
		nullptr, &uavDesc, mBlur1CpuUav);
}

// In BlurApp.cpp…Offset to location in heap to
// store descriptors for BlurFilter
mBlurFilter->BuildDescriptors(
	CD3DX12_CPU_DESCRIPTOR_HANDLE(
	mCbvSrvUavDescriptorHeap->GetCPUDescriptorHandleForHeapStart(),
		3, mCbvSrvUavDescriptorSize),
	CD3DX12_GPU_DESCRIPTOR_HANDLE(
	mCbvSrvUavDescriptorHeap->GetGPUDescriptorHandleForHeapStart(),
		3, mCbvSrvUavDescriptorSize),
	mCbvSrvUavDescriptorSize);

模糊是一个很占用性能的操作，它的运算量主要与纹理的大小相关。一般情况下我们渲染到纹理的时候，可以渲染到一张比后置缓冲小的纹理上。这样可以提高渲染的纹理的速度；因为尺寸减小了，所以提高了模糊的速度；最终绘制到后置缓冲的时候，因为用了放大滤波器，又增加一层模糊效果。

假设我们的贴图是宽w，高h。在下章中的CS我们可以看到，对于水平1D模糊，我们线程组水平方向有256个线程，所以我们需要分发 w/256。如果256不能被w整除，最后的线程组将会有多余的线程。对此我们没有办法，除非线程组大小被修复。我们可以使用clamping来进行边缘检测。竖直方向和水平方向处理类似。
下面的代码支出多少线程组被分派，并且开始实际的在CS上的模糊操作：

void BlurFilter::Execute(ID3D12GraphicsCommandList* cmdList,
	ID3D12RootSignature* rootSig,
	ID3D12PipelineState* horzBlurPSO,
	ID3D12PipelineState* vertBlurPSO,
	ID3D12Resource* input,
	int blurCount)
{
	auto weights = CalcGaussWeights(2.5f);
	int blurRadius = (int)weights.size() / 2; cmdList->SetComputeRootSignature(rootSig);
	cmdList->SetComputeRoot32BitConstants(0, 1, &blurRadius, 0);
	cmdList->SetComputeRoot32BitConstants(0, (UINT)weights.size(), weights. data(), 1);
	
	cmdList->ResourceBarrier(1,
		&CD3DX12_RESOURCE_BARRIER::Transition(input,
		D3D12_RESOURCE_STATE_RENDER_TARGET,
		D3D12_RESOURCE_STATE_COPY_SOURCE));
		
	cmdList->ResourceBarrier(1,
		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap0.
		Get(),
		D3D12_RESOURCE_STATE_COMMON,
		D3D12_RESOURCE_STATE_COPY_DEST));
		
	// Copy the input (back-buffer in this example) to BlurMap0.
	cmdList->CopyResource(mBlurMap0.Get(), input);
	cmdList->ResourceBarrier(1,
		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap0. Get(),
		D3D12_RESOURCE_STATE_COPY_DEST,
		D3D12_RESOURCE_STATE_GENERIC_READ));
	cmdList->ResourceBarrier(1,
		&CD3DX12_RESOURCE_BARRIER::Transition(mBlurMap1.
		Get(),
		D3D12_RESOURCE_STATE_COMMON,
		D3D12_RESOURCE_STATE_UNORDERED_ACCESS));
		
	for(int i = 0; i < blurCount; ++i)
	{
		//
		// Horizontal Blur pass.
		//
		cmdList->SetPipelineState(horzBlurPSO);
		cmdList->SetComputeRootDescriptorTable(1, mBlur0GpuSrv);
		cmdList->SetComputeRootDescriptorTable(2, mBlur1GpuUav);
		
		// How many groups do we need to dispatch to cover a row of pixels, where
		// each group covers 256 pixels (the 256 is defined in the ComputeShader).
		UINT numGroupsX = (UINT)ceilf(mWidth / 256.0f);
		cmdList->Dispatch(numGroupsX, mHeight, 1);
		
		cmdList->ResourceBarrier(1,
			&CD3DX12_RESOURCE_BARRIER::Transition(
			mBlurMap0.Get(),
			D3D12_RESOURCE_STATE_GENERIC_READ,
			D3D12_RESOURCE_STATE_UNORDERED_ACCESS));
		cmdList->ResourceBarrier(1,
			&CD3DX12_RESOURCE_BARRIER::Transition(
			mBlurMap1.Get(),
			D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
			D3D12_RESOURCE_STATE_GENERIC_READ));
			
		//
		// Vertical Blur pass.
		//
		cmdList->SetPipelineState(vertBlurPSO);
		cmdList->SetComputeRootDescriptorTable(1, mBlur1GpuSrv);
		cmdList->SetComputeRootDescriptorTable(2, mBlur0GpuUav);
		
		// How many groups do we need to dispatch to cover a column of pixels,
		// where each group covers 256 pixels (the 256 is defined in the
		// ComputeShader).
		UINT numGroupsY = (UINT)ceilf(mHeight / 256.0f);
		cmdList->Dispatch(mWidth, numGroupsY, 1);
		
		cmdList->ResourceBarrier(1,
			&CD3DX12_RESOURCE_BARRIER::Transition(
			mBlurMap0.Get(),
			D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
			D3D12_RESOURCE_STATE_GENERIC_READ));
		cmdList->ResourceBarrier(1,
			&CD3DX12_RESOURCE_BARRIER::Transition(
			mBlurMap1.Get(),
			D3D12_RESOURCE_STATE_GENERIC_READ,
			D3D12_RESOURCE_STATE_UNORDERED_ACCESS));
	}
}

在这里插入图片描述

7.4 计算着色器编程

根据之前章节的描述，我们线程组水平方向有256个线程，每个线程模糊一个像素。一个低效的方案是直接实现每个像素的模糊，这种方案的问题在于需要针对每个纹理的像素提取多次，浪费性能；
在这里插入图片描述
我们可以通过共享内存的方式来优化这个方案。每个线程可以在共享内存中读取像素值，当所有线程读取完毕后，再完成模糊操作。如果线程组有n = 256个线程，那么需要n + 2R个像素来模糊，R是模糊的半径：
在这里插入图片描述
解决方案很简单，我们申请n + 2R个元素的共享内存，然后有2R个线程看向2个像素值。唯一棘手的是当索引共享内存的时候需要一些记录；我们不再有第i个线程组ID对应第i个元素。下图展示了当R=4时的共享内存：
在这里插入图片描述
最后一个需要讨论的问题是，最左边和最右边的组索引的时候，会出输入纹理的范围：

超出边界的值正常情况下返回的是0，但是在我们这个Demo中，0就代表了黑色。我们采用使用边界值，类似clamp函数。这个可以通过clamping索引来实现：

// Clamp out of bound samples that occur at left image borders.
int x = max(dispatchThreadID.x - gBlurRadius, 0);
gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];

// Clamp out of bound samples that occur at right image borders.
int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1);
gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];

// Clamp out of bound samples that occur at image borders.
gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy- 1)];

最终完整的着色器代码如下：

//====================================================================
// Performs a separable Guassian blur with a blur
radius up to 5 pixels.
//====================================================================
cbuffer cbSettings : register(b0)
{
	// We cannot have an array entry in a constant buffer that gets mapped onto
	// root constants, so list each element. int gBlurRadius;
	// Support up to 11 blur weights.
	float w0;
	float w1;
	float w2;
	float w3;
	float w4;
	float w5;
	float w6;
	float w7;
	float w8;
	float w9;
	float w10;
};

static const int gMaxBlurRadius = 5;

Texture2D gInput : register(t0);
RWTexture2D<float4> gOutput : register(u0);

#define N 256
#define CacheSize (N + 2*gMaxBlurRadius)
groupshared float4 gCache[CacheSize];

[numthreads(N, 1, 1)]
void HorzBlurCS(int3 groupThreadID : SV_GroupThreadID,
	int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Put in an array for each indexing.
	float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };
	
	//
	// Fill local thread storage to reduce bandwidth. To blur
	// N pixels, we will need to load N + 2*BlurRadius pixels
	// due to the blur radius.
	//
	// This thread group runs N threads. To get the extra 2*BlurRadius
	// pixels, have 2*BlurRadius threads sample an extra pixel.
	if(groupThreadID.x < gBlurRadius)
	{
		// Clamp out of bound samples that occur at image borders.
		int x = max(dispatchThreadID.x - gBlurRadius, 0);
		gCache[groupThreadID.x] = gInput[int2(x, dispatchThreadID.y)];
	}
	
	if(groupThreadID.x >= N-gBlurRadius)
	{
		// Clamp out of bound samples that occur at image borders.
		int x = min(dispatchThreadID.x + gBlurRadius, gInput.Length.x-1);
		gCache[groupThreadID.x+2*gBlurRadius] = gInput[int2(x, dispatchThreadID.y)];
	}
	
	// Clamp out of bound samples that occur at image borders.
	gCache[groupThreadID.x+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];
	
	// Wait for all threads to finish.
	GroupMemoryBarrierWithGroupSync();
	
	//
	// Now blur each pixel.
	//
	float4 blurColor = float4(0, 0, 0, 0);
	for(int i = -gBlurRadius; i <= gBlurRadius; ++i)
	{
		int k = groupThreadID.x + gBlurRadius + i;
		blurColor +=
		weights[i+gBlurRadius]*gCache[k];
	}
	
	gOutput[dispatchThreadID.xy] = blurColor;
}

[numthreads(1, N, 1)]
void VertBlurCS(int3 groupThreadID : SV_GroupThreadID,
	int3 dispatchThreadID : SV_DispatchThreadID)
{
	// Put in an array for each indexing.
	float weights[11] = { w0, w1, w2, w3, w4, w5, w6, w7, w8, w9, w10 };
	
	//
	// Fill local thread storage to reduce bandwidth. To blur
	// N pixels, we will need to load N + 2*BlurRadius pixels
	// due to the blur radius.
	//
	// This thread group runs N threads. To get the extra 2*BlurRadius
	// pixels, have 2*BlurRadius threads sample an extra pixel.
	if(groupThreadID.y < gBlurRadius)
	{
		// Clamp out of bound samples that occur at image borders.
		int y = max(dispatchThreadID.y - gBlurRadius, 0);
		gCache[groupThreadID.y] = gInput[int2(dispatchThreadID.x, y)];
	}
	
	if(groupThreadID.y >= N-gBlurRadius)
	{
		// Clamp out of bound samples that occur at image borders.
		int y = min(dispatchThreadID.y + gBlurRadius, gInput.Length.y-1);
		gCache[groupThreadID.y+2*gBlurRadius] = gInput[int2(dispatchThreadID.x, y)];
	}
	
	// Clamp out of bound samples that occur at image borders.
	gCache[groupThreadID.y+gBlurRadius] = gInput[min(dispatchThreadID.xy, gInput.Length.xy-1)];
	
	// Wait for all threads to finish.
	GroupMemoryBarrierWithGroupSync();
	
	//
	// Now blur each pixel.
	//
	float4 blurColor = float4(0, 0, 0, 0);
	for(int i = -gBlurRadius; i <= gBlurRadius; ++i)
	{
		int k = groupThreadID.y + gBlurRadius + i;
		blurColor += weights[i+gBlurRadius]*gCache[k];
	}
	
	gOutput[dispatchThreadID.xy] = blurColor;
}

最后一行：

gOutput[dispatchThreadID.xy] = blurColor;

dispatchThreadID.xy它有可能是超出边界的，但是我们不需要担心这个问题，因为超出边界的写入是无效的。

8 更深入的材料

计算着色器编程是一个子学科，有几本关于使用GPU进行CS编程的书：

Programming Massively Parallel Processors: A Hands-on Approach by David B. Kirk and Wen-mei W. Hwu.
OpenCL Programming Guide by Aaftab Munshi, Benedict R. Gaster, Timothy G. Mattson, James Fung, and Dan Ginsburg.

类似CUDA和OpenCL的技术只是使用不同API访问GPU编写程序。好的CUDA和OpenCL练习也是好的DX计算机编程练习，它们都执行在相同的硬件上。本章展示了主要的Direct计算语法，所以移植到CUDA和OpenCL编程不会是什么太大的问题。
Chuck Walbourn发表了博客包含了许多Direct计算介绍的链接：
http://blogs.msdn.com/b/chuckw/archive/2010/07/14/directcompute.aspx
另外微软通道9有一些关于Direct计算编程的演讲视频：
http://channel9.msdn.com/tags/DirectCompute-Lecture-Series/
最后NVIDIA有完整的CUDA训练：
http://developer.nvidia.com/cuda-training

另外Illinois大学有有完整的CUDA编程课程，是我们强烈推荐的。学习了CUDA后，你将会对GPU硬件的工作有更好的了解，可以让你写出更优化的代码。

9 总结

ID3D12GraphicsCommandList::Dispatch结构分派一个格子的线程组。每个线程组是一个3D格子的线程[numthreads(x,y,z)]；出于性能考虑，线程总数最好是warp（Nvidea硬件 32）大小的倍数或者wavefront（ATI硬件 64）大小的倍数；
为了确保并行运算，每个多处理器应该至少分配2个线程组。最新的硬件可能有更多个多处理器，所以线程组的个数应该更好的确保为新硬件多处理器个数的倍数；
当线程组被指定到多处理器后，线程组中的线程会被分开到warps个（每个32个线程），然后多处理器对每个warp线程同时以SIMD形式执行。如果一个warp停滞了，比如在提取纹理，处理器会迅速切换到另一个潜伏的warp线程并指向指令。这个会让处理器一直都在运行。这个就是建议为什么线程组的大小是warp大小的倍数的原因，如果不这么设置，某个warp中的线程就会没有指令处理；
纹理资源可以作为CS输入资源，用过SRV；可以作为读取和写入资源（RWTexture)）作为输出资源，通过UAV。纹理元素可以通过索引或者采样（纹理坐标和采样状态 SampleLevel函数）访问；
结构化缓冲是一个包含相同类型元素的数组，类型可以让用户自己定义，比如只读：

StructuredBuffer<DataType> gInputA;

读写：

RWStructuredBuffer<DataType> gOutput;

只读可以作为输入资源通过SRV绑定进来；读写通过UAV绑定。

线程ID变量通过系统值传递到CS，它通常用来索引资源和共享内存；
消耗和添加结构化缓冲在HLSL中的定义如下：

ConsumeStructuredBuffer<DataType> gInput;
AppendStructuredBuffer<DataType> gOutput;

它们用来如果你不关心元素的处理和写入输出的顺序的时候，它可以避免索引符号。添加缓冲并不动态增长，它不需要足够大来保存添加的数据。

线程组提供共享内存，访问它跟访问硬件cache一样快，它可以用来优化或者一些算法的实现。在CS中，它的定义如下：groupshared float4 gCache[N]; 数组大小可以是任意数，但是不能超过32kb，出于性能考虑，它的大小应该不超过16kb，否则不能让2个线程组指定到用一个多处理器；
尽可能避免计算处理和显然之间的切换，因为切换操作是有性能消耗的。如果可能的话，最好在每帧先执行所有计算操作，然后执行所有渲染操作。

10 练习

本章内容因为本人暂时还都用不到，练习先不写