入门图形学：ComputeShader

这里学习一下Compute Shader，顾名思义就是用于计算（Compute）的Shader。

一般情况下，我们写跑在GPU上的shader都是为了特定的渲染效果，而做运算则用c#CPU去处理。但是GPU的运算能力，特别是浮点数运算能力强于CPU不止十倍，CPU则在逻辑处理能力上强于GPU。那么如果我们碰到低逻辑判断高运算需求的情况下，如果能用上GPU运算，那就可以节省很多时间了。比如AI模型训练，基本都是在GPU上跑算法，我们做视频图像识别卷积运算，用CPU去干这事，基本卡冒烟，而GPU（特别是硬件支持深度学习的GPU加速）才能顶得住，而且游刃有余。

好了，言归正传，我们来学习unity的compute shader（简称CS），先上官方：

unity computeshader

因为GPU的并行运算能力很强，所以CS能帮我们加速某些运算。

PS：其实普通的unlitshader照样可以，如下：

Shader "Compute/ComputeUnlitShader"
{
    Properties
    {
        _MainTex ("Texture", 2D) = "white" {}
        _BinaThreshold("Binaryzation Threshold",Range(0,0.01)) = 0.5
    }
    SubShader
    {
        Tags { "RenderType"="Opaque" }
        LOD 100

        Pass
        {
            CGPROGRAM
            #pragma vertex vert
            #pragma fragment frag

            #include "UnityCG.cginc"

            struct appdata
            {
                float4 vertex : POSITION;
                float2 uv : TEXCOORD0;
            };

            struct v2f
            {
                float2 uv : TEXCOORD0;
                float4 vertex : SV_POSITION;
            };

            sampler2D _MainTex;
            float4 _MainTex_ST;

            float _BinaThreshold;

            v2f vert (appdata v)
            {
                v2f o;
                o.vertex = UnityObjectToClipPos(v.vertex);
                o.uv = TRANSFORM_TEX(v.uv, _MainTex);
                return o;
            }

            //去色
            fixed4 decolor(fixed4 col)
            {
                fixed g = 0.299 * col.r + 0.587 * col.g + 0.114 * col.b;
                return fixed4(g,g,g,1);
            }

            //查边
            fixed4 edgecolor(fixed4 gcol)
            {
                fixed x = ddx(gcol.r);
                fixed y = ddy(gcol.r);
                fixed w = (x+y)/2;
                if(w>_BinaThreshold)
                {
                    return fixed4(1,1,1,1);
                }
                return fixed4(0,0,0,1);
            }

            fixed4 frag (v2f i) : SV_Target
            {
                fixed4 col = tex2D(_MainTex, i.uv);
                fixed4 gcol = decolor(col);
                fixed4 ecol = edgecolor(gcol);
                return ecol;
            }
            ENDCG
        }
    }
}

c#代码：

using System.Collections;
using System.Collections.Generic;
using UnityEngine;
using UnityEngine.UI;

public class TestComputeUnlitShader : MonoBehaviour
{
    public RawImage sourceImg;
    public RawImage destImg;

    public Material cptUnlitMat;
    public Texture2D sourceTex;
    public Texture2D destTex;

    void Start()
    {
        sourceImg.texture = sourceTex;

        RenderTexture tempRt = new RenderTexture(sourceTex.width, sourceTex.height, 0);

        Graphics.Blit(sourceTex, tempRt, cptUnlitMat);
        destTex = RT2Tex2D(tempRt);

        destImg.texture = destTex;
    }

    private Texture2D RT2Tex2D(RenderTexture rt)
    {
        Texture2D tex = new Texture2D(rt.width, rt.height, TextureFormat.RGB24, false);
        RenderTexture.active = rt;
        tex.ReadPixels(new Rect(0, 0, rt.width, rt.height), 0, 0);
        tex.Apply();
        return tex;
    }
}

我们将数据储存到texture2d，使用fragment函数处理计算，通过c#graphics.blit获取运算后的texture2d，则相当于使用了GPU完成一次数据运算。效果如下：

通过shader对图像进行了一次二值化处理的过程。

接下来我们尝试一下CS，一般情况下通过GPU shader进行运算的入参都是texture2d，出参也是texture2d，毕竟纹理即数据。

先看下默认CS代码：

// Each #kernel tells which function to compile; you can have many kernels
#pragma kernel CSMain

// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
RWTexture2D<float4> Result;

[numthreads(8,8,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    // TODO: insert actual code here!

    Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0);
}

还是来一个逐句解析（然而你看自带的英文注释都已经解释的比较清楚了）：

#pragma kernel CSMain （定义入口函数，类似c# main函数，或者shader中#pragma vertex vert等定义函数，不过CS中kernel可以定义多个）

RWTexture2D<float4> Result; （全称read write texture2d，即入参出参纹理，前面我们也说了shader中入参出参都用texture2d纹理来承载数据）

[numthreads(8,4,1)] （1个4*8线程集合，1个8列4行的线程矩阵，用于并行的处理texture2d中的rgba数据）

msdn关于numthreads解释

Dispatch(5,3,2)：分配一个5*3*2的<三维线程数组>的三维线程数组

numthreads[10,8,3]：分配一个10*8*3的<三维线程数组>

ps：其中numthreads=x*y*z要小于GPU的流处理单元数量。

uint3 id : SV_DispatchThreadID （语义，绑定当前所处线程id，该id包含3个int值xyz，xyz代表当前线程在<整个线程数组的数组中>的三维索引。因为我们是大量并行线程同时运行，有这个id保证我们可以获取到确定的线程运行的逻辑和结果）

其中三维索引SV_DispatchThreadID的计算公式：[(disptch.x,disptch.y,disptch.z)*[numthreads.x,numthreads.y,numthreads.z]]+SV_GroupThreadID

Result[id.xy] = float4(id.x & id.y, (id.x & 15)/15.0, (id.y & 15)/15.0, 0.0); （id.xy为当前线程处理的texture2d颜色二维矩阵的xy轴索引，便于我们获取准确的颜色二维坐标，但是有个前提：texture2d的width和height要分别与Dispatch(x,y)*numthreads[x,y]的处理线程单元一一对应，通熟来说就是一个线程处理一个像素）

注意：我反复用了几次<数组的数组>，就是为了突出GPU并行的能力之高，我们通过numthreads定义了<三维线程数组>（或者说x*y*z的长一维数组），而且我们还可以通过c#Dispatch(x,y,z)再次定义一个<三维线程数组>的三维线程数组，将texture2d纹理数据“平均切分开”用5*3*2=30个线程，且每个线程调用10*8*3=240个流处理单元处理数据。

我相信大家可能没有直观的感觉，下面我们就来写个demo演示一下：

#pragma kernel CSMain

RWTexture2D<float4> xResult;
RWTexture2D<float4> yResult;
RWTexture2D<float4> zResult;


[numthreads(4,8,2)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    //转为float进行计算
    float x = id.x;
    float y = id.y;
    float z = id.z;
    //根据id的xyz计算颜色值
    float xcol = x/512;
    float ycol = y/512;
    float zcol = z/2;
    //采样xyz的颜色值
    xResult[id.xy] = float4(xcol,xcol,xcol,1);
    yResult[id.xy] = float4(ycol,ycol,ycol,1);
    zResult[id.xy] = float4(zcol,zcol,zcol,1);
}

c#代码：

using UnityEngine;
using UnityEngine.UI;

public class DemoCSCall : MonoBehaviour
{
    public int texWidth = 512;
    public int texHeight = 512;
    public RawImage imgx;
    public RawImage imgy;
    public RawImage imgz;
    public ComputeShader theCS;

    private RenderTexture xTex;
    private RenderTexture yTex;
    private RenderTexture zTex;

    void Start()
    {

    }

    private void Update()
    {
        if (Input.GetKeyDown(KeyCode.R))
        {
            //创建xyz展示纹理
            xTex = new RenderTexture(texWidth, texHeight, 0, RenderTextureFormat.ARGB32);
            xTex.enableRandomWrite = true;
            xTex.Create();
            yTex = new RenderTexture(texWidth, texHeight, 0, RenderTextureFormat.ARGB32);
            yTex.enableRandomWrite = true;
            yTex.Create();
            zTex = new RenderTexture(texWidth, texHeight, 0, RenderTextureFormat.ARGB32);
            zTex.enableRandomWrite = true;
            zTex.Create();
            //获取kernel id
            int kl = theCS.FindKernel("CSMain");
            //赋值xyz纹理入参出参
            theCS.SetTexture(kl, "xResult", xTex);
            theCS.SetTexture(kl, "yResult", yTex);
            theCS.SetTexture(kl, "zResult", zTex);
            //设置数据处理单元并运行CS
            //我们使用numthreads[4,8,2]
            //则根据纹理宽/4，高/8，1设置numthreads处理的数据范围
            theCS.Dispatch(kl, texWidth / 4, texHeight / 8, 1);
            //提取xyztex进行展示
            imgx.texture = xTex;
            imgy.texture = yTex;
            imgz.texture = zTex;
        }
    }
}

运行效果如下：

结合上面的代码和下面的运行结果来观察zTex，可以了解到CS线程SV_Group是乱序执行的，因为我们每次得到的id.z都不一样，所以zTex的黑灰色块一直变换。

同时，如果我们修改Dispatch的入参，也会出现一些现象：

theCS.Dispatch(kl, texWidth / 8, texHeight / 8, 1);

降低每个线程在纹理x轴的处理数据量，则如下：

CS只处理“一半”的纹理数据。如果我们增加每个线程的处理数据量，则没什么变化，无非就是重叠了线程之间的数据区间，浪费了一些硬件资源。

同时，CS也可以处理纯粹的数据，比如数组。其实这也很正常，CS就是专门给我们做计算的shader，如果我们必须将数据都写成图片，那光是处理入参的开销都不小，CS处理普通数据如下：

#pragma kernel CSMain

RWStructuredBuffer<float2> Float2s;
RWTexture2D<float4> Result;
int Width;
int Height;

float getDistance(float2 p,float2 c)
{
    float2 d = p-c;
    return sqrt(d.x*d.x+d.y*d.y);
}

float getMaxDistance(float2 c)
{
    return sqrt(c.x*c.x+c.y*c.y);
}

float4 getTexRGBA(float2 p)
{
    float2 center = float2(Width/2,Height/2);
    float dist = getDistance(p,center);
    float mdist = getMaxDistance(center);
    float4 col = float4(dist/mdist,dist/mdist,dist/mdist,1);
    return col;
}

[numthreads(16,32,2)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    int index = id.y*Width+id.x;
    float2 p = Float2s[index];
    Result[id.xy] = getTexRGBA(p);
}

c#调用：

using UnityEngine;
using UnityEngine.UI;

public class BufferCSCall : MonoBehaviour
{
    public int texWidth = 8192;
    public int texHeight = 8192;
    public ComputeShader theCS;
    public RawImage img;

    private ComputeBuffer csBuffer;
    private Vector2[] csFloats;
    private RenderTexture csTex;

    void Start()
    {
        //假设有这么一个二维vector2矩阵
        int bufferlen = texWidth * texHeight;
        csFloats = new Vector2[bufferlen];
        for (int x = 0; x < texWidth; x++)
        {
            for (int y = 0; y < texHeight; y++)
            {
                int index = x * texHeight + y;
                csFloats[index] = new Vector2(x, y);
            }
        }
#if UNITY_EDITOR
        Debug.LogFormat("cs start time = {0}", Time.realtimeSinceStartup);
#endif
        //初始化tex
        csTex = new RenderTexture(texWidth, texHeight, 0, RenderTextureFormat.ARGB32);
        csTex.enableRandomWrite = true;
        csTex.Create();
        //绘制图像
        //通过Set函数传递数据到GPU
        int kl = theCS.FindKernel("CSMain");
        csBuffer = new ComputeBuffer(bufferlen, 32);
        csBuffer.SetData(csFloats);
        theCS.SetBuffer(kl, "Float2s", csBuffer);
        theCS.SetTexture(kl, "Result", csTex);
        theCS.SetInt("Width", texWidth);
        theCS.SetInt("Height", texHeight);
        theCS.Dispatch(kl, texWidth / 16, texHeight / 32, 1);
        img.texture = csTex;
#if UNITY_EDITOR
        Debug.LogFormat("cs stop time = {0}", Time.realtimeSinceStartup);
#endif
    }
}

效果如下：

可以看得出来使用0.4s左右生成了一个基于像素中心距离插值的8k图像，下面用CPU试一下：

using UnityEngine;
using UnityEngine.UI;

public class K8TexGenerator : MonoBehaviour
{
    public RawImage img;

    public int texWidth = 8192;
    public int texHeight = 8192;

    void Start()
    {
#if UNITY_EDITOR
        Debug.LogFormat("cpu start time = {0}", Time.realtimeSinceStartup);
#endif
        Texture2D tex = new Texture2D(texWidth, texHeight, TextureFormat.ARGB32, false);
        for (int x = 0; x < texWidth; x++)
        {
            for (int y = 0; y < texHeight; y++)
            {
                Vector2 p = new Vector2(x, y);
                tex.SetPixel(x, y, getTexRGBA(p));
            }
        }
        tex.Apply();
        img.texture = tex;
#if UNITY_EDITOR
        Debug.LogFormat("cpu stop time = {0}", Time.realtimeSinceStartup);
#endif
    }

    float GetDistance(Vector2 p, Vector2 c)
    {
        Vector2 d = p - c;
        return Mathf.Sqrt(d.x * d.x + d.y * d.y);
    }

    float GetMaxDistance(Vector2 c)
    {
        return Mathf.Sqrt(c.x * c.x + c.y * c.y);
    }

    Color getTexRGBA(Vector2 p)
    {
        Vector2 center = new Vector2(texWidth / 2, texHeight / 2);
        float dist = GetDistance(p, center);
        float mdist = GetMaxDistance(center);
        Color col = new Color(dist / mdist, dist / mdist, dist / mdist, 1);
        return col;
    }
}

效果如下：

整整22.5s才生成一张同样算法的图，差距简直不可想象。所以说CS在数据处理方面还是很强大的。

好了，CS学习到这里，以后有机会碰到用CS的地方，再来举几个CS的用法例子。

入门图形学：ComputeShader

猜你喜欢