Suppose there is a C function:
void add_arrays(const float* inA,
const float* inB,
float* result,
int length)
{
for (int index = 0; index < length ; index++) {
result[index] = inA[index] + inB[index];
}
}
复制代码
If we want to do it with Metal, first create a .metal
file . Then the file implements:
#include <metal_stdlib>
using namespace metal;
// 以 `kernel`关键字声明这是一个计算函数
kernel void add_arrays(device const float* inA,
device const float* inB,
device float* result,
uint index [[thread_position_in_grid]])
{
result[index] = inA[index] + inB[index];
}
复制代码
Compare the C and MSL version functions:
- MSL (Metal Shading Language) adds the
kernel
keyword , which declares the function as a public GPU function. Public functions are the only functions your application can see. Public functions also cannot be called by other shader functions. It is a compute function (also called a compute kernel) that performs parallel computations using a grid of threads.Using a Render Pipeline to Render Primitives
See for additional function keywords for declaring public graphics functions. - The MSL function declares three parameters with the
device
keyword , indicating that these pointers are located in the device address space. MSL defines several disjoint address spaces for memory. When declaring a pointer in MSL, a keyword must be provided to declare its address space. Usedevice
persistent memory that declares that the GPU can read and write. - The MSL function removes the for loop as the function will now be called by multiple threads in the compute grid.
- MSL function To replace the index in for, the function accepts a new index parameter
index
that uses another MSL keywordthread_position_in_grid
, which is specified using the C++ attribute syntax. This keyword declares that Metal should calculate a unique index for each thread and pass that index into this parameter.[[ xxx ]]
This is the syntax format of MSL.
sample code
if let device = MTLCreateSystemDefaultDevice() {
let adder = MetalAdder(device)
adder.prepareData()
adder.sendComputeCommand()
}
复制代码
MTLDevice
Object is an abstraction of the GPU that we use to communicate with the GPU. MTLCreateSystemDefaultDevice()
Get the default MTLDevice
object by calling
import Foundation
import Metal
let count: Int = 10
class MetalAdder {
private var device: MTLDevice
private var mAddFunctionPSO: MTLComputePipelineState?
private var commandQueue: MTLCommandQueue?
private var inA: [Float] = Array(repeating: 0, count: count)
private var inB: [Float] = Array(repeating: 0, count: count)
private var result: [Float] = Array(repeating: 0, count: count)
private var inABuffer: MTLBuffer?
private var inBBuffer: MTLBuffer?
private var resultBuffer: MTLBuffer?
init(_ devi: MTLDevice) {
device = devi
// 使用MTLDevice创建默认库,成功后得到一个MTLLibrary对象。
// 我们项目中使用.metal扩展名创建的函数、shader都会在这个里面。
guard let defaultLibrary = device.makeDefaultLibrary() else {
print("创建默认库失败")
return
}
// 在默认库中创建我们要让GPU执行的函数 "add_arrays", 成功后得到一个MTLFunction对象
guard let addFunc = defaultLibrary.makeFunction(name: "add_arrays") else {
print("未能找到 add_arrays 方法对象")
return
}
do {
// 通过MTLFunction创建计算管线状态对象 MTLComputePipelineState
mAddFunctionPSO = try device.makeComputePipelineState(function: addFunc)
} catch {
print("创建管道状态对象失败")
return
}
// 创建一个命令队列, 成功后得到一个MTLCommandQueue对象
commandQueue = device.makeCommandQueue()
guard commandQueue != nil else {
print("创建命令队列失败")
return
}
}
func prepareData() {
for i in 0..<count {
inA[i] = Float(i)
inB[i] = Float(i)
}
// 通过将传入内容copy到一块新的内存区,来创建buffer
// 使用共享内存(storageModeShared), CPU和GPU都可以访问它
inABuffer = device.makeBuffer(bytes: inA, length: count*MemoryLayout<Float>.size, options: [.storageModeShared])
guard commandQueue != nil else {
print("创建inABuffer失败")
return
}
inBBuffer = device.makeBuffer(bytes: inB, length: count*MemoryLayout<Float>.size, options: [.storageModeShared])
guard commandQueue != nil else {
print("创建inBBuffer失败")
return
}
resultBuffer = device.makeBuffer(bytes: result, length: count*MemoryLayout<Float>.size, options: [.storageModeShared])
guard commandQueue != nil else {
print("创建resultBuffer失败")
return
}
}
func sendComputeCommand() {
// 创建命令缓冲区, 成功后得到MTLCommandBuffer
let commandBuffer = commandQueue?.makeCommandBuffer()
assert(commandBuffer != nil, "创建commandBuffer失败")
// 创建一个命令编码器, 成功后得到MTLComputeCommandEncoder
let computeEncoder = commandBuffer?.makeComputeCommandEncoder()
assert(commandBuffer != nil, "创建computeEncoder失败")
assert(mAddFunctionPSO != nil, "mAddFunctionPSO为nil")
// 绑定状态
computeEncoder?.setComputePipelineState(mAddFunctionPSO!)
// 写入数据
computeEncoder?.setBuffer(inABuffer, offset: 0, index: 0)
computeEncoder?.setBuffer(inBBuffer, offset: 0, index: 1)
computeEncoder?.setBuffer(resultBuffer, offset: 0, index: 2)
let gridSize = MTLSizeMake(count, 1, 1)
var threadgroupSizeWidth = mAddFunctionPSO?.maxTotalThreadsPerThreadgroup ?? count
if threadgroupSizeWidth > count {
threadgroupSizeWidth = count
}
let threadgroupSize = MTLSizeMake(threadgroupSizeWidth, 1, 1)
// 决定要创建多少线程以及如何组织这些线程
// 使用均匀线程组边界的网格编码计算命令
// computeEncoder?.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
// 使用非均匀线程组的网格编码计算命令,该方法能够更有效的利用GPU资源, 但是该方法最低支持到A11处理器(芯片)
computeEncoder?.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
// 当您没有更多的命令要添加到计算传递时,您将结束编码过程以关闭计算传递
computeEncoder?.endEncoding()
// 交命令缓冲区后,Metal会异步准备执行的命令,然后调度命令缓冲区在GPU上执行。当GPU执行完命令缓冲区中的所有命令后,将命令缓冲区标记为完成。
commandBuffer?.commit()
// 处理完成回调
commandBuffer?.addCompletedHandler({ [self] buffer in
print("执行完成")
//contents()得到 UnsafeMutableRawPointer 指针
let pA = inABuffer?.contents()
let pB = inABuffer?.contents()
let pR = resultBuffer?.contents()
for i in 0..<count {
let a = pA!.load(fromByteOffset: i*MemoryLayout<Float>.size, as: Float.self)
let b = pB!.load(fromByteOffset: i*MemoryLayout<Float>.size, as: Float.self)
let r = pR!.load(fromByteOffset: i*MemoryLayout<Float>.size, as: Float.self)
print("\(a) + \(b) = \(r), result[\(i)] = \(result[i])")
}
})
}
}
复制代码
执行完成
0.0 + 0.0 = 0.0, result[0] = 0.0
1.0 + 1.0 = 2.0, result[1] = 0.0
2.0 + 2.0 = 4.0, result[2] = 0.0
3.0 + 3.0 = 6.0, result[3] = 0.0
4.0 + 4.0 = 8.0, result[4] = 0.0
5.0 + 5.0 = 10.0, result[5] = 0.0
6.0 + 6.0 = 12.0, result[6] = 0.0
7.0 + 7.0 = 14.0, result[7] = 0.0
8.0 + 8.0 = 16.0, result[8] = 0.0
9.0 + 9.0 = 18.0, result[9] = 0.0
复制代码
Summarize
Use GPU for data calculation implementation process:
- Get the abstraction of the GPU
MTLDevice
MTLDevice
Obtain the libraryMTLLibrary
that stores the method, and then find the method from the libraryMTLFunction
- By
MTLDevice
creating a command queueMTLCommandQueue
, responsible for sending tasks to the GPU - 通过
MTLCommandQueue
创建一个MTLCommandBuffer
来存储我们要执行的任务 - 给
MTLCommandBuffer
创建一个命令编码器MTLComputeCommandEncoder
,用来向MTLCommandBuffer
写入内容 - 向
MTLComputeCommandEncoder
写入我们要传递给函数的参数的buffer和接受返回值的buffer - 向
MTLComputeCommandEncoder
写入线程网格编码计算命令 - 写入完成后,编码器结束编码
computeEncoder?.endEncoding()
, 提交命令缓冲区以执行其命令commandBuffer?.commit()
,添加执行结束后的回调commandBuffer?.addCompletedHandler
注意点
MTLFunction
函数对象是MSL函数的代理,但它不是GPU可执行代码. 通过计算管道(ComputePipeline
)才会转换成GPU能执行的代码。在Metal中,计算管道由管道状态对象(MTLComputePipelineState
)表示.- 一个计算管道(
ComputePipeline
)运行单个计算函数,在运行函数之前操作输入数据,然后操作输出数据。当你创建一个MTLComputePipelineState
,MTLDevice
编译生成这个特定的GPU执行函数。需要注意的是,编译是会有耗时的,所以创建MTLComputePipelineState
的操作要基于性能考虑是否异步实现。 computeEncoder?.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
这个方法最低支持到A11处理器(芯片),不支持的要使用computeEncoder?.dispatchThreadgroups(gridSize, threadsPerThreadgroup: threadgroupSize)
方法代替。- Metal使用资源对象抽象内存管理(
MTLResource
)。资源是GPU执行命令时可以访问的内存分配。不同的资源类型有不同的用途。最常见的资源类型是缓冲区(MTLBuffer
),它是内存的线性分配,以及纹理(MTLTexture
),它保存结构化的图像数据。使用MTLDevice
为GPU创建资源。
计算每个线程组的线程数
每个线程负责一个像素
在A11处理器
、 iOS 11
和macOS 10.13
以及更高版本中,当准备好执行计算内核代码时,需要指定网格的大小和每个线程组的线程数。然后,Metal计算线程组的数量,如果网格大小不是线程组大小的倍数,则提供非均匀
的线程组。这可以确保没有未被充分利用的线程。使用func dispatchThreads(_ threadsPerGrid: MTLSize, threadsPerThreadgroup: MTLSize)
方法。
在之前的设备和版本上,则需要指定线程组的大小和数量。因为网格是由统一的线程组组成的,所以它可能与数据的大小不匹配,在这种情况下,需要向计算内核添加防御代码,以确保它不会在数据边界之外执行。使用func dispatchThreadgroups(indirectBuffer: MTLBuffer, indirectBufferOffset: Int, threadsPerThreadgroup: MTLSize)
方法。
可以根据两个 MTLComputePipelineState
的属性计算每个线程组的线程数。其中一个属性是 maxTotalThreadsPerThreadgroup
(单个线程组中可以包含的最大线程数)。另一个是 threadExecutionWidth
(计划在GPU上并行执行的线程数)。
maxTotalThreadsPerThreadgroup
属性取决于设备、计算内核的寄存器使用情况和线程组内存使用情况。在创建了计算管道状态之后,它的 maxTotalThreadsPerThreadgroup
值不会改变,但是同一设备上的两个管道状态可能会返回不同的值。 每个线程组的线程数不能超过 maxTotalThreadsPerThreadgroup
。在maxTotalThreadsPerThreadgroup
值为512和threadExecutionWidth
值为32的设备上,每个线程组的合适线程数是32(线程执行宽度)× 16(每个线程组的总线程除以线程执行宽度)。
下面例子是一个基于线程执行宽度和每个线程组的最大线程定义的线程组尺寸
let w = pipelineState.threadExecutionWidth
let h = pipelineState.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)
复制代码
非均匀线程组 与 均匀线程组
使用非均匀线程组:
func dispatchThreads(_ threadsPerGrid: MTLSize, threadsPerThreadgroup: MTLSize)
复制代码
使用均匀线程组:
func dispatchThreadgroups(_ threadgroupsPerGrid: MTLSize, threadsPerThreadgroup: MTLSize)
复制代码
threadgroupsPerGrid
: 网格中每个维度中的线程组数。
threadsPerThreadgroup
: 每个维度中一个线程组中的线程数。
// 假设纹理数据是11x7的
// 设置每个网格大小是4x4的
let threadsPerThreadgroup = MTLSizeMake(4, 4, 1)
// 每一维度上网格数量则为(11/4向上取整)x(7/4向上取整) = 3x2
let threadsPerGrid = MTLSize(width: 3, height: 2, depth: 1)
computeCommandEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
复制代码
在我们非均匀线程组时,当Metal执行此计算时可以沿网格边缘生成较小的线程组,能够有效的提高GPU性能。 在使用均匀线程组时, 因为数据比网格要少,边缘网格存在越界数据。 这时候我们需要处理这些边界,判断线程位置与纹理边界,当位置超出纹理范围直接返回。使用非均质线程组时,Metal会自动处理,不需要我们判断。
判断方法:
以将不透明的白色写入outputTexture
的每一个像素的shader为例。
kernel void simpleKernelFunction(texture2d<float, access::write>
outputTexture [[texture(0)]],
uint2 position [[thread_position_in_grid]])
{
if (position.x >= outputTexture.get_width() || position.y >= outputTexture.get_height()) {
return;
}
outputTexture.write(float4(1.0), position);
}
复制代码