Use GPU computing on the browser

This article is about me using the experimental WebGPU API and sharing my journey with web developers who are interested in using GPU for data parallel computing.

background

As you may already know, the graphics processing unit (GPU) is an electronic subsystem in a computer that was originally dedicated to processing graphics. However, in the past ten years, it has developed into a more flexible architecture that allows developers to implement multiple types of algorithms instead of just rendering 3D graphics while leveraging the unique architecture of the GPU. These functions are called GPU computing, and the use of GPU as a coprocessor for general scientific computing is called general-purpose GPU (GPGPU) programming.

Because convolutional neural networks and other models can use this architecture to run more efficiently on the GPU, GPU Compute has made an important contribution to the recent machine learning boom. Due to the lack of GPU computing capabilities on the current Web platform, W3C’s "GPU for the Web" community group is designing an API to expose modern GPU APIs available on most current devices. This API is called WebGPU.

WebGPU is a low-level API, such as WebGL. As you can see, it is very powerful and very verbose. but it does not matter. What we are looking for is performance.

In this article, I will focus on the GPU computing part of WebGPU. To be honest, I'm just exploring the surface so that you can start the game yourself. I will delve deeper and introduce WebGPU rendering (canvas, textures, etc.) in an upcoming article.

Dogfood: Chrome 78 for macOS now offers experimental flags, and WebGPU is now available. You can enable it at chrome://flags/#enable-unsafe-webgpu. The API is constantly changing and is currently insecure. Since GPU sandbox has not been implemented for WebGPU API, GPU data of other processes can be read! Do not enable it to browse the web.

Access GPU

In WebGPU, access to the GPU is easy. Calling navigator.gpu.requestAdapter() will return a JavaScript promise, which will be resolved asynchronously with the GPU adapter. Think of this adapter as a graphics card. It can be integrated (on the same chip as the CPU) or discrete (usually a PCIe card with higher performance but using more power).

After you have a GPU adapter, call adapter.requestDevice() to get a promise that will be used on GPU devices that perform some GPU calculations.

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();

Both of these functions have options that allow you to specify the type of adapter (power preference) and equipment (extensions, restrictions) required. For simplicity, we will use the default options in this article.

Write buffer memory

Let's see how to write data to the GPU's memory using JavaScript. Due to the sandbox model used in modern web browsers, this process is not simple.

The example below shows you how to write four bytes to buffer memory accessible from the GPU. It calls device.createBufferMapped() which will get the size of the buffer and its usage. Even though GPUBufferUsage.MAP_WRITE does not need to use flags for this particular call, it must be clear that we want to write to this buffer. It generates a GPU buffer object and its associated raw binary data buffer.

If you have played with byte writing, this is the familiar ArrayBuffer. Use a TypedArray and copy the value into it.

// Get a GPU buffer in a mapped state and an arrayBuffer for writing.
const [gpuBuffer, arrayBuffer] = device.createBufferMapped({
    
    
  size: 4,
  usage: GPUBufferUsage.MAP_WRITE
});

// Write bytes to buffer.
new Uint8Array(arrayBuffer).set([0, 1, 2, 3]);

At this point, the GPU buffer is mapped, which means it is owned by the CPU and can be read/written via JavaScript. In order for the GPU to be able to access it, it must be unmapped, as simple as calling gpuBuffer.unmap().

Need to use the concept of mapped/unmapped to prevent race conditions, that is, GPU and CPU access memory at the same time.

Read buffer memory

Now let's see how to copy one GPU buffer to another GPU buffer and read back.

Since we are writing to the first GPU buffer and want to copy it to the second GPU buffer, GPUBufferUsage.COPY_SRC therefore needs a new usage flag. The second GPU buffer is device.createBuffer() created by sync in the unmapped state. Its usage flag is GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ because it will be used as the destination of the first GPU buffer and read it from JavaScript after executing the GPU copy command.

// Get a GPU buffer in a mapped state and an arrayBuffer for writing.
const [gpuWriteBuffer, arrayBuffer] = device.createBufferMapped({
    
    
  size: 4,
  usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC
});

// Write bytes to buffer.
new Uint8Array(arrayBuffer).set([0, 1, 2, 3]);

// Unmap buffer so that it can be used later for copy.
gpuWriteBuffer.unmap();

// Get a GPU buffer for reading in an unmapped state.
const gpuReadBuffer = device.createBuffer({
    
    
  size: 4,
  usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
});

Because the GPU is an independent coprocessor, all GPU commands are executed asynchronously. This is why it is necessary to build and send a list of GPU commands in batches. In WebGPU, the GPU command encoder returned by device.createCommandEncoder() is a JavaScript object, which constructs a batch of "buffered" commands, which will be sent to the GPU at some point. GPUBuffer on the other hand, the methods in on are "unbuffered", which means they will be executed automatically when called.

Once you have the GPU command encoder, please call copyEncoder.copyBufferToBuffer() as shown below to add the command to the command queue for later execution. Finally, finish the encoding command by calling copyEncoder.finish() and submit it to the GPU device command queue. The queue is responsible for processing the submission of device.defaultQueue.submit() using the GPU command as a parameter. This will execute all commands stored in the array in atomic order.

// Encode commands for copying buffer to buffer.
const copyEncoder = device.createCommandEncoder();
copyEncoder.copyBufferToBuffer(
  gpuWriteBuffer /* source buffer */,
  0 /* source offset */,
  gpuReadBuffer /* destination buffer */,
  0 /* destination offset */,
  4 /* size */
);

// Submit copy commands.
const copyCommands = copyEncoder.finish();
device.defaultQueue.submit([copyCommands]);

So far, the GPU queue command has been sent, but not necessarily executed. To read the second GPU buffer, call gpuReadBuffer.mapReadAsync(). It returns a promise. Once all queued GPU commands have been executed, ArrayBuffer will be parsed with the same value as the first GPU buffer.

// Read buffer.
const copyArrayBuffer = await gpuReadBuffer.mapReadAsync();
console.log(new Uint8Array(copyArrayBuffer));

You can try this example .

In short, with regard to buffer memory operations, you need to remember the following:

The GPU buffer must be unmapped before it can be used in the device queue submission.
After mapping, you can use JavaScript to read and write GPU buffers.
When the GPU cache mapping mapReadAsync(), mapWriteAsync() and createBufferMapped() are called.

Shader programming

A program that only performs calculations (without drawing triangles) running on the GPU is called a calculation shader. They are executed in parallel by hundreds of GPU cores (smaller than CPU cores), which operate together to process data. Their input and output are buffers in WebGPU.

To illustrate the usage of computing shaders in WebGPU, we will play with matrix multiplication, which is a common algorithm in machine learning explained below.

Insert picture description here

In short, this is what we do:

  1. Create three GPU buffers (two for matrix multiplication and one for the result matrix)
  2. Describe the input and output of the compute shader
  3. Compile the compute shader code
  4. Set up calculation pipeline
  5. Submit encoded commands to the GPU in batch
  6. Read the result matrix GPU buffer

GPU buffer creation
For simplicity, the matrix will be represented as a list of floating point numbers. The first element is the number of rows, the second element is the number of columns, and the rest are the actual numbers of the matrix.

Insert picture description here

These three GPU buffers are storage buffers because we need to store and retrieve data in the compute shader. This explains why the GPU buffer usage flag includes all the flags of GPUBufferUsage.STORAGE. The result matrix usage flag also has a GPUBufferUsage.COPY_SRC reason, because once all GPU queue commands are executed, it will be copied to another buffer for reading.

const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();


// First Matrix

const firstMatrix = new Float32Array([
  2 /* rows */, 4 /* columns */,
  1, 2, 3, 4,
  5, 6, 7, 8
]);

const [gpuBufferFirstMatrix, arrayBufferFirstMatrix] = device.createBufferMapped({
    
    
  size: firstMatrix.byteLength,
  usage: GPUBufferUsage.STORAGE,
});
new Float32Array(arrayBufferFirstMatrix).set(firstMatrix);
gpuBufferFirstMatrix.unmap();


// Second Matrix

const secondMatrix = new Float32Array([
  4 /* rows */, 2 /* columns */,
  1, 2,
  3, 4,
  5, 6,
  7, 8
]);

const [gpuBufferSecondMatrix, arrayBufferSecondMatrix] = device.createBufferMapped({
    
    
  size: secondMatrix.byteLength,
  usage: GPUBufferUsage.STORAGE,
});
new Float32Array(arrayBufferSecondMatrix).set(secondMatrix);
gpuBufferSecondMatrix.unmap();


// Result Matrix

const resultMatrixBufferSize = Float32Array.BYTES_PER_ELEMENT * (2 + firstMatrix[0] * secondMatrix[1]);
const resultMatrixBuffer = device.createBuffer({
    
    
  size: resultMatrixBufferSize,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
});

Bonded group layout and bonded
group The concept of bonded group layout and bonded group is specific to WebGPU. The binding group layout defines the input/output interface required by the shader, and the binding group represents the actual input/output data of the shader.

In the following example, the binding group layout requires two read-only memory buffers to bind 0 and 1 in the number entry and store buffer 2 to calculate the shader. The binding group on the other hand, for the binding definition layout of this group, the entries associated with the GPU buffer: gpuBufferFirstMatrix with binding 0, gpuBufferSecondMatrix with binding 1, and resultMatrixBuffer with binding 2.

const bindGroupLayout = device.createBindGroupLayout({
    
    
  entries: [
    {
    
    
      binding: 0,
      visibility: GPUShaderStage.COMPUTE,
      type: "readonly-storage-buffer"
    },
    {
    
    
      binding: 1,
      visibility: GPUShaderStage.COMPUTE,
      type: "readonly-storage-buffer"
    },
    {
    
    
      binding: 2,
      visibility: GPUShaderStage.COMPUTE,
      type: "storage-buffer"
    }
  ]
});

const bindGroup = device.createBindGroup({
    
    
  layout: bindGroupLayout,
  entries: [
    {
    
    
      binding: 0,
      resource: {
    
    
        buffer: gpuBufferFirstMatrix
      }
    },
    {
    
    
      binding: 1,
      resource: {
    
    
        buffer: gpuBufferSecondMatrix
      }
    },
    {
    
    
      binding: 2,
      resource: {
    
    
        buffer: resultMatrixBuffer
      }
    }
  ]
});

Calculation shader code The calculation shader code
used for the multiplication matrix is ​​written in GLSL, which is a high-level shading language used in WebGL, and its syntax is based on the C programming language. Without going into details, you should find buffer in the three storage buffers marked with keywords below. The program will use firstMatrix and secondMatrix as input (read only) and resultMatrix as its output.

Note that each storage buffer has a binding qualifier, which corresponds to the same index defined in the binding group layout and the binding group declared above.

const computeShaderCode = `#version 450

  layout(std430, set = 0, binding = 0) readonly buffer FirstMatrix {
      vec2 size;
      float numbers[];
  } firstMatrix;

  layout(std430, set = 0, binding = 1) readonly buffer SecondMatrix {
      vec2 size;
      float numbers[];
  } secondMatrix;

  layout(std430, set = 0, binding = 2) buffer ResultMatrix {
      vec2 size;
      float numbers[];
  } resultMatrix;

  void main() {
    resultMatrix.size = vec2(firstMatrix.size.x, secondMatrix.size.y);

    ivec2 resultCell = ivec2(gl_GlobalInvocationID.x, gl_GlobalInvocationID.y);
    float result = 0.0;
    for (int i = 0; i < firstMatrix.size.y; i++) {
      int a = i + resultCell.x * int(firstMatrix.size.y);
      int b = resultCell.y + i * int(secondMatrix.size.y);
      result += firstMatrix.numbers[a] * secondMatrix.numbers[b];
    }

    int index = resultCell.y + resultCell.x * int(secondMatrix.size.y);
    resultMatrix.numbers[index] = result;
  }
`;

Pipeline settings
WebGPU in Chrome currently uses bytecode instead of the original GLSL code. This means that we must compile the computeShaderCode before running the compute shader. Fortunately for us, the @webgpu/glslang package allows us to computeShaderCode to compile in the format accepted by WebGPU in Chrome. This bytecode format is based on the security subset of SPIR-V.

Please note that when writing the shading language for WebGPU, the "GPU on the Web" W3C community group has not yet decided.

import glslangModule from 'https://unpkg.com/@webgpu/[email protected]/dist/web-devel/glslang.js';

The calculation pipeline is the object that actually describes the calculation operation we will perform. Create it by calling device.createComputePipeline(). It contains two parameters: the binding group layout we created before, and a calculation phase that defines our calculation shader (mainGLSL function) and the entry point glslang.compileGLSL() that uses the compiled actual calculation shader module .

const glslang = await glslangModule();

const computePipeline = device.createComputePipeline({
    
    
  layout: device.createPipelineLayout({
    
    
    bindGroupLayouts: [bindGroupLayout]
  }),
  computeStage: {
    
    
    module: device.createShaderModule({
    
    
      code: glslang.compileGLSL(computeShaderCode, "compute")
    }),
    entryPoint: "main"
  }
});

Command submission
After instantiating the bonding group with our three GPU buffers and the calculation pipeline with the bonding group layout, it is time to use them.

Let's start using the programmable calculation through the encoder commandEncoder.beginComputePass(). We will use it to encode GPU commands that will perform matrix multiplication. Use passEncoder.setPipeline(computePipeline) and its binding group to set it at index 0 passEncoder.setBindGroup(0, bindGroup). Index 0 corresponds to the qualifier set = 0 in the GLSL code.

Now, let us discuss how this computational shader will run on the GPU. Our goal is to gradually execute this procedure in parallel for each unit of the result matrix. For example, for a result matrix of size 2 by 4, we will call passEncoder.dispatch(2, 4) to encode the execution command. The first parameter "x" is the first dimension, the second parameter "y" is the second dimension, and the last parameter "z" is the third dimension, which is 1 by default because we don't need it here it. In the GPU computing world, encoding commands to perform kernel functions on a set of data is called scheduling.

Insert picture description here
In our code, "x" and "y" will be the number of rows in the first matrix and the number of columns in the second matrix, respectively. This way, we can now use the send dispatch call

passEncoder.dispatch(firstMatrix[0], secondMatrix[1])

As shown in the figure above, each shader can access a unique gl_GlobalInvocationID object, which will be used to know which result matrix pixel to calculate.

const commandEncoder = device.createCommandEncoder();

const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(computePipeline);
passEncoder.setBindGroup(0, bindGroup);
passEncoder.dispatch(firstMatrix[0] /* x */, secondMatrix[1] /* y */);
passEncoder.endPass();

To end the encoder calculation process, call passEncoder.endPass(). Then, create a GPU buffer to be used as the target copyBufferToBuffer for copying the target matrix buffer. Finally, complete the encoding of the command, copyEncoder.finish() and submit it to the GPU device queue by calling the device.defaultQueue.submit() GPU command.

// Get a GPU buffer for reading in an unmapped state.
const gpuReadBuffer = device.createBuffer({
    
    
  size: resultMatrixBufferSize,
  usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ
});

// Encode commands for copying buffer to buffer.
commandEncoder.copyBufferToBuffer(
  resultMatrixBuffer /* source buffer */,
  0 /* source offset */,
  gpuReadBuffer /* destination buffer */,
  0 /* destination offset */,
  resultMatrixBufferSize /* size */
);

// Submit GPU commands.
const gpuCommands = commandEncoder.finish();
device.defaultQueue.submit([gpuCommands]);

Reading the result matrix
Reading the result matrix is ​​as easy as calling gpuReadBuffer.mapReadAsync() and recording the result returned by the Promise generated by the ArrayBuffer.

Insert picture description here
In our code, the result recorded in the DevTools JavaScript console is "2, 2, 50, 60, 114, 140".

// Read buffer.
const arrayBuffer = await gpuReadBuffer.mapReadAsync();
console.log(new Float32Array(arrayBuffer));

Congratulations! you did it. You can take a look at an example .

The last resort
is getBindGroupLayout to make the code easier to read. One way is to use a convenient method of calculation pipeline to infer the layout of the binding group from the shader module. This technique eliminates the need to create a custom binding group layout and specify the pipeline layout in your calculation pipeline, as shown below.

The icon of getBindGroupLayout is available for the previous sample.

 const computePipeline = device.createComputePipeline({
    
    
-  layout: device.createPipelineLayout({
    
    
-    bindGroupLayouts: [bindGroupLayout]
-  }),
   computeStage: {
    
    
-// Bind group layout and bind group
- const bindGroupLayout = device.createBindGroupLayout({
    
    
-   entries: [
-     {
    
    
-       binding: 0,
-       visibility: GPUShaderStage.COMPUTE,
-       type: "readonly-storage-buffer"
-     },
-     {
    
    
-       binding: 1,
-       visibility: GPUShaderStage.COMPUTE,
-       type: "readonly-storage-buffer"
-     },
-     {
    
    
-       binding: 2,
-       visibility: GPUShaderStage.COMPUTE,
-       type: "storage-buffer"
-     }
-   ]
- });
+// Bind group
  const bindGroup = device.createBindGroup({
    
    
-  layout: bindGroupLayout,
+  layout: computePipeline.getBindGroupLayout(0 /* index */),
   entries: [

Performance test Insert picture description here
Figure 5. GPU and CPU benchmark test

So how does running matrix multiplication on the GPU compare to running matrix multiplication on the CPU? To find out, I wrote a program that I just wrote for the CPU. As you can see in the image below, when the size of the matrix is ​​greater than 256 x 256, using the full power of the GPU seems to be an obvious choice.

This article is just the beginning of my journey to explore WebGPU. More articles will be published soon, these articles will introduce GPU Compute in more depth, and how the rendering (canvas, texture, sampler) works in WebGPU.

Guess you like

Origin blog.csdn.net/uucckk/article/details/105530634