1 Introduction

Preface blog:

WebGPU——Draft 2023.7.17 was initiated by the teams of Apple, Google, and Mozilla. It is currently in the draft stage and aims to become a W3C recommended standard.

WebGPU provides APIs for performing operations such as rendering and computation on graphics processing units (GPUs).

GPU supports rich rendering and parallel computing applications. WebGPU uses the hardware capabilities of the GPU for Web use through the API. WebGPU API is an efficient mapping to native GPU API. WebGPU has nothing to do with WebGL and does not explicitly anchor OpenGL ES (OpenGL for Embedded Systems).

WebGPU:

1) Think of the physical GPU hardware as a GPUAdapter.
2) Connect GPUAdapter through GPUDevice.
3) GPUQueue of GPUDevice is used to execute instructions.
4) GPUDevice may have its own memory for high-speed access to the processor unit.
5) GPUBuffer and GPUTexture are physical resources supported by GPU memory.
6) GPUCommandBuffer and GPURenderBundle are containers for user-recorded instructions.
The GPU executes the instructions encoded in the GPUCommandBuffer and feeds the data through the pipeline.
- 6.1) GPUCommandBuffer: It consists of a mixture of fixed-function and programmable stages.
  - Programmable stages execute shaders, which are specific programs designed to run on GPU hardware. Shaders code runs inside the compute units of the GPU hardware.
- 6.2) pipeline: Most pipeline states are defined by GPURenderPipeline or GPUComputePipeline objects.
  States that are not included in the pipeline object are set through the instruction encoding phase, such as beginRenderPass() or setBlendConstant().
7) GPUShaderModule contains shader code.
8) GPUSampler or GPUBindGroup, used to configure the way the GPU uses physical resources.
9) In the future, multi-threading will be supported through Web Workers.

2. Coordinate system

Rendering operations can use the following coordinate systems: [Note that the coordinate system of WebGPU matches the DirectX coordinate system in a graphics pipeline. 】

1) Normalized device coordinates (NDC): has 3 dimensions:
- $-1.0\leq x \leq 1.0$ 。
- $-1.0\leq y \leq 1.0$ 。
- $0.0\leq z \leq 1.0$ 。
- The coordinates of the lower left corner are $(- 1.0, - 1.0, z)$ 。
2) Clip space coordinates: with 4 dimensions $(x, y, z, w)$ ：
- 2.1) Clip space coordinates can be:
  - Used as the clip position of a vertex (that is, the position output of a vertex shader).
  - Used as clip volume.
- 2.2) The relationship between Clip space coordinates and normalized device coordinates is:
  - 若point $p = (p . x, p . and, p . z, p . w)$ is in the clip volume, then its normalized device coordinates are $(px\div pw, py \ div pw, pz \div pw)$ 。
3) Framebuffer coordinates: used to address the pixels in the framebuffer:
- 3.1) has 2 dimensions.
- 3.2) Each pixel is at $x$ 和 $The unit of the y$ dimension is 1.
- 3.3) The coordinates of the upper left corner are $(0.0, 0.0)$ 。
- 3.4） $x$ grows to the right.
- 3.5） $y$ grows sideways.
4) Viewport coordinates: in Framebuffer coordinates $x, Based on the y$ is added $z$ 。
- Usually $0.0\leq z\leq 1.0$ , but minDepth and maxDepth can be modified through setViewport().
5) Fragment coordinates: match vIewport coordinates.
6) UV coordinates: used for sample textures, with 2 dimensions:
- $0\leq u\leq 1.0$
- $0\leq v\leq 1.0$
- $(0.0, 0.0)$ is the first texel in the texture memory address sequence.
- $(1.0, 1.0)$ is the last texel in the texture memory address sequence.
7) Window coordinates or present coordinates: match with framebuffer coordinates, used to interact with external display and other interfaces.

3. WebGPU programming model

3.1 Timeline

The behavior of WebGPU is represented by "timeline". Every operation within an algorithm happens in some timeline. The timeline will clearly define the order of operations and the corresponding state of an operation.
The timeline types of WebGPU are: [Immutable value can be used for any timeline]

1) Content timeline: associated with Web script execution. Contains all methods that call this protocol.
2) Device timeline: Associated with the GPU device operations released by the User agent. include:
- Create adapters, devices, GPU resources, and state objects. From the user agent perspective, these are classic synchronous operations.
3) Queue timeline: associated with the execution of operations within the GPU computing unit. Contains draw, copy, and compute jobs that actually run on the GPU.

如GPUDevice.createBuffer()：

1) The user fills the GPUBufferDescriptor and creates a GPUBuffer for it. This happens on the Content timeline.
2) User agent creates an underlying buffer in Device timeline.

3.2 Memory Model

Once a GPUDevice is obtained in the application initialization phase, the WebGPU platform can be described as the following levels:

1) User agent: used to implement this agreement.
2) An operating system driven by the underlying native API of the device.
3) Actual CPU and GPU hardware.

Different levels have different memory types, user agent needs to take into account when implementing this protocol:

1) script-owned memory: For example, an ArrayBuffer created by script is usually inaccessible to the GPU driver.
2) user agenet may have different processes responsible for running content and communication with the GPU driver. At this point, use cross-process shared memory to transfer data.
3) Certain GPUs have their own high-bandwidth memory, and these integrated GPUs usually share memory with the system.

To make GPU rendering or computing efficient, most physical resources are allocated in the form of memory. When the user needs to provide new data for the GPU: [The following is the worst case. In actual implementation, it usually does not need to cross the process boundary, or directly expose the driver-managed memory to the user's ArrayBuffer, thereby avoiding data copying. 】

1) The data may first cross the process boundary and reach the user agent part that communicates with the GPU driver.
2) Then, you may need to make it visible to the driver, and sometimes you need to copy it to the staging memory allocated by the driver.
3) Finally, data may need to be transferred to GPU-dedicated memory, possibly converting the internal layout to something more efficient for the GPU.

All the above data conversions are simultaneously passed through WebGPU's user agent Lai shixian d.

4. Key internal objects

4.1 adapters

The adapter is used to identify a WebGPU implementation:

It is an instance of computing or rendering function of a browser
An example is also implemented for a browser's WebGPU.

The adapter does not uniquely identify the underlying implementation, and requestAdapter()it will return a different adapter object each time it is called multiple times.

Each adapter object can only be used to create a device:

When requestDevice() returns success, the adapter will be invalid.
Additionally, adapter objects can expire at any time.

This ensures that the application uses the selected latest adapter system state to create the device, and also increases the robustness in more scenarios.

The adapter is exposed as GPUAdapter.

4.2 Devices

The device is instantiated for the logic of an adapter, thereby creating an internal object. Can be shared across multiple agents (such as dedicated workers).

device is the external owner of all internal objects created based on the device:

When the device fails (lost or destroyed), all objects created based on the device (created directly through createTexture(), or created through the introduction of createView()) will be implicitly unavailable.

5. Key functions

1) navigator.gpu: returns available GPU objects.
2) gpu.requestAdapter(): Request an adapter from the user agent.
3) adapter.requestDevice(): Request a device from an adapter.
4) adapter.requestAdapterInfo(); Get the GPUAdapterInfo of an adapter.
5) device.destroy(): destroy a device to prevent future operations on the device. Asynchronous operations will fail. The same device can be destroyed multiple times.
6) device.createBuffer(): Create GPUBuffer. GPUBuffer represents a piece of memory used for GPU computing. The data is stored in a linear layout, which means that each byte allocated can be addressed according to the offset of the starting position of the GPUBuffer, and there are alignment restrictions depending on the specific operation. Certain GPUBuffers are mappable so that the block of memory can be accessed through an ArrayBuffer.
7）GPUBuffer.destroy()：destroy GPUBuffer。
8) GPUBuffer.mapAsync(mode, offset, size): There are 2 modes in mode - read or write. After mapping, the content in GPUBuffer can be accessed through ArrayBuffer.
9) GPUBuffer.getMappedRange(offset, size): Returns the content of the specified mapped range in GPUBuffer, which is an ArrayBuffer.
10) GPUBuffer.unmap(): Unmap so that its content can be used by the GPU again.
11) device.createBindGroupLayout(): Indicates binding a single shader resource to a GPUBindGroupLayout. [Corresponding to the number of wgsl source files]
There are 3 types of GPUShaderStage:
- VERTEX: Accessible by vertex shaders.
- FRAGMENT: Accessible to fragment shaders.
- COMPUTE: Accessible by compute shaders.
12) device.createBindGroup(): Create a GPUBindGroup.
13）device.createShaderModule()：创建GPUShaderModule。
14）device.createPipelineLayout()：创建GPUPipelineLayout。
15) device.createComputePipeline(): Use the immediate pipeline creation method to create a GPUComputePipeline.
16）device.createCommandEncoder()：创建GPUCommandEncoder。
17) GPUCommandEncoder.beginComputePass(descriptor): Start encoding a compute pass described by the descriptor.
18) dispatchWorkgroups(workgroupCountX, workgroupCountY, workgroupCountZ): Distribute work to the current GPUComputePipeline for execution. in:
- workgroupCountX：为X dimension of the grid of workgroups to dispatch。
- workgroupCountY：为Y dimension of the grid of workgroups to dispatch。
- workgroupCountZ：为Z dimension of the grid of workgroups to dispatch。
That means, if the entry point defined by a certain GPUShaderModule has @workgroup_size(4, 4) and the work is distributed by calling computePass.dispatchWorkgroups(8, 8), the entry point will be triggered 1024 times in total:
- Distribute 4x4 workgroup 8 times along the X axis and 8 times along the Y axis: $4 * 4 * 8 * 8 = 1024$ 。
19) copyBufferToBuffer(source, sourceOffset, destination, destinationOffset, size): It is a command in GPUCommandEncoder, used to copy data in a GPUBuffer sub-region to a sub-region of another GPUBuffer.
20) clearBuffer(buffer, offset, size): clear a sub-region data of a GPUBuffer, and set all to 0.
21) device.queue.writeBuffer(buffer, bufferOffset, data, dataOffset, size): Write specific data into a specific area of GPUBuffer.
22) device.queue.submit(commandBuffers): Schedule the execution of command buffers in the queue of the GPU. Submitted command buffers cannot be used again.
23) device.queue.onSubmittedWorkDone(): When the queue completes all the currently submitted work, it will return a Promise.

References

[1] WebGPU——Draft 2023.7.17

Getting Started with WebGPU