TURING STREAMING MULTIPROCESSOR (SM) ARCHITECTURE

The Turing architecture features a new SM design that incorporates many of the features  introduced in our Volta GV100 SM architecture.

  • Two SMs are included per TPC;
  • Each SM has a total of 64 FP32 Cores and 64 INT32 Cores.

In comparison, the Pascal GP10x GPUs have one SM  per TPC and 128 FP32 Cores per SM.

The Turing SM supports concurrent execution of FP32 and  INT32 operations (more details below), independent thread scheduling similar to the Volta GV100 GPU. Each Turing SM also includes eight mixed-precision Turing Tensor Cores, and one RT Core.

The Turing SM is partitioned into four processing blocks, each with

  • 16 FP32 Cores
  • 16 INT32  Cores
  • two Tensor Cores
  • one warp scheduler
  • and one dispatch unit.
  • Each block includes a new L0 instruction cache and a 64 KB register file.

The four processing blocks share a combined 96 KB L1 data cache/shared memory.

  • Traditional graphics workloads partition the 96 KB L1/shared memory as 64 KB of dedicated graphics shader RAM and 32 KB for texture cache and register file spill area.
  • Compute workloads can divide the 96 KB into 32 KB shared memory and 64 KB L1 cache, or 64 KB shared memory and 32 KB L1 cache.

Turing implements a major revamping of the core execution datapaths. Modern shader workloads typically have a mix of FP arithmetic instructions such as FADD or FMAD with simpler instructions such as integer adds for addressing and fetching data, floating point compare or min/max for processing results, etc. In previous shader architectures, the floating-point math datapath sits idle whenever one of these non-FP-math instructions runs. Turing adds a second parallel execution unit next to every CUDA core that executes these instructions in parallel with floating point math.

Figure 5 shows that the mix of integer pipe versus floating point instructions varies, but across several modern applications, we typically see about 36 additional integer pipe instructions for every 100 floating point instructions. Moving these instructions to a separate pipe translates to an effective 36% additional throughput possible for floating point.

Turing 102/104/106 Streaming Multiprocessor (SM)

发布了26 篇原创文章 · 获赞 5 · 访问量 9898

猜你喜欢

转载自blog.csdn.net/royalfizz/article/details/93301678