Lesson 1 - The GPU Programming Model

2019/5/3

Udacity cs 344 学习！！坚持住

Unit 1:

How designer make computers run faster?

Chickens or Oxen?

CPU Speed Remaining Flat:

Why We Cannot Keep Increasing CPU Speed

What Kind of Processors are we building ?

Building A Power Efficient Processor

Core GPU Design Tenets

GPU from the point of view of the developer

CUDA Program Diagram

A CUDA Prigram

Defining the GPU Computation

What the GPU Is Good At?

Squaring a number on the CPU

GPU Code: A high level view

Squaring Numbers Using CUDA

Configure the Kernel Launch Parameters

What we know so far ？

Map

Summary of Lession 1

Unit 1:

Running CUDA locally on your machines:

Windows: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/index.html
OSX: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-mac-os-x/index.html
Linux: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html

How designer make computers run faster?

Thanks to Kym Dylla for animating this hole digging animation

(1)Digging Faster = Faster clock .

Shorter tme for each computation but increase power comsuption .

(2)Buying a more productive shovel = More work per clock-cycle

At limit for instruction-level parallelism per clock-cycle

(3)Hire more diggers = parallel computing. More processors

many samller, simpler processors

Chickens or Oxen?

Seymour Cray : would you rather plow a field with two strong oxen or 1024 chichkens?

More GPU: thousands of ALUs , hundreds of processors , tens of thousands of concurent thread

This class : How to think in Parallel (Like the chickens)

CPU Speed Remaining Flat:

We have more transistors available for computation.,

Why We Cannot Keep Increasing CPU Speed

Hava transistors stopped geting smaller or faster? NO LIMITED HEAT !

What matters today : POWER!

Consequence : Smaller 、More efficient 、Processores . More of them

What Kind of Processors are we building ?

(major design constraint : POWER)

(1) CPU -central processing unit - complex control hardware

↑ flexibility + performance

↓expensive in terms of power

(2) GPU:-graphics processing unit - simplier control hardware

↑ more HW for computation

↑ potentially more power efficient(ops/wait)

↓ shagn'jian'toumore restrictive programming Model

Building A Power Efficient Processor

Latency && Throught

Paper ： Latency Lags Bandwith

http://roc.cs.berkeley.edu/retreats/winter_04/slides/pattrsn_BWv1b.doc

Core GPU Design Tenets

(1)Lots of simple compute units . Trade simple control for more compute

(2)Expliciting Parallel Programming Model

(3)Optimize for Throught not Latency

GPU from the point of view of the developer

importacnce of programming in Parallel

CUDA Program Diagram

For your reference, CUDA C Programming Guide

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz387wa8Je1

A CUDA Prigram

Defining the GPU Computation

What the GPU Is Good At?

(1)Efficiently launching lots of threads.

(2)Running lots of threads in parallel

Squaring a number on the CPU

CPU Code : Square each element of an array

GPU Code: A high level view

But how does it work if 1 Launch 64 instance of the same program?

Calculation Time on the GPU

Squaring Numbers Using CUDA

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>


#include <stdio.h>

//__global__ : "declaration specifier"
__global__ void cube(float * d_out, float * d_in) {
	int idx=threadIdx.x;
	float f = d_in[idx];
	d_out[idx] = f*f*f;
}

int main(int argc, char ** argv) {
	const int ARRAY_SIZE = 94;
	const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

	// generate the input array on the host
	float h_in[ARRAY_SIZE];
	for (int i = 0; i < ARRAY_SIZE; i++) {
		h_in[i] = float(i);
	}
	float h_out[ARRAY_SIZE];

	// declare GPU memory pointers
	float * d_in;
	float * d_out;

	// allocate GPU memory
	cudaMalloc((void**)&d_in, ARRAY_BYTES);
	cudaMalloc((void**)&d_out, ARRAY_BYTES);

	// transfer the array to the GPU
	cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

	// launch the kernel
	cube << <1, ARRAY_SIZE >> >(d_out, d_in);

	// copy back the result array to the CPU
	cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

	// print out the resulting array
	for (int i = 0; i < ARRAY_SIZE; i++) {
		printf("%f", h_out[i]);
		printf(((i % 4) != 3) ? "\t" : "\n");
	}

	cudaFree(d_in);
	cudaFree(d_out);

	return 0;
}