2019/5/3
Udacity cs 344 学习!!坚持住
目录
How designer make computers run faster?
Why We Cannot Keep Increasing CPU Speed
What Kind of Processors are we building ?
Building A Power Efficient Processor
GPU from the point of view of the developer
Configure the Kernel Launch Parameters
Unit 1:
Running CUDA locally on your machines:
Windows: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-microsoft-windows/index.html
OSX: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-mac-os-x/index.html
Linux: http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html
How designer make computers run faster?
Thanks to Kym Dylla for animating this hole digging animation
(1)Digging Faster = Faster clock .
Shorter tme for each computation but increase power comsuption .
(2)Buying a more productive shovel = More work per clock-cycle
At limit for instruction-level parallelism per clock-cycle
(3)Hire more diggers = parallel computing. More processors
many samller, simpler processors
Chickens or Oxen?
Seymour Cray : would you rather plow a field with two strong oxen or 1024 chichkens?
More GPU: thousands of ALUs , hundreds of processors , tens of thousands of concurent thread
This class : How to think in Parallel (Like the chickens)
CPU Speed Remaining Flat:
We have more transistors available for computation.,
Why We Cannot Keep Increasing CPU Speed
Hava transistors stopped geting smaller or faster? NO LIMITED HEAT !
What matters today : POWER!
Consequence : Smaller 、More efficient 、Processores . More of them
What Kind of Processors are we building ?
(major design constraint : POWER)
(1) CPU -central processing unit - complex control hardware
↑ flexibility + performance
↓expensive in terms of power
(2) GPU:-graphics processing unit - simplier control hardware
↑ more HW for computation
↑ potentially more power efficient(ops/wait)
↓ shagn'jian'toumore restrictive programming Model
Building A Power Efficient Processor
Latency && Throught
Paper : Latency Lags Bandwith
http://roc.cs.berkeley.edu/retreats/winter_04/slides/pattrsn_BWv1b.doc
Core GPU Design Tenets
(1)Lots of simple compute units . Trade simple control for more compute
(2)Expliciting Parallel Programming Model
(3)Optimize for Throught not Latency
GPU from the point of view of the developer
importacnce of programming in Parallel
CUDA Program Diagram
For your reference, CUDA C Programming Guide
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz387wa8Je1
A CUDA Prigram
Defining the GPU Computation
What the GPU Is Good At?
(1)Efficiently launching lots of threads.
(2)Running lots of threads in parallel
Squaring a number on the CPU
CPU Code : Square each element of an array
GPU Code: A high level view
But how does it work if 1 Launch 64 instance of the same program?
Calculation Time on the GPU
Squaring Numbers Using CUDA
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdio.h>
//__global__ : "declaration specifier"
__global__ void cube(float * d_out, float * d_in) {
int idx=threadIdx.x;
float f = d_in[idx];
d_out[idx] = f*f*f;
}
int main(int argc, char ** argv) {
const int ARRAY_SIZE = 94;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
// generate the input array on the host
float h_in[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = float(i);
}
float h_out[ARRAY_SIZE];
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**)&d_in, ARRAY_BYTES);
cudaMalloc((void**)&d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel
cube << <1, ARRAY_SIZE >> >(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
Configure the Kernel Launch Parameters
Here's a good guide to finding the proper block and grid size.
What we know so far ?
(1)We write a program that looks like it runs on one thread
(2)We can launch that program on any number of threads
(3)Each thread knows its own index in the block of the grid
Map
Summary of Lession 1
---Technology trends
---Thoughout vs Lantency
---GPU Design Goals
---GPU Programming C Models with example