This article aim to provide an overview of DNNs, the various tools for understanding their behavior, and techniques being explored to efficiently accelerate their computations.

2 background on deep neural networks(DNN)

2.1 AI&DNNs

在这里插入图片描述

2.2 Neural networks and DNNs

在这里插入图片描述

2.3 Inference(推断) vs training

The weights are updated using a hill-climbing optimization process called gradient decent.
The gradient itself is efficiently computed through a process called back-propagation.
This article will focus the efficient processing of DNN inference rather than training , since DNN inference is often performed on embedded devices.

2.4 application of DNN

2.5 Embedded vs Cloud

why the DNN inference processing should be put near the sensor:?
reduce the communication cost and latency and security risk

3 Overview of DNNS

feed-forward networks
recurrent
fully-connected layer (FC)
sparsely-connected layer

3.1 convolutional neural networks(CNNs)

CNNs are composed of multiple convolutional layers(CONV), where each layer generates a higher-level abstraction of the input data, called a feature map(fmap).

3.2 non-Linearity

a non-linear activation function applied after each convolution or fully connected computation
non-linearity function

3.3 pooling

pooling enables the network to be robust and invariant to small shifts and distortions.

3.4 normalization

batch normalization
local response normalization

3.5 popular DNN models

LeNet
AlexNet
Overfeat
VGG-16
GoogLeNet: could fit into the GPU memory
ResNet

trends:
increasing the depth of the network tends to provide higher accuracy
the number of filter shapes continues to vary across layers
most of the computation has been placed on convolutional layers rather than fully connected

4 DNN development resources

4.1 Frameworks

Caffe, tensorflow, Torch

4.2 Model

For the same DNN, the accuracy of these models can vary by around 1 to 2% depending on how the model was trained

4.3 popular datasets for classification

MNIST dataset: classifying handwritten digits
ImageNet dataset: classifying an object into one of a 1000 classes
CIFAR-10 dataset: 10 mutually exclusive classes
ILSVRC dataset: 120 different breeds of dogs

4.4 datasets for other tasks

object detection
object in the image must be localized and classified

PASCAL VOC dataset: 20 classes
MS COCO dataset: 91 object categories

5 hardware for DNN processing

Section 5 describes the various hardware platforms used to process DNN and the various optimizations to improve throughput and energy without impacting performance accuracy

MAC: multiply and accumulate
PE: processing engine: ALU with its own lacal memory
SIMD/SIMT: Temporal Architecture: ALUs can fetch data from the memory hierarchy and cannot communicate data with each other

5.1 accelerate kernel computation on CPU and GPU platform

CPUs and GPUs use temporal architectures such as SIMD or SIMT to perform the MACs in parallel.
All the ALUs share the same control and memory.

5.1.1 Fast Fourier Transform(FFT)

reduce the number of multiplications.
take FFT of the filter and input feature map-> perform the multiplication in the frequency domain-> apply an inverse FFT to the resulting product
the FFT of the filter can be precomputed and stored

5.1.2 Strassen or Winograd

Other approaches include Strassen and Winograd which rearrange the computation such that the number of multiplications scale from O(N ³ ) to O (N^2.807)， at the cost of reduced numerical stability, increased storage requirements, and specialized processing depending on the size of the filter.

5.2 energy-efficient dataflow for accelerators

For DNNs, the bottleneck for processing is in the memory access.

5.2.1 the multiple levels of memory hierarchy(层次化结构)

helps to improve energy efficiency by providing low-cost data accesses

5.2.2 the optimized dataflow(数据复用)

minimizes access from the expensive levels of the memory hierarchy

It can save a significant amount of DRAM acesses by storing the data in the local memory hierarchy and accessing them multiple times without going back to the DRAM.

5.2.3 DNN dataflows in recent work based on their data handling characteristic

Weight stationary(WS)

each weight is read from DRAM into the register file(RF) of each PE and stays stationary for further accesses

output stationary(OS)

It keeps the accumulation of partial sums for the same output activation value local in the RF

no local reuse(NLR)

In order to maximize the storage capacity, and minimize the off-chip memory bandwidth, no local storage is allocated to the PE and instead all that area is allocated to the global bugger to increase its capacity.

row stationary(RS)

Row stationary dataflow aims to maximize the reuse and accumulation at the RF level for all types of data(weights, pixels, partial sums) for the overall energy efficiency.

How can the fixed-size PE array accommodate different layer shapes?
(1)The structure is replicated several times and run different channels and filters in each replication.
(2)In order to fit it into the fixed physical PE array, it can be folded into several parts
How can the fixed design achieves passing data in different patterns?
A custom multicast network is used to solve the second problem about flexible data delivery.
在这里插入图片描述

5.2.4 energy comparison of different dataflows

The WS and OS dataflows have the lowest energy consumption for accessing weights and partial sums, respectively. However, the RS dataflow has the lowest total energy consumption since it optimizes for the overall energy efficiency.

6 near-data processing

Section 6 discusses how mixed-signal circuits and new memory technologies can be used for near-data processing to address the challenging data movement that dominates throughput and energy consumption of DNNs

analog processing increase sensitivity to circuit and reduce precision
DNNs are often trained in the digital domain-> ADC and DAC are needed

6.1 DRAM

eDRAM(embedded DRAM)
bring the high density memory on-chip in order to reduce the energy cost of switching off-chip capacitance and improve the number of ports and memory bandwidth
3-D memory
stacks the DRAM on top of the logic using through silicon vias (TSV)

6.2 SRAM

The multiply and accumulate operation can be directly integrate into the bit-cell in an SRAM array

6.3 non-volatile resistive memories

A multiplication is performed with the resistor’s conductance as the weight, the voltage as the input, and the current is the output.

6.4 sensors

the matrix multiplication and accumulation is integrated into the ADC.
compute the gradient of the input in the senor, to skip the computations in the first layer of gradient-like feature map and reduces the sensor bandwidth by compression

7 co-design of DNN models and hardware

Section 7 describes various joint algorithm and hardware optimizations that can be performed on DNNs to improve both throughput and energy while trying to minimize impact on performance accuracy.
Traditionally, the DNN models were designed to maximize accuracy without much consideration of the implementation complexity.
But co-design of the DNN model and hardware can be effective in jointly maximizing accuracy and throughput, while minimizing energy and cost.

7.1 reduce precision

Benefits of reduced precision include reduced storage cost and reduced computation requirements.
Most recent works focus on reducing the precision of the weights rather than the activations, and focus on reducing the precision for the inference rather than training.

7.1.1 linear quantization

fixed point number
dynamic fixed number
Because the weights and activations are centered near zero, particularly when batch normalization is used , the accumulation would not move only in one direction and there is no significant impact on accuracy.
binary weights(-1,1)
ternary weights(-w,0,w)

7.1.2 non-linear quantization

the distribution of the weights and activations are not uniform

log domain quantization
learned quantization or weight sharing: force several weights to share a single value and reduces the number of unique weights.

7.2 reduce number of operations and model size

7.2.1 exploiting activation statistics

sets all negative values to 0
save energy and area using compression
skip reading the weights and performing the MAC for 0 valued activations
prune(删除，减少) low valued activations

7.2.2 network pruning

A large amount of weights in a network are redundant and can be removed or set to zero.
custom hardware is needed to support the efficient processing of sparse weights

compressed sparse row format(CSR)
compressed sparse column format(CSC): more effective

7.2.3 compact(紧凑的简洁的) network architectures

before training

one NN convolution can be decomposed into 1N and N*1 convolution
1*1 convolutional layers can be used to reduce the number of filter channel in the next layer

after training

tensor decomposition can be used to decompose filters in a trained network without impacting the accuracy.
It treats weights in a layer as a 4-D tensor and breaks it into a combination of smaller tensors.

7.2.4 knowledge distillation

knowledge distillation transfers the knowledge learned by the complex model to the simpler model.

8 benchmarking metrics for DNN evaluation and comparison

Section 8 describes the key metrics that should be considered when comparing various DNNs designs.

metrics for DNN models

accuracy of the model in terms of the top-5 error on datasets
number of layers, filter sizes, number of filters and number of channels
the number of weights->impact the storage requirement of the model
the number of MAC

metrics for DNN hardware

the energy-efficiency of the design
the off-chip bandwidth
the area efficiency
the throughput

others

the batch(批次) size
the number of bits per operand
the energy per non-zero MAC
the off-chip access per non-zero MAC
the run time
the power consumption

文献笔记（1）

文章目录

1 abstract&introduction