显存内存使用量估计卷积神经网络 convolution torch finput

如何估算深度卷积神经网络的显存/内存消耗量

torch7中是可以打印显示深度神经网络中各个神经网络层的内存占用情况，既每个Tensor的配置情况，比如batch大小为16的时候：

nn.SpatialConvolution(3,4,4,4,2,2,1,1)
-- cpu
{
  padW : 1
  nInputPlane : 3
  output : FloatTensor - size: 16x4x16x16
  gradInput : FloatTensor - size: 16x3x32x32
  _type : "torch.FloatTensor"
  dH : 2
  dW : 2
  nOutputPlane : 4
  padH : 1
  kH : 4
  finput : FloatTensor - size: 16x48x256
  weight : FloatTensor - size: 4x3x4x4
  gradWeight : FloatTensor - size: 4x3x4x4
  fgradInput : FloatTensor - size: 16x48x256
  kW : 4
  bias : FloatTensor - size: 4
  gradBias : FloatTensor - size: 4
}
-- gpu
{
  padW : 1
  nInputPlane : 3
  output : CudaTensor - size: 16x4x16x16
  gradInput : CudaTensor - size: 16x3x32x32
  _type : "torch.CudaTensor"
  dH : 2
  dW : 2
  nOutputPlane : 4f
  padH : 1
  kH : 4
  finput : CudaTensor - size: 48x256
  weight : CudaTensor - size: 4x3x4x4
  gradWeight : CudaTensor - size: 4x3x4x4
  fgradInput : CudaTensor - size: 16x16
  kW : 4
  bias : CudaTensor - size: 4
  gradBias : CudaTensor - size: 4
}

可见，cpu和gpu基本上一样，有较大区别的是finput和fgradInput，cpu版本与batch有关，gpu的与batch无关，也是为什么torch7在cpu上跑的话很吃内存，gpu上则好很多。这两个变量是卷积层在运算时开辟的临时缓存，用于加速运算，其大小的计算方法很难找，网络上并没有直接的解释，需要解读c文件源码才能理解。

nn.SpatialConvolution
finput=(kW*kH*nInputPlane) x (outputHeigh*outputWideth)
fgradInput=same as finput

nn.SpatialFullConvolution
finput=(kW*kH*nOutputPlane) x (inputHeight*inputWidth)
fgradInput=outputHeigh x outputWideth -- 这个是我推测的，源码中并没有看到直接相关的代码，可能是眼拙，错过了。

其他Tensor的使用情况可以对照Torch的打印结果，以及卷积神经网络的基础知识推算出计算方法。

GPU显存实际用量

上述Tensor只是一块内存空间的引用，多个Tensor可能复用同一块内存空间，特别是临时缓存空间，存在复用是必然的，所以我们推算出的显存使用量是真实值的上限，两者之间的差距，对于大规模网络来说会比较明显。比如一个预计9MB的网络，显存消耗约7MB，预计69MB的网络，显存消耗约30~50MB。

另外，CUDA在运行时，会载入很多其他东西，所以torch中，当载入第一个CudaTensor时，显存会大量消耗，比如额外消耗100~200MB，之后每次载入Tensor则如实增加显存消耗。

finput和fgradInput的复用技术

有大神推荐如下复用代码，可以让网络中各个神经网络层复用同一块临时缓存空间：

https://groups.google.com/forum/#!topic/torch7/BmP_RJ-yxlU
@Thomas you could share all the temporary buffers this way:

local finput, fgradInput

model:apply(function(m) if torch.type(m) == 'nn.SpatialConvolution' or torch.type(m) == 'nn.SpatialConvolutionMM' then 
         finput = finput or m.finput
         fgradInput = fgradInput or m.fgradInput
         m.finput = finput
         m.fgradInput = fgradInput
    end
 end)

This will share the temporary buffer among all convolution layers in your network.

上述代码据说不能用于训练模式，本人也在torch7给的ImageNet训练代码中看到类似上述功能的代码，但是处于注释状态，看来该说法还是很有可能的。