《Learning Deep Representations of Fine-Grained Visual Descriptions》论文及代码阅读笔记

最近在读这篇文章，顺便记录些东东。。。

论文原题目是《Learning Deep Representations of Fine-Grained Visual Descriptions》（链接），程序在GitHub上有（链接），用了Torch框架（总觉得这个框架的文档有点杂乱。。。有挺多坑要去踩的。。。虽然贫僧觉得cafe坑更加多。。。）来搭神经网络（这个框架主要是用Lua语言，其实和Python有点像，比较容易上手的还是）。

这群人做了什么

训练出了一种无监督学习模型，能够根据你提供的一句话来搜出满足这句话的图像。

模型训练目标

\frac{1}{N} \sum_{n = 1}^{N} Δ (y_{n}, f_{v} (v_{n})) + Δ (y_{n}, f_{t} (t_{n}))

$\frac{1}{N} \sum_{n=1}^{N}\Delta(y_n, f_v(v_n)) + \Delta(y_n, f_t(t_n))$

视觉信息 $v \in V$ （这里只是定义，其实用通俗的话来说就是单张图片 $v$ 属于图片数据库 $V$ ），文字描述 $t \in T$ 且类别标签 $y \in Y$ ，学习函数（就是后面要训练的模型部分） $f_v : V \rightarrow Y$ ， $f_t: T \rightarrow Y$ 。这里的 $N$ 是指数据集中图像-文本对的数量，所以一个图像可以有多个不同的文本描述。

将 $\Delta : Y \times Y \rightarrow R$ ， $\Delta$ 是由 $0$ 和 $1$ 构成的损失函数减小到能够接受的程度的时候就是达到了最后目标了。上面这个公式就是DS-SJE（deep symmetric structured joint embedding），如果只优化 $f_v$ 的话那么就是DA-SJE（deep asymmetric structured joint embedding）（如果是只优化另一个的话也可以，但是作者说还没有看到过有人这么做过）。

更加具体的东东这里就不重复了，看下面参考里面的链接吧。

模型优势

不需要人为标定图片的特征，直接在图片和对应的文本上进行训练就可以达到在人为标定特征的数据集上训练的模型的效果（甚至更好），让模型的适用性更强（毕竟人为标定特征的数据集不多，而且工作量也大，应用起来也不方便）。

文件夹中的各个文件的存在意义

训练和检验模型分别用scripts文件夹下面的eval*和train*脚本文件。运行训练脚本之后scripts文件夹下会生成cv文件夹，里面存的是训练好了的模型（.t7格式，模型在这个格式里的“protos”属性里）。

训练之后并运行了eval*脚本之后，存放代码的主文件夹内会自动生成个results文件夹，这个文件夹内会有几个.txt文件，里面的内容是记录训练之后的模型的accuracy。

原来数据集里面的image文件夹里面保存的是已经转化成.t7格式了的图片（已经预处理过了，长度为1024，数据维度是：60*1024*10）。论文里也说了CNN的输入层长度就是1024。这是图片数据集。

text_c10里面保存的是描述文本文件（主要要用到的是目录下的.t7文件，目录下的文件夹里面的文件不知道怎么用，是用来生成.t7文件的），这是训练char level时用到的数据集。

bow_c10文件夹里面保存的数据的维度是60*5726*101（是经过词袋模型编码器处理过的数据）。

word_c10里面数据维度是60*30*10。根据论文里的说法是在“word-level”上分析时的使用了30作为输入层的维度，所以这个应该是做DS-SJE用的。

w2v_c10里面的数据维度是60*400*10。

注意：word2vec和BoW是两个编码器，可以用来提高zero-shot精度（论文里面的说法）。

要读取数据集里面的vocab_c10.t7要先引入几个包：

require 'optim'; require 'torch'; require 'nn'; require 'nngraph'; require 'lfs';

其实应该不用引入这么多个，不过为了保险（和后面懒得再引入包），所以才引入了这么多个。如果只是读取内容的话贫僧觉得应该只需要引入optim和另外几个包，不过没试过。这个文件里面保存的是一堆单词对应的数字，可以理解为一种将单词转换成数字的表。如果发现用了上面这些包之后还是没办法的话就把上面的几个包也引入下。

将Table转化成json

这里也是一样，要用到包：

json = require 'cjson'
mj = json.encode(m) -- 这里m是table类型
torch.save('xxx.json',mj)

其实输出的文件里面不是标准的json格式，贫僧猜应该用json的save函数（不知道有没有这个函数，没试过）才对，但是没去试。其实没什么用，贫僧只是将vocab.t7里面的内容保存成了json方便看而已，实际上用不到。

使用原代码里面封装好了的函数提取语义/图片向量

这几个函数是在retrieval_sje_tcnn.lua文件里面找到的：

function extract_img(filename)
    local fea = torch.load(filename)[{{},{},1}]
    fea = fea:float():cuda()
    local out = protos.enc_image:forward(fea):clone()
    return out:cuda()
end

function extract_txt(filename)
    if opt.ttype == 'word' then
        return extract_txt_word(filename)
    else -- char
        return extract_txt_char(filename)
    end
end

function extract_txt_word(filename)
    -- average all text features together.
    local txt = torch.load(filename):permute(1,3,2)
    txt = txt:reshape(txt:size(1)*txt:size(2),txt:size(3)):float():cuda()
    if opt.txt_limit > 0 then
        local actual_limit = math.min(txt:size(1), opt.txt_limit)
        txt_order = torch.randperm(txt:size(1)):sub(1,actual_limit)
        local tmp = txt:clone()
        for i = 1,actual_limit do
            txt[{i,{}}]:copy(tmp[{txt_order[i],{}}])
        end
        txt = txt:narrow(1,1,actual_limit)
    end

    if (model.opt.num_repl ~= nil) then
        tmp = txt:clone()
        txt = torch.ones(txt:size(1),model.opt.num_repl*txt:size(2))
        for i = 1,txt:size(1) do
            local cur_sen = torch.squeeze(tmp[{i,{}}]):clone()
            local cur_len = cur_sen:size(1) - cur_sen:eq(1):sum()
            local txt_ix = 1
            for j = 1,cur_len do
                for k = 1,model.opt.num_repl do
                    txt[{i,txt_ix}] = cur_sen[j]
                    txt_ix = txt_ix + 1
                end
            end
        end
    end

    local txt_mat = torch.zeros(txt:size(1), txt:size(2), vocab_size+1)
    for i = 1,txt:size(1) do
        for j = 1,txt:size(2) do
            local on_ix = txt[{i, j}]
            if on_ix == 0 then
                break
            end
            txt_mat[{i, j, on_ix}] = 1
        end
    end
    txt_mat = txt_mat:float():cuda()
    local out = protos.enc_doc:forward(txt_mat):clone()
    out = torch.mean(out,1)
    return out
end

function extract_txt_char(filename)
    -- average all text features together.
    local txt = torch.load(filename):permute(1,3,2)
    txt = txt:reshape(txt:size(1)*txt:size(2),txt:size(3)):float():cuda()
    if opt.txt_limit > 0 then
        local actual_limit = math.min(txt:size(1), opt.txt_limit)
        txt_order = torch.randperm(txt:size(1)):sub(1,actual_limit)
        local tmp = txt:clone()
        for i = 1,actual_limit do
            txt[{i,{}}]:copy(tmp[{txt_order[i],{}}])
        end
        txt = txt:narrow(1,1,actual_limit)
    end
    local txt_mat = torch.zeros(txt:size(1), txt:size(2), #alphabet)
    for i = 1,txt:size(1) do
        for j = 1,txt:size(2) do
            local on_ix = txt[{i, j}]
            if on_ix == 0 then
                break
            end
            txt_mat[{i, j, on_ix}] = 1
        end
    end
    txt_mat = txt_mat:float():cuda()
    local out = protos.enc_doc:forward(txt_mat):clone()
    return torch.mean(out,1)
end

可以用对应的函数，结合下面的便利文夹目录下所有文件的方法，来把数据集（images、word_c10或者text_c10）里面的所有文件都提取一遍。这里放一下贫僧的代码：

--[[运行脚本内容
mkdir -p image_vec
mkdir -p word_c10
mkdir -p text_c10
th extract_model.lua \
  -data_dir 改成数据集路径 \
  -num_caption 10 \
  -model cv/lm_sje_cub_c10_hybrid_0.00070_1_10_trainvalids.txt.t7 \
  -ttype char
  # 暂时只能在char level上，word level上还没办法运行
]]

require('nn')
require('nngraph')
require('cutorch')
require('cunn')
require('cudnn')
require('lfs')

local matio = require('matio')  -- 引入能将数据保存为mat格式的包
local model_utils = require('util.model_utils')

local alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{} "
local dict = {}
for i = 1,#alphabet do
    dict[alphabet:sub(i,i)] = i
end

-------------------------------------------------
cmd = torch.CmdLine()
cmd:option('-data_dir','data','data directory.')
cmd:option('-savefile','sje_tcnn','filename to autosave the checkpont to. Will be inside checkpoint_dir/')
cmd:option('-checkpoint_dir', 'cv', 'output directory where checkpoints get written')
cmd:option('-symmetric',1,'symmetric sje')
cmd:option('-learning_rate',0.0001,'learning rate')
cmd:option('-testclasses', 'testclasses.txt', 'validation or test classes to be used in evaluation')
cmd:option('-ids_file', 'trainvalids.txt', 'file specifying which class labels were used for training.')
cmd:option('-model','','model to load. If blank then above options will be used.')
cmd:option('-txt_limit',0,'if 0 then use all available text. Otherwise limit the number of documents per class')
cmd:option('-num_caption',10,'numner of captions per image to be used for training')
cmd:option('-outfile', 'results/roc.csv', 'output csv file with ROC curves.')
cmd:option('-ttype','char','word|char')

opt = cmd:parse(arg)    -- 就是把输入的变量打包起来，具体有什么选项或者说变量可以看上面
local model
-- 加载模型，如果opt.model不为空就加载opt里面的模型，空的话就加载另一个（指定路径下的模型）
if opt.model ~= '' then
    model = torch.load(opt.model)
else
    model = torch.load(string.format('%s/lm_%s_%.5f_%.0f_%.0f_%s.t7', opt.checkpoint_dir, opt.savefile, opt.learning_rate, opt.symmetric, opt.num_caption, opt.ids_file))
end
-----------------------------------------------------------

local doc_length = model.opt.doc_length     -- 指令指定的文本长度
local protos = model.protos     -- model的protos，就是模型
protos.enc_doc:evaluate()   -- 将train = false，用来让某些模块的功能改变，nn的函数
protos.enc_image:evaluate()
--print(model.opt)  -- 测试代码

--[[
函数名字：extract_img
输入：文件名（可以是路径）
输出：维度是60*1024的tensor
]]
function extract_img(filename)
    local fea = torch.load(filename)[{{},{},1}]
    fea = fea:float():cuda()
    local out = protos.enc_image:forward(fea):clone()
    return out:cuda()
end

function extract_txt(filename)
    if opt.ttype == 'word' then
        return extract_txt_word(filename)
    else -- char
        return extract_txt_char(filename)
    end
end


function extract_txt_word(filename)
    -- average all text features together.
    local txt = torch.load(filename):permute(1,3,2)
    txt = txt:reshape(txt:size(1)*txt:size(2),txt:size(3)):float():cuda()
    if opt.txt_limit > 0 then
        local actual_limit = math.min(txt:size(1), opt.txt_limit)
        txt_order = torch.randperm(txt:size(1)):sub(1,actual_limit)
        local tmp = txt:clone()
        for i = 1,actual_limit do
            txt[{i,{}}]:copy(tmp[{txt_order[i],{}}])
        end
        txt = txt:narrow(1,1,actual_limit)
    end

    if (model.opt.num_repl ~= nil) then
        tmp = txt:clone()
        txt = torch.ones(txt:size(1),model.opt.num_repl*txt:size(2))
        for i = 1,txt:size(1) do
            local cur_sen = torch.squeeze(tmp[{i,{}}]):clone()
            local cur_len = cur_sen:size(1) - cur_sen:eq(1):sum()
            local txt_ix = 1
            for j = 1,cur_len do
                for k = 1,model.opt.num_repl do
                    txt[{i,txt_ix}] = cur_sen[j]
                    txt_ix = txt_ix + 1
                end
            end
        end
    end

    local txt_mat = torch.zeros(txt:size(1), txt:size(2), vocab_size+1)
    for i = 1,txt:size(1) do
        for j = 1,txt:size(2) do
            local on_ix = txt[{i, j}]
            if on_ix == 0 then
                break
            end
            txt_mat[{i, j, on_ix}] = 1
        end
    end
    txt_mat = txt_mat:float():cuda()
    local out = protos.enc_doc:forward(txt_mat):clone()
    out = torch.mean(out,1)
    return out
end


function extract_txt_char(filename)
    -- average all text features together.
    local txt = torch.load(filename):permute(1,3,2)
    txt = txt:reshape(txt:size(1)*txt:size(2),txt:size(3)):float():cuda()
    if opt.txt_limit > 0 then
        local actual_limit = math.min(txt:size(1), opt.txt_limit)
        txt_order = torch.randperm(txt:size(1)):sub(1,actual_limit)
        local tmp = txt:clone()
        for i = 1,actual_limit do
            txt[{i,{}}]:copy(tmp[{txt_order[i],{}}])
        end
        txt = txt:narrow(1,1,actual_limit)
    end
    local txt_mat = torch.zeros(txt:size(1), txt:size(2), #alphabet)
    for i = 1,txt:size(1) do
        for j = 1,txt:size(2) do
            local on_ix = txt[{i, j}]
            if on_ix == 0 then
                break
            end
            txt_mat[{i, j, on_ix}] = 1
        end
    end
    txt_mat = txt_mat:float():cuda()
    local out = protos.enc_doc:forward(txt_mat):clone()
    return torch.mean(out,1)
end


--[[
    函数名： files_names
    函数功能：将所有输入路径下的文件（除了文件夹）的名字提取出来
    输入：路径名
    输出：table类型数据，里面包括了所有路径下直接包含的文件
    备注：如果需要输出路径下所有文件（包括文件夹内的所有文件），请自行添加递归
]]
function files_names(path)
    -- 提取路径下所有文件的名字，不包括文件夹
    local j = 1
    local file_name = {}
    for file in lfs.dir(path) do
        if file ~= "." and file ~= ".." then
            local f = path..'/'.. file
            local attr = lfs.attributes(f)
                if attr.mode ~= "directory" then
                    file_name [j] = file
                    j = j + 1
                end
        end
    end
    return file_name
end


function extract_vec(filename, mode)
    if mode == 'image' then
        return extract_img(filename)
    elseif mode == 'word' then
        return extract_txt_word(filename)
    else -- char
        return extract_txt_char(filename)
    end
end


function save_vectors(path, mode)
    -- 输出语义向量
    local dataset_size = 200
    local i = 1
    local extracted_vectors = {}
    local vectors = {}
    -- 下面是用来设定使用哪个数据集的，因为不同的语义向量对应着不同的数据集
    if mode == "image" then
        work_dir = string.format('%s/images', path)    -- 设定图片保存的文件夹
    elseif mode == "word" then
        work_dir = string.format('%s/word_c%d', path, opt.num_caption)
        vocab_size = 0
        if model.vocab == nil and opt.ttype ~= 'char' then
            print("Model is not trained upon word level! Stopping extraction process.") -- 模型没在word level训练时报错并停止运行
        else
            for k,v in pairs(model.vocab) do
                vocab_size = vocab_size + 1
            end
        end
    else
        work_dir = string.format('%s/text_c%d', path, opt.num_caption)
    end
    vec_names = files_names(work_dir)
    -- 下面正式开始从工作目录work_dir里面提取语义向量
    for i = 1, dataset_size do
        local work_path = work_dir..'/'..vec_names[i]
        vectors[i] = extract_vec(work_path, mode)
    end
    -- 把向量和向量名字包装在一个table
--    extracted_vectors['vec_names'] = vec_names
--    extracted_vectors['vectors'] = vectors
    -- 保存向量
    if mode == "image" then
        for i = 1, dataset_size do
            vectors[i] = vectors[i]:float()
            matio.save(string.format('cub_image_vec/%s.mat', vec_names[i]),vectors[i])
        end
    elseif mode == "word" then
        for i = 1, dataset_size do
            vectors[i] = vectors[i]:float()
            matio.save(string.format('cub_word_c%d/%s.mat', opt.num_caption, vec_names[i]),vectors[i])
        end
    else
        for i = 1, dataset_size do
            vectors[i] = vectors[i]:float()
            matio.save(string.format('cub_text_c%d/%s.mat', opt.num_caption, vec_names[i]),vectors[i])
        end
    end

    -- return extracted_vectors
end

save_vectors(opt.data_dir, 'image')
if opt.ttype == 'char' then
    save_vectors(opt.data_dir, 'char')
else
    save_vectors(opt.data_dir, 'word')    -- 需要模型本身是用word level进行训练的才可以使用这个指令
end

贫僧最后面把提取出来的东东保存成.mat文件了，其实可以改改改成保存成别的格式，直接改最下面的save_vectors函数里面的内容就可以了。

遍历某个文件目录下的所有文件

require"lfs"

function attrdir(path)
  for file in lfs.dir(path) do
  -- 过滤linux目录下的"."和".."目录，没有的话会报错
    if file ~= "." and file ~= ".." then
      local f = path.. '/' ..file
      local attr = lfs.attributes (f)
      if attr.mode == "directory" then
          print(f .. "  -->  " .. attr.mode)
          -- 如果是目录，则进行递归调用
          attrdir(f)
      else
          print(f .. "  -->  " .. attr.mode)
      end
    end
  end
end

attrdir(".")  -- 遍历“.”文件夹，就是遍历本地文件夹。

上面那段代码来自这篇文章，更加详细的用法可以看那里。

word level提取语义向量时遇到问题

在一开始的时候直接调用在word level提取的函数的时候直接跳出错误，在看了下错误的内容之后发现原来是调用的模型里面没有vocab索引，所以就没有办法读取这里面的值（或者读取出来之后的值是nil类型的，不存在）。后来贫僧决定手动将模型的vocab索引设置成数据集目录下的vocab.t7或者vocab_c10.t7（具体设置成哪个就要看数据集里面的是哪一个了）里面保存的内容，结果出现了下面的错误。

Found Environment variable CUDNN_PATH = /usr/local/cuda/lib64/libcudnn.so.7/home/ubuntu/torch/install/bin/luajit: /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
/home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:110: bad argument #2 to 'v' (invalid input frame size. Got: 5726, Expected: 70 at /tmp/luarocks_cunn-scm-1-6806/cunn/lib/THCUNN/generic/TemporalConvolution.cu:30)
stack traceback:
        [C]: in function 'v'
        /home/ubuntu/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'TemporalConvolution_updateOutput'
        ...u/torch/install/share/lua/5.1/nn/TemporalConvolution.lua:41: in function <...u/torch/install/share/lua/5.1/nn/TemporalConvolution.lua:40>
        [C]: in function 'xpcall'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        ...e/ubuntu/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        ...e/ubuntu/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        ../extract_model.lua:157: in function 'extract_vec'
        ../extract_model.lua:264: in function 'save_vectors'
        ../extract_model.lua:293: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        ...e/ubuntu/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        ...e/ubuntu/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        ../extract_model.lua:157: in function 'extract_vec'
        ../extract_model.lua:264: in function 'save_vectors'
        ../extract_model.lua:293: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

这个问题还没有解决。

参考

【论文阅读】Learning Deep Representations of Fine-Grained Visual Descriptions
Learning Deep Representations of Fine-Grained Visual Descriptions