最近在研究将各种数据集转换为不同AI框架的自有数据,这些框架包括Caffe,MXNet,Tensorflow等.C++这样一个通用而又强大的语言,却让使用C++的同鞋在AI时代有一个痛点,那就是目前的AI框架基本都是底层用C/C++实现,但提供的接口却大部分都是python的接口,而且Python的接口封装的都特别好,MXNet还好,提供im2rec.cc这样的C/C++源码,而Caffe,尤其是Tensorflow这样的框架,想用C++来转换数据就需要花点功夫了.所以本文首先讲解Tensorflow的数据集格式转换.
1.不同框架的数据分别是怎样的?
MXNet的自有数据集:rec格式
Caffe的自有据集:Lmdb格式
Tensorflow的自有数据集:TFRecord格式
2.什么是TFRecord格式?
关于tensorflow读取数据,官网给出了三种方法:
1、供给数据:在tensorflow程序运行的每一步,让python代码来供给数据
2、从文件读取数据:建立输入管线从文件中读取数据
3、预加载数据:如果数据量不太大,可以在程序中定义常量或者变量来保存所有的数据。
而tfrecord格式是Tensorflow官方推荐的标准格式。tfrecord数据文件是一种将图像数据和标签统一存储的二进制文件,能更好的利用内存,在tensorflow中快速的复制,移动,读取,存储等。
该数据集由一个example.proto文件定义:
syntax = "proto3"; message Example{ Features features = 1; }; message Features{ map<string,Feature> feature = 1; }; // Containers to hold repeated fundamental values. message BytesList { repeated bytes value = 1; } message FloatList { repeated float value = 1 [packed = true]; } message Int64List { repeated int64 value = 1 [packed = true]; } message Feature{ oneof kind{ BytesList bytes_list = 1; FloatList float_list = 2; Int64List int64_list = 3; } };
这是一个protobuf3的格式定义,需要使用以下命令通过该文件生成头文件example.pb.h和cc文件example.pb.cc:
protoc -I=. --cpp_out=./ example.proto
3.自有数据集该准备成什么样?
此处以VOC2007数据集为检测任务的例子讲解,LFW数据集为分类任务讲解.
对于分类任务,数据集统一构建一个这样的列表,该表的构建可以参考Caffe的分类任务列表的构建(文件名和标签中间不是空格,而是\t):
/output/oldFile/1000015_10/wKgB5Fr6WwWAJb7iAAABKohu5Nw109.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwWAEbg6AAABC_mxdD8880.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwWAUGTdAAAA8wVERrQ677.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwWAPJ-lAAABPYAoeuY242.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwWARVIWAAABCK2alGs331.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwWAV3R5AAAA5573dko147.png 0 /output/oldFile/1000015_10/wKgB5Fr6WwaAUjQRAAABIkYxqoY008.png 0 ... /output/oldFile/1000015_10/wKgB5Vr6YF-AALG-AAAA-qStI_Q208.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAe1VYAAABN5fz53Y240.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAQo7fAAABVFasXJ4223.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAL00yAAABJdrU4U0508.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAFjTyAAABJVgoCrU242.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAKmMMAAABMd1_pJg240.png 1 /output/oldFile/1000015_10/wKgB5Vr6YGCAR2FqAAABFCQ7LRY651.png 1
对于VOC2007数据集,构建的列表如下(文件名和标签中间不是空格,而是\t):
/home/test/data/VOC2007/JPEGImages/004379.jpg /home/xbx/data/VOC2007/Annotations/004379.xml /home/test/data/VOC2007/JPEGImages/001488.jpg /home/xbx/data/VOC2007/Annotations/001488.xml /home/test/data/VOC2007/JPEGImages/004105.jpg /home/xbx/data/VOC2007/Annotations/004105.xml /home/test/data/VOC2007/JPEGImages/006146.jpg /home/xbx/data/VOC2007/Annotations/006146.xml /home/test/data/VOC2007/JPEGImages/004295.jpg /home/xbx/data/VOC2007/Annotations/004295.xml /home/test/data/VOC2007/JPEGImages/001360.jpg /home/xbx/data/VOC2007/Annotations/001360.xml /home/test/data/VOC2007/JPEGImages/003468.jpg /home/xbx/data/VOC2007/Annotations/003468.xml ...
4.数据集转换的流程是怎样的?
数据列表准备好之后,就可以开始分析数据集转换的流程,大体上来说就是对于分类任务,首先初始化一个RecordWriter,然后处理列表中的数据,每一行对应一个Example,每行包含图片路径和相应的标签,使用OPENCV读取图片为Mat后,将其转换为string的格式(为什么不是char*,因为图像中可能存在\0),保存到Example中的feature中,map名称取为image_raw,并获取图片的宽高通道数,标签等信息,也都保存到Example中的feature中,map名分别为width,height,depth,label等,最后将每行的Example序列化SerializeToString为string,调用writer_->WriteRecord写入.对于检测任务区别则在于增加了对xml文件的解析,并保存bbox信息等.
需要用到的头文件包括:
#include <fcntl.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <boost/foreach.hpp> #include <boost/property_tree/ptree.hpp> #include <boost/property_tree/xml_parser.hpp> #include <fstream> #include <iostream> #include <map> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/imgproc/imgproc.hpp> #include <vector> #include "tensorflow/core/lib/core/status_test_util.h" #include "tensorflow/core/lib/core/stringpiece.h" #include "tensorflow/core/lib/io/record_writer.h" #include <boost/lexical_cast.hpp> #include "rng.hpp"
using namespace tensorflow::io;
using namespace tensorflow;
主函数的判断:
if ((dataset_type == "object_detect") && (label_map_file.length() > 0)) { //检测任务,其中datalist_file是列表名,label_map_file是标签name和label的转换文件,output_dir是tfrecord需要输出的路径,output_name是tfrecord输出的文件名,samples_pre是tfrecord单个文件保存多少行,Shuffle是是否打乱 if (!detecteddata_to_tfrecords(datalist_file, label_map_file, output_dir, output_name, samples_pre, Shuffle)) { printf("convert wrong!!!\n"); return false; } } else if ((dataset_type == "classification") && (label_width > 0)) { //分类任务,其中datalist_file是列表名,output_dir是tfrecord需要输出的路径,output_name是tfrecord输出的文件名,samples_pre是tfrecord单个文件保存多少行,label_width是标签数目,对应单标签还是多标签,Shuffle是是否打乱 if (!clsdata_to_tfrecords(datalist_file, output_dir, output_name, samples_pre, label_width, Shuffle)) { printf("convert wrong!!!\n"); return false; } } else { printf( "dataset type is not object_detect or classification, or label_width [%lu], label_map_file " "[%s] is wrong!!!\n", label_width, label_map_file.c_str()); return false; } // Optional: Delete all global objects allocated by libprotobuf.清理在各子函数中打开的protobuf资源 google::protobuf::ShutdownProtobufLibrary();
对于分类任务,代码如下:
bool clsdata_to_tfrecords(string datalist_file, string output_dir, string output_name, int samples_pre, size_t label_width, int Shuffle) { std::ifstream infile(datalist_file.c_str()); std::string line; std::vector<std::pair<string, std::vector<int> > > dataset; //读取列表文件,并将信息保存到dataset中 while (getline(infile, line)) { vector<string> tmp_str = param_split(line, "\t"); std::string filename; std::vector<int> label_v; if (tmp_str.size() != (label_width + 1)) { std::cout << "line " << line << "has too many param!!!" << std::endl; return false; } for (size_t i = 0; i < (label_width + 1); ++i) { if (i == 0) { filename = tmp_str[0]; } else { try { int label = boost::lexical_cast<int>(tmp_str[i]); label_v.push_back(label); } catch (boost::bad_lexical_cast& e) { printf("%s\n", e.what()); return false; } } } if (filename.size() > 0) dataset.push_back(std::make_pair(filename, label_v)); } //打乱数据集,该代码借用caffe中rng.hpp代码 if (Shuffle) { printf("tensorflow task will be shuffled!!!"); caffe::shuffle(dataset.begin(), dataset.end()); } printf("A total of %lu images.\n", dataset.size()); // create recordwriter std::unique_ptr<WritableFile> file; RecordWriterOptions options = RecordWriterOptions::CreateRecordWriterOptions("ZLIB"); RecordWriter* writer_ = NULL; int j = 0, fidx = 0; size_t line_id = 0; for (line_id = 0; line_id < dataset.size(); ++line_id) { if (line_id == 0 || j > samples_pre) { //如果是第一次或者单个文件的tfrecord记录达到samples_pre上限,则重新初始化一个新的RecordWriter if (writer_ != NULL) { delete writer_; writer_ = NULL; } char output_file[1024]; memset(output_file, 0, 1024); sprintf(output_file, "%s/%s_%03d.tfrecord", output_dir.c_str(), output_name.c_str(), fidx); printf("create new tfrecord file: [%s] \n", output_file); Status s = Env::Default()->NewWritableFile((string)output_file, &file); if (!s.ok()) { printf("create write record file [%s] wrong!!!\n", output_file); return false; } writer_ = new RecordWriter(file.get(), options); j = 0; fidx += 1; } //读取图片 cv::Mat image = ReadImageToCVMat(dataset[line_id].first); //将Mat转为string的形式 std::string image_b = matToBytes(image); int height = image.rows; int width = image.cols; int depth = image.channels(); //每一条数据对应一个Example Example example1; Features* features1 = example1.mutable_features(); ::google::protobuf::Map<string, Feature>* feature1 = features1->mutable_feature(); Feature feature_tmp; feature_tmp.Clear(); if (!bytes_feature(feature_tmp, image_b)) { printf("image: [%s] wrong\n", dataset[line_id].first.c_str()); continue; } (*feature1)["image_raw"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, height)) { printf("image: [%s] , height [%d] wrong\n", dataset[line_id].first.c_str(), height); continue; } (*feature1)["height"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, width)) { printf("image: [%s] , width [%d] wrong\n", dataset[line_id].first.c_str(), width); continue; } (*feature1)["width"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, depth)) { printf("image: [%s] , depth [%d] wrong\n", dataset[line_id].first.c_str(), depth); continue; } (*feature1)["depth"] = feature_tmp; //此次默认分类数据集的label已经转化为了0,1,2,3,4,5这样的形式,否则此处需要加上name to label的转化代码 feature_tmp.Clear(); if (!int64_feature(feature_tmp, dataset[line_id].second)) { printf("image: [%s] wrong\n", dataset[line_id].first.c_str()); continue; } (*feature1)["label"] = feature_tmp; //将example序列化为string,并写入Writer_ std::string str; example1.SerializeToString(&str); writer_->WriteRecord(str); ++j; if (line_id % 1000 == 0) { printf("Processed %lu files.\n", line_id); } } printf("Processed %lu files.\n finished", line_id); if (writer_ != NULL) { delete writer_; writer_ = NULL; } return true; }
其中,matToBytes函数定义如下:
std::string matToBytes(cv::Mat image) { int size = image.total() * image.elemSize(); byte* bytes = new byte[size]; memcpy(bytes, image.data, size * sizeof(byte)); std::string img_s(bytes, size); return img_s; }
string转feature,或vector<int>转feature等定义如下:
//函数重载,使得int和vector<int>都可以转换为feature bool int64_feature(Feature& feature, int value) { Int64List* i_list1 = feature.mutable_int64_list(); i_list1->add_value(value); return true; } bool int64_feature(Feature& feature, std::vector<int> value) { if (value.size() < 1) { printf("value int64 is wrong!!!"); return false; } Int64List* i_list1 = feature.mutable_int64_list(); for (size_t i = 0; i < value.size(); ++i) i_list1->add_value(value[i]); return true; } bool float_feature(Feature& feature, std::vector<double> value) { if (value.size() < 1) { printf("value float is wrong!!!"); return false; } FloatList* f_list1 = feature.mutable_float_list(); for (size_t i = 0; i < value.size(); ++i) f_list1->add_value(value[i]); return true; } //将图像信息转换为feature bool bytes_feature(Feature& feature, std::string value) { BytesList* b_list1 = feature.mutable_bytes_list(); //图像中含有0可能会存在问题 b_list1->add_value(value); return true; }
对于检测任务,大体流程一致,列表读取代码有点差异,另外需要增加对xml文件的格式化处理,可以使用boost的xml解析,大体代码如下:
bool ReadXMLToExapmle(const string& image_file, const string& xmlfile, const int img_height, const int img_width, const std::map<string, int>& name_to_label, RecordWriter* writer_) { //图像读取 cv::Mat image = ReadImageToCVMat(image_file); if (!image.data) { cout << "Could not open or find file " << image_file; return false; } //将Mat转换为string std::string image_b = matToBytes(image); Example example1; Features* features1 = example1.mutable_features(); ::google::protobuf::Map<string, Feature>* feature1 = features1->mutable_feature(); Feature feature_tmp; feature_tmp.Clear(); if (!bytes_feature(feature_tmp, image_b)) { printf("image: [%s] wrong\n", image_file.c_str()); return false; ; } (*feature1)["image/encoded"] = feature_tmp; ptree pt; read_xml(xmlfile, pt); // Parse annotation. int width = 0, height = 0, depth = 0; try { height = pt.get<int>("annotation.size.height"); width = pt.get<int>("annotation.size.width"); depth = pt.get<int>("annotation.size.depth"); } catch (const ptree_error& e) { std::cout << "when parsing " << xmlfile << ":" << e.what() << std::endl; height = img_height; width = img_width; return false; } feature_tmp.Clear(); feature_tmp.Clear(); if (!int64_feature(feature_tmp, height)) { printf("xml : [%s] 's height wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/height"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, width)) { printf("xml : [%s] 's width wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/width"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, depth)) { printf("xml : [%s] 's depth wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/depth"] = feature_tmp; std::vector<int> v_label; std::vector<int> v_difficult; std::vector<double> v_xmin; std::vector<double> v_ymin; std::vector<double> v_xmax; std::vector<double> v_ymax; BOOST_FOREACH (ptree::value_type& v1, pt.get_child("annotation")) { ptree pt1 = v1.second; if (v1.first == "object") { bool difficult = false; ptree object = v1.second; BOOST_FOREACH (ptree::value_type& v2, object.get_child("")) { ptree pt2 = v2.second; if (v2.first == "name") { string name = pt2.data(); if (name_to_label.find(name) == name_to_label.end()) { std::cout << "file : [" << xmlfile << "] Unknown name: " << name << std::endl; return true; } int label = name_to_label.find(name)->second; v_label.push_back(label); } else if (v2.first == "difficult") { difficult = pt2.data() == "1"; v_difficult.push_back(difficult); } else if (v2.first == "bndbox") { int xmin = pt2.get("xmin", 0); int ymin = pt2.get("ymin", 0); int xmax = pt2.get("xmax", 0); int ymax = pt2.get("ymax", 0); if ((xmin > width) || (ymin > height) || (xmax > width) || (ymax > height) || (xmin < 0) || (ymin < 0) || (xmax < 0) || (ymax < 0)) { std::cout << "bounding box exceeds image boundary." << std::endl; return false; } v_xmin.push_back(xmin); v_ymin.push_back(ymin); v_xmax.push_back(xmax); v_ymax.push_back(ymax); } } } } feature_tmp.Clear(); if (!int64_feature(feature_tmp, v_label)) { printf("xml : [%s]'s label wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/label"] = feature_tmp; feature_tmp.Clear(); if (!int64_feature(feature_tmp, v_difficult)) { printf("xml : [%s]'s difficult wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/difficult"] = feature_tmp; feature_tmp.Clear(); if (!float_feature(feature_tmp, v_xmin)) { printf("xml : [%s]'s v_xmin wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/xmin"] = feature_tmp; feature_tmp.Clear(); if (!float_feature(feature_tmp, v_ymin)) { printf("xml : [%s]'s v_ymin wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/ymin"] = feature_tmp; feature_tmp.Clear(); if (!float_feature(feature_tmp, v_xmax)) { printf("xml : [%s]'s v_xmax wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/xmax"] = feature_tmp; feature_tmp.Clear(); if (!float_feature(feature_tmp, v_ymax)) { printf("xml : [%s]'s v_ymax wrong\n", xmlfile.c_str()); return false; } (*feature1)["image/object/bbox/xmax"] = feature_tmp; //序列化example并写入writerrecord std::string str; example1.SerializeToString(&str); writer_->WriteRecord(str); return true; }
最终编译Makefile如下:
all: rm -rf example.pb* ${PROTOBUF_HOME}/bin/protoc -I=. --cpp_out=./ example.proto ${PROTOBUF_HOME}/bin/protoc -I=. --cpp_out=./ label.proto g++ -std=c++11 -o dataset_to_tfrecord dataset_to_tfrecord.cc example.pb.cc common.cpp -I/usr/local/opencv2/include -L/usr/local/opencv2/lib -L. -lopencv_core -lopencv_highgui -lopencv_imgproc -Itensorflow的路径 -Itensorflow的路径/bazel-genfiles -I${PROTOBUF_HOME}/include -I/usr/local/include/eigen3 -L${PROTOBUF_HOME}/lib -Ltensorflow的路径/bazel-bin/tensorflow/ -lprotobuf -ltensorflow_framework -I${JSONCPP_HOME}/include -L${JSONCPP_HOME}/lib -ljson_linux-gcc-5.4.0_libmt