【学点Kaldi】Kaldi的I/O机制

本篇给出了Kalid中输入输出机制的概览。（主要参考Kaldi的Doc）

Kaldi中类的输入输出接口

Kaldi中定义的类有一个通用的I/O接口。标准的接口如下：

class SomeKaldiClass {
 public:
   void Read(std::istream &is, bool binary);
   void Write(std::ostream &os, bool binary) const;
};

注意这两个成员函数返回的是void类型，不能接一连串的istream或者ostream。binary参数是一个标志位，表明要读写的是binary数据还是text数据。

Kaldi中的对象是怎么存储在文件中的

上面我们提到，Kaldi进行读操作的代码需要知道用哪种模式（binary mode 和 text mode）。其实我们也不需要准确地追溯一个文件到底是binary的还是text的。Kaldi对象的文件提供给我们怎么取辨识。一个binary的Kaldi文件以字符串"\0B"开头，而text文件不需要文件头(header)。

拓展文件名：rxfilenames 和 wxfilenames

rxfilename和wxfilename 这两个词表示的不是类类型。它们只是一些经常出现在变量名字中的修饰符，它们的含义如下：

一个rxfilename是一个字符串，作为读操作的拓展名被Kaldi Input类所解释。
一个wxfilename是一个字符串，作为写操作的拓展名被Kaldi Output类所解释。

{ // input.
  bool binary_in;
  Input ki(some_rxfilename, &binary_in);
  my_object.Read(ki.Stream(), binary_in);
  // you can have more than one object in a file:
  my_other_object.Read(ki.Stream(), binary_in);
}
// output.  note, "binary" is probably a command-line option.
{
  Output ko(some_wxfilename, binary);
  my_object.Write(ko.Stream(), binary);
}

rxfilename的类型如下：

“-” 或者 “” 表示标准输入。
“some command |”表示一个输入管道命令，也就是说，我们可以通过popen()把|之前的字符串给shell。
"/some/filename:12345"表示文件的偏置，即，我们打开文件并定位到位置 12345.
"/some/filename"...匹配不到以上模式的文件名都被当做普通的文件名。

wxfilename的类型如下：

“-” 或者 “” 表示标准输入。
| some command 表示一个输出管道命令，也就是说，我们可以通过popen()把|之后的字符串给shell。
"/some/filename"...匹配不到以上模式的文件名都被当做普通的文件名。

表的概念

Kaldi中表（Table）只是一种概念，而非实际的C++类。表由一个已知类型的对象集合而成，这些对象是通过字符串（strings）索引的。这些字符串必须是tokens（令牌），即没有空格的非空字符串。表的典型例子有：

很多被utterance id索引的特征文件（被表示为Matrix<float>）。
很多被utterance id索引的文本标注文件（transcriptions）（被表示为std::vector<int32>）。
很多被speaker id索引的constrained MLLR转换（被表示为Matrix<float>）。
一张表可以以两种可能的形式存在硬盘中或者pipe中： script文件或者archive文件

表的访问

Kaldi中表可以通过三种方式访问：TableWriter、SequentialTableReader 和 RandomAccessReader。
这些都是模板，但是不是基于表所含的对象，二是基于Holder类型。Holder告诉访问表的代码怎么取读写表所包含的对象，它不是一个实际的类或者基类，而是描述了一系列以Holder结尾的类，比如说，TokenHolder和KaldiObjectHolder。
被Holder “held”的类的类型是typedef Holder::T，其中的Holder是实际所用的Holder类的名字。
为了打开一个Table类型，我们必须提供一个叫做wspecifier或者rspecifier的字符串，告诉表访问代码表示怎么在硬盘中存储的以及其他一些指令。我们来看一下这个例子，这段代码读取特征，经过线性转换，然后再写到硬盘上去。

std::string feature_rspecifier   = "scp:/tmp/my_orig_features.scp",
            transform_rspecifier = "ark:/tmp/transforms.ark",
            feature_wspecifier   = "ark,t:/tmp/new_features.ark";
// there are actually more convenient typedefs for the types below,
// e.g. BaseFloatMatrixWriter, SequentialBaseFloatMatrixReader, etc.
TableWriter<BaseFloatMatrixHolder> feature_writer(feature_wspecifier);
SequentialTableReader<BaseFloatMatrixHolder> feature_reader(feature_rspecifier);
RandomAccessTableReader<BaseFloatMatrixHolder> transform_reader(transform_rspecifier);
for(; !feature_reader.Done(); feature_reader.Next()) {
   std::string utt = feature_reader.Key();
   if(transform_reader.HasKey(utt)) {
      Matrix<BaseFloat> new_feats(feature_reader.Value());
      ApplyFmllrTransform(new_feats, transform_reader.Value(utt));
      feature_writer.Write(utt, new_feats);
   }
}

比较好的是，这种设定使得代码能够像访问一般的map或者list一样访问表。数据的格式以及读数据过程的其他方面都能够由rspecifier或者wspecifer来控制，而无需由调用代码来处理。在以上的这个例子中，",t" 表示以text的形式写数据。
当然，理想情况是我们能够以string-object的方式访问表（就像map），然而，只要我们不是随机访问一个特定的表，一个表中有重复的项是可以的（对于写操作和顺序访问操作，这些过程中表表现得更像a list of pairs）。

Kaldi script文件格式：

script文件是一种文本文件，每一行长得像这样：

some_string_identifier /some/filename

另外一种有效的行是：

utt_id_01001 gunzip -c /usr/data/file_010001.wav.gz |

每行的格式是：

<key> <rxfilename>

对于Matrix类型的对象，我们还可以指定范围，比如：

utt_id_01002 foo.ark:89142[0:51]
utt_id_01002 foo.ark:89142[0:51,89:100]
utt_id_01002 foo.ark:89142[,89:100]

Kaldi处理script文件的每一行：先去掉每行首尾空格，然后根据中间的空格把每一行分成两个部分，前面一部分就是Table的key，比如说，utt_id_01001，第二部分去掉范围指定符后成为xfilename，例如 gunzip -c /usr/data/file_010001.wav.gz |。空行或者空xfilename是不允许的。

Note: 偏置是以字节为单位的(byte offsets)，比如，foo.ark:8432表示第8432字节。字节偏置会指向对象的开头。对于binary数据，它指向的是"\0B"。

Kaldi archive文件格式：

Kaldi的archive格式比较简单，如下：

token1 [something]token2 [something]token3 [something] …

也就是： (a token; then a space character; then the result of calling the Write function of the Holder) .

指定Table的格式：wspecifiers 和 rspecifiers

Table类要求有一个string来传给构造函数或者Open函数。如果传给的是TableWriter，这个string被称为wspecifier，如果传给的是RandomAcessTableReader或者SequentialTableReader，这个string就叫做rspecifier。

std::string rspecifier1 = "scp:data/train.scp"; // script file.
std::string rspecifier2 = "ark:-"; // archive read from stdin.
// write to a gzipped text archive.
std::string wspecifier1 = "ark,t:| gzip -c > /some/dir/foo.ark.gz";
std::string wspecifier2 = "ark,scp:data/my.ark,data/my.ark";

通常，一个rspecifier或者一个wspecifier由逗号分隔的列表，这个列表包含ark和scp其中的一个，以及一些有一个字母或者两个字母组成的选项，然后是一个冒号，后面接rxfilename或者wxfilename。冒号之前可选项的顺序无关紧要。

同时写一个archive和一个script文件

wspecifier的一种特殊情况：在冒号之前是"ark,scp"，冒号之后是一个写archive的wxfilename，接一个逗号，然后接一个写script的wxfilename。

“ark,scp:/some/dir/foo.ark,/some/dir/foo.scp”

这会同时写一个archive和一个script文件，后者的每一行形如 "utt_id /somedir/foo.ark:1234"，其中的数字指定了便于随机访问的偏置。注意，指定的archive的wxfilename应该是普通的文件名，要不然得到的script文件将不被Kaldi直接可读。

wspecifier的有效可选项

允许的wspecifier可选项有如下一些：

“b” (binary) means write in binary mode (currently unnecessary as it’s always the default).
“t” (text) means write in text mode.
“f” (flush) means flush the stream after each write operation.
“nf” (no-flush) means don’t flush the stream after each write operation (would currently be pointless, but calling code can change the default).
“p” means permissive mode, which affects “scp:” wspecifiers where the scp file is missing some entries: the “p” option will cause it to silently not write anything for these files, and report no error.

用多个可选项的wspecifier例子：

“ark,t,f:data/my.ark”
“ark,scp,t,f:data/my.ark,|gzip -c > data/my.scp.gz”

rspecifier的有效可选项

当了解这些可选项的时候，要记住，当archive是一个pipe时（通常情况下都是），读archive的代码是不能在archive中进行搜索的。如果一个RandomAccessTableReader在读一个archive文件，代码可能需要在内存中存很多对象来避免之后重新请求，或者它可能需要搜索文件知道文件末尾来找一个实际上文件中没有的key。以下列出的一些可选项可以避免这种情况。

“o” (once) is the user’s way of asserting to the RandomAccessTableReader code that each key will be queried only once. This stops it from having to keep already-read objects in memory just in case they are needed again.
“p” (permissive) instructs the code to ignore errors and just provide what data it can; invalid data is treated as not existing. In scp files, this means that a query to HasKey() forces the load of the corresponding file, so the code can know to return false if the file is corrupt. In archives, this option stops exceptions from being raised if the archive is corrupted or truncated (it will just stop reading at that point).
“s” (sorted) instructs the code that the keys in an archive being read are in sorted string order. For RandomAccessTableReader, this means that when HasKey() is called for some key not in the archive, it can return false as soon as it encounters a “higher” key; it won’t have to read till the end.
“cs” (called-sorted) instructs the code that the calls to HasKey() and Value() will be in sorted string order. Thus, if one of these functions is called for some string, the reading code can discard the objects for lower-numbered keys. This saves memory. In effect, “cs” represents the user’s assertion that some other archive that the program may be iterating over, is itself sorted.

例子如下：

“ark:o,s,cs:-”
“scp,p:data/my.scp”