手写体识别例程的mnist数据集包含了4个文件,分别为train-images-idx3-ubyte、train-labels-idx1-ubyte、t10k-images-idx3-ubyte、t10k-labels-idx1-ubyte。这四个文件分别是训练集的图片文件、训练集的标签文件、测试集的图片文件、测试集的标签文件,其中,训练集中有60000张图片,测试集有10000张图片。这4个文件是二进制文件,那么这么图片是如何在这些文件存储的呢?
我们已测试集的文件为例进行分析,即t10k-images-idx3-ubyte和t10k-labels-idx1-ubyte。文件的定义如下
t10k-images-idx3-ubyte定义
地址偏移 | 数据类型 | 取值 | 描述 |
0000 | uint32 | 2051 | 魔数(大端存储) |
0004 | uint32 | 10000 | 文件包含的条目总数 |
0008 | uint32 | 28 | 行数 |
000c | uint32 | 28 | 列数 |
0010 | uint8 | ? | 像素灰度值(0~255) |
0011 | uint8 | ? | 像素灰度值(0~255) |
... | ... | ... | ... |
t10k-labels-idx1-ubyte定义
地址偏移 | 数据类型 | 取值 | 描述 |
0000 | uint32 | 2049 | 魔数(大端存储) |
0004 | uint32 | 10000 | 文件包含的条目总数 |
0008 | uint8 | ? | 标签值(0~9) |
000c | uint8 | ? | 标签值(0~9) |
... | ... | ... | ... |
训练集的两个文件与测试集的文件定义是完全一样的。
从上面的表格我们可以看出来t10k-images-idx3-ubyte中存储的是10000张图片,每个图片是28X28的像素分辨率,每个像素占一个字节数据,数据描述的是像素的灰度值,0表示白色,255表示黑色。
t10k-labels-idx1-ubyte中存储的是每张图片对应的标签值,即0~9的数字值,每张图片与每个标签是一一对应的。为了详细分析,我们看一下这两个文件的内部取值,由于是二进制文件,我们可以用hexdump软件打开,如果计算机中没有这个软件,可以用下面的命令进行安装
sudo apt-get install hexdump
用下面的命令打开文件
hexdump -Cv t10k-images-idx3-ubyte | more
打开文件后如下所示
00000000 00 00 08 03 00 00 27 10 00 00 00 1c 00 00 00 1c |......'.........|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000d0 00 00 00 00 00 00 00 00 00 00 54 b9 9f 97 3c 24 |..........T...<$|
000000e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000000f0 00 00 00 00 00 00 de fe fe fe fe f1 c6 c6 c6 c6 |................|
00000100 c6 c6 c6 c6 aa 34 00 00 00 00 00 00 00 00 00 00 |.....4..........|
00000110 00 00 43 72 48 72 a3 e3 fe e1 fe fe fe fa e5 fe |..CrHr..........|
00000120 fe 8c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000130 00 00 00 11 42 0e 43 43 43 3b 15 ec fe 6a 00 00 |....B.CCC;...j..|
00000140 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000150 00 00 00 00 00 00 53 fd d1 12 00 00 00 00 00 00 |......S.........|
00000160 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000170 00 16 e9 ff 53 00 00 00 00 00 00 00 00 00 00 00 |....S...........|
00000180 00 00 00 00 00 00 00 00 00 00 00 00 00 81 fe ee |................|
00000190 2c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |,...............|
000001a0 00 00 00 00 00 00 00 00 3b f9 fe 3e 00 00 00 00 |........;..>....|
000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000001c0 00 00 00 00 85 fe bb 05 00 00 00 00 00 00 00 00 |................|
000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 09 |................|
000001e0 cd f8 3a 00 00 00 00 00 00 00 00 00 00 00 00 00 |..:.............|
000001f0 00 00 00 00 00 00 00 00 00 00 00 7e fe b6 00 00 |...........~....|
00000200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000210 00 00 00 00 00 00 4b fb f0 39 00 00 00 00 00 00 |......K..9......|
00000220 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000230 00 13 dd fe a6 00 00 00 00 00 00 00 00 00 00 00 |................|
00000240 00 00 00 00 00 00 00 00 00 00 00 00 03 cb fe db |................|
00000250 23 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |#...............|
00000260 00 00 00 00 00 00 00 00 26 fe fe 4d 00 00 00 00 |........&..M....|
00000270 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000280 00 00 00 1f e0 fe 73 01 00 00 00 00 00 00 00 00 |......s.........|
00000290 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 85 |................|
000002a0 fe fe 34 00 00 00 00 00 00 00 00 00 00 00 00 00 |..4.............|
000002b0 00 00 00 00 00 00 00 00 00 00 3d f2 fe fe 34 00 |..........=...4.|
000002c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000002d0 00 00 00 00 00 00 79 fe fe db 28 00 00 00 00 00 |......y...(.....|
000002e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000002f0 00 00 79 fe cf 12 00 00 00 00 00 00 00 00 00 00 |..y.............|
00000300 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000310 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000330 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000340 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000350 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000360 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000370 00 00 00 00 00 00 00 00 00 00 00 00 00 00 74 7d |..............t}|
00000380 ab ff ff 96 5d 00 00 00 00 00 00 00 00 00 00 00 |....]...........|
00000390 00 00 00 00 00 00 00 00 00 a9 fd fd fd fd fd fd |................|
000003a0 da 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
000003b0 00 00 00 00 a9 fd fd fd d5 8e b0 fd fd 7a 00 00 |.............z..|
000003c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 34 |...............4|
对于多字节的数据,是按大端模式存储的,我们可以根据上面表格的定义对文件内容进行分析。
首先我们看前4个字节 00000803,它转换为十进制为2051,正是表格中定义的魔数。
接下来4个字节是00002710,它转换为十进制为10000,正是表格中定义的条目数。
再接下来就是两个4字节的0000001c,都是十进制的28,分别是图片像素的行数和列数。
再下面的就是图片的像素值,28X28=784个字节为一张图片,我们从上面的数据中可以看到00比较多,代表白色的背景,其它值则代表有颜色的像素,我们把这些784个像素值描绘在28X28的图片上,如下图所示。
很明显,这是一个数字7,文件中其它的数据是剩下的9999张和上图类似的图片的数据,有兴趣可以描几个看看。至此t10k-images-idx3-ubyte文件已经淋漓尽致的展现在我们的面前。
接下来,我们再来看看t10k-labels-idx1-ubyte文件,它存储的是标签值,用下面的命令打开文件
hexdump -Cv t10k-labels-idx1-ubyte | more
打开后如下图所示。
00000000 00 00 08 01 00 00 27 10 07 02 01 00 04 01 04 09 |......'.........|
00000010 05 09 00 06 09 00 01 05 09 07 03 04 09 06 06 05 |................|
00000020 04 00 07 04 00 01 03 01 03 04 07 02 07 01 02 01 |................|
00000030 01 07 04 02 03 05 01 02 04 04 06 03 05 05 06 00 |................|
00000040 04 01 09 05 07 08 09 03 07 04 06 04 03 00 07 00 |................|
00000050 02 09 01 07 03 02 09 07 07 06 02 07 08 04 07 03 |................|
00000060 06 01 03 06 09 03 01 04 01 07 06 09 06 00 05 04 |................|
00000070 09 09 02 01 09 04 08 07 03 09 07 04 04 04 09 02 |................|
00000080 05 04 07 06 07 09 00 05 08 05 06 06 05 07 08 01 |................|
00000090 00 01 06 04 06 07 03 01 07 01 08 02 00 02 09 09 |................|
000000a0 05 05 01 05 06 00 03 04 04 06 05 04 06 05 04 05 |................|
000000b0 01 04 04 07 02 03 02 07 01 08 01 08 01 08 05 00 |................|
000000c0 08 09 02 05 00 01 01 01 00 09 00 03 01 06 04 02 |................|
000000d0 03 06 01 01 01 03 09 05 02 09 04 05 09 03 09 00 |................|
000000e0 03 06 05 05 07 02 02 07 01 02 08 04 01 07 03 03 |................|
000000f0 08 08 07 09 02 02 04 01 05 09 08 07 02 03 00 04 |................|
00000100 04 02 04 01 09 05 07 07 02 08 02 06 08 05 07 07 |................|
00000110 09 01 08 01 08 00 03 00 01 09 09 04 01 08 02 01 |................|
00000120 02 09 07 05 09 02 06 04 01 05 08 02 09 02 00 04 |................|
00000130 00 00 02 08 04 07 01 02 04 00 02 07 04 03 03 00 |................|
00000140 00 03 01 09 06 05 02 05 09 02 09 03 00 04 02 00 |................|
00000150 07 01 01 02 01 05 03 03 09 07 08 06 05 06 01 03 |................|
00000160 08 01 00 05 01 03 01 05 05 06 01 08 05 01 07 09 |................|
00000170 04 06 02 02 05 00 06 05 06 03 07 02 00 08 08 05 |................|
00000180 04 01 01 04 00 03 03 07 06 01 06 02 01 09 02 08 |................|
00000190 06 01 09 05 02 05 04 04 02 08 03 08 02 04 05 00 |................|
000001a0 03 01 07 07 05 07 09 07 01 09 02 01 04 02 09 02 |................|
000001b0 00 04 09 01 04 08 01 08 04 05 09 08 08 03 07 06 |................|
000001c0 00 00 03 00 02 06 06 04 09 03 03 03 02 03 09 01 |................|
000001d0 02 06 08 00 05 06 06 06 03 08 08 02 07 05 08 09 |................|
000001e0 06 01 08 04 01 02 05 09 01 09 07 05 04 00 08 09 |................|
000001f0 09 01 00 05 02 03 07 08 09 04 00 06 03 09 05 02 |................|
00000200 01 03 01 03 06 05 07 04 02 02 06 03 02 06 05 04 |................|
00000210 08 09 07 01 03 00 03 08 03 01 09 03 04 04 06 04 |................|
00000220 02 01 08 02 05 04 08 08 04 00 00 02 03 02 07 07 |................|
00000230 00 08 07 04 04 07 09 06 09 00 09 08 00 04 06 00 |................|
00000240 06 03 05 04 08 03 03 09 03 03 03 07 08 00 08 02 |................|
00000250 01 07 00 06 05 04 03 08 00 09 06 03 08 00 09 09 |................|
00000260 06 08 06 08 05 07 08 06 00 02 04 00 02 02 03 01 |................|
00000270 09 07 05 01 00 08 04 06 02 06 07 09 03 02 09 08 |................|
00000280 02 02 09 02 07 03 05 09 01 08 00 02 00 05 02 01 |................|
00000290 03 07 06 07 01 02 05 08 00 03 07 02 04 00 09 01 |................|
我们依然对照这上面的表格来研究,
前4个字节为00000801,它转换为十进制为2049,正是表格中定义的魔数。
接下来4个字节为00002710,它转换为十进制为10000,正是表格中定义的条目数。
再后面就是标签值,我们看到第一个值就是07,正是我们从t10k-images-idx3-ubyte文件中描出来的图片的取值。这个文件中剩下的就是9999个0~9的标签值。
对于训练集的train-images-idx3-ubyte、train-labels-idx1-ubyte文件,我们这类就不做过多解释了,定义和测试集的文件是完全一样的。