[PyTorch] Use of ImageNet data set and construction of miniImageNet

[PyTorch] The use of ImageNet and the construction of miniImageNet

1. ImageNet download and introduction
2. miniImageNet
- 2.1 Division of miniImageNet
3. Use ImageFolder to build a dataset class

1. ImageNet download and introduction

ImageNet is a large-scale computer vision data set started in 2007 by Stanford University and other institutions. Since its release in 2009, it has become a widely used dataset for indicator evaluation in the field of computer vision. Until now, this data set has more than 14 million images and is one of the most commonly used data sets for image classification, detection, and localization in the field of deep learning.

"ImageNet Large-scale Visual Recognition Task", namely ImageNet Large - Scale Visual Recognition Challenge , is a competition based on ImageNet . The dataset used is a subset of ImageNet. The official website provides the data sets of ILSVRC2011~ILSVRC2017. The more commonly used one is ILSVRC2012. This dataset has 1000 classes, and each class has about 1000 images. Among them, about 1.2 million images are used as training sets, 50,000 images are used as verification sets, and 100,000 images are used as test sets (unlabeled).

1.1 Download address

The official address of ImageNet: https://image-net.org/ . Currently, the ImageNet data set is no longer open to the public. If you want to download the data set, you must use an education email address of .edu to register and authenticate. In the process of downloading the data set, I found that the download speed of the official website is only about 300K/s, which is really too slow. You can use torrent downloaders such as Xunlei and use the following torrent links to download:

Validation set
http://academictorrents.com/download/5d6d0df7ed81efd49ca99ea4737e0ae5e3a5f2e5.torrent
Training set
http://academictorrents.com/download/a306397ccf9c2ead27155983c254227c0fd938e2.torrent

Here is also a website for everyone: https://academictorrents.com/ . This website contains download links for most commonly used data sets.

1.2 Preliminary processing

After the download is completed, you get two files: ILSVRC2012_img_train.tarand ILSVRC2012_img_val.tar. Enter git bash and run the command on the directory of these two files:

md5sum ILSVRC2012_img_train.tar
md5sum ILSVRC2012_img_val.tar

Get the MD5 check codes of the two files. If the downloaded data set is complete and correct, the check codes obtained should be the same as those given on the official website:

Training set: 1d675b47d978889d74fa0da5fadfb00e
Validation set: 29b22e2961454d5413ddabcf34fc5622

Next, decompress the data set, where the phase parameter represents the training set (train) or validation set (val).

import argparser
args = parser.parse_args()

parser.add_argument('--tar_dir', type=str)
def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir

The file structure obtained after decompression is as follows

训练集
|-ImagNet
	|-train
		|-class0.tar
		|-class1.tar
		|-...
	|-val
		|-img1.JPEG
		|-...

1.3 devkit introduction

In addition to the data set, the official website also provides an ILSVRC2012_devkit_t12. This contains some information about the data set.

Among them, .\ILSVRC2012_devkit_t12\data\ILSVRC2012_validation_ground_truth.txtthe corresponding labels of the validation set samples are given. For example, the label of the first verification sample is 490, the second is 361...
Insert image description here
In addition, there is a meta file in this directory. This file records the classification information of the data set.
The synset field in meta contains ILSVRC2012_ID, WIND and other information. WIND corresponds to the name of each type of folder and is the name of the class. ILSVRC2012_ID is the category identifier of this class in the ILSVRC2012C data set, and the label of the verification set adopts this identifier. Among them, ILSVRC2012_ID<=1000 is a category identifier, and if it is greater than or equal to 1000, it contains more than one category and puts similar categories together. This forms a tree structure.

2. miniImageNet

miniImageNet is a subset of ILSCVRC2012-img-train. It was first proposed by Vinyals et al. in the article Matching Networks for One Shot Learning for few-shot learning tasks. However, because the division method was not announced at the beginning, Ravi et al. in the article Optimization as a model for few-shot learning. A unique division method is used, which is also one of the commonly used miniImageNet divisions.

2.1 Division of miniImageNet

In Ravi's partitioning method https://github.com/twitter-research/meta-learning-lstm/tree/master/data/miniImagenet , 100 classes are randomly selected from ILSVRC2012C. Among them, 64 classes are used as trainset, 16 classes are used as valset, and 20 classes are used as testset, and each image is scaled to the size of 84×84. This article also adopts this division method.

In Ravi's csv price inquiry, the table handles filename and the corresponding label name. But note that the filename given here does not correspond one-to-one with the name in the original ILSVR2012C data set . The serial number given after the label is the sorting position of the sample in the category (starting from 1).

The python implementation of building miniImageNet is given below (from https://github.com/yaoyao-liu/mini-imagenet-tools ):

import argparse
import os
import numpy as np
import pandas as pd
import glob
import cv2
from shutil import copyfile
from tqdm import tqdm

# argument parser
parser = argparse.ArgumentParser(description='')
parser.add_argument('--tar_dir', type=str)
parser.add_argument('--phase', type=str, choices=['train', 'val'])
parser.add_argument('--imagenet_dir', type=str)
parser.add_argument('--miniImageNet_dir', type=str)
parser.add_argument('--split_filepath', typr=str)
parser.add_argument('--image_resize', type=int, default=84)

args = parser.parse_args()


def untarring(phase):
    if args.tar_dir is None:
        raise ValueError("tar_dir must be not None")
    print('Untarring ILSVRC2012 ' + phase + ' package')
    imagenet_dir = './ImageNet/' + phase
    if not os.path.exists(imagenet_dir):
        os.mkdir(imagenet_dir)
    os.system('tar xvf ' + str(args.tar_dir) + ' -C ' + imagenet_dir)
    return imagenet_dir


class MiniImageNetGenerator(object):
    def __init__(self, input_args):
        self.processed_img_dir = './miniImageNet'
        self.mini_keys = None
        self.input_args = input_args
        self.imagenet_dir = input_args.imagenet_dir
        self.raw_mini_dir = './miniImageNet_raw'
        self.csv_paths = input_args.split_filepath
        if not os.path.exists(self.raw_mini_dir):
            os.mkdir(self.raw_mini_dir)
        self.image_resize = self.input_args.image_resize

    def untar_mini(self):
        self.mini_keys = ['n02110341', 'n01930112', 'n04509417', 'n04067472', 'n04515003', 'n02120079', 'n03924679',
                          'n02687172', 'n03075370', 'n07747607', 'n09246464', 'n02457408', 'n04418357', 'n03535780',
                          'n04435653', 'n03207743', 'n04251144', 'n03062245', 'n02174001', 'n07613480', 'n03998194',
                          'n02074367', 'n04146614', 'n04243546', 'n03854065', 'n03838899', 'n02871525', 'n03544143',
                          'n02108089', 'n13133613', 'n03676483', 'n03337140', 'n03272010', 'n01770081', 'n09256479',
                          'n02091244', 'n02116738', 'n04275548', 'n03773504', 'n02606052', 'n03146219', 'n04149813',
                          'n07697537', 'n02823428', 'n02089867', 'n03017168', 'n01704323', 'n01532829', 'n03047690',
                          'n03775546', 'n01843383', 'n02971356', 'n13054560', 'n02108551', 'n02101006', 'n03417042',
                          'n04612504', 'n01558993', 'n04522168', 'n02795169', 'n06794110', 'n01855672', 'n04258138',
                          'n02110063', 'n07584110', 'n02091831', 'n03584254', 'n03888605', 'n02113712', 'n03980874',
                          'n02219486', 'n02138441', 'n02165456', 'n02108915', 'n03770439', 'n01981276', 'n03220513',
                          'n02099601', 'n02747177', 'n01749939', 'n03476684', 'n02105505', 'n02950826', 'n04389033',
                          'n03347037', 'n02966193', 'n03127925', 'n03400231', 'n04296562', 'n03527444', 'n04443257',
                          'n02443484', 'n02114548', 'n04604644', 'n01910747', 'n04596742', 'n02111277', 'n03908618',
                          'n02129165', 'n02981792']

        for idx, keys in enumerate(self.mini_keys):
            print('Untarring ' + keys)
            os.system('tar xvf ' + self.imagenet_dir + '/' + keys + '.tar -C ' + self.raw_mini_dir)
        print('All the tar files are untarred')

    def process_original_files(self):
        split_lists = ['train', 'val', 'test']

        if not os.path.exists(self.processed_img_dir):
            os.makedirs(self.processed_img_dir)

        for this_split in split_lists:
            filename = os.path.join(self.csv_paths, this_split + '.csv')
            this_split_dir = self.processed_img_dir + '/' + this_split
            if not os.path.exists(this_split_dir):
                os.makedirs(this_split_dir)
            with open(filename) as csvfile:
                csv = pd.read_csv(csvfile, delimiter=',')
                images = {
    
    }
                print('Reading IDs....')

                for row in csv.values:
                    if row[1] in images.keys():
                        images[row[1]].append(row[0])
                    else:
                        images[row[1]] = [row[0]]

                print('Writing photos....')
                for cls in tqdm(images.keys()):
                    this_cls_dir = this_split_dir + '/' + cls
                    if not os.path.exists(this_cls_dir):
                        os.makedirs(this_cls_dir)
                    # find files which name matches '.../...cls...'
                    lst_files = glob.glob(self.raw_mini_dir + "/*" + cls + "*")
                    # sort file names, get index
                    lst_index = [int(i[i.rfind('_') + 1:i.rfind('.')]) for i in lst_files]
                    index_sorted = np.argsort(np.array(lst_index))
                    # get file names in miniImageNet, the name in csv indicates the file index in miniImageNet class
                    index_selected = [int(i[i.index('.') - 4:i.index('.')]) for i in images[cls]]
                    # note that names in csv begin from 1 not 0, get selected images indexes
                    selected_images = index_sorted[np.array(index_selected) - 1]
                    for i in np.arange(len(selected_images)):
                        if self.image_resize == 0:
                            copyfile(lst_files[selected_images[i]], os.path.join(this_cls_dir, images[cls][i]))
                        else:
                            im = cv2.imread(lst_files[selected_images[i]])
                            im_resized = cv2.resize(im, (self.image_resize, self.image_resize),
                                                    interpolation=cv2.INTER_AREA)
                            cv2.imwrite(os.path.join(this_cls_dir, images[cls][i]), im_resized)


if __name__ == "__main__":
    dataset_generator = MiniImageNetGenerator(args)
    dataset_generator.untar_mini()
    dataset_generator.process_original_files()

After execution, unprocessed and unclassified samples are stored in the miniImageNet_raw folder. The structure of the miniImageNet folder is as follows:

|-miniImageNet
	|-train
		|-class1
			|-img1.jpg
			|-...
		|-...
	|-val
	|-test

Among them, the folder structure of val and test is the same as that of train.

3. Use ImageFolder to build a dataset class

PyTorch provides a very convenient class ImageFolder for building image datasets.

dataset = ImageFolder(root='./miniImageNet/train')

However, in the data set constructed in this way, the labels of the samples are based on the order of the folders. For example, samples in the first folder will be marked as class 0. If we want the sample labels to correspond to the labels of the data set, we need to rewrite some methods.

3.1 Rewrite methods in DataFolder

ImageFolder is a subclass of DataFolder. In DataFolder, two methods are provided for overriding:
find_classes and make_dataset. Among them, find_classes needs to return the category name and the mapping (dict) between category names (list) and labels. make_dataset builds a dataset based on the parameters returned by find_classes.

Rewritten as follows:

meta_dir = os.path.join(os.getcwd(), 'meta_info')
if not os.path.exists(meta_dir):
    os.makedirs(meta_dir)

meta_info_path = os.path.join(meta_dir, "meta_info.npy")
if not os.path.exists(meta_info_path):
    meta = loadmat('./ILSVRC2012_devkit_t12/data/meta.mat')
    meta = meta.get('synsets')
    meta = meta.reshape(1860)
    meta_id = [[i[0].item(), i[1].item()] for i in meta]
    meta_info = np.array(meta_id[:1000])
    np.save(meta_info_path, meta_info, allow_pickle=True)
else:
    meta_info = np.load(meta_info_path, allow_pickle=True)


class MiniImageNetFolder(ImageFolder):
    """
    Generator miniImageNet Dataset. This is a subclass of ImageFolder <- DataFolder
    Overwrite method find_class() and make_dataset() to let the label match ILSVRC2012-ID

    -----------------------------------------------------------------------------------------
    Parameters:
      root: root directory of image dataset, should have structure of
        root/dog/xxx.png
        root/dog/xxy.png
        root/dog/[...]/xxz.png

        root/cat/123.png
        root/cat/nsdf3.png
        root/cat/[...]/asd932_.pn

      phase: the validation phase of model, takes value from {"train", "validation", "test"}
    """

    def __init__(self, root, phase="train", transformer=None):

        self.meta_info = meta_info
        self.phase = phase
        super(MiniImageNetFolder, self).__init__(root=root, transform=transformer)

    def find_classes(self, directory):
        dic = {
    
    }
        names = np.unique(np.array(pd.read_csv('./split_csv/miniImageNet/' + self.phase + '.csv')['label']))
        for i in self.meta_info:
            if i[1] in names:
                dic[i[1]] = int(i[0])
        return list(names), dic

Among them, meta_info stores the correspondence between all classes and labels in ILSVRC2012.

3.2 BatchSampler implements episode sampling

In few-shot learning, a method often used is Episode Training. This method divides each training session into K different tasks. Each task represents a few-sample learning of a category and contains several support samples and query samples. The support sample is used for model training, and the query sample is used to evaluate the performance of the model. Usually, K-way N-shot is used to represent the episode training mode, and N is the number of support samples. If we want to implement a 5-way 5-shot episode training (assuming that the number of query samples is 1), then 5×(5+1)=30 samples need to be sampled from the training set each time.

The sampling strategy can be implemented through BatchSampler:

class PrototypicalBatchSampler(object):
    """
    Adopted from
    https://github.com/orobix/Prototypical-Networks-for-Few-shot-Learning-PyTorch/blob/master/src/prototypical_batch_sampler.py

    Yield a batch of indexes at each iteration.
    Indexes are calculated by keeping in account 'classes_per_it' and 'num_samples',
    Each iteration the batch indexes will refer to  'num_support' + 'num_query' samples
    for 'classes_per_it' random classes.

    __len__ returns the number of episodes per epoch (same as 'self.iterations').

    ----------------------------------------------------------------------------------------------
    Parameters:
        labels: ndarray, all labels for current dataset
        classes_per_episode: int, number of classes in one episode
        sample_per_class: int, numer of sample in one class
        iterations: int, number of episodes in one epoch
    """

    def __init__(self, labels, classes_per_episode, sample_per_class, iterations, dataset_name="miniImageNet_train"):
        """
        Initialize the PrototypicalBatchSampler object
        Args:
        - labels: an iterable containing all the labels for the current dataset
        samples indexes will be inferred from this iterable.
        - classes_per_it: number of random classes for each iteration
        - num_samples: number of samples for each iteration for each class (support + query)
        - iterations: number of iterations (episodes) per epoch
        """
        super(PrototypicalBatchSampler, self).__init__()
        self.labels = labels
        self.classes_per_it = classes_per_episode
        self.sample_per_class = sample_per_class
        self.iterations = iterations
        self.dataset_name = dataset_name

        # 该函数是去除数组中的重复数字，并进行排序之后输出
        self.classes, self.counts = np.unique(self.labels, return_counts=True)
        self.classes = torch.LongTensor(self.classes)

        # create a matrix, indexes, of dim: classes X max(elements per class)
        # fill it with nans
        # for every class c, fill the relative row with the indices samples belonging to c
        # in numel_per_class we store the number of samples for each class/row
        indexes_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_indexes.npy')
        numel_per_class_path = os.path.join(os.getcwd() + '\episode_idx', self.dataset_name + '_numel_per_class.npy')
        if not os.path.exists(indexes_path) and not os.path.exists(numel_per_class_path):
            print("Creat dataset indexes")
            self.idxs = range(len(self.labels))
            self.indexes = np.empty((len(self.classes), max(self.counts)), dtype=int) * np.nan
            self.indexes = torch.Tensor(self.indexes)
            self.numel_per_class = torch.zeros_like(self.classes)
            for idx, label in enumerate(self.labels):
                label_idx = np.argwhere(self.classes == label).item()
                # 即np.where(condition),只有条件 (condition)，没有x和y，则输出满足条件 (即非0) 元素的坐标 (等价于numpy.nonzero)。
                # 这里的坐标以tuple的形式给出，通常原数组有多少维，输出的tuple中就包含几个数组，分别对应符合条件元素的各维坐标。
                # 这里会返回第label_idx行中，是nan的坐标, 格式((),)，所以取[0][0]， 获得第一个为nan的坐标
                # 给indexes对应class对应sample赋予idx (在label数组中的idx)
                self.indexes[label_idx, np.where(np.isnan(self.indexes[label_idx]))[0][0]] = idx
                self.numel_per_class[label_idx] += 1
            save_path = os.path.join(os.getcwd(), 'episode_idx')
            if not os.path.exists(save_path):
                os.makedirs(save_path)
            np.save(os.path.join(save_path, self.dataset_name) + "_indexes.npy", self.indexes)
            np.save(os.path.join(save_path, self.dataset_name) + "_numel_per_class.npy", self.numel_per_class)
        else:
            print("Read Dataset indexes.")
            self.indexes = torch.tensor(np.load(indexes_path))
            self.numel_per_class = torch.tensor(np.load(numel_per_class_path))

    def __iter__(self):
        """
        yield a batch (episode) of indexes
        """
        spc = self.sample_per_class
        cpi = self.classes_per_it

        for it in range(self.iterations):
            batch_size = spc * cpi
            batch = torch.LongTensor(batch_size)
            # 随机选取c个类
            c_idxs = torch.randperm(len(self.classes))[:cpi]
            for i, c in enumerate(self.classes[c_idxs]):
                # 从第i个class到第i+1个class在batch中的slice
                s = slice(i * spc, (i + 1) * spc)
                # FIXME when torch.argwhere will exists
                # 找到第i个类的label_idx
                label_idx = torch.arange(len(self.classes)).long()[self.classes == c].item()
                # 在第label_idx类中随机选择spc个样本
                sample_idxs = torch.randperm(self.numel_per_class[label_idx])[:spc]
                # 这些样本的索引写如batch
                batch[s] = self.indexes[label_idx][sample_idxs]
            # 随机打乱batch
            batch = batch[torch.randperm(len(batch))]
            yield batch

    def __len__(self):
        """
        returns the number of iterations (episodes, batches) per epoch
        """
        return self.iterations

In this way, in each training epoch, we sample iterations times, and each sampling batch contains K-way N-shot samples. Then the total number of episodes trained by the model is: epochs × iterations.

3.3 Batch visualization

trans = Compose([ToTensor()])
dataset = MiniImageNetFolder(root='F:/processed_images/train/', phase="train", transformer=trans)
dataloader = DataLoader(dataset=dataset, batch_sampler=PrototypicalBatchSampler(dataset.targets, 5, 5, 10))


def visual_batch(dataloader, dataset_name):
    """
    Visualize images.
    :param x: Tensor, with shape of [batch_size, 3, h, w]
    :param y: Tensor, with shape of [batch_size, 1]
    :return:
    """
    x, y = next(iter(dataloader))
    plt.figure(figsize=(12, 12))
    for i in range(x.shape[0]):
        plt.subplot(5, 5, i + 1)
        idx = y[i].item() - 1
        plt.title(meta_info[idx, 1])
        plt.imshow(x[i].permute(1, 2, 0))
        plt.axis('off')

    if not os.path.exists(os.path.join(os.getcwd(), 'imgs')):
        os.makedirs(os.path.join(os.getcwd(), 'imgs'))
    plt.savefig('./imgs/visual_batch_' + dataset_name + '.png')

visual_batch(dataloader, "miniImageNet_train")

Visualize the first batch results as follows:
Insert image description here