2020-12-11 keras trains the model through model.fit_generator (save memory)

Keras trains the model through model.fit_generator (save memory)

Preface

When training the model some time ago, I found that when the number of training sets is too large and the input image dimension is too large, it is easy to overrun the memory. For a simple example, if we have 20,000 samples, the dimension of the input image is 224x224x3, using float32 storage, then if we load all the data into the memory at one time, a total of 20000x224x224x3x32bit/8=11.2GB of memory is needed, so if you want to load all the data sets at one time, a lot of memory is required.

If we directly use the keras fit function to train the model, we need to pass in all the training data, but fortunately, fit_generator is provided, which can read the data in batches, saving our memory. The only thing we need to do is to implement A generator.

1. Introduction to fit_generator function

fit_generator(generator,

steps_per_epoch=None,

epochs=1,

verbose=1,

callbacks=None,

validation_data=None,

validation_steps=None,

class_weight=None,

max_queue_size=10,

workers=1,

use_multiprocessing=False,

shuffle=True,

initial_epoch=0)

parameter:

generator: A generator, or an instance of a Sequence (keras.utils.Sequence) object. This is the focus of our implementation. Two implementations of generator and sequence will be introduced later.

steps_per_epoch: This is how many times we need to execute the generator in each epoch to produce data. The fit_generator function does not have the batch_size parameter. It is implemented through steps_per_epoch. The data produced each time is a batch, so the value of steps_per_epoch is passed. Set to (number of samples/batch_size). If our generator is a sequence type, then this parameter is optional, and len(generator) is used by default.

epochs: The number of iterations of our training.

verbose: 0, 1 or 2. Log display mode. 0 = quiet mode, 1 = progress bar, 2 = one line per round

callbacks: A series of callback functions called during training.

validation_data: Similar to our generator, but this one is used for validation and does not participate in training.

validation_steps: similar to the previous steps_per_epoch.

class_weight: Optional dictionary mapping class index (integer) to weight (floating point) value, used for weighted loss function (only during training). This can be used to tell the model to "pay more attention" to samples from underrepresented classes. (I feel that this parameter is used less)

max_queue_size: integer. The maximum size of the generator queue. The default is 10.

workers: integer. The maximum number of processes used, if process-based multithreading is used. If not specified, workers will default to 1. If it is 0, the generator will be executed on the main thread.

use_multiprocessing: Boolean value. If True, use process-based multithreading. The default is False.

shuffle: Whether to shuffle the order of batches before each iteration. It can only be used with the Sequence(keras.utils.Sequence) instance.

initial_epoch: the round to start training (helping to resume previous training)

2.generator implementation

2.1 How the generator is implemented

Sample code:

from keras.models import Sequential
from keras.layers import Dense
import numpy as np
from PIL import Image


def process_x(path):
    img = Image.open(path)
    img = img.resize((96, 96))
    img = img.convert('RGB')
    img = np.array(img)

    img = np.asarray(img, np.float32) / 255.0
    # 也可以进行进行一些数据数据增强的处理
    return img


def generate_arrays_from_file(x_y):
    # x_y 是我们的训练集包括标签,每一行的第一个是我们的图片路径,后面的是图片标签

    global count
    batch_size = 8
    while 1:
        batch_x = x_y[(count - 1) * batch_size:count * batch_size, 0]
        batch_y = x_y[(count - 1) * batch_size:count * batch_size, 1:]

        batch_x = np.array([process_x(img_path) for img_path in batch_x])
        batch_y = np.array(batch_y).astype(np.float32)
        print("count:" + str(count))
        count = count + 1
        yield batch_x, batch_y


model = Sequential()
model.add(Dense(units=1000, activation='relu', input_dim=2))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
count = 1
x_y = []
model.fit_generator(generate_arrays_from_file(x_y), steps_per_epoch=10, epochs=2, max_queue_size=1, workers=1)

Before understanding the above code, we need to first understand the usage of yield.

yield keyword:

Let's first look at the usage of yield through an example:

def foo():
    print("starting...")
    while True:
        res = yield 4
        print("res:", res)


g = foo()
print(next(g))
print("----------")
print(next(g))

operation result:

starting...
4
----------
res: None
4

The function with yield is a generator, not a function. Because there is a yield keyword in the foo function, the foo function will not be executed. Instead, an instance of the generator will be obtained first. When we call the next function for the first time, the foo function will start. First execute foo The print method in the function then enters the while loop. When the loop reaches yield, yield is actually equivalent to return, the function returns 4, and the program stops. So the output of our first call to next(g) is the first two lines.

Then when we call next(g) again, this time is to continue the execution from the place where it stopped last time, that is, to perform the assignment operation of res, because 4 has been returned in the last execution, and res is randomly assigned to None, and then Execute print("res:",res) to print res: None, loop again to yield and return to 4, and the program stops.

So the function of the yield keyword is that we can continue execution from the place where the program stopped last time, so that when we use it as a generator, we can avoid the situation of insufficient memory caused by reading data at one time.

Now see the sample code above:

The generate_arrays_from_file function is our generator. It reads a batch of data in a loop, then processes the data and returns. x_y is our training set after combining the path and the label, similar to the following form:

['data/img_4092.jpg' '0' '1' '0' '0' '0' ]

As for the format, it does not have to be this way, it can be your own format. As for how to deal with it, it is based on your own format and is processed in process_x. Here, because it is the stored image path, the main function of the process_x function is to read the image and restore it. You can also define operations you need to perform here, such as real-time data enhancement of images.

2.2 Use Sequence to implement generator

Sample code:

class BaseSequence(Sequence):
    """
 基础的数据流生成器,每次迭代返回一个batch
 BaseSequence可直接用于fit_generator的generator参数
 fit_generator会将BaseSequence再次封装为一个多进程的数据流生成器
 而且能保证在多进程下的一个epoch中不会重复取相同的样本
 """

    def __init__(self, img_paths, labels, batch_size, img_size):
        # np.hstack在水平方向上平铺
        self.x_y = np.hstack((np.array(img_paths).reshape(len(img_paths), 1), np.array(labels)))
        self.batch_size = batch_size
        self.img_size = img_size

    def __len__(self):
        # math.ceil表示向上取整
        # 调用len(BaseSequence)时返回,返回的是每个epoch我们需要读取数据的次数
        return math.ceil(len(self.x_y) / self.batch_size)

    def preprocess_img(self, img_path):
        img = Image.open(img_path)
        resize_scale = self.img_size[0] / max(img.size[:2])
        img = img.resize((self.img_size[0], self.img_size[0]))
        img = img.convert('RGB')
        img = np.array(img)

        # 数据归一化
        img = np.asarray(img, np.float32) / 255.0
        return img

    def __getitem__(self, idx):
        batch_x = self.x_y[idx * self.batch_size: (idx + 1) * self.batch_size, 0]
        batch_y = self.x_y[idx * self.batch_size: (idx + 1) * self.batch_size, 1:]
        batch_x = np.array([self.preprocess_img(img_path) for img_path in batch_x])
        batch_y = np.array(batch_y).astype(np.float32)
        print(batch_x.shape)
        return batch_x, batch_y

    # 重写的父类Sequence中的on_epoch_end方法,在每次迭代完后调用。
    def on_epoch_end(self):
        # 每次迭代后重新打乱训练集数据
        np.random.shuffle(self.x_y)

In the above code, __len __ and __getitem __ are our rewritten magic methods, __len __ is called when we call the len(BaseSequence) function, here we return (the total number of samples/batch_size) for us to pass Enter the steps_per_epoch parameter in fit_generator; __getitem__ allows the object to implement the iterative function, so that after passing the BaseSequence object into the fit_generator, the data can be read cyclically by continuously executing the generator.

Give an example to illustrate the role of getitem:

class Animal:
    def __init__(self, animal_list):
        self.animals_name = animal_list

    def __getitem__(self, index):
        return self.animals_name[index]


animals = Animal(["dog", "cat", "fish"])
for animal in animals:
    print(animal)

Output result:

dog
cat
fish

And using the Sequence class can ensure that in the case of multiple processes, the samples in each epoch will only be trained once.

Refer to the yield method :

 

 

Guess you like

Origin blog.csdn.net/qingfengxd1/article/details/111032641