[Development skills]-depth learning to use the generator accelerating data read and Training Concise Guide (TensorFlow, pytorch, keras)

Copyright Notice: Copyright - Xiao Song is it - yansongsong.cn - welcome to reprint https://blog.csdn.net/xiaosongshine/article/details/89213360

[Development skills]-depth learning to use the generator accelerating data read and Training Concise Guide (TensorFlow, pytorch, keras)

 

1. Problem Description

In depth study which had a famous saying, we decided to limit the depth of the data application effect, and the functions of the network model and algorithm are closer and closer to this limit. Which can be seen the degree of importance of the data.

In carrying out the development of deep learning, we are a very important part of the training before the modeling and analysis features and data is read, the main content of this article is to read the data with organizations, other aspects such as the future in other blog, set forth.

General way so that the same data is read into an array to go inside, in some small data such processing can, but in some more data than the data set will be a great problem:

  • Take up too much memory , when we train the network, in general, the method minibatch, no need to read a lot of data about the use of the slice select a part.
  • Take longer , we generate an array containing all the data, the reading will go to each element, we added together all the time, very time-consuming, this time in training the neural network does not, this will lead to an overall longer period of time a lot of.

The author in the development process, the use of large data sets encountered these problems (millions of pieces of audio data). First of all read into memory, the memory space is certainly not enough, furthermore time-consuming reading will accumulate over several days. Eventually solve the above problems, thanks to a powerful Python function generator.

Builder implements these features, you can read the data is returned in batches, a batch of data is returned after you resume where you left off reading the return of

2. Programming combat

2.1 generate some fake data for demonstration

import numpy as np
import math
data = np.array([[x*10,x] for x in range(16)])

print(data)

Output

[[  0   0]
 [ 10   1]
 [ 20   2]
 [ 30   3]
 [ 40   4]
 [ 50   5]
 [ 60   6]
 [ 70   7]
 [ 80   8]
 [ 90   9]
 [100  10]
 [110  11]
 [120  12]
 [130  13]
 [140  14]
 [150  15]]

2.2 Construction of the generator

def xs_gen(data,batch_size):
    lists = data
    num_batch = math.ceil(len(lists) / batch_size)    # 确定每轮有多少个batch
    for i in range(num_batch):
        batch_list = lists[i * batch_size : i * batch_size + batch_size]
        np.random.shuffle(batch_list)  
        batch_x = np.array([x for x in batch_list[:,0]])
        batch_y = np.array([y for y in batch_list[:,1]])


For ease of demonstration, the above list is directly read operations, typically when used in the path list is read, in accordance with the extracted data path

2.3 demo output

if __name__ == "__main__":

    #data_gen = xs_gen(data,5)
    for x,y in xs_gen(data,5):
        print("item",x,y)
    for x,y in xs_gen(data,5):
        print("item",x,y)

The results are as follows

item [30 20 10  0 40] [3 2 1 0 4]
item [50 70 80 90 60] [5 7 8 9 6]
item [110 120 140 100 130] [11 12 14 10 13]
item [150] [15]
item [ 0 30 20 10 40] [0 3 2 1 4]
item [60 80 90 70 50] [6 8 9 7 5]
item [130 100 110 120 140] [13 10 11 12 14]
item [150] [15]

Indeed organized in accordance with our ideas, but there is a problem. The first line above the fifth row and contrast can be found, although it will disrupt the data, but the data or the five best results of five should be random data.

How to achieve it, we can add a condition to determine when to return to the first batch of data, disrupting the entire table.

2.4 Improved generator function

def xs_gen_pro(data,batch_size):
    lists = data
    num_batch = math.ceil(len(lists) / batch_size)    # 确定每轮有多少个batch
    for i in range(num_batch):
        if(i==0):
            np.random.shuffle(lists)
        batch_list = lists[i * batch_size : i * batch_size + batch_size]
        np.random.shuffle(batch_list)
        batch_x = np.array([x for x in batch_list[:,0]])
        batch_y = np.array([y for y in batch_list[:,1]])

        yield batch_x, batch_y

Output data again

item [50 30 20 90 80] [5 3 2 9 8]
item [ 60   0 100 110  40] [ 6  0 10 11  4]
item [120  10 140 130 150] [12  1 14 13 15]
item [70] [7]
item [120  90  70  80 130] [12  9  7  8 13]
item [ 10 150 100   0  50] [ 1 15 10  0  5]
item [140  30  60  20 110] [14  3  6  2 11]
item [40] [4]

The random data is completely up.

How deep learning application generator

2.1 How TensorFlow, pytorch application generator

In TensorFlow, when pytorch application generator may be directly applied

for e in Epochs:
    for x,y in xs_gen():
    train(x,y)

2.1 How to apply Builder keras

Do some minor modifications keras using generators

def xs_gen_keras(data,batch_size):
    lists = data
    num_batch = math.ceil(len(lists) / batch_size)    # 确定每轮有多少个batch
    while True:
        np.random.shuffle(lists)
        for i in range(num_batch):
                
            batch_list = lists[i * batch_size : i * batch_size + batch_size]
            np.random.shuffle(batch_list)
            batch_x = np.array([x for x in batch_list[:,0]])
            batch_y = np.array([y for y in batch_list[:,1]])

            yield batch_x, batch_y

keras training using generator

train_iter = xs_gen_keras()
val_iter = xs_gen_keras()
model.fit_generator(   
        generator=train_iter,
        steps_per_epoch=Lens1//Batch_size,
        epochs=10,
        validation_data = val_iter,
        nb_val_samples = Lens2//Batch_size,
        )
        
      

Explain a few simple parameters, val_iter is their definition of the test generator, what I have to do directly with the builder training, pay attention to training modeled on the generator when you use your own tinkering. Which is a steps_per_epoch epoch in the number of batch, similar nb_val_samples definitions, use the time that is divisible by the total number of data Batch_size. Specific parameters can be found keras documents.

Specific examples of the application builder training network I can refer to the actual blog post: https://blog.csdn.net/xiaosongshine/article/details/88972196

Guess you like

Origin blog.csdn.net/xiaosongshine/article/details/89213360