Keras For training of large data, it cannot be loaded into memory at one time, and iterators are used

Transfer to: http://blog.csdn.net/lujiandong1/article/details/54869170

Description: I am modifying the official demo of keras https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py

1. A few points. Reading data from a file will reduce the GPU usage. If the data can be loaded directly into the memory, the GPU usage will be higher. Here's a comparison:

All data is loaded into memory, GPU usage:

Use a queue to train while reading data:


Conclusion: All loaded into the memory, the GPU utilization rate can reach 82%, if the data is loaded while training, it can only reach 48%


2. Keras uses iterators to implement big data training. The simple idea is to use iterators to sequentially read data from files. Therefore, your training data must be randomly scattered first. Because our iterator also reads a batch_size of data sequentially for training.

An example is as follows: The data is as follows, the first 400 dimensions are features, and the last dimension is label


The official demo of keras is as follows:

[python]  view plain copy  
  1. def generate_arrays_from_file(path):  
  2.     while1:   
  3.     f = open(path)  
  4.     for line in f:  
  5.         # create Numpy arrays of input data  
  6.         # and labels, from each line in the file  
  7.         x, y = process_line(line)  
  8.         yield (x, y)  
  9.     f.close()  
  10.   
  11. model.fit_generator(generate_arrays_from_file('/my_file.txt'),  
  12.         samples_per_epoch=10000, nb_epoch=10)  
Note: The official demo is still flawed, batch_size is not implemented, and the demo can only extract one sample at a time. For the above data set, I implemented an iterator of batch_size data extraction, the code is as follows:

[python]  view plain copy  
  1. def process_line(line):  
  2.     tmp = [int(val) for val in line.strip().split(',')]  
  3.     x = np.array(tmp[:-1])  
  4.     y = np.array(tmp[-1:])  
  5.     return x,y  
  6.   
  7. def generate_arrays_from_file(path,batch_size):  
  8.     while1:   
  9.         f = open(path)  
  10.         cnt = 0  
  11.         X =[]  
  12.         AND =[]  
  13.         for line in f:  
  14.             # create Numpy arrays of input data  
  15.             # and labels, from each line in the file  
  16.             x, y = process_line(line)  
  17.             X.append(x)  
  18.             Y.append(y)  
  19.             cnt += 1  
  20.             if cnt==batch_size:  
  21.                 cnt = 0  
  22.                 yield (np.array(X), np.array(Y))  
  23.                 X = []  
  24.                 Y = []  
  25.     f.close()  

The code for training is as follows:

[python]  view plain copy  
  1. model.fit_generator(generate_arrays_from_file('./train',batch_size=batch_size),  
  2.         samples_per_epoch=25024,nb_epoch=nb_epoch,validation_data=(X_test, y_test),max_q_size=1000,verbose=1,nb_worker=1)  

3. Description of samples_per_epoch:

My training data, train has only 25000 rows, batch_size=32. It stands to reason that samples_per_epoch=32, but there will be a warning. UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results


Description: The reason for this error is that the number of trains/batch_size is not an integer. You can set samples_per_epoch = ceil(train_num/batch_size) *batch_size. The result is 88.72%:


The method used by the keras demo is to load all the data into training:


The result of the demo is 88.86%, so there is basically no problem with the way the data is read. However, the data must be shuffled first. If you can load all of them into memory, load them all into memory, and the speed will be much faster


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325618685&siteId=291194637