Transfer to: http://blog.csdn.net/lujiandong1/article/details/54869170
Description: I am modifying the official demo of keras https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py
1. A few points. Reading data from a file will reduce the GPU usage. If the data can be loaded directly into the memory, the GPU usage will be higher. Here's a comparison:
All data is loaded into memory, GPU usage:
Use a queue to train while reading data:
Conclusion: All loaded into the memory, the GPU utilization rate can reach 82%, if the data is loaded while training, it can only reach 48%
2. Keras uses iterators to implement big data training. The simple idea is to use iterators to sequentially read data from files. Therefore, your training data must be randomly scattered first. Because our iterator also reads a batch_size of data sequentially for training.
An example is as follows: The data is as follows, the first 400 dimensions are features, and the last dimension is label
The official demo of keras is as follows:
- def generate_arrays_from_file(path):
- while1:
- f = open(path)
- for line in f:
- # create Numpy arrays of input data
- # and labels, from each line in the file
- x, y = process_line(line)
- yield (x, y)
- f.close()
- model.fit_generator(generate_arrays_from_file('/my_file.txt'),
- samples_per_epoch=10000, nb_epoch=10)
- def process_line(line):
- tmp = [int(val) for val in line.strip().split(',')]
- x = np.array(tmp[:-1])
- y = np.array(tmp[-1:])
- return x,y
- def generate_arrays_from_file(path,batch_size):
- while1:
- f = open(path)
- cnt = 0
- X =[]
- AND =[]
- for line in f:
- # create Numpy arrays of input data
- # and labels, from each line in the file
- x, y = process_line(line)
- X.append(x)
- Y.append(y)
- cnt += 1
- if cnt==batch_size:
- cnt = 0
- yield (np.array(X), np.array(Y))
- X = []
- Y = []
- f.close()
The code for training is as follows:
- model.fit_generator(generate_arrays_from_file('./train',batch_size=batch_size),
- samples_per_epoch=25024,nb_epoch=nb_epoch,validation_data=(X_test, y_test),max_q_size=1000,verbose=1,nb_worker=1)
3. Description of samples_per_epoch:
My training data, train has only 25000 rows, batch_size=32. It stands to reason that samples_per_epoch=32, but there will be a warning. UserWarning: Epoch comprised more than `samples_per_epoch` samples, which might affect learning results
Description: The reason for this error is that the number of trains/batch_size is not an integer. You can set samples_per_epoch = ceil(train_num/batch_size) *batch_size. The result is 88.72%:
The method used by the keras demo is to load all the data into training:
The result of the demo is 88.86%, so there is basically no problem with the way the data is read. However, the data must be shuffled first. If you can load all of them into memory, load them all into memory, and the speed will be much faster