The training speed is very slow in caffe CPU mode

As mentioned earlier, when using caffe to train the model, the problem of using only a single CPU core is solved by openblas, but even if multiple cores are used, the speed is still super slow.


As far as the current network is concerned, it can only run 30 iter (1 iter 128 samples) in 1min, that is, 30*128 = 3840 samples/min, why is it so slow? Is there something wrong with the code, or where the settings are wrong?


Try to compare with lenet, use 8-core CPU to train mnist according to lenet, about 8 minutes to complete 10000iter (one iter 64 samples), that is 10000*64/8 = 80000 samples/min


From the comparison here, the training speed of lenet is still Normally, it is also suggested on the Internet that it takes about 10 minutes to train lenet. So why is the speed difference so big? It should be due to the amount of network computation.


You can view the calculation amount of the network through https://dgschwend.github.io/netscope/#/editor, enter the deploy.prototxt file of the network on the left, and then press shift + enter to view the calculation amount of the network.
The analysis found that the calculation amount of the lenet network is 2.29M, while the calculation amount of our network is 36.08M, so the analysis is clear, 3840*36.08 = 138547.2, and 80000*2.29 = 183200, which is relatively close. So the speed of training should be like this.


In addition, I compared the training speed under caffe with the training under tensorflow, and found something very strange:


  tensorflow    caffe
 GPU 700 iter/min 1500 iter/min 
 CPU 200 iter/min 20 iter/min 
Is the CPU speed of caffe so slow? Still don't understand why.




Do another experiment, use the mnist data set to train the lenet network in tensorflow and caffe (the network structure is modified according to the example in caffe, and write a tensorflow lenet network), and take a look at the time consumption.
The data obtained are as follows:
10000 iter training with mnist, 64 samples per batch


  GPU + CPU CPU cores unlimited CPU 8 cores
 tensorflow 50s 219s 258s
 caffe 30s 1000s+
 900s


It can be seen from here that caffe is indeed in GPU mode It will be faster than tensorflow, but much slower in CPU mode. (There is a problem here that the more cores caffe uses in CPU mode, the slower the speed. This problem has also been mentioned by other people, so I don't know why)




Check the information, it is mentioned that openblas needs to be compiled with OpenMP, even if openblas is used In the multi-threaded form, it is possible that the openblas downloaded and installed by apt-get does not use openmp by default. In the previous blog, I mentioned how to install openblas and use the multi-core cpu for training.


1. First, compare the difference between openblas installed through apt-get and openblas compiled and installed by yourself
The openblas installed by sudo apt-get install libopenblas-dev is in the default directory /usr/lib. You can view the so library it uses through the ldd command. The results are as follows
- Bash code
1
ldd /usr/lib/libopenblas.so
2
 
3
linux -vdso.so.1 => (0x00007fffb435a000)
4
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa84028b000)
5
libpthread.so.0 => /lib/x86_64-linux -gnu/libpthread.so.0 (0x00007fa84006d000)
6
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa83fca3000)
7 /lib64/ld-linux-x86-64.so
. 2 (0x0000560783ce6000)


and then manually compile and install openblas
 
First download openblas from https://github.com/xianyi/OpenBLAS,
then unzip OpenBLAS-xxx,
then make USE_OPENMP=1
and then sudo make install
That is, openblas is installed in the /opt/OpenBLAS directory. After


installation ,
-Bash code
01
ldd /opt/OpenBLAS/lib/libopenblas.so
02
 
03
linux-vdso.so.1 => (0x00007fffe6988000)
04
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3876368000)
05
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f387614a000)
06
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007f3875e2f000)
07
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f3875c20000)
08
libc .so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3875857000)
09
/lib64/ld-linux-x86-64.so.2 (0x000055933fb64000)
10
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f387561a000)
11
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 ( 0x00007f3875404000)


It can be seen that the openblas compiled and installed by itself, and then other libraries such as libgomp are called.


2. Compile caffe


through openblas Openblas installed through apt-get is in /usr/lib by default, so to compile caffe, modify the BLAS in Makefile.config in caffe, and then compile it again,
but you need to use your own compiled openblas To compile caffe, there is a blas lib and include directory setting in Makefile.config, but after setting, it cannot be effective.


Finally done through export LD_LIBRARY_PATH=/opt/OpenBLAS/lib, and


then use ldd build/lib/libcaffe.so , you can see that the so library in /opt/OpenBLAS/lib is used here.




3. Let’s take a look at the comparison of the training speed of the two.


Through apt-get installation, openblas can set OPENBLAS_NUM_THREADS=N to set how many CPU cores
are used. By compiling and installing openblas by yourself, you can set OMP_NUM_THREADS=N to set how many CPU cores are used. processes


By setting OMP_NUM_THREADS, you will find that it is basically invalid. With 1 process, 4 processes and 8 processes, the speed is basically unchanged




. Embarrassing , this problem still cannot be solved




. Refer to the blog post:
Caffe: Using openblas in CPU mode- openmp (multi-threaded version) http://blog.csdn.net/10km/article/details/52723306




Caffe uses openblas+openmp, sets OMP_NUM_THREADS, but the training speed cannot be improved. After research, there is no good solution. , I saw on the Internet that the process acceleration depends on the network structure. Some network structures can be accelerated, and some cannot. It is also said that setting OMP_NUM_THREADS=1 is good. There are different opinions, and I don't know why.




If this road does not work, then the possible solutions are as follows:


1. OPENMP acceleration solution, someone submitted a pull request in caffe, it is said that it can speed up 5-10 times, https://github.com/BVLC/ caffe/pull/439, but caffe did not accept the PR, for the stability of the code


2. The intel version of caffe is said to have been accelerated. Possible problems: (1) Whether to charge, and whether to use intel's MKL Library, this library should be charged, (2), the product needs to be certified for open source code (intel's caffe version, our company should not have done certification) (https://software.intel.com/en-us/articles /comparison-between-intel-optimized-caffe-and-vanilla-caffe-by-intel-vtune-amplifier)


3. Go back to tensorflow and use tensorflow's c/c++ API. There are problems: (1), tensorflow's C/C++ training model has few references; (2), tensorflow's c/C++ does not implement auto differentiation?


4. Then go to other frameworks and get crazy.






Currently I hope to use the MKL library in caffe and use the intel-optimized caffe version.




The one I encountered earlier is that setting OMP_NUM_THREADS has no effect. The larger the value, the slower the training speed. When a smaller value is set, the training speed is faster. After some attempts, it was found that if adam was used to optimize the operator , then setting OMP_NUM_THREADS has no effect. If other ones are used, it will be effective. By replacing the adam operator, and then setting OMP_NUM_TRHEADS=8, there will be a 3 times speed improvement.
-Bash code
1
base_lr: 0.001
2
momentum: 0.9
3
momentum2: 0.999
4
 
5
lr_policy: "fixed"
6
type: "Adam"
-Bash code
1
base_lr: 0.001
2
momentum: 0.9
3
 
4
lr_policy: "step"
5
Gamma: 1
6
stepsize: 5000


But the above configuration involves the problem of learning rate adjustment and fine-tuning of parameters, which fails to achieve the same accuracy as Adam.


Then start to use intel's mkl library and intel's caffe version.


The caffe version using intel mainly refers to the performance comparison instructions of this intel.  
Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*
https://software.intel.com/en-us/articles/comparison-between-intel-optimized -caffe-and-vanilla-caffe-by-intel-vtune-amplifier


Here, it shows how much performance improvement can be achieved by using Intel's caffe+mkl.




First, let's take a look at how to use Intel's mkl library in native caffe.


The functions of the mkl library are the same as those of atlas and openblas. After downloading and installing, you can configure it in the makefile.config file in caffe.
To download the mkl library, you need to go to the intel website to apply for registration, https://software.intel.com/en-us/mkl. After applying, you will usually receive an email containing the registration code and other information, and then apply for an intel account, you can download mkl, and you will get an mkl compressed package with a file name similar to "l_mkl_2018.1.163.tgz", and then Installation is ready. But I applied, and I didn't receive the email after waiting for four or five days. I don't know where there is a problem. I can try it with other colleagues.


After downloading the installation file, decompress it under Ubuntu, tar -xvzf l_mkl_2018.1.163.tgz, enter the decompressed directory, run ./install.sh Follow the prompts to complete the installation of the mkl library, after installation, the default location is / There will be mkl lib and include in the opt/intel directory.


Complete the installation of the mkl library, and then recompile caffe.


In the main directory of caffe, configure Makefile.config and modify the content of BLAS as follows:


-Bash code
1
BLAS := mkl
2
BLAS_INCLUDE := /opt/intel/mkl/include
3
BLAS_LIB := /opt/intel/mkl/ lib/intel64


In addition, when compiling caffe, similar to openblas, you also need to import the lib library to LD_LIBRARY_PATH, so first execute
-Bash code
1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/mkl/lib/intel64


and then make clean, make all -j32, you can complete the compilation.
Then use ldd build/tools/caffe to see that the library depends on libmkl_rt.so.
However, only using the mkl library to compile the native caffe, the training speed is not significantly improved, and it feels that there is little difference between opening multiple threads in OpenBLAS. The improvement is very limited.


Then continue to try the caffe version of intel, first you need to download the caffe source code of intel,


git clone https://github.com/intel/caffe


(here you need to make sure that git clone can be used normally, and you need to use it later when compiling intel-caffe When you arrive, the source code of mkl-dnn will be downloaded, etc.)
and then configure the Makefile.config file, and configure blas as mkl in the same way as before using the mkl library. Then just start make.


If you download intel-caffe through windows, then copy it to linux, and then decompress it, some .sh files may not have execution permissions, and errors will occur during compilation.




If the compilation is correct, you can start the execution. It can be found that the speed has indeed been greatly improved. Compared with the native caffe, it has a speed improvement of 7-8 times, which may be slightly different from the official data given by Intel. The main reason is that the speed of native caffe+mkl is not as slow as Intel officially said.


But I encountered some problems in training my network, and the accuracy rate did not increase, but there is no problem in training mnsit and cifar10 with the network structure that comes with caffe.


At the beginning of the inspection, it was found that intel-caffe will have some network structure optimizations by default. In src/caffe/net.cpp, after removing the optimization, there are still problems, so I started to build the network structure again, layer by layer, and found that Adding a third convolutional layer will cause an error, the exact same network structure, the exact same data, but the performance of intel-caffe is different from that of native caffe, intel-caffe starts to train and the accuracy does not rise. I can't figure it out, and I can't solve this problem. By chance, I found that a slight modification of the network structure is effective, that is, the size of the third convolution kernel is changed from 5 to 9 or 11, just fine, everything is normal.


This problem may be strange, and I don't know how to solve it. Most people should not encounter it.


Okay, like this, you can use the caffe version accelerated by intel. The speed is indeed much faster.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325723284&siteId=291194637