Valgrind trial notes

valgrind is a full-featured code diagnostic software that can be installed under Ubuntu

sudo apt-get install valgrind

Manuel.pdf can be downloaded from the official website.

Can diagnose memory leaks

g++ xxx.cpp
valgrind --tool=memcheck ./a.out

It will report memory leaks.

Can also diagnose cache hit rate

g++ xxx.cpp
valgrind --tool=cachegrind ./a.out

It will report information such as first-level cache data hit rate, instruction hit rate, and last-level cache hit rate.

The following example

#include<iostream>
using namespace std;
#include<ctime>

const size_t N = 1E3;

int main () {

        double y=0,z=0;
        clock_t tstart = clock();
        double *A = new double [N*N];
        for(size_t i=0;i<N*N;i++)A[i]=i;
        double *B = new double [N*N];
        for(size_t i=0;i<N*N;i++)B[i]=i;
        double *C = new double [N*N];
        for(size_t i=0;i<N;i++)
        for(size_t j=0;j<N;j++){
                z=0;
                for(size_t l=0;l<N;l++)
                        z += B[l*N+j];
                y=0;
                for(size_t k=0;k<N;k++){
                        y += A[k*N+i] * z;
                }
                C[i*N+j]=y;
        }
        clock_t t1=clock();
        cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl;
        delete [] A; delete [] B; delete [] C;
        return 0;
}

In the inner loop of this code, l, k is the number of lines, so it will cause the memory locality to be not very good, which is reflected in the test report of valgrind, that is, the level 1 cache data hit rate is lower (D1 miss rate: 11.8%).

==2322== Cachegrind, a cache and branch-prediction profiler
==2322== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2322== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2322== Command: ./a.out
==2322== 
--2322-- warning: L3 cache found, using its data for the LL simulation.
276.428 s
==2322== 
==2322== I   refs:      31,053,190,692
==2322== I1  misses:             1,976
==2322== LLi misses:             1,928
==2322== I1  miss rate:           0.00%
==2322== LLi miss rate:           0.00%
==2322== 
==2322== D   refs:      17,025,701,244  (15,018,537,430 rd   + 2,007,163,814 wr)
==2322== D1  misses:     2,001,266,444  ( 2,000,014,098 rd   +     1,252,346 wr)
==2322== LLd misses:       125,490,381  (   125,113,840 rd   +       376,541 wr)
==2322== D1  miss rate:           11.8% (          13.3%     +           0.1%  )
==2322== LLd miss rate:            0.7% (           0.8%     +           0.0%  )
==2322== 
==2322== LL refs:        2,001,268,420  ( 2,000,016,074 rd   +     1,252,346 wr)
==2322== LL misses:        125,492,309  (   125,115,768 rd   +       376,541 wr)
==2322== LL miss rate:             0.3% (           0.3%     +           0.0%  )

In the inner loop of the following code, k and l are the number of columns, and memory locality is better.

#include<iostream>
using namespace std;
#include<ctime>

const size_t N = 1E3;

int main () {

        double y=0,z=0;
        clock_t tstart = clock();
        double *A = new double [N*N];
        for(size_t i=0;i<N*N;i++)A[i]=i;
        double *B = new double [N*N];
        for(size_t i=0;i<N*N;i++)B[i]=i;
        double *C = new double [N*N];
        for(size_t i=0;i<N;i++)
        for(size_t j=0;j<N;j++){
                z=0;
                for(size_t l=0;l<N;l++)
                        z += B[j*N+l];
                y=0;
                for(size_t k=0;k<N;k++){
                        y += A[i*N+k] * z;
                }
                C[i*N+j]=y;
        }
        clock_t t1=clock();
        cout<<(double)(t1-tstart)/CLOCKS_PER_SEC<<" s"<<endl;
        delete [] A; delete [] B; delete [] C;
        return 0;
}

Reflected in the cachegrind report, it is a higher level 1 cache data hit rate (D1 miss rate: 0.7%).

==2334== Cachegrind, a cache and branch-prediction profiler
==2334== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2334== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==2334== Command: ./a.out
==2334== 
--2334-- warning: L3 cache found, using its data for the LL simulation.
202.343 s
==2334== 
==2334== I   refs:      31,053,190,658
==2334== I1  misses:             1,974
==2334== LLi misses:             1,926
==2334== I1  miss rate:           0.00%
==2334== LLi miss rate:           0.00%
==2334== 
==2334== D   refs:      17,025,701,233  (15,018,537,423 rd   + 2,007,163,810 wr)
==2334== D1  misses:       125,517,445  (   125,140,099 rd   +       377,346 wr)
==2334== LLd misses:       125,510,970  (   125,134,429 rd   +       376,541 wr)
==2334== D1  miss rate:            0.7% (           0.8%     +           0.0%  )
==2334== LLd miss rate:            0.7% (           0.8%     +           0.0%  )
==2334== 
==2334== LL refs:          125,519,419  (   125,142,073 rd   +       377,346 wr)
==2334== LL misses:        125,512,896  (   125,136,355 rd   +       376,541 wr)
==2334== LL miss rate:             0.3% (           0.3%     +           0.0%  )

After adding valgrind, the running time of the two pieces of code is 276.428s and 202.342s respectively. Without the valgrind command, the running time of the two pieces of code is 24.0162 s and 7.99471 s, respectively. Therefore, the 10% difference in D1 miss rate in the report (in addition there is almost no other difference, LL refs are an order of magnitude worse, but the number of LL misses is similar), will lead to several times the efficiency difference. This shows that the tasks calculated by the CPU are far less than the redundant tasks caused by the 10% D1 miss.

 

Regarding the computer's cache, when the CPU needs data, it will look for level 1 cache-> level 2 cache-> ...-> last level cache-> memory in the following units. If it is found in the level 1 cache, stop searching and use it. The deeper the search, the higher the cost of time. If you want to find it in the memory at the end (cache miss), you will take one "data block" at a time and store it in the cache. If the data you find next time is in the same data block, you can save time and cost. Therefore, the higher the cache hit rate, the better the program performance, so memory locality is very important.

Guess you like

Origin www.cnblogs.com/luyi07/p/12725565.html