Journey to the Great Wilderness of Min (8) ----- Where has the time gone?

HOG+SVM for pedestrian detection is a very classic approach, but those who have actually used it can find that the detection algorithm provided by OpenCV has very poor real-time performance. In fact, some optimizations have been made in OpenCV, such as using CPU to perform a parallel calculation for multi-scale pedestrian detection. However, it takes 1~2 seconds to run a complete detection process on my laptop. This detection speed, If it is applied to unmanned technology, it is estimated that the person has been hit by the detected person. . . In order to improve the detection speed, the use of GPU parallel computing is very suitable and one of the solutions.

 

I have studied CUDA, HOG, and SVM for a while, and I have even started the task of parallel computing. However, I have overlooked a very important and critical issue - where is the time spent in all these detections? ?

 

Thanks to the teacher's guidance, let me rethink this problem, although I may have briefly considered this problem before, but in engineering, I don't believe in simple thinking, and what I need should be real data!

 

Where is the time spent? is HOG calculation? Is it multi-scale detection? is SVM computing? ····still?

 

Only by truly figuring out this problem can we prescribe the right medicine!

 

So, I started to revisit the entire pedestrian detection process. It is easy to find that the most important operation function is the detectMultiScale function, which is a multi-scale pedestrian detection function!

 

So what is multi-scale pedestrian detection? I would like to thank Senior Brother Qian for the explanation, which is simple and clear.

The size of the template is often determined, and the template that is used more (well, I use more) is the template of (128, 64), so there is actually a limit to the size of the person in the template. And the size of the image to be detected that we process is not necessarily, it may be 640*480, it may be 1280*760, these completely depend on the degree of support of the hardware resolution you use, the problem caused by different sizes is the size and size of the person. Templates can be far apart, and these size differences are likely to be missed by the algorithm! In order to solve this problem, one of the solutions, which is also the method used in the detectMultiScale function, is to reduce the image to be detected into images of many sizes, then there will always be a size that is suitable for the size of the people in these images (in fact, it is also Not necessarily, I think it is limited by the number of image layers) the size of the template, so that it can be detected. The general default is 64 layers, which is multi-scale detection!

 

We can look at the source code of the detectMultiScale function

void HOGDescriptor::detectMultiScale(
    const Mat& img, vector<Rect>& foundLocations, vector<double>& foundWeights,
    double hitThreshold, Size winStride, Size padding,
    double scale0, double finalThreshold, bool useMeanshiftGrouping) const
{
    double scale = 1.;
    int levels = 0;
    //******************first part******************
    vector<double> levelScale;
    for( levels = 0; levels < nlevels; levels++ )
    {
        levelScale.push_back(scale);
        if( cvRound(img.cols/scale) < winSize.width ||
            cvRound(img.rows/scale) < winSize.height ||
            scale0 <= 1 )
            break;
        scale *= scale0;
    }
    levels = std::max(levels, 1);
    levelScale.resize(levels);
   //*********************************************
    std::vector<Rect> allCandidates;
    std::vector<double> tempScales;
    std::vector<double> tempWeights;
    std::vector<double> foundScales;
    Mutex mtx;
   //****************the second part************************
    parallel_for_(Range(0, (int)levelScale.size()),
                 HOGInvoker(this, img, hitThreshold, winStride, padding, &levelScale[0], &allCandidates, &mtx, &tempWeights, &tempScales));
   //**********************************************
   
    std::copy(tempScales.begin(), tempScales.end(), back_inserter(foundScales));
    foundLocations.clear();
    std::copy(allCandidates.begin(), allCandidates.end(), back_inserter(foundLocations));
    foundWeights.clear();
    std::copy(tempWeights.begin(), tempWeights.end(), back_inserter(foundWeights));
    //********************the third part***********************
    if ( useMeanshiftGrouping )
    {
        groupRectangles_meanshift(foundLocations, foundWeights, foundScales, finalThreshold, winSize);
    }
    else
    {
        groupRectangles(foundLocations, foundWeights, (int)finalThreshold, 0.2);
    }
}

 According to the time measurement, the first part and the third part only consume a very short time, and more than 98% of the time is spent in the second part. See

 The parallel_for_ function naturally thinks of parallel computing! Yes, the CPU can also have threads, but obviously compared to the GPU, the CPU is not very good at this task (in terms of computing, many threads of the CPU are used to execute various processes).

 I think it will take time to understand the parallel_for_ function, but maybe you don't need to understand the details of this function first, first figure out what this function is doing in parallel!

 In fact, this is to open up Range(0, (int)levelScale.size()) threads to complete the calculation of the HOGInvoker( ) function. We found
the category HOGInvoker. Maybe at first glance, isn't this function an assignment function?

class HOGInvoker : public ParallelLoopBody
{
public:
    HOGInvoker( const HOGDescriptor* _hog, const Mat& _img,
                double _hitThreshold, Size _winStride, Size _padding,
                const double* _levelScale, std::vector<Rect> * _vec, Mutex* _mtx,
                std::vector<double>* _weights=0, std::vector<double>* _scales=0 )
    {
        hog = _hog;
        img = _img;
        hitThreshold = _hitThreshold;
        winStride = _winStride;
        padding = _padding;
        levelScale = _levelScale;
        vec = _vec;
        weights = _weights;
        scales = _scales;
        mtx = _mtx;
    }

    void operator()( const Range& range ) const
    {
        int i, i1 = range.start, i2 = range.end;
        double minScale = i1 > 0 ? levelScale[i1] : i2 > 1 ? levelScale[i1+1] : std::max(img.cols, img.rows);
        Size maxSz(cvCeil(img.cols/minScale), cvCeil(img.rows/minScale));
        Mat smallerImgBuf(maxSz, img.type());
        vector<Point> locations;
        vector<double> hitsWeights;

        for( i = i1; i < i2; i++ )
        {
            double scale = levelScale[i];
            Size sz(cvRound(img.cols/scale), cvRound(img.rows/scale));
            Mat smallerImg(sz, img.type(), smallerImgBuf.data);
            if( sz == img.size() )
                smallerImg = Mat(sz, img.type(), img.data, img.step);
            else
                resize(img, smallerImg, sz);
            hog->detect(smallerImg, locations, hitsWeights, hitThreshold, winStride, padding);
            Size scaledWinSize = Size(cvRound(hog->winSize.width*scale), cvRound(hog->winSize.height*scale));

            mtx->lock();
            for( size_t j = 0; j < locations.size(); j++ )
            {
                vec->push_back(Rect(cvRound(locations[j].x*scale),
                                    cvRound(locations[j].y*scale),
                                    scaledWinSize.width, scaledWinSize.height));
                if (scales)
                {
                    scales->push_back(scale);
                }
            }
            mtx->unlock();

            if (weights && (!hitsWeights.empty()))
            {
                mtx->lock();
                for (size_t j = 0; j < locations.size(); j++)
                {
                    weights->push_back(hitsWeights[j]);
                }
                mtx->unlock();
            }
        }
    }

    const HOGDescriptor* hog;
    Mat img;
    double hitThreshold;
    Size winStride;
    Size padding;
    const double* levelScale;
    std::vector<Rect>* vec;
    std::vector<double>* weights;
    std::vector<double>* scales;
    Mutex* mtx;
};

 In fact, those who have studied C++ will understand that the operator function is an operator overloading function, which overloads "()". And the most important and most time-consuming step, I think everyone can see it at a glance, after time measurement, it is indeed it---detect!

 

Let's see the source code of detect again

void HOGDescriptor::detect(const Mat& img,
    vector<Point>& hits, vector<double>& weights, double hitThreshold,
    Size winStride, Size padding, const vector<Point>& locations) const
{
    hits.clear();
    if( svmDetector.empty() )
        return;

    if( winStride == Size() )
        winStride = cellSize;
    Size cacheStride(gcd(winStride.width, blockStride.width),
                     gcd(winStride.height, blockStride.height));
    size_t nwindows = locations.size();
    padding.width = (int)alignSize(std::max(padding.width, 0), cacheStride.width);
    padding.height = (int)alignSize(std::max(padding.height, 0), cacheStride.height);
    Size paddedImgSize(img.cols + padding.width*2, img.rows + padding.height*2);
    //***************************first part******************* ******
    HOGCache cache(this, img, padding, padding, nwindows == 0, cacheStride);
    //************************************************************	
    if( !nwindows )
        nwindows = cache.windowsInImage(paddedImgSize, winStride).area();

    const HOGCache::BlockData* blockData = &cache.blockData[0];
    int nblocks = cache.nblocks.area();
    int blockHistogramSize = cache.blockHistogramSize;
    size_t dsize = getDescriptorSize();

    double rho = svmDetector.size() > dsize ? svmDetector[dsize] : 0;
    vector<float> blockHist(blockHistogramSize);
    //*******************the second part***********************
    for( size_t i = 0; i < nwindows; i++ )
    {
        //********************One for loop**********************
        Point pt0;
        if( !locations.empty() )
        {
            pt0 = locations[i];
            if( pt0.x < -padding.width || pt0.x > img.cols + padding.width - winSize.width ||
                pt0.y < -padding.height || pt0.y > img.rows + padding.height - winSize.height )
                continue;
        }
        else
        {
            pt0 = cache.getWindow(paddedImgSize, winStride, (int)i).tl() - Point(padding);
            CV_Assert(pt0.x % cacheStride.width == 0 && pt0.y % cacheStride.height == 0);
        }
        double s = rho;
        const float* svmVec = &svmDetector[0];
#ifdef HAVE_IPP
        int j;
#else
        int j, k;
#endif
        for( j = 0; j < nblocks; j++, svmVec += blockHistogramSize )
        {
            const HOGCache::BlockData& bj = blockData[j];
            Point pt = pt0 + bj.imgOffset;

            const float* vec = cache.getBlock(pt, &blockHist[0]);
#ifdef HAVE_IPP
            Ipp32f partSum;
            ippsDotProd_32f(vec,svmVec,blockHistogramSize,&partSum);
            s += (double)partSum;
#else
            for( k = 0; k <= blockHistogramSize - 4; k += 4 )
                s + = vec [k] * svmVec [k] + vec [k + 1] * svmVec [k + 1] +
                    vec [k + 2] * svmVec [k + 2] + vec [k + 3] * svmVec [k + 3];
            for( ; k < blockHistogramSize; k++ )
                s + = vec [k] * svmVec [k];
#endif
        }
        if( s >= hitThreshold )
        {
            hits.push_back(pt0);
            weights.push_back(s);
        }
        //*************************************************
    }
    //************************************************
}

 Time test the above key parts:

           Part 1: 20% of the time

           Part 2: 80% of the time

           One for loop: about 0.1ms

So, we finally found the place where time wasted! We spend 20% of the time in the first part of the HOG operator calculation, and 80% of the time is in the calculation and processing of the HOG operator.

In response to the above results, I adjusted the rewriting scheme of parallel computing in time:

 

Stage 1: Rewrite the second part of the detect function! That is, the for loop of nwindows, each for loop opens a thread, and a total of nwindows threads are opened. The reason is obvious, the for loop takes a lot of time, but the time of each loop is very short, which means that the number of runs is very large! This is great for parallel processing!

 

The second stage: Rewrite the first part of the detect function, just like the rewrite of the computeGradient function in (7), so that it is also calculated in parallel!

 

Others remain unchanged, and the parallel computing of the CPU is still used to process images of 64 scales in parallel!

The following figure is a simple schematic diagram of the first scheme:
                        

 

Plan preparation:

               The most important thing in rewriting CUDA is data exchange, so here we need to figure out what data to input and what data to output

               Input data: Rewrite the scheme:

                                 Mat& img                                                   ptrstep<uchar>, int img_h, int img_w

                                 double hitThreshold                        

                                 size winstride                                             int winstride_w, int winstride_h

                                 size padding                                              int padding_w, int padding_h

                                 BlockData* blockdata ptrstepsz<uchar3>

               Output Data:

                                 vector<point> point                                   ptrstepsz<uchar3>

                                 vector<double> weight                              ptrstep<uchar>

 

               Note: 1.Mat img I plan to input the grayscale image, so I can only use the ptrstep<uchar> type to undertake, but this type has no length member, so only the width and length information can be input!

                      2. The BlockData structure includes an int and a point type, so three channels are used to undertake x, y and int respectively;

                       
 

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327013093&siteId=291194637