CNN target detection method (RCNN, Fast-RCNN, Faster-RCNN, Mask-RCNN, YOLO, SSD) based on the pedestrian detection, object tracking, convolutional neural network

This link: https://blog.csdn.net/qq_32998593/article/details/80558449

 

First, the significance of
        convolutional neural network (CNN) due to its strong feature extraction capabilities, in recent years, is widely used in the field of computer vision. LeNet-5 in 1998 Yann LeCun network structure proposed by the convolutional neural network can be structure so that the end of the training, and applied to document recognition. LeNet-5 structure is the most classic CNN network structure, convolution neural network development are then derived by this version.

        In the past six years, due to the development and progress of convolution depth learning and network-based object detection and classification capabilities of the image has been greatly improved. The target detection is an important field of computer vision research is an essential prerequisite for many high-level tasks, including understanding the scene, event recognition. Currently, object detection is also widely used in security monitoring, autopilot, human-computer interaction, and enhance many areas of reality. Target detection for computer vision and industry are of great practical significance.

        However, because of the perspective, occlusion, posture and other factors cause the target to deform, resulting in target detection has become a challenging task. The design of high efficiency and high accuracy of object detection algorithm is still of great significance.

Second, the research status
        Today, object detection convolution neural network has gone beyond the traditional target detection method, become the mainstream method of target detection. Based on the use convolution neural networks, it will be divided into two major categories of target detection based on neural network convolution: Convolution object detection neural network classification and regression-based convolution neural network-based object detection.

1. The classification-based object detection convolutional neural network
        based classification CNN may be two-stage detection algorithm, traditional target detection method comprising pretreatment, sliding window, feature extraction, feature selection, feature classification, and post-processing step, and convolutional neural network itself has a feature extraction, feature selection and feature classification function. Then, using a direct binary convolutional neural network for each of the candidate regions generated by the sliding window, which is to be determined whether a detection target. Such methods will be referred to herein, a convolutional neural network classification based target detection. Compared to the traditional six target detection step, only the target based on convolutional neural network classifier of: sliding window, generating candidate region (region proposals), the candidate area image classification, three step process, and the rear sliding window and the method of treatment is fixed. Therefore, the study focused on how to improve such methods is that the convolution neural network feature extraction capabilities, capabilities, and feature classification feature selection capabilities to improve the accuracy of image recognition. A typical representative of such an algorithm is based on the region proposal of R-CNN-based algorithm, such as the R-CNN, Fast R-CNN , Faster R-CNN and the like;

 

1.1 R-CNN

        R-CNN for target detection algorithm is based on the foundation of the region proposal series method, for which the first search area, and then classifies the candidate areas. In the R-CNN, Selective search method to generate a selected candidate region, which is a heuristic search algorithm. It is first divided by a simple algorithm divide the picture area into many small areas, and then combine them in a certain similarity by hierarchical grouping method, is the last remaining candidate region (region proposals), they may contain an object. Diagram is as follows:

 

        For a picture, R-CNN generating selective search based on approximately 2000 candidate regions, and resize each candidate region is a fixed size (227 × 227) and fed to a CNN model, using the extracted image feature AlexNet, and finally 4096 to obtain a feature vector dimension. This feature vector is then fed to a multi-class SVM classifier, the predicted probability values ​​belonging to each class of objects contained in the candidate region. Each category a trained SVM classifier, the size of its inference probability belonging to the class from the feature vector. In order to improve the positioning accuracy, R-CNN finally trained a bounding box regression model. Training samples (P, G), where P = (Px, Py, Pw, Ph) as a candidate region, and G = (Gx, Gy, Gw, Gh) of the real position and size of the frame. P G is selected with the largest real IoU block, the target value is defined as the return:

 

        In doing prediction, the predicted position can be corrected to obtain the anti-block using the above formula. R-CNN are training for each category separate return, the use of minimum mean squared error loss function for training.

        R-CNN is very intuitive, the detection problem is the classification problem for conversion, however, due to the complexity of using the R-CNN highly selective search calculation extraction candidate region, and the SVM for classification, not a training end model. R-CNN model to feature extraction and feature classification after the size of the unified candidate area. And extracts the candidate block will be repeated when the calculated feature extraction.

 

1.2 Fast-RCNN

        Fast-RCNN feature extraction in order to solve the problem of double counting birth, and Fast-RCNN clever target identification and location on the same model constituting Multi-task in a CNN.

        Fast-RCNN obtained after first find candidate block with Selective Search, 'followed by the entire FIG once CNN, then RoI Pooling, part of the candidate block corresponding to the sample made to obtain the same characteristic length, and after two fully connected layers the final feature. Then generates two branches, a branch to this feature classification, another branch of the return of this feature candidate frame offset. Fast-RCNN classification and regression tasks will be integrated in a model.

 

        Introduces Fast-RCNN core algorithm modules, namely RoI Pooling. First picture rescaled based convolution neural network image classification task and was cut to a fixed size, such as AlexNet and ResNet picture zoom to 256 scale and crop the size to 224 × 224, then the input image cropped to network training. But for inspection tasks, image size has an important influence on the detection performance. Assuming that the input image size of 224 × 224, the target object is likely because the resolution is too low to be detected. Fast-RCNN image input to the image size is not limited, but the key to achieving this is the RoI Pooling network layer. RoIPooling layer and the volume of any scale layer network feature, extracting a feature region in the map fixed dimension feature level for each candidate block, multi-scale features can be extracted by setting different scales RoI Pooling. For RoIPooling simple principle is to calculate the scale is fixed by setting the mesh size of each sample, and the sample to the maximum value. RoI Pooling maximum pooled convert any valid features within the region of interest is fixed spatial FIG small features having a range of H × W (e.g., 7 × 7), wherein W is H and ultra RoI Pooling layer parameters, independent to any specific ROI. Fast-RCNN RoI is defined a rectangular window. Its upper left corner of each ROI (r, c) and the height and width (h, w) of the four-tuple (r, c, h, w) is defined by the definition. RoI Max Pooling h × w by converting a size of the ROI window h / H × w / W of the sub-window HxW grid, and then each sub-window stored in the maximum value of the output to the respective grid cells. RoI Pooling is a special case of space pyramid layer SPP-Net used, only one layer of the pyramid. Wherein FIG RoI pooling layer is then obtained into several fully connected layers, and a new feature vector, the feature vectors are used on a softmax classifier (predicted category) and a linear regression (for adjusting the bounding box position) is detected. On implementation is to use two different fully connected layers, a first layer fully connected N + 1 outputs (N is the total number of categories, a background), denotes the probability value for each category; the second layer fully connected 4N outputs, return values ​​represent the coordinates (tx, ty, tw, TH), and this is the same as R-CNN, each category four predicted position coordinate value.

 

        Fast R-CNN a major difference with another point R-CNN softmax classification is adopted instead of SVM classifier, and the training process is a single pipe, as the R-CNN classification errors and positioning errors were combined together Fast Training positioning error instead of using smooth L1 L2, R-CNN. Thus, the entire network can be end to end training.

        After Fast-RCNN proposed framework for object detection based on the depth of learning has been very clear, it is that we can not put into the extraction potential candidate region within the framework of CNN. Faster-RCNN is based on this point and propose Region Proposal Net will be included in the potential candidate region extraction CNN framework.

Faster-RCNN 1.3
        Faster-RCNN model introduces RPN (Region Proposal Network) directly generate candidate region. Faster-RCNN can be seen as a combination of RPN and Fast RCNN model, i.e. Faster-RCNN = RPN + Fast- RCNN.

        For RPN network, using a first CNN model receives the entire image (generally called feature extractor) and extracting features of FIG. Then using this feature in view of an N × N (hereinafter a 3 × 3) of the sliding window, the sliding window location for each feature of a low dimensional maps (e.g., 256-d). This feature is then fed to two fully connected layers, respectively, a classification for the prediction, and one for return. For each window position disposed generally different size or scale k prior frame (anchors, default bounding boxes), which means that the position of each candidate prediction region k (region proposals). For classification level, the output size is 2k, showing each candidate region including a probability value or a background object, and return the output layer 4k coordinate value indicating the position (relative to each frame a priori) the respective candidate region. For each sliding window position, two fully connected layers are shared. Thus, the RPN convolution layer may be employed to achieve: a first convolution of n × n low-dimensional features, then the convolution of two 1 × 1, respectively, for the classification and regression.

 

        RPN is used in binary to distinguish the background and the object only, but do not predict the category of the object, i.e., class-agnostic. Due to the predicted coordinate values ​​simultaneously, during training, the prior frame and the first ground-truth box matching principle: (1) a ground-truth box with the highest IoU prior frame; (2) a ground-truth box IoU value greater than 0.7 prior frame, as long as one can match a priori block ground-truth, so that the frame is a priori positive samples (of the object), and to the ground-truth for the return target. For those with any of a box ground-truth IoU values ​​are below 0.3 prior frame, which is considered negative samples. RPN network can be trained separately and individually trained RPN model gives a lot region proposals. Because of the large number of the prior frame, many RPN predicted candidate region is overlapped, first performed NMS (non-maximum suppression, IoU threshold is set to 0.7) is operative to reduce the number of candidate areas, and then in descending order according to the confidence to select top -N to a region proposals for training Fast R-CNN model. RPN role is to replace the role of Selective search, but faster and therefore Faster R-CNN whether it is training or prediction can be accelerated.

 

        Faster-RCNN follow the training process as follows:

        The first step: Use pre-trained on ImageNe the model initialization feature extraction network and training RPN network;

        Step 2: Use of the features in the model initialization Fast-RCNN pre-trained on ImageNet extract the network, a step in the use of the candidate block trained network RPN generated as an input, a Fast-RCNN network training, thus, two networks each layer completely shared parameters;

        The third step: The procedures Fast-RCNN two network parameter initialization RPN a new network, but the RPN, Fast-RCNN network sharing feature extraction parameter learning rate is set to 0, even if the learning network RPN parameters specific, fixed feature extraction network. This step, the two networks have shared all public convolution layer;

        Step four: those who shared network layer remains fixed, the Fast-RCNN unique network layer also joined in, continuing training, fine-tuning Fast-RCNN unique network layer, so far, RPN and Fast-RCNN fully shared network parameters, Fast-RCNN using candidate blocks to complete extraction and target detection functions simultaneously.

        The main problem lies in its small scale detection performance is poor, can not effectively be feature candidate area after the extraction depth CNN, effective retention features small scale. Resolution is limited and the same or a slower detection speed.

1.4 Mask RCNN
        the original Faster-RCNN expanded, adding a branch target detection using existing prediction parallel. At the same time, the network structure is relatively easy to implement and training, speed of 5fps, can easily be applied to other areas, such as target detection, segmentation, and key figures point detection, and better than existing algorithms effect.

        Examples of the difficulty is that the first division of the picture all the targets correctly detect but also to segment each example. The test is aimed to each individual object classes and calibrated out with a bounding box, the purpose is to distinguish each instance of divided pixels is a different classification without distinguish different targets. Mask R-CNN convolution using fully connected network (the FCN) to complete the prediction. This requires training data must have pixel-level mark, rather than a simple border.

 

        Faster R-CNN: two parts, made RPN area, find the target frame, the classification of the ROI. When the core idea is to give the image a depth area network content, and extracts a feature of a depth of the network layer, and to use this feature to determine what the object is, as the background is one category, it is determined if the object is not 20, implementation is actually determined class 21. Finally, a region of the object again be slightly adjusted.

        Mask Representation: mask input target spatial layout encoded. Use m * m matrix of each ROI instead prediction vector to be predicted, which can secure a space among the ROI information is not lost.

        ROI Align: RoI Pooling is to realize the mapping from the last picture pooling region to region to convolution function of fixed size, the size of the region size normalization into a convolution of the input network. Among the normalization process, there will be ROI and extracted features do not coincide phenomenon, the authors proposed ROI Align, using ROI Align layer between the extracted features and input calibration. Avoid or boundaries for each ROI block digitized. Bilinear interpolation calculation using the input feature four sampling positions fixed values ​​and the results obtained in the ROI blocks the fusion.

        Because it will accurately predict the FCN categories for each pixel, that is, each pixel will enter the picture corresponds to a category in the callout. The input image is an anchor frame, we can precisely matched to the corresponding region of the pixel marked. But RoI pooling role in the feature after convolution, the default is to do the anchor frame fixed point. For example, suppose an anchor frame is selected (x, y, w, h), and the image feature extraction 16 times smaller, that is, if the original picture is 256 × 256, then the feature size is 16 × 16. This time corresponding features on the anchor block is turned into (x / 16, y / 16, h / 16, w / 16). If the x, y, w, h a is not in any divisible by 16, then misalignment may occur. Similarly, if the anchor box length and width of the pool size is not divisible, then the same will be fixed point, leading to dislocation.

        Usually such a misalignment between just a few pixels, for classification and prediction border have little effect. But for predicting the pixel level, so the dislocation may lead to big problems. RoI Align layer similar to RoI pool layer, but to get rid of the designated step, is to remove all. If the calculated anchor box is not just to get between the pixels, then we use linear interpolation of the surrounding pixels to obtain a value on this point.

        For the one-dimensional case, suppose we want to calculate the value F (x) of the point x, then we can use the whole value around x interpolated points:

 

        We actually used to estimate a two-dimensional difference f (x, y), we first obtain x-axis the difference f (x, y) and f (x, y + 1), then these two values to obtain a difference f (x, y).

 

        Network Architecture :

        Divided into three parts, a first backbone network is used for feature extraction, the second head structure is used for identifying a bounding box (classification and regression), a third mask is used for each prediction distinguish ROI. Backbone network using the network ResNet50 50 residual depth of the layer.

 

        Since the above method of using a specific method for extracting candidate regions, an alternative embodiment to the exhaustive search method of sliding window, less the number of the candidate regions, the target detection convolutional neural network based on accuracy and speed have been great upgrade. However, these methods are very dependent on the candidate area extraction accuracy of the method, if the target is complex and detecting the scene is not obvious, a candidate area extraction difficult to capture a region near the target, resulting in the target is not detected.

Regression 2 convolutional neural network-based object detection
structure for detecting a target redesign convolutional neural network is proposed as the regressor convolutional neural network, the entire image to be detected as a candidate region, a convolutional neural direct input networks, regression target position information to be detected in the image. One of the most representative is YOLO algorithm and SSD algorithm.

2.1 YOLO algorithm
        Yolo algorithm, which stands for You Only Look Once: Unified, Real -TimeObject Detection. Yolo CNN algorithm uses a single model to achieve target detection end-to-end, the entire system as shown: First, an input image resize to 448x448, then into the CNN network, and finally the processing network prediction target detection results. Compared to R-CNN algorithm, which is a unified framework, its faster and Yolo training process is the end-to-end.

 

        The CNN network YOLO input image is divided into grids S × S, and then each cell is responsible for detecting that the center point to fall within the target grid. Each B cell predicts a bounding box (bounding box) and the bounding box of confidence (confidence score). The so-called confidence actually contains two aspects, one is the possibility that the size of the bounding box containing the target, the second is the accuracy of the bounding box. The former is referred to as Pr (object), when the bounding box is background (i.e. not containing the target), then Pr (object) = 0. When the bounding box contains the target, Pr (object) = 1. The accuracy of the bounding box may be predicted block and the actual block (ground truth) of IOU (intersection over union, and the cross ratio) be characterized, referred to as IOU (truth / pred). Thus confidence may be defined as Pr (object) * IOU (truth / pred). YOLO confidence is the product of two factors, the prediction accuracy is also reflected in the inside frame. Bounding box size and location may be characterized by four values: (x, y, w, h), where (x, y) is the center coordinates of the bounding box, and w and h are the width and height of the bounding box. The predicted value of the center coordinates (x, y) are offset with respect to the top left cell coordinate value of each cell point and with respect to the unit cell size. And w and h is the predicted value with respect to the bounding box width and height ratio of the whole picture, so that the size of the element 4 is theoretically should be in the range [0,1]. Thus, a predicted value for each bounding box actually contains five elements: (x, y, w, h, c), the first four of which characterize the size and location of the bounding box, and the last value is the confidence level.

        For classification problems, but also gives the predicted probability category value C for each of which a cell, which is characterized by the probability of each category of the cell responsible for the bounding box of its predicted target belongs. But in fact, these probabilities are conditional probabilities at each bounding box degree of confidence, that is, Pr (class | object). No matter how many bounding box of a cell forecasts, which predict only a set of categories probability value, which is a disadvantage YOLO algorithm, later improved versions, YOLO9000 category is the probability of the predicted value and the bounding box is bound together . At the same time, we can calculate a bounding box of each category confidence (class-specific confidence scores): 

 

        Bounding box characterization category confidence that the object boundary box belonging to each category, and the size of the bounding box of the possibility of matching the target quality, usually filtered prediction block in accordance with the network category confidence.

 

        The advantage of YOLO, Yolo adopt a CNN network to achieve the detection, is a single-channel strategy, their training and forecasts are end-to-end, so Yolo algorithm is relatively simple and fast. The second point is to do due Yolo convolution of the whole picture, so it has a greater vision in the detection target, it is not easy to misjudge the background. In addition, Yolo strong generalization ability, doing migration, the robustness of the model high.

        The disadvantage YOLO, Yolo first prediction only two individual cells bounding boxes, and belongs to a category. For small objects, Yolo performance will be unsatisfactory. Yolo for generalization in terms of the aspect ratio of the object is low, is unable to locate the unusual proportions of the object. Yolo accurate positioning is not a big problem.

2.2 SSD algorithm
        SSD stands for: Single Shot MultiBox Detector. In the R-CNN model in the series. Region Proposal and divided into two classifications are carried out. SSD will be its unified into one step to make the model simpler and faster. YOLO and SSD detection can be completed in one step. Compared YOLO, SSD directly detected using the CNN, do not detect the connection layer after full-image Yolo. It Faster R-CNN with two main different, 1, to the anchor block, first determine if it is no longer contains the object of interest, then the n-type anchor block into the real object classification. SSD uses it explicitly a C + 1 class classifier to determine what type of object it corresponds to, or just background. We no longer have an extra pair of back frame further predicted, but directly using a single regression is used to predict the real border. 2, SSD not only make predictions for the convolutional neural network output characteristics, it will be further characterized by a convolution layer becomes smaller pools do prediction. So to achieve the effect of multi-scale prediction.

        (1) FIG multiscale feature for detecting 
        multi-scale characteristic graph different sizes, generally in front of the CNN network wherein FIG relatively large, increasingly use will be later stride = convolution or pool 2 to reduce the size of features of FIG., A comparator FIGS large and a relatively small features characteristic diagrams which are used for detection. The advantage of this feature is relatively large map for detecting relatively small targets, and FIG small features responsible for detecting goals, as shown in the following figure, wherein FIG 8x8 more elements may be, but each unit block priori relatively small scale.

 

        (2) detecting convolution 
        with a different last Yolo fully connected layers, SSD direct convolution of the different features are extracted by the detection result of FIG. The shape is a characteristic diagram of the m × n × p, only 3 × 3 × p using such a relatively small convolution kernel obtained detection value.

        (3) is provided prior frame 
        in Yolo, each of the plurality of prediction unit block boundary, but it is relative to this unit itself (squares), but the shape of the real target is varied, since Yolo required during training adapt to the shape of the target. Faster draws the SSD R-CNN concept the anchor, the aspect ratio of each cell is provided, or a different scale prior frame, the predicted frame boundary (bounding boxes) is the a priori as a reference frame, to a certain extent reduce the difficulty of training. In general, each cell is provided a plurality of prior frames will be different between the scale and aspect ratio, as shown, as can be seen in FIG. 5 for each unit uses a different a priori block, using the different objects in the picture they frame the most suitable shape prior to training.

 

        (4) object category prediction

        For each anchor frame we need to predict it is not included objects of interest to us, or just background. Using a 3 × 3 convolution layer prediction to do, plus pad = 1 to use it as input and output. At the same time the number of output channels is num_anchors * (num_classes + 1), one for each channel confidence anchor block a class pair. Assuming that the output is Y, then the corresponding input of the n-th (i, j) of the pixel values ​​of the samples in the confidence Y [n,:, i, j] in. Specifically, in order for the (i, j) as the center of a first anchor frame. Channel a * (num_class + 1) is the fraction which contains only the background, the channel a * (num_class + 1) + 1 + b is the b th fractions containing the object.

        (5) Prediction bounding box

        Because the real bounding box can be any shape, we need to predict how the change from one frame to anchor the true bounding box. 4 will be described in this vector can be converted by one long. Above, we use a convolution num_anchors * 4 channels. Assuming that the output is Y, then the first input corresponding to the n-th sample (i, j) centered on the pixel in the frame conversion anchor Y [n,:, i, j] in. Specifically, for a first anchor frame, it is converted to the a * 4 a * 4 + 3 aisles.

 

        The objective function and same common Object Detection method, divided into two parts: computing the corresponding target confidence level category default box and corresponding bounding box regression results.

Third, problems
        such as the R-CNN-based methods that rely on a candidate, since the specific method for extracting candidate regions, an alternative method of traversing a sliding window search, a smaller number of candidate regions requires the convolution-based target detection neural network on accuracy and speed have been greatly improved. However, these methods are very dependent on the candidate area extraction accuracy of the method, such as detection of the scene is complex and not obvious target, a candidate area extraction difficult to capture a region near the target, resulting in the target is not detected. It is currently used to prove the validity of the convolution neural network experiments, set training parameters mostly rely on experience and practice, the lack of theoretical guidance and quantitative analysis; on the other hand, the need for a more reasonable target detection design network architecture, combined style reply neural network to enhance the detection efficiency, to achieve multi-scale multi-target detection category. RCNN, Fast-RCNN, Faster- RCNN etc., accuracy and precision target detection has advantages, but it is time-consuming, slow speed. SSD and similar good YOLO algorithms, its speed, but relatively poor accuracy performance.

        The main challenges the following:

        1. Small-scale target detection. For the depth of convolution based object detection neural network, due to the depth of the top layer of the network of neurons larger receptive field, save for small-scale target information is less complete, and therefore small-scale target detection performance is not high.

        2. The computational complexity. Front target detection algorithm depends on the complexity of the feature extraction calculation time propagation network. Expression patterns of network capacity exists a strong relationship between the depth of the network, in general, to a certain depth, the deeper the network, the stronger expression ability, the better the performance of object detection algorithm, the computational overhead increases.

In addition, the depth of convolution neural network requires a lot of manual annotation data for training, access to training data is also essential for target detection algorithm.

 

Guess you like

Origin www.cnblogs.com/cloudrivers/p/11853470.html