Overview of Fast RCNN for target detection

Fundamental

The main steps of Fast Rcnn are

  • Generate candidate regions using SR algorithm
  • Feature extraction using VGG16 network
  • Use the candidate regions generated in the first step to obtain the corresponding feature matrix in the feature map.
  • Use ROI pooling to scale the feature matrix to the same size and flatten it to get the prediction results

Optimization relative to RCNN

Insert image description here
There are three main improvements

  1. Instead of putting each candidate area into the CNN network one after another for a series of operations such as feature extraction, the entire image is input into the network to obtain the feature map. Then the candidate areas in the original image are used to flatten the corresponding areas in the feature map to obtain the prediction results.
  2. It is no longer necessary to force the image to be scaled, but to use ROI Pooling to scale it to the same size.
  3. Instead of using SVM for classification, softmax is used instead.

Optimization significance
  1. For the first optimization point,
    a picture only needs to pass through the convolutional network once, which reduces a lot of operations. However, for each candidate area of ​​the feature map, the fully connected layer needs to perform an operation on each candidate area, and the algorithm author uses SVD for accelerated processing.
  2. The second advantage
    of roi pooling is that it can improve the training processing speed and better solve the scaling problem.
  3. The third optimization point
  • Incorporating the classification loss into the entire network training process reduces the disk space usage compared to RCNN.
  • The fully connected layer has two branches, one for softmax classification and the other for position regression.
  • Construct
    L ( p , u , tu , v ) = L cls ( p , u ) + λ [ u ⩾ 1 ] L loc ( tu , v ) L(p,u,t^u,v)=L_{ cls}(p,u)+\lambda[u\geqslant 1]L_{loc}(t^u,v)L(p,u,tu,v)=Lcls(p,u)+λ [ u1]Lloc(tu,v )
    andL cls(p,u) = − log(p,u) L_{cls}(p,u)=-log{(p,u)}Lcls(p,u)=log(p,u ) is the classification loss, p is the predicted probability, and u is the true label.
    λ [ u ≥ 1 ] L loc ( tu , v ) \lambda[u \ge 1]L_{loc}(t^u,v)λ [ u1]Lloc(tu,v ) is the position loss, v is the predicted offset and scaling coefficient,tut^utu is the offset and scaling coefficient between the actual candidate frame and the real frame, which is consistent with RCNN.
    The previous coefficientλ [ u ≥ 1 ] \lambda[u \ge 1]λ [ u1 ] is used to determine whether the candidate area is a background or an object. If it is a background, it is not calculated; if it is an object, regression is calculated.
    where
    L loc ( tu , v ) = ∑ i ϵ { x , y , w , h } smooth L 1 ( tiu − vi ) L_{loc}(t^u,v)=\sum_{i\epsilon \{x ,y,w,h\}}smooth_{L_1}(t_i^u-v_i)Lloc(tu,v)=iϵ{ x,y,w,h}smoothL1(tiuvi)
    s m o o t h L 1 ( x ) = { 0.5 x 2        i f   ∣ x ∣ < 1 ∣ x ∣ − 0.5    o t h e r w i s e smooth_{L_1}(x)=\left\{\begin{matrix}0.5x^2\ \ \ \ \ \ if\ |x|< 1\\|x|-0.5\ \ otherwise\end{matrix}\right. smoothL1(x)={ 0.5x _2      if x<1x0.5  otherwise

Guess you like

Origin blog.csdn.net/qq_44116998/article/details/128425273
Recommended