Object Detection

1. Object detection

Top 5 Applications of Computer Vision

1.1 Brief overview and terminology of target detection

insert image description here

  • Object recognition is to distinguish what objects are in the picture, the input is a picture, and the output is a category label and probability. The object detection algorithm not only needs to detect what objects are in the picture, but also outputs the outer frame (x, y, width, height) of the object to locate the position of the object.
  • Object detection is to accurately find the location of the object in a given picture and mark the category of the object.
  • The problem to be solved by object detection is the whole process of where and what the object is.
  • However, this problem is not so easy to solve. The size of the object varies widely, the angle and posture of the object are uncertain, and it can appear anywhere in the picture, not to mention that the object can also be of multiple categories.

insert image description here

At present, the target detection algorithms emerging in academia and industry are divided into three categories:

  1. Traditional target detection algorithm: Cascade + HOG/DPM + Haar/SVM and many improvements and optimizations of the above methods;
  2. Candidate area/frame + deep learning classification: by extracting candidate areas, and classifying corresponding areas based on deep learning methods, such as:
  • R-CNN(Selective Search + CNN + SVM)
  • SPP-net(ROI Pooling)
  • Fast R-CNN(Selective Search + CNN + ROI)
  • Faster R-CNN(RPN + CNN + ROI)
  1. Regression method based on deep learning: YOLO/SSD and other methods

1.2 IOU

Intersection over Union is a standard for measuring the accuracy of detecting corresponding objects in a specific data set.

IoU is a simple measurement standard, as long as the task of obtaining a prediction range (bounding boxex) in the output can be measured with IoU.

In order for IoU to be used to measure object detection of arbitrary size and shape, we need:

  1. ground-truth bounding boxes (artificially mark the approximate range of the object to be detected in the training set image);
  2. The range of results produced by our algorithm.

That is, this criterion is used to measure the degree of correlation between the real and predicted, the higher the degree of correlation, the higher the value.
insert image description here

insert image description here

insert image description here

1.3 TP TN FP FN

There are 4 letters in TP TN FP FN, which are TFPN.
T is True;
F is False;
P is Positive;
N is Negative.

T or F represents whether the sample is correctly classified.
P or N represents whether the sample was originally a positive sample or a negative sample.

TP (True Positives) means that it is classified as a positive sample, and it is correct.
TN (True Negatives) means that it is classified as a negative sample, and it is correct, and
FP (False Positives) means that it is classified as a positive sample, but it is wrong (in fact, this sample is a negative sample).
FN (False Negatives) means that it is divided into negative samples, but it is wrong (in fact, this sample is a positive sample).
In the process of mAP calculation, the three concepts of TP, FP and FN are mainly used.

1.4 precision (precision) and recall (recall rate)

insert image description here
TP is an example that the classifier considers to be a positive sample and is indeed a positive sample. FP is an example that the classifier considers to be a positive sample but is not actually a positive sample. Precision translated into Chinese means "the part that the classifier considers to be a positive class and is indeed a positive class accounts for all classifications. The proportion that the device considers to be the positive class".

insert image description here
TP is an example that the classifier considers to be a positive sample and is indeed a positive sample. FN is an example that the classifier considers to be a negative sample but is not actually a negative sample. Recall translated into Chinese means "the part that the classifier considers to be a positive class and is indeed a positive class accounts for all It is indeed the proportion of the positive class."

Precision is to find the right one, and recall is to find everything.

insert image description here

  • The blue box is the real box. The green and red boxes are prediction boxes, the green boxes are positive samples, and the red boxes are negative samples.
  • Generally speaking, when the prediction box and the real box IOU>=0.5, it is considered a positive sample.

2. Bounding-Box regression

insert image description here

What is bounding box regression?

  • The window is generally represented by a four-dimensional vector (x, y, w, h), which respectively represent the coordinates of the center point and the width and height of the window.
  • The red box P represents the original Proposal;
  • The green box G represents the Ground Truth of the target;

Our goal is to find a relationship so that the input original window P is mapped to obtain a regression window G^ that is closer to the real window G.

Therefore, the purpose of frame regression is:
given (Px,Py,Pw,Ph) to find a mapping f, so that: f(Px,Py,Pw,Ph)=(Gx ,Gy ,Gw ,Gh ) and ( Gx ,Gy ,Gw ,Gh )≈(Gx,Gy,Gw,Gh)

insert image description here

How to do border regression?
The simpler idea is: translation + scaling
insert image description here

Input:
P=(Px,Py,Pw,Ph)
(Note: The training phase input also includes Ground Truth)

Output:
The required translation transformation and scaling dx, dy, dw, dh, or Δx, Δy, Sw, Sh.

With these four transformations, we can directly get Ground Truth.

3. Faster R-CNN

Faster RCNN can be divided into 4 main contents:

  1. Conv layers : As a CNN network target detection method, Faster RCNN first uses a set of basic conv+relu+pooling layers to extract the feature maps of the image. The feature maps are shared for subsequent RPN layers and fully connected layers.
  2. Region Proposal Networks (RPN) : The RPN network is used to generate region proposals. Judging whether the anchors are positive or negative through softmax, and then using the bounding box regression to correct the anchors to obtain accurate proposals.
  3. Roi Pooling : This layer collects the input feature maps and proposals, extracts the proposal feature maps after integrating these information, and sends them to the subsequent fully connected layer to determine the target category.
  4. Classification : Use the proposal feature maps to calculate the category of the proposal, and at the same time bounding box regression again to obtain the final precise position of the detection frame.

insert image description here

insert image description here

3.1 Faster-RCNN:conv layer

1 Conv layers
Conv layers include conv, pooling, relu three layers. There are 13 conv layers, 13 relu layers, and 4 pooling layers.

In Conv layers:

  1. All conv layers are: kernel_size=3, pad=1, stride=1
  2. All pooling layers are: kernel_size=2, pad=1, stride=2

In the Faster RCNN Conv layers, all convolutions are pad-processed (pad=1, that is, a circle of 0 is filled), causing the original image to become (M+2)x(N+2) in size, and then a 3x3 volume is made Output MxN after product. It is this setting that causes the conv layer in Conv layers to not change the input and output matrix sizes.

insert image description here

Similarly, the pooling layer kernel_size=2, stride=2 in Conv layers.
In this way, each MxN matrix that passes through the pooling layer will become (M/2)x(N/2) in size.

To sum up, in the entire Conv layers, the conv and relu layers do not change the input and output sizes, and only the pooling layer makes the output length and width 1/2 of the input.

Then, a matrix of MxN size is fixed to (M/16)x(N/16) after Conv layers.
In this way, the feature map generated by Conv layers can correspond to the original image.

3.2 Faster-RCNN:Region Proposal Networks(RPN)

2. Region Proposal Networks (RPN)
The classic detection method of Region Proposal Networks (RPN) is very time-consuming to generate detection frames. Directly using RPN to generate detection frames is a huge advantage of Faster R-CNN, which can greatly increase the speed of detection frame generation.
insert image description here

  • You can see that the RPN network is actually divided into 2 lines:
  1. The above one uses softmax to classify anchors to obtain positive and negative classifications;
  2. The following one is used to calculate the bounding box regression offset for anchors to obtain an accurate proposal.
  • The final Proposal layer is responsible for synthesizing the positive anchors and the corresponding bounding box regression offset to obtain proposals, and at the same time rejecting proposals that are too small and beyond the boundary.
  • In fact, when the entire network reaches the Proposal Layer, it completes the function equivalent to target positioning.

3.2.1 anchors

After the RPN network is convolved, each pixel is upsampled and mapped to an area of ​​the original image, the center position of this area is found, and 9 kinds of anchor boxes are selected according to the rules based on the center position.

There are 3 types of areas for 9 rectangles: 128,256,512; 3 shapes: the aspect ratio is about 1:1, 1:2, 2:1. (not a fixed ratio, adjustable)

The 4 values ​​in each row represent the coordinates of the upper left and lower right corners of the rectangle.
insert image description here

The feature maps obtained by traversing the Conv layers are equipped with these 9 anchors as the initial detection frame for each point.
insert image description here

3.2.2 Softmax determines positive and negative

In fact, RPN finally sets up dense candidate anchors on the scale of the original image. Then use cnn to judge which anchors are positive anchors with targets in them, and which are negative anchors without targets. So, it's just a binary category.

insert image description here
It can be seen that the num_output=18 of its conv, that is, the output image after the convolution is WxHx18 in size.

This just corresponds to the fact that each point of the feature maps has 9 anchors, and each anchor may be positive and negative. All this information is stored in a matrix of WxHx(9*2) size.

Why do you do this? Followed by softmax classification to obtain positive anchors, it is equivalent to initially extracting the detection target candidate area box (generally believed that the target is in the positive anchors).

So why connect a reshape layer before and after softmax? In fact, it is just for the convenience of softmax classification.

The matrix of the previous positive/negative anchors is stored in caffe in the form of [1, 18, H, W]. In the softmax classification, positive/negative binary classification is required, so the reshape layer will change it to [1, 2, 9xH, W] size, that is, "vacate" a dimension for softmax classification, and then reshape it back to the original state.

In summary, the RPN network uses anchors and softmax to initially extract positive anchors as candidate regions.

3.2.3 bounding box regression on proposals

insert image description here
You can see the num_output=36 of conv, that is, the output image after this convolution is WxHx36. This is equivalent to feature maps with 9 anchors for each point, and each anchor has 4 transformations for regression:
insert image description here

3.2.4 Proposal Layer

The Proposal Layer is responsible for synthesizing all transformations and positive anchors, calculating an accurate proposal, and sending it to the subsequent RoI Pooling Layer.

Proposal Layer has 4 inputs:

  1. positive vs negative anchors classifier results rpn_cls_prob_reshape,
  2. The transformation amount of the corresponding bbox reg rpn_bbox_pred,
  3. im_info
  4. parameter feature_stride=16

im_info: For a PxQ image of any size, first reshape to a fixed MxN before passing it into Faster RCNN, and im_info=[M, N, scale_factor] saves all the information of this scaling.
The input image passes through Conv Layers, and becomes WxH=(M/16)x(N/16) size after 4 times of pooling, where feature_stride=16 saves this information and is used to calculate the anchor offset.

Proposal Layers are processed in the following order:

  1. Use the transformation amount to do bbox regression regression on all positive anchors
  2. According to the input positive softmax scores, the anchors are sorted from large to small, and the first pre_nms_topN (eg 6000) anchors are extracted, that is, the positive anchors after the corrected position are extracted.
  3. Perform NMS (non-maximum suppression) on the remaining positive anchors.
  4. Then output the proposal.

The detection in the strict sense should end here, and the subsequent part should belong to the identification.

RPN network structure, summed up: Generate anchors -> softmax classifier to extract positvie anchors -> bbox reg regression positive anchors -> Proposal Layer to generate proposals

3.3 Faster-RCNN:Roi pooling

The RoI Pooling layer is responsible for collecting proposals, and calculating the proposal feature maps, which are sent to the subsequent network.

The Rol pooling layer has 2 inputs:

  1. Original feature maps
  2. Proposal boxes output by RPN (various in size)

insert image description here

Why is RoI Pooling needed?
For traditional CNNs (such as AlexNet and VGG), when the network is trained, the input image size must be a fixed value, and the network output is also a vector or matrix of a fixed size. This problem becomes more troublesome if the input image size is variable.

There are 2 solutions:

  1. Part of the crop from the image is passed to the network to transfer the image (destroying the complete structure of the image)
  2. Warp into the required size and pass it to the network (destroying the original shape information of the image)
    insert image description here

Principle of RoI Pooling
New parameters pooled_w, pooled_h and spatial_scale (1/16)

RoI Pooling layer forward process:

  1. Since the proposal corresponds to the M N scale, first use the spatial_scale parameter to map it back to the feature map scale of (M/16) (N/16) size;
  2. Then divide the feature map area corresponding to each proposal horizontally into a grid of pooled_w * pooled_h;
  3. Perform max pooling processing on each part of the grid.

After processing in this way, the output results of proposals with different sizes are pooled_w * pooled_h fixed size, and the fixed length output is realized.

Then divide the feature map area corresponding to each proposal horizontally into pooled_w * pooled_h grids;
perform max pooling processing on each grid.

After processing in this way, the output results of proposals with different sizes are pooled_w * pooled_h fixed size, and the fixed length output is realized.
insert image description here

3.4 Faster-RCNN: Classification

The Classification part uses the obtained proposal feature maps, calculates which category each proposal belongs to (such as people, cars, TVs, etc.) through the full connect layer and softmax, and outputs the cls_prob probability vector;

At the same time, the bounding box regression is used again to obtain the position offset bbox_pred of each proposal, which is used to return a more accurate target detection frame.
insert image description here

After obtaining the pooled_w * pooled_h size proposal feature maps from RoI Pooling, they are sent to the follow-up network, and the following two things are done:

  1. Classify proposals through full connection and softmax, which is actually the category of recognition
  2. Perform bounding box regression on proposals again to obtain higher-precision prediction boxes

Fully connected layer InnerProduct layers:
insert image description here

3.5 Network Comparison

insert image description here

3.6 Code example

3.6.1 Network construction

import cv2
import keras
import numpy as np
import colorsys
import pickle
import os
import nets.frcnn as frcnn
from nets.frcnn_training import get_new_img_size
from keras import backend as K
from keras.layers import Input
from keras.applications.imagenet_utils import preprocess_input
from PIL import Image,ImageFont, ImageDraw
from utils.utils import BBoxUtility
from utils.anchors import get_anchors
from utils.config import Config
import copy
import math
class FRCNN(object):
    _defaults = {
    
    
        "model_path": 'model_data/voc_weights.h5',
        "classes_path": 'model_data/voc_classes.txt',
        "confidence": 0.7,
    }

    @classmethod
    def get_defaults(cls, n):
        if n in cls._defaults:
            return cls._defaults[n]
        else:
            return "Unrecognized attribute name '" + n + "'"

    #---------------------------------------------------#
    #   初始化faster RCNN
    #---------------------------------------------------#
    def __init__(self, **kwargs):
        self.__dict__.update(self._defaults)
        self.class_names = self._get_class()
        self.sess = K.get_session()
        self.config = Config()
        self.generate()
        self.bbox_util = BBoxUtility()
    #---------------------------------------------------#
    #   获得所有的分类
    #---------------------------------------------------#
    def _get_class(self):
        classes_path = os.path.expanduser(self.classes_path)
        with open(classes_path) as f:
            class_names = f.readlines()
        class_names = [c.strip() for c in class_names]
        return class_names

    #---------------------------------------------------#
    #   获得所有的分类
    #---------------------------------------------------#
    def generate(self):
        model_path = os.path.expanduser(self.model_path)
        assert model_path.endswith('.h5'), 'Keras model or weights must be a .h5 file.'
        
        # 计算总的种类
        self.num_classes = len(self.class_names)+1

        # 载入模型,如果原来的模型里已经包括了模型结构则直接载入。
        # 否则先构建模型再载入
        self.model_rpn,self.model_classifier = frcnn.get_predict_model(self.config,self.num_classes)
        self.model_rpn.load_weights(self.model_path,by_name=True)
        self.model_classifier.load_weights(self.model_path,by_name=True,skip_mismatch=True)
                
        print('{} model, anchors, and classes loaded.'.format(model_path))

        # 画框设置不同的颜色
        hsv_tuples = [(x / len(self.class_names), 1., 1.)
                      for x in range(len(self.class_names))]
        self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
        self.colors = list(
            map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)),
                self.colors))
    
    def get_img_output_length(self, width, height):
        def get_output_length(input_length):
            # input_length += 6
            filter_sizes = [7, 3, 1, 1]
            padding = [3,1,0,0]
            stride = 2
            for i in range(4):
                # input_length = (input_length - filter_size + stride) // stride
                input_length = (input_length+2*padding[i]-filter_sizes[i]) // stride + 1
            return input_length
        return get_output_length(width), get_output_length(height) 
    
    #---------------------------------------------------#
    #   检测图片
    #---------------------------------------------------#
    def detect_image(self, image):
        image_shape = np.array(np.shape(image)[0:2])
        old_width = image_shape[1]
        old_height = image_shape[0]
        old_image = copy.deepcopy(image)
        width,height = get_new_img_size(old_width,old_height)


        image = image.resize([width,height])
        photo = np.array(image,dtype = np.float64)

        # 图片预处理,归一化
        photo = preprocess_input(np.expand_dims(photo,0))
        preds = self.model_rpn.predict(photo)
        # 将预测结果进行解码
        anchors = get_anchors(self.get_img_output_length(width,height),width,height)

        rpn_results = self.bbox_util.detection_out(preds,anchors,1,confidence_threshold=0)
        R = rpn_results[0][:, 2:]
        
        R[:,0] = np.array(np.round(R[:, 0]*width/self.config.rpn_stride),dtype=np.int32)
        R[:,1] = np.array(np.round(R[:, 1]*height/self.config.rpn_stride),dtype=np.int32)
        R[:,2] = np.array(np.round(R[:, 2]*width/self.config.rpn_stride),dtype=np.int32)
        R[:,3] = np.array(np.round(R[:, 3]*height/self.config.rpn_stride),dtype=np.int32)
        
        R[:, 2] -= R[:, 0]
        R[:, 3] -= R[:, 1]
        base_layer = preds[2]
        
        delete_line = []
        for i,r in enumerate(R):
            if r[2] < 1 or r[3] < 1:
                delete_line.append(i)
        R = np.delete(R,delete_line,axis=0)
        
        bboxes = []
        probs = []
        labels = []
        for jk in range(R.shape[0]//self.config.num_rois + 1):
            ROIs = np.expand_dims(R[self.config.num_rois*jk:self.config.num_rois*(jk+1), :], axis=0)
            
            if ROIs.shape[1] == 0:
                break

            if jk == R.shape[0]//self.config.num_rois:
                #pad R
                curr_shape = ROIs.shape
                target_shape = (curr_shape[0],self.config.num_rois,curr_shape[2])
                ROIs_padded = np.zeros(target_shape).astype(ROIs.dtype)
                ROIs_padded[:, :curr_shape[1], :] = ROIs
                ROIs_padded[0, curr_shape[1]:, :] = ROIs[0, 0, :]
                ROIs = ROIs_padded
            
            [P_cls, P_regr] = self.model_classifier.predict([base_layer,ROIs])

            for ii in range(P_cls.shape[1]):
                if np.max(P_cls[0, ii, :]) < self.confidence or np.argmax(P_cls[0, ii, :]) == (P_cls.shape[2] - 1):
                    continue

                label = np.argmax(P_cls[0, ii, :])

                (x, y, w, h) = ROIs[0, ii, :]

                cls_num = np.argmax(P_cls[0, ii, :])

                (tx, ty, tw, th) = P_regr[0, ii, 4*cls_num:4*(cls_num+1)]
                tx /= self.config.classifier_regr_std[0]
                ty /= self.config.classifier_regr_std[1]
                tw /= self.config.classifier_regr_std[2]
                th /= self.config.classifier_regr_std[3]

                cx = x + w/2.
                cy = y + h/2.
                cx1 = tx * w + cx
                cy1 = ty * h + cy
                w1 = math.exp(tw) * w
                h1 = math.exp(th) * h

                x1 = cx1 - w1/2.
                y1 = cy1 - h1/2.

                x2 = cx1 + w1/2
                y2 = cy1 + h1/2

                x1 = int(round(x1))
                y1 = int(round(y1))
                x2 = int(round(x2))
                y2 = int(round(y2))

                bboxes.append([x1,y1,x2,y2])
                probs.append(np.max(P_cls[0, ii, :]))
                labels.append(label)

        if len(bboxes)==0:
            return old_image
        
        # 筛选出其中得分高于confidence的框
        labels = np.array(labels)
        probs = np.array(probs)
        boxes = np.array(bboxes,dtype=np.float32)
        boxes[:,0] = boxes[:,0]*self.config.rpn_stride/width
        boxes[:,1] = boxes[:,1]*self.config.rpn_stride/height
        boxes[:,2] = boxes[:,2]*self.config.rpn_stride/width
        boxes[:,3] = boxes[:,3]*self.config.rpn_stride/height
        results = np.array(self.bbox_util.nms_for_out(np.array(labels),np.array(probs),np.array(boxes),self.num_classes-1,0.4))
        
        top_label_indices = results[:,0]
        top_conf = results[:,1]
        boxes = results[:,2:]
        boxes[:,0] = boxes[:,0]*old_width
        boxes[:,1] = boxes[:,1]*old_height
        boxes[:,2] = boxes[:,2]*old_width
        boxes[:,3] = boxes[:,3]*old_height

        font = ImageFont.truetype(font='model_data/simhei.ttf',size=np.floor(3e-2 * np.shape(image)[1] + 0.5).astype('int32'))
        
        thickness = (np.shape(old_image)[0] + np.shape(old_image)[1]) // width
        image = old_image
        for i, c in enumerate(top_label_indices):
            predicted_class = self.class_names[int(c)]
            score = top_conf[i]

            left, top, right, bottom = boxes[i]
            top = top - 5
            left = left - 5
            bottom = bottom + 5
            right = right + 5

            top = max(0, np.floor(top + 0.5).astype('int32'))
            left = max(0, np.floor(left + 0.5).astype('int32'))
            bottom = min(np.shape(image)[0], np.floor(bottom + 0.5).astype('int32'))
            right = min(np.shape(image)[1], np.floor(right + 0.5).astype('int32'))

            # 画框框
            label = '{} {:.2f}'.format(predicted_class, score)
            draw = ImageDraw.Draw(image)
            label_size = draw.textsize(label, font)
            label = label.encode('utf-8')
            print(label)
            
            if top - label_size[1] >= 0:
                text_origin = np.array([left, top - label_size[1]])
            else:
                text_origin = np.array([left, top + 1])

            for i in range(thickness):
                draw.rectangle(
                    [left + i, top + i, right - i, bottom - i],
                    outline=self.colors[int(c)])
            draw.rectangle(
                [tuple(text_origin), tuple(text_origin + label_size)],
                fill=self.colors[int(c)])
            draw.text(text_origin, str(label,'UTF-8'), fill=(0, 0, 0), font=font)
            del draw
        return image

    def close_session(self):
        self.sess.close()

3.6.2 Training script

from __future__ import division
from nets.frcnn import get_model
from nets.frcnn_training import cls_loss,smooth_l1,Generator,get_img_output_length,class_loss_cls,class_loss_regr

from utils.config import Config
from utils.utils import BBoxUtility
from utils.roi_helpers import calc_iou

from keras.utils import generic_utils
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
import keras
import numpy as np
import time 
import tensorflow as tf
from utils.anchors import get_anchors

def write_log(callback, names, logs, batch_no):
    for name, value in zip(names, logs):
        summary = tf.Summary()
        summary_value = summary.value.add()
        summary_value.simple_value = value
        summary_value.tag = name
        callback.writer.add_summary(summary, batch_no)
        callback.writer.flush()

if __name__ == "__main__":
    config = Config()
    NUM_CLASSES = 21
    EPOCH = 100
    EPOCH_LENGTH = 2000
    bbox_util = BBoxUtility(overlap_threshold=config.rpn_max_overlap,ignore_threshold=config.rpn_min_overlap)
    annotation_path = '2007_train.txt'

    model_rpn, model_classifier,model_all = get_model(config,NUM_CLASSES)
    base_net_weights = "model_data/voc_weights.h5"
    
    model_all.summary()
    model_rpn.load_weights(base_net_weights,by_name=True)
    model_classifier.load_weights(base_net_weights,by_name=True)

    with open(annotation_path) as f: 
        lines = f.readlines()
    np.random.seed(10101)
    np.random.shuffle(lines)
    np.random.seed(None)

    gen = Generator(bbox_util, lines, NUM_CLASSES, solid=True)
    rpn_train = gen.generate()
    log_dir = "logs"
    # 训练参数设置
    logging = TensorBoard(log_dir=log_dir)
    callback = logging
    callback.set_model(model_all)
    
    model_rpn.compile(loss={
    
    
                'regression'    : smooth_l1(),
                'classification': cls_loss()
            },optimizer=keras.optimizers.Adam(lr=1e-5)
    )
    model_classifier.compile(loss=[
        class_loss_cls, 
        class_loss_regr(NUM_CLASSES-1)
        ], 
        metrics={
    
    'dense_class_{}'.format(NUM_CLASSES): 'accuracy'},optimizer=keras.optimizers.Adam(lr=1e-5)
    )
    model_all.compile(optimizer='sgd', loss='mae')

    # 初始化参数
    iter_num = 0
    train_step = 0
    losses = np.zeros((EPOCH_LENGTH, 5))
    rpn_accuracy_rpn_monitor = []
    rpn_accuracy_for_epoch = [] 
    start_time = time.time()
    # 最佳loss
    best_loss = np.Inf
    # 数字到类的映射
    print('Starting training')

    for i in range(EPOCH):

        if i == 20:
            model_rpn.compile(loss={
    
    
                        'regression'    : smooth_l1(),
                        'classification': cls_loss()
                    },optimizer=keras.optimizers.Adam(lr=1e-6)
            )
            model_classifier.compile(loss=[
                class_loss_cls, 
                class_loss_regr(NUM_CLASSES-1)
                ], 
                metrics={
    
    'dense_class_{}'.format(NUM_CLASSES): 'accuracy'},optimizer=keras.optimizers.Adam(lr=1e-6)
            )
            print("Learning rate decrease")
        
        progbar = generic_utils.Progbar(EPOCH_LENGTH) 
        print('Epoch {}/{}'.format(i + 1, EPOCH))
        while True:
            if len(rpn_accuracy_rpn_monitor) == EPOCH_LENGTH and config.verbose:
                mean_overlapping_bboxes = float(sum(rpn_accuracy_rpn_monitor))/len(rpn_accuracy_rpn_monitor)
                rpn_accuracy_rpn_monitor = []
                print('Average number of overlapping bounding boxes from RPN = {} for {} previous iterations'.format(mean_overlapping_bboxes, EPOCH_LENGTH))
                if mean_overlapping_bboxes == 0:
                    print('RPN is not producing bounding boxes that overlap the ground truth boxes. Check RPN settings or keep training.')
            
            X, Y, boxes = next(rpn_train)
            
            loss_rpn = model_rpn.train_on_batch(X,Y)
            write_log(callback, ['rpn_cls_loss', 'rpn_reg_loss'], loss_rpn, train_step)
            P_rpn = model_rpn.predict_on_batch(X)
            height,width,_ = np.shape(X[0])
            anchors = get_anchors(get_img_output_length(width,height),width,height)
            
            # 将预测结果进行解码
            results = bbox_util.detection_out(P_rpn,anchors,1, confidence_threshold=0)
            
            R = results[0][:, 2:]

            X2, Y1, Y2, IouS = calc_iou(R, config, boxes[0], width, height, NUM_CLASSES)

            if X2 is None:
                rpn_accuracy_rpn_monitor.append(0)
                rpn_accuracy_for_epoch.append(0)
                continue
            
            neg_samples = np.where(Y1[0, :, -1] == 1)
            pos_samples = np.where(Y1[0, :, -1] == 0)

            if len(neg_samples) > 0:
                neg_samples = neg_samples[0]
            else:
                neg_samples = []

            if len(pos_samples) > 0:
                pos_samples = pos_samples[0]
            else:
                pos_samples = []

            rpn_accuracy_rpn_monitor.append(len(pos_samples))
            rpn_accuracy_for_epoch.append((len(pos_samples)))

            if len(neg_samples)==0:
                continue

            if len(pos_samples) < config.num_rois//2:
                selected_pos_samples = pos_samples.tolist()
            else:
                selected_pos_samples = np.random.choice(pos_samples, config.num_rois//2, replace=False).tolist()
            try:
                selected_neg_samples = np.random.choice(neg_samples, config.num_rois - len(selected_pos_samples), replace=False).tolist()
            except:
                selected_neg_samples = np.random.choice(neg_samples, config.num_rois - len(selected_pos_samples), replace=True).tolist()
            
            sel_samples = selected_pos_samples + selected_neg_samples
            loss_class = model_classifier.train_on_batch([X, X2[:, sel_samples, :]], [Y1[:, sel_samples, :], Y2[:, sel_samples, :]])

            write_log(callback, ['detection_cls_loss', 'detection_reg_loss', 'detection_acc'], loss_class, train_step)


            losses[iter_num, 0]  = loss_rpn[1]
            losses[iter_num, 1] = loss_rpn[2]
            losses[iter_num, 2] = loss_class[1]
            losses[iter_num, 3] = loss_class[2]
            losses[iter_num, 4] = loss_class[3]


            train_step += 1
            iter_num += 1
            progbar.update(iter_num, [('rpn_cls', np.mean(losses[:iter_num, 0])), ('rpn_regr', np.mean(losses[:iter_num, 1])),
                                  ('detector_cls', np.mean(losses[:iter_num, 2])), ('detector_regr', np.mean(losses[:iter_num, 3]))])

            
            if iter_num == EPOCH_LENGTH:
                loss_rpn_cls = np.mean(losses[:, 0])
                loss_rpn_regr = np.mean(losses[:, 1])
                loss_class_cls = np.mean(losses[:, 2])
                loss_class_regr = np.mean(losses[:, 3])
                class_acc = np.mean(losses[:, 4])

                mean_overlapping_bboxes = float(sum(rpn_accuracy_for_epoch)) / len(rpn_accuracy_for_epoch)
                rpn_accuracy_for_epoch = []

                if config.verbose:
                    print('Mean number of bounding boxes from RPN overlapping ground truth boxes: {}'.format(mean_overlapping_bboxes))
                    print('Classifier accuracy for bounding boxes from RPN: {}'.format(class_acc))
                    print('Loss RPN classifier: {}'.format(loss_rpn_cls))
                    print('Loss RPN regression: {}'.format(loss_rpn_regr))
                    print('Loss Detector classifier: {}'.format(loss_class_cls))
                    print('Loss Detector regression: {}'.format(loss_class_regr))
                    print('Elapsed time: {}'.format(time.time() - start_time))

                
                curr_loss = loss_rpn_cls + loss_rpn_regr + loss_class_cls + loss_class_regr
                iter_num = 0
                start_time = time.time()

                write_log(callback,
                        ['Elapsed_time', 'mean_overlapping_bboxes', 'mean_rpn_cls_loss', 'mean_rpn_reg_loss',
                        'mean_detection_cls_loss', 'mean_detection_reg_loss', 'mean_detection_acc', 'total_loss'],
                        [time.time() - start_time, mean_overlapping_bboxes, loss_rpn_cls, loss_rpn_regr,
                        loss_class_cls, loss_class_regr, class_acc, curr_loss],i)
                    
                
                if config.verbose:
                    print('The best loss is {}. The current loss is {}. Saving weights'.format(best_loss,curr_loss))
                if curr_loss < best_loss:
                    best_loss = curr_loss
                model_all.save_weights(log_dir+"/epoch{:03d}-loss{:.3f}-rpn{:.3f}-roi{:.3f}".format(i,curr_loss,loss_rpn_cls+loss_rpn_regr,loss_class_cls+loss_class_regr)+".h5")
                
                break

3.6.3 Prediction script

from keras.layers import Input
from frcnn import FRCNN 
from PIL import Image

frcnn = FRCNN()

while True:
    img = input('img/street.jpg')
    try:
        image = Image.open('img/street.jpg')
    except:
        print('Open Error! Try again!')
        continue
    else:
        r_image = frcnn.detect_image(image)
        r_image.show()
frcnn.close_session()
    

4. One stage和two stage

two-stage : The two-stage algorithm will first use a network to generate a proposal, such as selective search and RPN network. After the emergence of RPN, the ss method is basically abandoned. After the RPN network is connected to the backbone of the image feature extraction network, the RPN loss (bbox regression loss+classification loss) will be set to train the RPN network, and the proposal generated by the RPN will be sent to the subsequent network for more refined bbox regression and classification.

one-stage : One-stage pursues speed and abandons the two-stage architecture, that is, it no longer sets up a separate network to generate a proposal, but directly performs dense sampling on the feature map to generate a large number of prior frames, such as the grid method of YOLO. These prior boxes are not processed in two steps, and the size of the box is often artificially specified.

The two-stage algorithm is mainly RCNN series, including RCNN, Fast-RCNN, Faster-RCNN. The subsequent Mask-RCNN combines the Faster-RCNN architecture, ResNet and FPN (Feature Pyramid Networks) backbone, and the segmentation method in FCN, which improves the accuracy of detection while completing segmentation.

The most typical one-stage algorithm is YOLO, which is extremely fast.

5. Yolo

Pedestrian detection - Yolo3
insert image description here

5.1 Yolo-You Only Look Once

insert image description here

The YOLO algorithm uses a separate CNN model to achieve end-to-end target detection:

  1. Resize into 448 448, and the image is divided into 7 7 grids (cells)
  2. CNN extracts features and predicts: the convolution part is responsible for extracting features, and the fully connected part is responsible for prediction.
  3. filter bbox (via nms)

insert image description here

  • The YOLO algorithm as a whole is to divide the input picture into S S grids, here are 3 3 grids.
  • When the center point of the detected target falls into this grid, this grid is responsible for detecting the target, such as the person in the figure.
  • We input this picture into the network, and the final output size is also S S n (n is the number of channels), and the output S S corresponds to the original input picture S S (both are 3*3).
  • If our network can detect a total of 20 categories of targets, then the number of output channels n=2*(4+1)+20=30. 2 here means that each grid has two calibration boxes (pointed out in the paper), 4 represents the coordinate information of the calibration frame, 1 represents the confidence of the calibration frame, and 20 is the number of categories of the detection target.
  • So the final output size of the network is S S n=3 3 30.

About the calibration box

  • The output of the network is a tensor of S x S x (5*B+C) (S-size, B-number of calibration boxes, C-number of detection categories, 5-information of calibration boxes).
  • 5 is divided into 4+1:
  • 4 represents the location information of the calibration frame. The center point (x, y) of the box, the height and width h, w of the box.
  • 1 represents the confidence of each calibration frame and the accuracy information of the calibration frame.

insert image description here

In general, YOLO does not predict the exact coordinates of the center of the bounding box. It predicts:

  • offset relative to the upper left corner of the grid cell of the predicted target;
  • Normalized offset using the dimensionality of the feature map cells.

For example: Take
the above picture as an example, if the prediction of the center is (0.4, 0.7), the coordinates of the center on the 13 x 13 feature map are (6.4, 6.7) (the coordinates of the upper left corner of the red cell are (6,6)).

However, if the predicted x,y coordinates are greater than 1, such as (1.2, 0.7). Then the predicted center coordinates are (7.2, 6.7). Note that the center is in the cell to the right of the red cell. This breaks the theory behind YOLO, because if we assume that the red box is responsible for predicting the target dog, then the center of the dog must be in the red cell and should not be in the grid cell next to it.

So, to fix this, we perform a sigmoid function on the output, squashing the output into the interval 0 to 1, effectively ensuring that the center is in the grid cell where the prediction is made.

The confidence of each calibration box and the accuracy information of the calibration box:

The left side represents whether there is a target in the grid containing the calibration box. Yes = 1 No = 0.

The right side represents the accuracy of the calibration frame. The part on the right is to perform an IOU operation on two calibration frames (one is Ground truth and the other is the predicted calibration frame), that is, the intersection ratio and union of the two calibration frames. The larger the value, the That is, the more the calibration frames overlap, the more accurate it will be.
insert image description here

We can calculate the class-specific confidence scores/class scores of each calibration frame: it expresses the possibility that the target in the calibration frame belongs to each category and how well the calibration frame matches the target.

The class information predicted by each grid is multiplied by the confidence information predicted by the bounding box to obtain the class-specific confidence score of each bounding box.
insert image description here
insert image description here

insert image description here

  • It has performed more than 20 convolutions and four maximum pooling. Among them, 3x3 convolution is used to extract features, 1x1 convolution is used to compress features, and finally compress the image to the size of 7x7xfilter, which is equivalent to dividing the entire image into 7x7 grids, and each grid is responsible for the target detection of its own area. .
  • The entire network finally uses the fully connected layer to make the result size (7x7x30), where 7x7 represents a 7x7 grid, the first 20 of 30 represent the type of prediction, and the last 10 represent two prediction boxes and their confidence ( 5x2).

insert image description here

insert image description here

insert image description here

insert image description here

Perform the same operation for each bbox of each grid: 7x7x2 = 98 bbox (each bbox has both corresponding class information and coordinate information)
insert image description here

insert image description here

After obtaining the class-specific confidence score of each bbox, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.

insert image description here

After sorting, the probabilities are different in the boxes of different positions:
insert image description here

Use the maximum value as bbox_max, and compare it with a non-zero value (bbox_cur) smaller than it: IOU
insert image description here

Recursively, the following non-zero bbox_cur (0.2) is used as bbox_max to continue comparing IOU:
insert image description here

Finally, there are n boxes left
insert image description here

After obtaining the class-specific confidence score of each bbox, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.
insert image description here

For the scores of the bb3(20×1) category, find the index corresponding to the largest category.---->the largest score in class bb3(20×1)---->score
insert image description here

insert image description here

insert image description here

Disadvantages of Yolo:

  • YOLO is not good for objects that are very close to each other (the situation where they are close together and the midpoints fall on the same grid), and there are small group detections that are not effective because only two are predicted in one grid. box, and only belong to one class.
  • In the test image, the generalization ability is weak when the same type of objects has unusual aspect ratios and other situations.

5.2 Yolo2

  1. Yolo2 uses a new classification network as the feature extraction part.
  2. The network uses more 3 x 3 convolution kernels, doubling the number of channels after each pooling operation.
  3. Put the 1 x 1 convolution kernel between the 3 x 3 convolution kernels to compress the features.
  4. Use batch normalization to stabilize model training and accelerate convergence.
  5. A shortcut is kept for storing the previous features.
  6. Compared with yolo1, yolo2 has added the prior frame part, and the shape of the final output conv_dec is (13,13,425):
  • 13x13 is to divide the entire image into 13x13 grids for prediction.
  • 425 can be decomposed into (85x5). In 85, because yolo2 is commonly used in the coco data set, which has 80 classes; the remaining 5 refers to x, y, w, h and their confidence. x5 means that the prediction result contains 5 boxes, corresponding to 5 prior boxes.

insert image description here

5.2.1 Yolo2 - using anchor boxes

insert image description here

5.2.2 Yolo2 – Dimension Clusters (dimension clustering)

Use kmeans clustering to obtain the information of the prior frame:
the previous prior frame was set manually, and YOLO2 tries to count the prior frame that is more in line with the size of the object in the sample, so that the network fine-tuning the prior frame to the actual position can be reduced. difficulty. YOLO2's approach is to perform cluster analysis on the marked borders in the training set to find the border size that matches the sample as much as possible.

The most important thing for a clustering algorithm is to choose how to calculate the "distance" between two borders. For the commonly used Euclidean distance, a large border will produce a larger error, but what we care about is the IOU of the border. Therefore, YOLO2 uses the following formula to calculate the "distance" between two borders when clustering.
insert image description here

In the case of selecting different clustering k values, the obtained k centroid borders are calculated, and the Avg IOU of the marked borders in the sample and each centroid is calculated.

Obviously, the more the border number k, the larger the Avg IOU.

YOLO2 chooses k=5 as a compromise between the number of borders and IOU. Compared with the manually selected prior frame, 61 Avg IOU can be achieved by using 5 clustering frames, which is equivalent to 60.9 Avg IOU of 9 manually set prior frames
insert image description here

The author finally selects 5 cluster centers as the prior box. For the two data sets, the width and height of the five prior boxes are as follows:
COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)

VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)

5.3 Yolo3

Compared with the previous yolo1 and yolo2, YOLOv3 has improved greatly. The main improvement directions are:

  1. Using the residual network Residual
  2. Extract multiple feature layers for target detection. A total of three feature layers are extracted, and their shapes are (13,13,75), (26,26,75), (52,52,75). The last dimension is 75 because the graph is based on the voc dataset, which has 20 classes. Yolo3 has 3 prior frames for each feature layer, so the final dimension is 3x25.
  3. It adopts UpSampling2d design
    insert image description here

5.4 Code example (yolo v3)

5.4.1 Model building

# -*- coding:utf-8 -*-

import numpy as np
import tensorflow as tf
import os

class yolo:
    def __init__(self, norm_epsilon, norm_decay, anchors_path, classes_path, pre_train):
        """
        Introduction
        ------------
            初始化函数
        Parameters
        ----------
            norm_decay: 在预测时计算moving average时的衰减率
            norm_epsilon: 方差加上极小的数,防止除以0的情况
            anchors_path: yolo anchor 文件路径
            classes_path: 数据集类别对应文件
            pre_train: 是否使用预训练darknet53模型
        """
        self.norm_epsilon = norm_epsilon
        self.norm_decay = norm_decay
        self.anchors_path = anchors_path
        self.classes_path = classes_path
        self.pre_train = pre_train
        self.anchors = self._get_anchors()
        self.classes = self._get_class()

    #---------------------------------------#
    #   获取种类和先验框
    #---------------------------------------#
    def _get_class(self):
        """
        Introduction
        ------------
            获取类别名字
        Returns
        -------
            class_names: coco数据集类别对应的名字
        """
        classes_path = os.path.expanduser(self.classes_path)
        with open(classes_path) as f:
            class_names = f.readlines()
        class_names = [c.strip() for c in class_names]
        return class_names

    def _get_anchors(self):
        """
        Introduction
        ------------
            获取anchors
        """
        anchors_path = os.path.expanduser(self.anchors_path)
        with open(anchors_path) as f:
            anchors = f.readline()
        anchors = [float(x) for x in anchors.split(',')]
        return np.array(anchors).reshape(-1, 2)

    #---------------------------------------#
    #   用于生成层
    #---------------------------------------#
    # l2 正则化
    def _batch_normalization_layer(self, input_layer, name = None, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
        '''
        Introduction
        ------------
            对卷积层提取的feature map使用batch normalization
        Parameters
        ----------
            input_layer: 输入的四维tensor
            name: batchnorm层的名字
            trainging: 是否为训练过程
            norm_decay: 在预测时计算moving average时的衰减率
            norm_epsilon: 方差加上极小的数,防止除以0的情况
        Returns
        -------
            bn_layer: batch normalization处理之后的feature map
        '''
        bn_layer = tf.layers.batch_normalization(inputs = input_layer,
            momentum = norm_decay, epsilon = norm_epsilon, center = True,
            scale = True, training = training, name = name)
        return tf.nn.leaky_relu(bn_layer, alpha = 0.1)

    # 这个就是用来进行卷积的
    def _conv2d_layer(self, inputs, filters_num, kernel_size, name, use_bias = False, strides = 1):
        """
        Introduction
        ------------
            使用tf.layers.conv2d减少权重和偏置矩阵初始化过程,以及卷积后加上偏置项的操作
            经过卷积之后需要进行batch norm,最后使用leaky ReLU激活函数
            根据卷积时的步长,如果卷积的步长为2,则对图像进行降采样
            比如,输入图片的大小为416*416,卷积核大小为3,若stride为2时,(416 - 3 + 2)/ 2 + 1, 计算结果为208,相当于做了池化层处理
            因此需要对stride大于1的时候,先进行一个padding操作, 采用四周都padding一维代替'same'方式
        Parameters
        ----------
            inputs: 输入变量
            filters_num: 卷积核数量
            strides: 卷积步长
            name: 卷积层名字
            trainging: 是否为训练过程
            use_bias: 是否使用偏置项
            kernel_size: 卷积核大小
        Returns
        -------
            conv: 卷积之后的feature map
        """
        conv = tf.layers.conv2d(
            inputs = inputs, filters = filters_num,
            kernel_size = kernel_size, strides = [strides, strides], kernel_initializer = tf.glorot_uniform_initializer(),
            padding = ('SAME' if strides == 1 else 'VALID'), kernel_regularizer = tf.contrib.layers.l2_regularizer(scale = 5e-4), use_bias = use_bias, name = name)
        return conv

    # 这个用来进行残差卷积的
    # 残差卷积就是进行一次3X3的卷积,然后保存该卷积layer
    # 再进行一次1X1的卷积和一次3X3的卷积,并把这个结果加上layer作为最后的结果
    def _Residual_block(self, inputs, filters_num, blocks_num, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
        """
        Introduction
        ------------
            Darknet的残差block,类似resnet的两层卷积结构,分别采用1x1和3x3的卷积核,使用1x1是为了减少channel的维度
        Parameters
        ----------
            inputs: 输入变量
            filters_num: 卷积核数量
            trainging: 是否为训练过程
            blocks_num: block的数量
            conv_index: 为了方便加载预训练权重,统一命名序号
            weights_dict: 加载预训练模型的权重
            norm_decay: 在预测时计算moving average时的衰减率
            norm_epsilon: 方差加上极小的数,防止除以0的情况
        Returns
        -------
            inputs: 经过残差网络处理后的结果
        """
        # 在输入feature map的长宽维度进行padding
        inputs = tf.pad(inputs, paddings=[[0, 0], [1, 0], [1, 0], [0, 0]], mode='CONSTANT')
        layer = self._conv2d_layer(inputs, filters_num, kernel_size = 3, strides = 2, name = "conv2d_" + str(conv_index))
        layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        for _ in range(blocks_num):
            shortcut = layer
            layer = self._conv2d_layer(layer, filters_num // 2, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
            layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            conv_index += 1
            layer = self._conv2d_layer(layer, filters_num, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
            layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            conv_index += 1
            layer += shortcut
        return layer, conv_index

    #---------------------------------------#
    #   生成_darknet53
    #---------------------------------------#
    def _darknet53(self, inputs, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
        """
        Introduction
        ------------
            构建yolo3使用的darknet53网络结构
        Parameters
        ----------
            inputs: 模型输入变量
            conv_index: 卷积层数序号,方便根据名字加载预训练权重
            weights_dict: 预训练权重
            training: 是否为训练
            norm_decay: 在预测时计算moving average时的衰减率
            norm_epsilon: 方差加上极小的数,防止除以0的情况
        Returns
        -------
            conv: 经过52层卷积计算之后的结果, 输入图片为416x416x3,则此时输出的结果shape为13x13x1024
            route1: 返回第26层卷积计算结果52x52x256, 供后续使用
            route2: 返回第43层卷积计算结果26x26x512, 供后续使用
            conv_index: 卷积层计数,方便在加载预训练模型时使用
        """
        with tf.variable_scope('darknet53'):
            # 416,416,3 -> 416,416,32
            conv = self._conv2d_layer(inputs, filters_num = 32, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
            conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            conv_index += 1
            # 416,416,32 -> 208,208,64
            conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 64, blocks_num = 1, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            # 208,208,64 -> 104,104,128
            conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 128, blocks_num = 2, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            # 104,104,128 -> 52,52,256
            conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 256, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            # route1 = 52,52,256
            route1 = conv
            # 52,52,256 -> 26,26,512
            conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 512, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            # route2 = 26,26,512
            route2 = conv
            # 26,26,512 -> 13,13,1024
            conv, conv_index = self._Residual_block(conv, conv_index = conv_index,  filters_num = 1024, blocks_num = 4, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
            # route3 = 13,13,1024
        return  route1, route2, conv, conv_index

    # 输出两个网络结果
    # 第一个是进行5次卷积后,用于下一次逆卷积的,卷积过程是1X1,3X3,1X1,3X3,1X1
    # 第二个是进行5+2次卷积,作为一个特征层的,卷积过程是1X1,3X3,1X1,3X3,1X1,3X3,1X1
    def _yolo_block(self, inputs, filters_num, out_filters, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
        """
        Introduction
        ------------
            yolo3在Darknet53提取的特征层基础上,又加了针对3种不同比例的feature map的block,这样来提高对小物体的检测率
        Parameters
        ----------
            inputs: 输入特征
            filters_num: 卷积核数量
            out_filters: 最后输出层的卷积核数量
            conv_index: 卷积层数序号,方便根据名字加载预训练权重
            training: 是否为训练
            norm_decay: 在预测时计算moving average时的衰减率
            norm_epsilon: 方差加上极小的数,防止除以0的情况
        Returns
        -------
            route: 返回最后一层卷积的前一层结果
            conv: 返回最后一层卷积的结果
            conv_index: conv层计数
        """
        conv = self._conv2d_layer(inputs, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        route = conv
        conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
        conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
        conv_index += 1
        conv = self._conv2d_layer(conv, filters_num = out_filters, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index), use_bias = True)
        conv_index += 1
        return route, conv, conv_index

    # 返回三个特征层的内容
    def yolo_inference(self, inputs, num_anchors, num_classes, training = True):
        """
        Introduction
        ------------
            构建yolo模型结构
        Parameters
        ----------
            inputs: 模型的输入变量
            num_anchors: 每个grid cell负责检测的anchor数量
            num_classes: 类别数量
            training: 是否为训练模式
        """
        conv_index = 1
        # route1 = 52,52,256、route2 = 26,26,512、route3 = 13,13,1024
        conv2d_26, conv2d_43, conv, conv_index = self._darknet53(inputs, conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
        with tf.variable_scope('yolo'):
            #--------------------------------------#
            #   获得第一个特征层
            #--------------------------------------#
            # conv2d_57 = 13,13,512,conv2d_59 = 13,13,255(3x(80+5))
            conv2d_57, conv2d_59, conv_index = self._yolo_block(conv, 512, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)

            #--------------------------------------#
            #   获得第二个特征层
            #--------------------------------------#
            conv2d_60 = self._conv2d_layer(conv2d_57, filters_num = 256, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
            conv2d_60 = self._batch_normalization_layer(conv2d_60, name = "batch_normalization_" + str(conv_index),training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
            conv_index += 1
            # unSample_0 = 26,26,256
            unSample_0 = tf.image.resize_nearest_neighbor(conv2d_60, [2 * tf.shape(conv2d_60)[1], 2 * tf.shape(conv2d_60)[1]], name='upSample_0')
            # route0 = 26,26,768
            route0 = tf.concat([unSample_0, conv2d_43], axis = -1, name = 'route_0')
            # conv2d_65 = 52,52,256,conv2d_67 = 26,26,255
            conv2d_65, conv2d_67, conv_index = self._yolo_block(route0, 256, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)

            #--------------------------------------#
            #   获得第三个特征层
            #--------------------------------------# 
            conv2d_68 = self._conv2d_layer(conv2d_65, filters_num = 128, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
            conv2d_68 = self._batch_normalization_layer(conv2d_68, name = "batch_normalization_" + str(conv_index), training=training, norm_decay=self.norm_decay, norm_epsilon = self.norm_epsilon)
            conv_index += 1
            # unSample_1 = 52,52,128
            unSample_1 = tf.image.resize_nearest_neighbor(conv2d_68, [2 * tf.shape(conv2d_68)[1], 2 * tf.shape(conv2d_68)[1]], name='upSample_1')
            # route1= 52,52,384
            route1 = tf.concat([unSample_1, conv2d_26], axis = -1, name = 'route_1')
            # conv2d_75 = 52,52,255
            _, conv2d_75, _ = self._yolo_block(route1, 128, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)

        return [conv2d_59, conv2d_67, conv2d_75]


5.4.2 Configuration file

num_parallel_calls = 4
input_shape = 416
max_boxes = 20
jitter = 0.3
hue = 0.1
sat = 1.0
cont = 0.8
bri = 0.1
norm_decay = 0.99
norm_epsilon = 1e-3
pre_train = True
num_anchors = 9
num_classes = 80
training = True
ignore_thresh = .5
learning_rate = 0.001
train_batch_size = 10
val_batch_size = 10
train_num = 2800
val_num = 5000
Epoch = 50
obj_threshold = 0.5
nms_threshold = 0.5
gpu_index = "0"
log_dir = './logs'
data_dir = './model_data'
model_dir = './test_model/model.ckpt-192192'
pre_train_yolo3 = True
yolo3_weights_path = './model_data/yolov3.weights'
darknet53_weights_path = './model_data/darknet53.weights'
anchors_path = './model_data/yolo_anchors.txt'
classes_path = './model_data/coco_classes.txt'

image_file = "./img/img.jpg"

5.4.3 detect file

import os
import config
import argparse
import numpy as np
import tensorflow as tf
from yolo_predict import yolo_predictor
from PIL import Image, ImageFont, ImageDraw
from utils import letterbox_image, load_weights

# 指定使用GPU的Index
os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu_index

def detect(image_path, model_path, yolo_weights = None):
    """
    Introduction
    ------------
        加载模型,进行预测
    Parameters
    ----------
        model_path: 模型路径,当使用yolo_weights无用
        image_path: 图片路径
    """
    #---------------------------------------#
    #   图片预处理
    #---------------------------------------#
    image = Image.open(image_path)
    # 对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
    resize_image = letterbox_image(image, (416, 416))
    image_data = np.array(resize_image, dtype = np.float32)
    # 归一化
    image_data /= 255.
    # 转格式,第一维度填充
    image_data = np.expand_dims(image_data, axis = 0)
    #---------------------------------------#
    #   图片输入
    #---------------------------------------#
    # input_image_shape原图的size
    input_image_shape = tf.placeholder(dtype = tf.int32, shape = (2,))
    # 图像
    input_image = tf.placeholder(shape = [None, 416, 416, 3], dtype = tf.float32)

    # 进入yolo_predictor进行预测,yolo_predictor是用于预测的一个对象
    predictor = yolo_predictor(config.obj_threshold, config.nms_threshold, config.classes_path, config.anchors_path)
    with tf.Session() as sess:
        #---------------------------------------#
        #   图片预测
        #---------------------------------------#
        if yolo_weights is not None:
            with tf.variable_scope('predict'):
                boxes, scores, classes = predictor.predict(input_image, input_image_shape)
            # 载入模型
            load_op = load_weights(tf.global_variables(scope = 'predict'), weights_file = yolo_weights)
            sess.run(load_op)
            
            # 进行预测
            out_boxes, out_scores, out_classes = sess.run(
            [boxes, scores, classes],
            feed_dict={
    
    
                # image_data这个resize过
                input_image: image_data,
                # 以y、x的方式传入
                input_image_shape: [image.size[1], image.size[0]]
            })
        else:
            boxes, scores, classes = predictor.predict(input_image, input_image_shape)
            saver = tf.train.Saver()
            saver.restore(sess, model_path)
            out_boxes, out_scores, out_classes = sess.run(
            [boxes, scores, classes],
            feed_dict={
    
    
                input_image: image_data,
                input_image_shape: [image.size[1], image.size[0]]
            })

        #---------------------------------------#
        #   画框
        #---------------------------------------#
        # 找到几个box,打印
        print('Found {} boxes for {}'.format(len(out_boxes), 'img'))
        font = ImageFont.truetype(font = 'font/FiraMono-Medium.otf', size = np.floor(3e-2 * image.size[1] + 0.5).astype('int32'))
        
        # 厚度
        thickness = (image.size[0] + image.size[1]) // 300

        for i, c in reversed(list(enumerate(out_classes))):
            # 获得预测名字,box和分数
            predicted_class = predictor.class_names[c]
            box = out_boxes[i]
            score = out_scores[i]

            # 打印
            label = '{} {:.2f}'.format(predicted_class, score)

            # 用于画框框和文字
            draw = ImageDraw.Draw(image)
            # textsize用于获得写字的时候,按照这个字体,要多大的框
            label_size = draw.textsize(label, font)

            # 获得四个边
            top, left, bottom, right = box
            top = max(0, np.floor(top + 0.5).astype('int32'))
            left = max(0, np.floor(left + 0.5).astype('int32'))
            bottom = min(image.size[1]-1, np.floor(bottom + 0.5).astype('int32'))
            right = min(image.size[0]-1, np.floor(right + 0.5).astype('int32'))
            print(label, (left, top), (right, bottom))
            print(label_size)
            
            if top - label_size[1] >= 0:
                text_origin = np.array([left, top - label_size[1]])
            else:
                text_origin = np.array([left, top + 1])

            # My kingdom for a good redistributable image drawing library.
            for i in range(thickness):
                draw.rectangle(
                    [left + i, top + i, right - i, bottom - i],
                    outline = predictor.colors[c])
            draw.rectangle(
                [tuple(text_origin), tuple(text_origin + label_size)],
                fill = predictor.colors[c])
            draw.text(text_origin, label, fill=(0, 0, 0), font=font)
            del draw
        image.show()
        image.save('./img/result1.jpg')

if __name__ == '__main__':

    # 当使用yolo3自带的weights的时候
    if config.pre_train_yolo3 == True:
        detect(config.image_file, config.model_dir, config.yolo3_weights_path)

    # 当使用模型的时候
    else:
        detect(config.image_file, config.model_dir)

5.4.4 gen_anchors.py

import numpy as np
import matplotlib.pyplot as plt

def convert_coco_bbox(size, box):
    """
    Introduction
    ------------
        计算box的长宽和原始图像的长宽比值
    Parameters
    ----------
        size: 原始图像大小
        box: 标注box的信息
    Returns
        x, y, w, h 标注box和原始图像的比值
    """
    dw = 1. / size[0]
    dh = 1. / size[1]
    x = (box[0] + box[2]) / 2.0 - 1
    y = (box[1] + box[3]) / 2.0 - 1
    w = box[2]
    h = box[3]
    x = x * dw
    w = w * dw
    y = y * dh
    h = h * dh
    return x, y, w, h


def box_iou(boxes, clusters):
    """
    Introduction
    ------------
        计算每个box和聚类中心的距离值
    Parameters
    ----------
        boxes: 所有的box数据
        clusters: 聚类中心
    """
    box_num = boxes.shape[0]
    cluster_num = clusters.shape[0]
    box_area = boxes[:, 0] * boxes[:, 1]
    #每个box的面积重复9次,对应9个聚类中心
    box_area = box_area.repeat(cluster_num)
    box_area = np.reshape(box_area, [box_num, cluster_num])

    cluster_area = clusters[:, 0] * clusters[:, 1]
    cluster_area = np.tile(cluster_area, [1, box_num])
    cluster_area = np.reshape(cluster_area, [box_num, cluster_num])

    #这里计算两个矩形的iou,默认所有矩形的左上角坐标都是在原点,然后计算iou,因此只需取长宽最小值相乘就是重叠区域的面积
    boxes_width = np.reshape(boxes[:, 0].repeat(cluster_num), [box_num, cluster_num])
    clusters_width = np.reshape(np.tile(clusters[:, 0], [1, box_num]), [box_num, cluster_num])
    min_width = np.minimum(clusters_width, boxes_width)

    boxes_high = np.reshape(boxes[:, 1].repeat(cluster_num), [box_num, cluster_num])
    clusters_high = np.reshape(np.tile(clusters[:, 1], [1, box_num]), [box_num, cluster_num])
    min_high = np.minimum(clusters_high, boxes_high)

    iou = np.multiply(min_high, min_width) / (box_area + cluster_area - np.multiply(min_high, min_width))
    return iou


def avg_iou(boxes, clusters):
    """
    Introduction
    ------------
        计算所有box和聚类中心的最大iou均值作为准确率
    Parameters
    ----------
        boxes: 所有的box
        clusters: 聚类中心
    Returns
    -------
        accuracy: 准确率
    """
    return np.mean(np.max(box_iou(boxes, clusters), axis =1))


def Kmeans(boxes, cluster_num, iteration_cutoff = 25, function = np.median):
    """
    Introduction
    ------------
        根据所有box的长宽进行Kmeans聚类
    Parameters
    ----------
        boxes: 所有的box的长宽
        cluster_num: 聚类的数量
        iteration_cutoff: 当准确率不再降低多少轮停止迭代
        function: 聚类中心更新的方式
    Returns
    -------
        clusters: 聚类中心box的大小
    """
    boxes_num = boxes.shape[0]
    best_average_iou = 0
    best_avg_iou_iteration = 0
    best_clusters = []
    anchors = []
    np.random.seed()
    # 随机选择所有boxes中的box作为聚类中心
    clusters = boxes[np.random.choice(boxes_num, cluster_num, replace = False)]
    count = 0
    while True:
        distances = 1. - box_iou(boxes, clusters)
        boxes_iou = np.min(distances, axis=1)
        # 获取每个box距离哪个聚类中心最近
        current_box_cluster = np.argmin(distances, axis=1)
        average_iou = np.mean(1. - boxes_iou)
        if average_iou > best_average_iou:
            best_average_iou = average_iou
            best_clusters = clusters
            best_avg_iou_iteration = count
        # 通过function的方式更新聚类中心
        for cluster in range(cluster_num):
            clusters[cluster] = function(boxes[current_box_cluster == cluster], axis=0)
        if count >= best_avg_iou_iteration + iteration_cutoff:
            break
        print("Sum of all distances (cost) = {}".format(np.sum(boxes_iou)))
        print("iter: {} Accuracy: {:.2f}%".format(count, avg_iou(boxes, clusters) * 100))
        count += 1
    for cluster in best_clusters:
        anchors.append([round(cluster[0] * 416), round(cluster[1] * 416)])
    return anchors, best_average_iou



def load_cocoDataset(annfile):
    """
    Introduction
    ------------
        读取coco数据集的标注信息
    Parameters
    ----------
        datasets: 数据集名字列表
    """
    data = []
    coco = COCO(annfile)
    cats = coco.loadCats(coco.getCatIds())
    coco.loadImgs()
    base_classes = {
    
    cat['id'] : cat['name'] for cat in cats}
    imgId_catIds = [coco.getImgIds(catIds = cat_ids) for cat_ids in base_classes.keys()]
    image_ids = [img_id for img_cat_id in imgId_catIds for img_id in img_cat_id ]
    for image_id in image_ids:
        annIds = coco.getAnnIds(imgIds = image_id)
        anns = coco.loadAnns(annIds)
        img = coco.loadImgs(image_id)[0]
        image_width = img['width']
        image_height = img['height']

        for ann in anns:
            box = ann['bbox']
            bb = convert_coco_bbox((image_width, image_height), box)
            data.append(bb[2:])
    return np.array(data)


def process(dataFile, cluster_num, iteration_cutoff = 25, function = np.median):
    """
    Introduction
    ------------
        主处理函数
    Parameters
    ----------
        dataFile: 数据集的标注文件
        cluster_num: 聚类中心数目
        iteration_cutoff: 当准确率不再降低多少轮停止迭代
        function: 聚类中心更新的方式
    """
    last_best_iou = 0
    last_anchors = []
    boxes = load_cocoDataset(dataFile)
    box_w = boxes[:1000, 0]
    box_h = boxes[:1000, 1]
    plt.scatter(box_h, box_w, c = 'r')
    anchors = Kmeans(boxes, cluster_num, iteration_cutoff, function)
    plt.scatter(anchors[:,0], anchors[:, 1], c = 'b')
    plt.show()
    for _ in range(100):
        anchors, best_iou = Kmeans(boxes, cluster_num, iteration_cutoff, function)
        if best_iou > last_best_iou:
            last_anchors = anchors
            last_best_iou = best_iou
            print("anchors: {}, avg iou: {}".format(last_anchors, last_best_iou))
    print("final anchors: {}, avg iou: {}".format(last_anchors, last_best_iou))



if __name__ == '__main__':
    process('./annotations/instances_train2014.json', 9)

5.4.5 utils.py

import json
import numpy as np
import tensorflow as tf
from PIL import Image
from collections import defaultdict

def load_weights(var_list, weights_file):
    """
    Introduction
    ------------
        加载预训练好的darknet53权重文件
    Parameters
    ----------
        var_list: 赋值变量名
        weights_file: 权重文件
    Returns
    -------
        assign_ops: 赋值更新操作
    """
    with open(weights_file, "rb") as fp:
        _ = np.fromfile(fp, dtype=np.int32, count=5)

        weights = np.fromfile(fp, dtype=np.float32)

    ptr = 0
    i = 0
    assign_ops = []
    while i < len(var_list) - 1:
        var1 = var_list[i]
        var2 = var_list[i + 1]
        # do something only if we process conv layer
        if 'conv2d' in var1.name.split('/')[-2]:
            # check type of next layer
            if 'batch_normalization' in var2.name.split('/')[-2]:
                # load batch norm params
                gamma, beta, mean, var = var_list[i + 1:i + 5]
                batch_norm_vars = [beta, gamma, mean, var]
                for var in batch_norm_vars:
                    shape = var.shape.as_list()
                    num_params = np.prod(shape)
                    var_weights = weights[ptr:ptr + num_params].reshape(shape)
                    ptr += num_params
                    assign_ops.append(tf.assign(var, var_weights, validate_shape=True))

                # we move the pointer by 4, because we loaded 4 variables
                i += 4
            elif 'conv2d' in var2.name.split('/')[-2]:
                # load biases
                bias = var2
                bias_shape = bias.shape.as_list()
                bias_params = np.prod(bias_shape)
                bias_weights = weights[ptr:ptr + bias_params].reshape(bias_shape)
                ptr += bias_params
                assign_ops.append(tf.assign(bias, bias_weights, validate_shape=True))

                # we loaded 1 variable
                i += 1
            # we can load weights of conv layer
            shape = var1.shape.as_list()
            num_params = np.prod(shape)

            var_weights = weights[ptr:ptr + num_params].reshape((shape[3], shape[2], shape[0], shape[1]))
            # remember to transpose to column-major
            var_weights = np.transpose(var_weights, (2, 3, 1, 0))
            ptr += num_params
            assign_ops.append(tf.assign(var1, var_weights, validate_shape=True))
            i += 1

    return assign_ops


def letterbox_image(image, size):
    """
    Introduction
    ------------
        对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
    Parameters
    ----------
        image: 输入图像
        size: 图像大小
    Returns
    -------
        boxed_image: 缩放后的图像
    """
    image_w, image_h = image.size
    w, h = size
    new_w = int(image_w * min(w*1.0/image_w, h*1.0/image_h))
    new_h = int(image_h * min(w*1.0/image_w, h*1.0/image_h))
    resized_image = image.resize((new_w,new_h), Image.BICUBIC)

    boxed_image = Image.new('RGB', size, (128, 128, 128))
    boxed_image.paste(resized_image, ((w-new_w)//2,(h-new_h)//2))
    return boxed_image


def draw_box(image, bbox):
    """
    Introduction
    ------------
        通过tensorboard把训练数据可视化
    Parameters
    ----------
        image: 训练数据图片
        bbox: 训练数据图片中标记box坐标
    """
    xmin, ymin, xmax, ymax, label = tf.split(value = bbox, num_or_size_splits = 5, axis=2)
    height = tf.cast(tf.shape(image)[1], tf.float32)
    weight = tf.cast(tf.shape(image)[2], tf.float32)
    new_bbox = tf.concat([tf.cast(ymin, tf.float32) / height, tf.cast(xmin, tf.float32) / weight, tf.cast(ymax, tf.float32) / height, tf.cast(xmax, tf.float32) / weight], 2)
    new_image = tf.image.draw_bounding_boxes(image, new_bbox)
    tf.summary.image('input', new_image)


def voc_ap(rec, prec):
    """
    --- Official matlab code VOC2012---
    mrec=[0 ; rec ; 1];
    mpre=[0 ; prec ; 0];
    for i=numel(mpre)-1:-1:1
        mpre(i)=max(mpre(i),mpre(i+1));
    end
    i=find(mrec(2:end)~=mrec(1:end-1))+1;
    ap=sum((mrec(i)-mrec(i-1)).*mpre(i));
    """
    rec.insert(0, 0.0)  # insert 0.0 at begining of list
    rec.append(1.0)  # insert 1.0 at end of list
    mrec = rec[:]
    prec.insert(0, 0.0)  # insert 0.0 at begining of list
    prec.append(0.0)  # insert 0.0 at end of list
    mpre = prec[:]
    for i in range(len(mpre) - 2, -1, -1):
        mpre[i] = max(mpre[i], mpre[i + 1])

    i_list = []
    for i in range(1, len(mrec)):
        if mrec[i] != mrec[i - 1]:
            i_list.append(i)
    ap = 0.0
    for i in i_list:
        ap += ((mrec[i] - mrec[i - 1]) * mpre[i])
    return ap, mrec, mpre

5.4.6 Prediction script

import os
import config
import random
import colorsys
import numpy as np
import tensorflow as tf
from model.yolo3_model import yolo


class yolo_predictor:
    def __init__(self, obj_threshold, nms_threshold, classes_file, anchors_file):
        """
        Introduction
        ------------
            初始化函数
        Parameters
        ----------
            obj_threshold: 目标检测为物体的阈值
            nms_threshold: nms阈值
        """
        self.obj_threshold = obj_threshold
        self.nms_threshold = nms_threshold
        # 预读取
        self.classes_path = classes_file
        self.anchors_path = anchors_file
        # 读取种类名称
        self.class_names = self._get_class()
        # 读取先验框
        self.anchors = self._get_anchors()

        # 画框框用
        hsv_tuples = [(x / len(self.class_names), 1., 1.)for x in range(len(self.class_names))]

        self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
        self.colors = list(map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)), self.colors))
        random.seed(10101)
        random.shuffle(self.colors)
        random.seed(None)

    def _get_class(self):
        """
        Introduction
        ------------
            读取类别名称
        """
        classes_path = os.path.expanduser(self.classes_path)
        with open(classes_path) as f:
            class_names = f.readlines()
        class_names = [c.strip() for c in class_names]
        return class_names

    def _get_anchors(self):
        """
        Introduction
        ------------
            读取anchors数据
        """
        anchors_path = os.path.expanduser(self.anchors_path)
        with open(anchors_path) as f:
            anchors = f.readline()
            anchors = [float(x) for x in anchors.split(',')]
            anchors = np.array(anchors).reshape(-1, 2)
        return anchors
    
    #---------------------------------------#
    #   对三个特征层解码
    #   进行排序并进行非极大抑制
    #---------------------------------------#
    def boxes_and_scores(self, feats, anchors, classes_num, input_shape, image_shape):
        """
        Introduction
        ------------
            将预测出的box坐标转换为对应原图的坐标,然后计算每个box的分数
        Parameters
        ----------
            feats: yolo输出的feature map
            anchors: anchor的位置
            class_num: 类别数目
            input_shape: 输入大小
            image_shape: 图片大小
        Returns
        -------
            boxes: 物体框的位置
            boxes_scores: 物体框的分数,为置信度和类别概率的乘积
        """
        # 获得特征
        box_xy, box_wh, box_confidence, box_class_probs = self._get_feats(feats, anchors, classes_num, input_shape)
        # 寻找在原图上的位置
        boxes = self.correct_boxes(box_xy, box_wh, input_shape, image_shape)
        boxes = tf.reshape(boxes, [-1, 4])
        # 获得置信度box_confidence * box_class_probs
        box_scores = box_confidence * box_class_probs
        box_scores = tf.reshape(box_scores, [-1, classes_num])
        return boxes, box_scores

    # 获得在原图上框的位置
    def correct_boxes(self, box_xy, box_wh, input_shape, image_shape):
        """
        Introduction
        ------------
            计算物体框预测坐标在原图中的位置坐标
        Parameters
        ----------
            box_xy: 物体框左上角坐标
            box_wh: 物体框的宽高
            input_shape: 输入的大小
            image_shape: 图片的大小
        Returns
        -------
            boxes: 物体框的位置
        """
        box_yx = box_xy[..., ::-1]
        box_hw = box_wh[..., ::-1]
        # 416,416
        input_shape = tf.cast(input_shape, dtype = tf.float32)
        # 实际图片的大小
        image_shape = tf.cast(image_shape, dtype = tf.float32)

        new_shape = tf.round(image_shape * tf.reduce_min(input_shape / image_shape))

        offset = (input_shape - new_shape) / 2. / input_shape
        scale = input_shape / new_shape
        box_yx = (box_yx - offset) * scale
        box_hw *= scale

        box_mins = box_yx - (box_hw / 2.)
        box_maxes = box_yx + (box_hw / 2.)
        boxes = tf.concat([
            box_mins[..., 0:1],
            box_mins[..., 1:2],
            box_maxes[..., 0:1],
            box_maxes[..., 1:2]
        ], axis = -1)
        boxes *= tf.concat([image_shape, image_shape], axis = -1)
        return boxes

    # 其实是解码的过程
    def _get_feats(self, feats, anchors, num_classes, input_shape):
        """
        Introduction
        ------------
            根据yolo最后一层的输出确定bounding box
        Parameters
        ----------
            feats: yolo模型最后一层输出
            anchors: anchors的位置
            num_classes: 类别数量
            input_shape: 输入大小
        Returns
        -------
            box_xy, box_wh, box_confidence, box_class_probs
        """
        num_anchors = len(anchors)
        anchors_tensor = tf.reshape(tf.constant(anchors, dtype=tf.float32), [1, 1, 1, num_anchors, 2])
        grid_size = tf.shape(feats)[1:3]
        predictions = tf.reshape(feats, [-1, grid_size[0], grid_size[1], num_anchors, num_classes + 5])

        # 这里构建13*13*1*2的矩阵,对应每个格子加上对应的坐标
        grid_y = tf.tile(tf.reshape(tf.range(grid_size[0]), [-1, 1, 1, 1]), [1, grid_size[1], 1, 1])
        grid_x = tf.tile(tf.reshape(tf.range(grid_size[1]), [1, -1, 1, 1]), [grid_size[0], 1, 1, 1])
        grid = tf.concat([grid_x, grid_y], axis = -1)
        grid = tf.cast(grid, tf.float32)

        # 将x,y坐标归一化,相对网格的位置
        box_xy = (tf.sigmoid(predictions[..., :2]) + grid) / tf.cast(grid_size[::-1], tf.float32)
        # 将w,h也归一化
        box_wh = tf.exp(predictions[..., 2:4]) * anchors_tensor / tf.cast(input_shape[::-1], tf.float32)
        box_confidence = tf.sigmoid(predictions[..., 4:5])
        box_class_probs = tf.sigmoid(predictions[..., 5:])
        return box_xy, box_wh, box_confidence, box_class_probs
        
    def eval(self, yolo_outputs, image_shape, max_boxes = 20):
        """
        Introduction
        ------------
            根据Yolo模型的输出进行非极大值抑制,获取最后的物体检测框和物体检测类别
        Parameters
        ----------
            yolo_outputs: yolo模型输出
            image_shape: 图片的大小
            max_boxes:  最大box数量
        Returns
        -------
            boxes_: 物体框的位置
            scores_: 物体类别的概率
            classes_: 物体类别
        """
        # 每一个特征层对应三个先验框
        anchor_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
        boxes = []
        box_scores = []
        # inputshape是416x416
        # image_shape是实际图片的大小
        input_shape = tf.shape(yolo_outputs[0])[1 : 3] * 32
        # 对三个特征层的输出获取每个预测box坐标和box的分数,score = 置信度x类别概率
        #---------------------------------------#
        #   对三个特征层解码
        #   获得分数和框的位置
        #---------------------------------------#
        for i in range(len(yolo_outputs)):
            _boxes, _box_scores = self.boxes_and_scores(yolo_outputs[i], self.anchors[anchor_mask[i]], len(self.class_names), input_shape, image_shape)
            boxes.append(_boxes)
            box_scores.append(_box_scores)
        # 放在一行里面便于操作
        boxes = tf.concat(boxes, axis = 0)
        box_scores = tf.concat(box_scores, axis = 0)

        mask = box_scores >= self.obj_threshold
        max_boxes_tensor = tf.constant(max_boxes, dtype = tf.int32)
        boxes_ = []
        scores_ = []
        classes_ = []

        #---------------------------------------#
        #   1、取出每一类得分大于self.obj_threshold
        #   的框和得分
        #   2、对得分进行非极大抑制
        #---------------------------------------#
        # 对每一个类进行判断
        for c in range(len(self.class_names)):
            # 取出所有类为c的box
            class_boxes = tf.boolean_mask(boxes, mask[:, c])
            # 取出所有类为c的分数
            class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])
            # 非极大抑制
            nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold = self.nms_threshold)
            
            # 获取非极大抑制的结果
            class_boxes = tf.gather(class_boxes, nms_index)
            class_box_scores = tf.gather(class_box_scores, nms_index)
            classes = tf.ones_like(class_box_scores, 'int32') * c

            boxes_.append(class_boxes)
            scores_.append(class_box_scores)
            classes_.append(classes)
        boxes_ = tf.concat(boxes_, axis = 0)
        scores_ = tf.concat(scores_, axis = 0)
        classes_ = tf.concat(classes_, axis = 0)
        return boxes_, scores_, classes_


 
    #---------------------------------------#
    #   predict用于预测,分三步
    #   1、建立yolo对象
    #   2、获得预测结果
    #   3、对预测结果进行处理
    #---------------------------------------#
    def predict(self, inputs, image_shape):
        """
        Introduction
        ------------
            构建预测模型
        Parameters
        ----------
            inputs: 处理之后的输入图片
            image_shape: 图像原始大小
        Returns
        -------
            boxes: 物体框坐标
            scores: 物体概率值
            classes: 物体类别
        """
        model = yolo(config.norm_epsilon, config.norm_decay, self.anchors_path, self.classes_path, pre_train = False)
        # yolo_inference用于获得网络的预测结果
        output = model.yolo_inference(inputs, config.num_anchors // 3, config.num_classes, training = False)
        boxes, scores, classes = self.eval(output, image_shape, max_boxes = 20)
        return boxes, scores, classes

6. Expansion-SSD

insert image description here

SSD is also a multi-feature layer network, which has a total of 11 layers, and the first half structure is VGG16:

  1. First, through multiple 3X3 convolutional layers, 5 maximum pooling with a step size of 2 to extract features, 5 Blocks are formed, and the fourth Block is used to extract small targets (feature preservation of large targets after multiple convolutions) is better, small target features will disappear, and small target features need to be extracted in a relatively early layer).
  2. Perform a convolution kernel expansion dilate.
  3. Read the characteristics of the seventh Block7.
  4. Use 1x1 and 3x3 convolution to extract features respectively, use step size 2 when 3x3 convolution, reduce the number of features, and obtain the features of the eighth Block8.
  5. Repeat step 4 to obtain the features of the 9, 10, and 11 convolutional layers.

Guess you like

Origin blog.csdn.net/m0_63260018/article/details/132268168