Article directory
1. Object detection
Top 5 Applications of Computer Vision
1.1 Brief overview and terminology of target detection
- Object recognition is to distinguish what objects are in the picture, the input is a picture, and the output is a category label and probability. The object detection algorithm not only needs to detect what objects are in the picture, but also outputs the outer frame (x, y, width, height) of the object to locate the position of the object.
- Object detection is to accurately find the location of the object in a given picture and mark the category of the object.
- The problem to be solved by object detection is the whole process of where and what the object is.
- However, this problem is not so easy to solve. The size of the object varies widely, the angle and posture of the object are uncertain, and it can appear anywhere in the picture, not to mention that the object can also be of multiple categories.
At present, the target detection algorithms emerging in academia and industry are divided into three categories:
- Traditional target detection algorithm: Cascade + HOG/DPM + Haar/SVM and many improvements and optimizations of the above methods;
- Candidate area/frame + deep learning classification: by extracting candidate areas, and classifying corresponding areas based on deep learning methods, such as:
- R-CNN(Selective Search + CNN + SVM)
- SPP-net(ROI Pooling)
- Fast R-CNN(Selective Search + CNN + ROI)
- Faster R-CNN(RPN + CNN + ROI)
- Regression method based on deep learning: YOLO/SSD and other methods
1.2 IOU
Intersection over Union is a standard for measuring the accuracy of detecting corresponding objects in a specific data set.
IoU is a simple measurement standard, as long as the task of obtaining a prediction range (bounding boxex) in the output can be measured with IoU.
In order for IoU to be used to measure object detection of arbitrary size and shape, we need:
- ground-truth bounding boxes (artificially mark the approximate range of the object to be detected in the training set image);
- The range of results produced by our algorithm.
That is, this criterion is used to measure the degree of correlation between the real and predicted, the higher the degree of correlation, the higher the value.
1.3 TP TN FP FN
There are 4 letters in TP TN FP FN, which are TFPN.
T is True;
F is False;
P is Positive;
N is Negative.
T or F represents whether the sample is correctly classified.
P or N represents whether the sample was originally a positive sample or a negative sample.
TP (True Positives) means that it is classified as a positive sample, and it is correct.
TN (True Negatives) means that it is classified as a negative sample, and it is correct, and
FP (False Positives) means that it is classified as a positive sample, but it is wrong (in fact, this sample is a negative sample).
FN (False Negatives) means that it is divided into negative samples, but it is wrong (in fact, this sample is a positive sample).
In the process of mAP calculation, the three concepts of TP, FP and FN are mainly used.
1.4 precision (precision) and recall (recall rate)
TP is an example that the classifier considers to be a positive sample and is indeed a positive sample. FP is an example that the classifier considers to be a positive sample but is not actually a positive sample. Precision translated into Chinese means "the part that the classifier considers to be a positive class and is indeed a positive class accounts for all classifications. The proportion that the device considers to be the positive class".
TP is an example that the classifier considers to be a positive sample and is indeed a positive sample. FN is an example that the classifier considers to be a negative sample but is not actually a negative sample. Recall translated into Chinese means "the part that the classifier considers to be a positive class and is indeed a positive class accounts for all It is indeed the proportion of the positive class."
Precision is to find the right one, and recall is to find everything.
- The blue box is the real box. The green and red boxes are prediction boxes, the green boxes are positive samples, and the red boxes are negative samples.
- Generally speaking, when the prediction box and the real box IOU>=0.5, it is considered a positive sample.
2. Bounding-Box regression
What is bounding box regression?
- The window is generally represented by a four-dimensional vector (x, y, w, h), which respectively represent the coordinates of the center point and the width and height of the window.
- The red box P represents the original Proposal;
- The green box G represents the Ground Truth of the target;
Our goal is to find a relationship so that the input original window P is mapped to obtain a regression window G^ that is closer to the real window G.
Therefore, the purpose of frame regression is:
given (Px,Py,Pw,Ph) to find a mapping f, so that: f(Px,Py,Pw,Ph)=(Gx ,Gy ,Gw ,Gh ) and ( Gx ,Gy ,Gw ,Gh )≈(Gx,Gy,Gw,Gh)
How to do border regression?
The simpler idea is: translation + scaling
Input:
P=(Px,Py,Pw,Ph)
(Note: The training phase input also includes Ground Truth)
Output:
The required translation transformation and scaling dx, dy, dw, dh, or Δx, Δy, Sw, Sh.
With these four transformations, we can directly get Ground Truth.
3. Faster R-CNN
Faster RCNN can be divided into 4 main contents:
- Conv layers : As a CNN network target detection method, Faster RCNN first uses a set of basic conv+relu+pooling layers to extract the feature maps of the image. The feature maps are shared for subsequent RPN layers and fully connected layers.
- Region Proposal Networks (RPN) : The RPN network is used to generate region proposals. Judging whether the anchors are positive or negative through softmax, and then using the bounding box regression to correct the anchors to obtain accurate proposals.
- Roi Pooling : This layer collects the input feature maps and proposals, extracts the proposal feature maps after integrating these information, and sends them to the subsequent fully connected layer to determine the target category.
- Classification : Use the proposal feature maps to calculate the category of the proposal, and at the same time bounding box regression again to obtain the final precise position of the detection frame.
3.1 Faster-RCNN:conv layer
1 Conv layers
Conv layers include conv, pooling, relu three layers. There are 13 conv layers, 13 relu layers, and 4 pooling layers.
In Conv layers:
- All conv layers are: kernel_size=3, pad=1, stride=1
- All pooling layers are: kernel_size=2, pad=1, stride=2
In the Faster RCNN Conv layers, all convolutions are pad-processed (pad=1, that is, a circle of 0 is filled), causing the original image to become (M+2)x(N+2) in size, and then a 3x3 volume is made Output MxN after product. It is this setting that causes the conv layer in Conv layers to not change the input and output matrix sizes.
Similarly, the pooling layer kernel_size=2, stride=2 in Conv layers.
In this way, each MxN matrix that passes through the pooling layer will become (M/2)x(N/2) in size.
To sum up, in the entire Conv layers, the conv and relu layers do not change the input and output sizes, and only the pooling layer makes the output length and width 1/2 of the input.
Then, a matrix of MxN size is fixed to (M/16)x(N/16) after Conv layers.
In this way, the feature map generated by Conv layers can correspond to the original image.
3.2 Faster-RCNN:Region Proposal Networks(RPN)
2. Region Proposal Networks (RPN)
The classic detection method of Region Proposal Networks (RPN) is very time-consuming to generate detection frames. Directly using RPN to generate detection frames is a huge advantage of Faster R-CNN, which can greatly increase the speed of detection frame generation.
- You can see that the RPN network is actually divided into 2 lines:
- The above one uses softmax to classify anchors to obtain positive and negative classifications;
- The following one is used to calculate the bounding box regression offset for anchors to obtain an accurate proposal.
- The final Proposal layer is responsible for synthesizing the positive anchors and the corresponding bounding box regression offset to obtain proposals, and at the same time rejecting proposals that are too small and beyond the boundary.
- In fact, when the entire network reaches the Proposal Layer, it completes the function equivalent to target positioning.
3.2.1 anchors
After the RPN network is convolved, each pixel is upsampled and mapped to an area of the original image, the center position of this area is found, and 9 kinds of anchor boxes are selected according to the rules based on the center position.
There are 3 types of areas for 9 rectangles: 128,256,512; 3 shapes: the aspect ratio is about 1:1, 1:2, 2:1. (not a fixed ratio, adjustable)
The 4 values in each row represent the coordinates of the upper left and lower right corners of the rectangle.
The feature maps obtained by traversing the Conv layers are equipped with these 9 anchors as the initial detection frame for each point.
3.2.2 Softmax determines positive and negative
In fact, RPN finally sets up dense candidate anchors on the scale of the original image. Then use cnn to judge which anchors are positive anchors with targets in them, and which are negative anchors without targets. So, it's just a binary category.
It can be seen that the num_output=18 of its conv, that is, the output image after the convolution is WxHx18 in size.
This just corresponds to the fact that each point of the feature maps has 9 anchors, and each anchor may be positive and negative. All this information is stored in a matrix of WxHx(9*2) size.
Why do you do this? Followed by softmax classification to obtain positive anchors, it is equivalent to initially extracting the detection target candidate area box (generally believed that the target is in the positive anchors).
So why connect a reshape layer before and after softmax? In fact, it is just for the convenience of softmax classification.
The matrix of the previous positive/negative anchors is stored in caffe in the form of [1, 18, H, W]. In the softmax classification, positive/negative binary classification is required, so the reshape layer will change it to [1, 2, 9xH, W] size, that is, "vacate" a dimension for softmax classification, and then reshape it back to the original state.
In summary, the RPN network uses anchors and softmax to initially extract positive anchors as candidate regions.
3.2.3 bounding box regression on proposals
You can see the num_output=36 of conv, that is, the output image after this convolution is WxHx36. This is equivalent to feature maps with 9 anchors for each point, and each anchor has 4 transformations for regression:
3.2.4 Proposal Layer
The Proposal Layer is responsible for synthesizing all transformations and positive anchors, calculating an accurate proposal, and sending it to the subsequent RoI Pooling Layer.
Proposal Layer has 4 inputs:
- positive vs negative anchors classifier results rpn_cls_prob_reshape,
- The transformation amount of the corresponding bbox reg rpn_bbox_pred,
- im_info
- parameter feature_stride=16
im_info: For a PxQ image of any size, first reshape to a fixed MxN before passing it into Faster RCNN, and im_info=[M, N, scale_factor] saves all the information of this scaling.
The input image passes through Conv Layers, and becomes WxH=(M/16)x(N/16) size after 4 times of pooling, where feature_stride=16 saves this information and is used to calculate the anchor offset.
Proposal Layers are processed in the following order:
- Use the transformation amount to do bbox regression regression on all positive anchors
- According to the input positive softmax scores, the anchors are sorted from large to small, and the first pre_nms_topN (eg 6000) anchors are extracted, that is, the positive anchors after the corrected position are extracted.
- Perform NMS (non-maximum suppression) on the remaining positive anchors.
- Then output the proposal.
The detection in the strict sense should end here, and the subsequent part should belong to the identification.
RPN network structure, summed up: Generate anchors -> softmax classifier to extract positvie anchors -> bbox reg regression positive anchors -> Proposal Layer to generate proposals
3.3 Faster-RCNN:Roi pooling
The RoI Pooling layer is responsible for collecting proposals, and calculating the proposal feature maps, which are sent to the subsequent network.
The Rol pooling layer has 2 inputs:
- Original feature maps
- Proposal boxes output by RPN (various in size)
Why is RoI Pooling needed?
For traditional CNNs (such as AlexNet and VGG), when the network is trained, the input image size must be a fixed value, and the network output is also a vector or matrix of a fixed size. This problem becomes more troublesome if the input image size is variable.
There are 2 solutions:
- Part of the crop from the image is passed to the network to transfer the image (destroying the complete structure of the image)
- Warp into the required size and pass it to the network (destroying the original shape information of the image)
Principle of RoI Pooling
New parameters pooled_w, pooled_h and spatial_scale (1/16)
RoI Pooling layer forward process:
- Since the proposal corresponds to the M N scale, first use the spatial_scale parameter to map it back to the feature map scale of (M/16) (N/16) size;
- Then divide the feature map area corresponding to each proposal horizontally into a grid of pooled_w * pooled_h;
- Perform max pooling processing on each part of the grid.
After processing in this way, the output results of proposals with different sizes are pooled_w * pooled_h fixed size, and the fixed length output is realized.
Then divide the feature map area corresponding to each proposal horizontally into pooled_w * pooled_h grids;
perform max pooling processing on each grid.
After processing in this way, the output results of proposals with different sizes are pooled_w * pooled_h fixed size, and the fixed length output is realized.
3.4 Faster-RCNN: Classification
The Classification part uses the obtained proposal feature maps, calculates which category each proposal belongs to (such as people, cars, TVs, etc.) through the full connect layer and softmax, and outputs the cls_prob probability vector;
At the same time, the bounding box regression is used again to obtain the position offset bbox_pred of each proposal, which is used to return a more accurate target detection frame.
After obtaining the pooled_w * pooled_h size proposal feature maps from RoI Pooling, they are sent to the follow-up network, and the following two things are done:
- Classify proposals through full connection and softmax, which is actually the category of recognition
- Perform bounding box regression on proposals again to obtain higher-precision prediction boxes
Fully connected layer InnerProduct layers:
3.5 Network Comparison
3.6 Code example
3.6.1 Network construction
import cv2
import keras
import numpy as np
import colorsys
import pickle
import os
import nets.frcnn as frcnn
from nets.frcnn_training import get_new_img_size
from keras import backend as K
from keras.layers import Input
from keras.applications.imagenet_utils import preprocess_input
from PIL import Image,ImageFont, ImageDraw
from utils.utils import BBoxUtility
from utils.anchors import get_anchors
from utils.config import Config
import copy
import math
class FRCNN(object):
_defaults = {
"model_path": 'model_data/voc_weights.h5',
"classes_path": 'model_data/voc_classes.txt',
"confidence": 0.7,
}
@classmethod
def get_defaults(cls, n):
if n in cls._defaults:
return cls._defaults[n]
else:
return "Unrecognized attribute name '" + n + "'"
#---------------------------------------------------#
# 初始化faster RCNN
#---------------------------------------------------#
def __init__(self, **kwargs):
self.__dict__.update(self._defaults)
self.class_names = self._get_class()
self.sess = K.get_session()
self.config = Config()
self.generate()
self.bbox_util = BBoxUtility()
#---------------------------------------------------#
# 获得所有的分类
#---------------------------------------------------#
def _get_class(self):
classes_path = os.path.expanduser(self.classes_path)
with open(classes_path) as f:
class_names = f.readlines()
class_names = [c.strip() for c in class_names]
return class_names
#---------------------------------------------------#
# 获得所有的分类
#---------------------------------------------------#
def generate(self):
model_path = os.path.expanduser(self.model_path)
assert model_path.endswith('.h5'), 'Keras model or weights must be a .h5 file.'
# 计算总的种类
self.num_classes = len(self.class_names)+1
# 载入模型,如果原来的模型里已经包括了模型结构则直接载入。
# 否则先构建模型再载入
self.model_rpn,self.model_classifier = frcnn.get_predict_model(self.config,self.num_classes)
self.model_rpn.load_weights(self.model_path,by_name=True)
self.model_classifier.load_weights(self.model_path,by_name=True,skip_mismatch=True)
print('{} model, anchors, and classes loaded.'.format(model_path))
# 画框设置不同的颜色
hsv_tuples = [(x / len(self.class_names), 1., 1.)
for x in range(len(self.class_names))]
self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
self.colors = list(
map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)),
self.colors))
def get_img_output_length(self, width, height):
def get_output_length(input_length):
# input_length += 6
filter_sizes = [7, 3, 1, 1]
padding = [3,1,0,0]
stride = 2
for i in range(4):
# input_length = (input_length - filter_size + stride) // stride
input_length = (input_length+2*padding[i]-filter_sizes[i]) // stride + 1
return input_length
return get_output_length(width), get_output_length(height)
#---------------------------------------------------#
# 检测图片
#---------------------------------------------------#
def detect_image(self, image):
image_shape = np.array(np.shape(image)[0:2])
old_width = image_shape[1]
old_height = image_shape[0]
old_image = copy.deepcopy(image)
width,height = get_new_img_size(old_width,old_height)
image = image.resize([width,height])
photo = np.array(image,dtype = np.float64)
# 图片预处理,归一化
photo = preprocess_input(np.expand_dims(photo,0))
preds = self.model_rpn.predict(photo)
# 将预测结果进行解码
anchors = get_anchors(self.get_img_output_length(width,height),width,height)
rpn_results = self.bbox_util.detection_out(preds,anchors,1,confidence_threshold=0)
R = rpn_results[0][:, 2:]
R[:,0] = np.array(np.round(R[:, 0]*width/self.config.rpn_stride),dtype=np.int32)
R[:,1] = np.array(np.round(R[:, 1]*height/self.config.rpn_stride),dtype=np.int32)
R[:,2] = np.array(np.round(R[:, 2]*width/self.config.rpn_stride),dtype=np.int32)
R[:,3] = np.array(np.round(R[:, 3]*height/self.config.rpn_stride),dtype=np.int32)
R[:, 2] -= R[:, 0]
R[:, 3] -= R[:, 1]
base_layer = preds[2]
delete_line = []
for i,r in enumerate(R):
if r[2] < 1 or r[3] < 1:
delete_line.append(i)
R = np.delete(R,delete_line,axis=0)
bboxes = []
probs = []
labels = []
for jk in range(R.shape[0]//self.config.num_rois + 1):
ROIs = np.expand_dims(R[self.config.num_rois*jk:self.config.num_rois*(jk+1), :], axis=0)
if ROIs.shape[1] == 0:
break
if jk == R.shape[0]//self.config.num_rois:
#pad R
curr_shape = ROIs.shape
target_shape = (curr_shape[0],self.config.num_rois,curr_shape[2])
ROIs_padded = np.zeros(target_shape).astype(ROIs.dtype)
ROIs_padded[:, :curr_shape[1], :] = ROIs
ROIs_padded[0, curr_shape[1]:, :] = ROIs[0, 0, :]
ROIs = ROIs_padded
[P_cls, P_regr] = self.model_classifier.predict([base_layer,ROIs])
for ii in range(P_cls.shape[1]):
if np.max(P_cls[0, ii, :]) < self.confidence or np.argmax(P_cls[0, ii, :]) == (P_cls.shape[2] - 1):
continue
label = np.argmax(P_cls[0, ii, :])
(x, y, w, h) = ROIs[0, ii, :]
cls_num = np.argmax(P_cls[0, ii, :])
(tx, ty, tw, th) = P_regr[0, ii, 4*cls_num:4*(cls_num+1)]
tx /= self.config.classifier_regr_std[0]
ty /= self.config.classifier_regr_std[1]
tw /= self.config.classifier_regr_std[2]
th /= self.config.classifier_regr_std[3]
cx = x + w/2.
cy = y + h/2.
cx1 = tx * w + cx
cy1 = ty * h + cy
w1 = math.exp(tw) * w
h1 = math.exp(th) * h
x1 = cx1 - w1/2.
y1 = cy1 - h1/2.
x2 = cx1 + w1/2
y2 = cy1 + h1/2
x1 = int(round(x1))
y1 = int(round(y1))
x2 = int(round(x2))
y2 = int(round(y2))
bboxes.append([x1,y1,x2,y2])
probs.append(np.max(P_cls[0, ii, :]))
labels.append(label)
if len(bboxes)==0:
return old_image
# 筛选出其中得分高于confidence的框
labels = np.array(labels)
probs = np.array(probs)
boxes = np.array(bboxes,dtype=np.float32)
boxes[:,0] = boxes[:,0]*self.config.rpn_stride/width
boxes[:,1] = boxes[:,1]*self.config.rpn_stride/height
boxes[:,2] = boxes[:,2]*self.config.rpn_stride/width
boxes[:,3] = boxes[:,3]*self.config.rpn_stride/height
results = np.array(self.bbox_util.nms_for_out(np.array(labels),np.array(probs),np.array(boxes),self.num_classes-1,0.4))
top_label_indices = results[:,0]
top_conf = results[:,1]
boxes = results[:,2:]
boxes[:,0] = boxes[:,0]*old_width
boxes[:,1] = boxes[:,1]*old_height
boxes[:,2] = boxes[:,2]*old_width
boxes[:,3] = boxes[:,3]*old_height
font = ImageFont.truetype(font='model_data/simhei.ttf',size=np.floor(3e-2 * np.shape(image)[1] + 0.5).astype('int32'))
thickness = (np.shape(old_image)[0] + np.shape(old_image)[1]) // width
image = old_image
for i, c in enumerate(top_label_indices):
predicted_class = self.class_names[int(c)]
score = top_conf[i]
left, top, right, bottom = boxes[i]
top = top - 5
left = left - 5
bottom = bottom + 5
right = right + 5
top = max(0, np.floor(top + 0.5).astype('int32'))
left = max(0, np.floor(left + 0.5).astype('int32'))
bottom = min(np.shape(image)[0], np.floor(bottom + 0.5).astype('int32'))
right = min(np.shape(image)[1], np.floor(right + 0.5).astype('int32'))
# 画框框
label = '{} {:.2f}'.format(predicted_class, score)
draw = ImageDraw.Draw(image)
label_size = draw.textsize(label, font)
label = label.encode('utf-8')
print(label)
if top - label_size[1] >= 0:
text_origin = np.array([left, top - label_size[1]])
else:
text_origin = np.array([left, top + 1])
for i in range(thickness):
draw.rectangle(
[left + i, top + i, right - i, bottom - i],
outline=self.colors[int(c)])
draw.rectangle(
[tuple(text_origin), tuple(text_origin + label_size)],
fill=self.colors[int(c)])
draw.text(text_origin, str(label,'UTF-8'), fill=(0, 0, 0), font=font)
del draw
return image
def close_session(self):
self.sess.close()
3.6.2 Training script
from __future__ import division
from nets.frcnn import get_model
from nets.frcnn_training import cls_loss,smooth_l1,Generator,get_img_output_length,class_loss_cls,class_loss_regr
from utils.config import Config
from utils.utils import BBoxUtility
from utils.roi_helpers import calc_iou
from keras.utils import generic_utils
from keras.callbacks import TensorBoard, ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
import keras
import numpy as np
import time
import tensorflow as tf
from utils.anchors import get_anchors
def write_log(callback, names, logs, batch_no):
for name, value in zip(names, logs):
summary = tf.Summary()
summary_value = summary.value.add()
summary_value.simple_value = value
summary_value.tag = name
callback.writer.add_summary(summary, batch_no)
callback.writer.flush()
if __name__ == "__main__":
config = Config()
NUM_CLASSES = 21
EPOCH = 100
EPOCH_LENGTH = 2000
bbox_util = BBoxUtility(overlap_threshold=config.rpn_max_overlap,ignore_threshold=config.rpn_min_overlap)
annotation_path = '2007_train.txt'
model_rpn, model_classifier,model_all = get_model(config,NUM_CLASSES)
base_net_weights = "model_data/voc_weights.h5"
model_all.summary()
model_rpn.load_weights(base_net_weights,by_name=True)
model_classifier.load_weights(base_net_weights,by_name=True)
with open(annotation_path) as f:
lines = f.readlines()
np.random.seed(10101)
np.random.shuffle(lines)
np.random.seed(None)
gen = Generator(bbox_util, lines, NUM_CLASSES, solid=True)
rpn_train = gen.generate()
log_dir = "logs"
# 训练参数设置
logging = TensorBoard(log_dir=log_dir)
callback = logging
callback.set_model(model_all)
model_rpn.compile(loss={
'regression' : smooth_l1(),
'classification': cls_loss()
},optimizer=keras.optimizers.Adam(lr=1e-5)
)
model_classifier.compile(loss=[
class_loss_cls,
class_loss_regr(NUM_CLASSES-1)
],
metrics={
'dense_class_{}'.format(NUM_CLASSES): 'accuracy'},optimizer=keras.optimizers.Adam(lr=1e-5)
)
model_all.compile(optimizer='sgd', loss='mae')
# 初始化参数
iter_num = 0
train_step = 0
losses = np.zeros((EPOCH_LENGTH, 5))
rpn_accuracy_rpn_monitor = []
rpn_accuracy_for_epoch = []
start_time = time.time()
# 最佳loss
best_loss = np.Inf
# 数字到类的映射
print('Starting training')
for i in range(EPOCH):
if i == 20:
model_rpn.compile(loss={
'regression' : smooth_l1(),
'classification': cls_loss()
},optimizer=keras.optimizers.Adam(lr=1e-6)
)
model_classifier.compile(loss=[
class_loss_cls,
class_loss_regr(NUM_CLASSES-1)
],
metrics={
'dense_class_{}'.format(NUM_CLASSES): 'accuracy'},optimizer=keras.optimizers.Adam(lr=1e-6)
)
print("Learning rate decrease")
progbar = generic_utils.Progbar(EPOCH_LENGTH)
print('Epoch {}/{}'.format(i + 1, EPOCH))
while True:
if len(rpn_accuracy_rpn_monitor) == EPOCH_LENGTH and config.verbose:
mean_overlapping_bboxes = float(sum(rpn_accuracy_rpn_monitor))/len(rpn_accuracy_rpn_monitor)
rpn_accuracy_rpn_monitor = []
print('Average number of overlapping bounding boxes from RPN = {} for {} previous iterations'.format(mean_overlapping_bboxes, EPOCH_LENGTH))
if mean_overlapping_bboxes == 0:
print('RPN is not producing bounding boxes that overlap the ground truth boxes. Check RPN settings or keep training.')
X, Y, boxes = next(rpn_train)
loss_rpn = model_rpn.train_on_batch(X,Y)
write_log(callback, ['rpn_cls_loss', 'rpn_reg_loss'], loss_rpn, train_step)
P_rpn = model_rpn.predict_on_batch(X)
height,width,_ = np.shape(X[0])
anchors = get_anchors(get_img_output_length(width,height),width,height)
# 将预测结果进行解码
results = bbox_util.detection_out(P_rpn,anchors,1, confidence_threshold=0)
R = results[0][:, 2:]
X2, Y1, Y2, IouS = calc_iou(R, config, boxes[0], width, height, NUM_CLASSES)
if X2 is None:
rpn_accuracy_rpn_monitor.append(0)
rpn_accuracy_for_epoch.append(0)
continue
neg_samples = np.where(Y1[0, :, -1] == 1)
pos_samples = np.where(Y1[0, :, -1] == 0)
if len(neg_samples) > 0:
neg_samples = neg_samples[0]
else:
neg_samples = []
if len(pos_samples) > 0:
pos_samples = pos_samples[0]
else:
pos_samples = []
rpn_accuracy_rpn_monitor.append(len(pos_samples))
rpn_accuracy_for_epoch.append((len(pos_samples)))
if len(neg_samples)==0:
continue
if len(pos_samples) < config.num_rois//2:
selected_pos_samples = pos_samples.tolist()
else:
selected_pos_samples = np.random.choice(pos_samples, config.num_rois//2, replace=False).tolist()
try:
selected_neg_samples = np.random.choice(neg_samples, config.num_rois - len(selected_pos_samples), replace=False).tolist()
except:
selected_neg_samples = np.random.choice(neg_samples, config.num_rois - len(selected_pos_samples), replace=True).tolist()
sel_samples = selected_pos_samples + selected_neg_samples
loss_class = model_classifier.train_on_batch([X, X2[:, sel_samples, :]], [Y1[:, sel_samples, :], Y2[:, sel_samples, :]])
write_log(callback, ['detection_cls_loss', 'detection_reg_loss', 'detection_acc'], loss_class, train_step)
losses[iter_num, 0] = loss_rpn[1]
losses[iter_num, 1] = loss_rpn[2]
losses[iter_num, 2] = loss_class[1]
losses[iter_num, 3] = loss_class[2]
losses[iter_num, 4] = loss_class[3]
train_step += 1
iter_num += 1
progbar.update(iter_num, [('rpn_cls', np.mean(losses[:iter_num, 0])), ('rpn_regr', np.mean(losses[:iter_num, 1])),
('detector_cls', np.mean(losses[:iter_num, 2])), ('detector_regr', np.mean(losses[:iter_num, 3]))])
if iter_num == EPOCH_LENGTH:
loss_rpn_cls = np.mean(losses[:, 0])
loss_rpn_regr = np.mean(losses[:, 1])
loss_class_cls = np.mean(losses[:, 2])
loss_class_regr = np.mean(losses[:, 3])
class_acc = np.mean(losses[:, 4])
mean_overlapping_bboxes = float(sum(rpn_accuracy_for_epoch)) / len(rpn_accuracy_for_epoch)
rpn_accuracy_for_epoch = []
if config.verbose:
print('Mean number of bounding boxes from RPN overlapping ground truth boxes: {}'.format(mean_overlapping_bboxes))
print('Classifier accuracy for bounding boxes from RPN: {}'.format(class_acc))
print('Loss RPN classifier: {}'.format(loss_rpn_cls))
print('Loss RPN regression: {}'.format(loss_rpn_regr))
print('Loss Detector classifier: {}'.format(loss_class_cls))
print('Loss Detector regression: {}'.format(loss_class_regr))
print('Elapsed time: {}'.format(time.time() - start_time))
curr_loss = loss_rpn_cls + loss_rpn_regr + loss_class_cls + loss_class_regr
iter_num = 0
start_time = time.time()
write_log(callback,
['Elapsed_time', 'mean_overlapping_bboxes', 'mean_rpn_cls_loss', 'mean_rpn_reg_loss',
'mean_detection_cls_loss', 'mean_detection_reg_loss', 'mean_detection_acc', 'total_loss'],
[time.time() - start_time, mean_overlapping_bboxes, loss_rpn_cls, loss_rpn_regr,
loss_class_cls, loss_class_regr, class_acc, curr_loss],i)
if config.verbose:
print('The best loss is {}. The current loss is {}. Saving weights'.format(best_loss,curr_loss))
if curr_loss < best_loss:
best_loss = curr_loss
model_all.save_weights(log_dir+"/epoch{:03d}-loss{:.3f}-rpn{:.3f}-roi{:.3f}".format(i,curr_loss,loss_rpn_cls+loss_rpn_regr,loss_class_cls+loss_class_regr)+".h5")
break
3.6.3 Prediction script
from keras.layers import Input
from frcnn import FRCNN
from PIL import Image
frcnn = FRCNN()
while True:
img = input('img/street.jpg')
try:
image = Image.open('img/street.jpg')
except:
print('Open Error! Try again!')
continue
else:
r_image = frcnn.detect_image(image)
r_image.show()
frcnn.close_session()
4. One stage和two stage
two-stage : The two-stage algorithm will first use a network to generate a proposal, such as selective search and RPN network. After the emergence of RPN, the ss method is basically abandoned. After the RPN network is connected to the backbone of the image feature extraction network, the RPN loss (bbox regression loss+classification loss) will be set to train the RPN network, and the proposal generated by the RPN will be sent to the subsequent network for more refined bbox regression and classification.
one-stage : One-stage pursues speed and abandons the two-stage architecture, that is, it no longer sets up a separate network to generate a proposal, but directly performs dense sampling on the feature map to generate a large number of prior frames, such as the grid method of YOLO. These prior boxes are not processed in two steps, and the size of the box is often artificially specified.
The two-stage algorithm is mainly RCNN series, including RCNN, Fast-RCNN, Faster-RCNN. The subsequent Mask-RCNN combines the Faster-RCNN architecture, ResNet and FPN (Feature Pyramid Networks) backbone, and the segmentation method in FCN, which improves the accuracy of detection while completing segmentation.
The most typical one-stage algorithm is YOLO, which is extremely fast.
5. Yolo
Pedestrian detection - Yolo3
5.1 Yolo-You Only Look Once
The YOLO algorithm uses a separate CNN model to achieve end-to-end target detection:
- Resize into 448 448, and the image is divided into 7 7 grids (cells)
- CNN extracts features and predicts: the convolution part is responsible for extracting features, and the fully connected part is responsible for prediction.
- filter bbox (via nms)
- The YOLO algorithm as a whole is to divide the input picture into S S grids, here are 3 3 grids.
- When the center point of the detected target falls into this grid, this grid is responsible for detecting the target, such as the person in the figure.
- We input this picture into the network, and the final output size is also S S n (n is the number of channels), and the output S S corresponds to the original input picture S S (both are 3*3).
- If our network can detect a total of 20 categories of targets, then the number of output channels n=2*(4+1)+20=30. 2 here means that each grid has two calibration boxes (pointed out in the paper), 4 represents the coordinate information of the calibration frame, 1 represents the confidence of the calibration frame, and 20 is the number of categories of the detection target.
- So the final output size of the network is S S n=3 3 30.
About the calibration box
- The output of the network is a tensor of S x S x (5*B+C) (S-size, B-number of calibration boxes, C-number of detection categories, 5-information of calibration boxes).
- 5 is divided into 4+1:
- 4 represents the location information of the calibration frame. The center point (x, y) of the box, the height and width h, w of the box.
- 1 represents the confidence of each calibration frame and the accuracy information of the calibration frame.
In general, YOLO does not predict the exact coordinates of the center of the bounding box. It predicts:
- offset relative to the upper left corner of the grid cell of the predicted target;
- Normalized offset using the dimensionality of the feature map cells.
For example: Take
the above picture as an example, if the prediction of the center is (0.4, 0.7), the coordinates of the center on the 13 x 13 feature map are (6.4, 6.7) (the coordinates of the upper left corner of the red cell are (6,6)).
However, if the predicted x,y coordinates are greater than 1, such as (1.2, 0.7). Then the predicted center coordinates are (7.2, 6.7). Note that the center is in the cell to the right of the red cell. This breaks the theory behind YOLO, because if we assume that the red box is responsible for predicting the target dog, then the center of the dog must be in the red cell and should not be in the grid cell next to it.
So, to fix this, we perform a sigmoid function on the output, squashing the output into the interval 0 to 1, effectively ensuring that the center is in the grid cell where the prediction is made.
The confidence of each calibration box and the accuracy information of the calibration box:
The left side represents whether there is a target in the grid containing the calibration box. Yes = 1 No = 0.
The right side represents the accuracy of the calibration frame. The part on the right is to perform an IOU operation on two calibration frames (one is Ground truth and the other is the predicted calibration frame), that is, the intersection ratio and union of the two calibration frames. The larger the value, the That is, the more the calibration frames overlap, the more accurate it will be.
We can calculate the class-specific confidence scores/class scores of each calibration frame: it expresses the possibility that the target in the calibration frame belongs to each category and how well the calibration frame matches the target.
The class information predicted by each grid is multiplied by the confidence information predicted by the bounding box to obtain the class-specific confidence score of each bounding box.
- It has performed more than 20 convolutions and four maximum pooling. Among them, 3x3 convolution is used to extract features, 1x1 convolution is used to compress features, and finally compress the image to the size of 7x7xfilter, which is equivalent to dividing the entire image into 7x7 grids, and each grid is responsible for the target detection of its own area. .
- The entire network finally uses the fully connected layer to make the result size (7x7x30), where 7x7 represents a 7x7 grid, the first 20 of 30 represent the type of prediction, and the last 10 represent two prediction boxes and their confidence ( 5x2).
Perform the same operation for each bbox of each grid: 7x7x2 = 98 bbox (each bbox has both corresponding class information and coordinate information)
After obtaining the class-specific confidence score of each bbox, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.
After sorting, the probabilities are different in the boxes of different positions:
Use the maximum value as bbox_max, and compare it with a non-zero value (bbox_cur) smaller than it: IOU
Recursively, the following non-zero bbox_cur (0.2) is used as bbox_max to continue comparing IOU:
Finally, there are n boxes left
After obtaining the class-specific confidence score of each bbox, set the threshold, filter out the boxes with low scores, and perform NMS processing on the reserved boxes to obtain the final detection result.
For the scores of the bb3(20×1) category, find the index corresponding to the largest category.---->the largest score in class bb3(20×1)---->score
Disadvantages of Yolo:
- YOLO is not good for objects that are very close to each other (the situation where they are close together and the midpoints fall on the same grid), and there are small group detections that are not effective because only two are predicted in one grid. box, and only belong to one class.
- In the test image, the generalization ability is weak when the same type of objects has unusual aspect ratios and other situations.
5.2 Yolo2
- Yolo2 uses a new classification network as the feature extraction part.
- The network uses more 3 x 3 convolution kernels, doubling the number of channels after each pooling operation.
- Put the 1 x 1 convolution kernel between the 3 x 3 convolution kernels to compress the features.
- Use batch normalization to stabilize model training and accelerate convergence.
- A shortcut is kept for storing the previous features.
- Compared with yolo1, yolo2 has added the prior frame part, and the shape of the final output conv_dec is (13,13,425):
- 13x13 is to divide the entire image into 13x13 grids for prediction.
- 425 can be decomposed into (85x5). In 85, because yolo2 is commonly used in the coco data set, which has 80 classes; the remaining 5 refers to x, y, w, h and their confidence. x5 means that the prediction result contains 5 boxes, corresponding to 5 prior boxes.
5.2.1 Yolo2 - using anchor boxes
5.2.2 Yolo2 – Dimension Clusters (dimension clustering)
Use kmeans clustering to obtain the information of the prior frame:
the previous prior frame was set manually, and YOLO2 tries to count the prior frame that is more in line with the size of the object in the sample, so that the network fine-tuning the prior frame to the actual position can be reduced. difficulty. YOLO2's approach is to perform cluster analysis on the marked borders in the training set to find the border size that matches the sample as much as possible.
The most important thing for a clustering algorithm is to choose how to calculate the "distance" between two borders. For the commonly used Euclidean distance, a large border will produce a larger error, but what we care about is the IOU of the border. Therefore, YOLO2 uses the following formula to calculate the "distance" between two borders when clustering.
In the case of selecting different clustering k values, the obtained k centroid borders are calculated, and the Avg IOU of the marked borders in the sample and each centroid is calculated.
Obviously, the more the border number k, the larger the Avg IOU.
YOLO2 chooses k=5 as a compromise between the number of borders and IOU. Compared with the manually selected prior frame, 61 Avg IOU can be achieved by using 5 clustering frames, which is equivalent to 60.9 Avg IOU of 9 manually set prior frames
The author finally selects 5 cluster centers as the prior box. For the two data sets, the width and height of the five prior boxes are as follows:
COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)
5.3 Yolo3
Compared with the previous yolo1 and yolo2, YOLOv3 has improved greatly. The main improvement directions are:
- Using the residual network Residual
- Extract multiple feature layers for target detection. A total of three feature layers are extracted, and their shapes are (13,13,75), (26,26,75), (52,52,75). The last dimension is 75 because the graph is based on the voc dataset, which has 20 classes. Yolo3 has 3 prior frames for each feature layer, so the final dimension is 3x25.
- It adopts UpSampling2d design
5.4 Code example (yolo v3)
5.4.1 Model building
# -*- coding:utf-8 -*-
import numpy as np
import tensorflow as tf
import os
class yolo:
def __init__(self, norm_epsilon, norm_decay, anchors_path, classes_path, pre_train):
"""
Introduction
------------
初始化函数
Parameters
----------
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
anchors_path: yolo anchor 文件路径
classes_path: 数据集类别对应文件
pre_train: 是否使用预训练darknet53模型
"""
self.norm_epsilon = norm_epsilon
self.norm_decay = norm_decay
self.anchors_path = anchors_path
self.classes_path = classes_path
self.pre_train = pre_train
self.anchors = self._get_anchors()
self.classes = self._get_class()
#---------------------------------------#
# 获取种类和先验框
#---------------------------------------#
def _get_class(self):
"""
Introduction
------------
获取类别名字
Returns
-------
class_names: coco数据集类别对应的名字
"""
classes_path = os.path.expanduser(self.classes_path)
with open(classes_path) as f:
class_names = f.readlines()
class_names = [c.strip() for c in class_names]
return class_names
def _get_anchors(self):
"""
Introduction
------------
获取anchors
"""
anchors_path = os.path.expanduser(self.anchors_path)
with open(anchors_path) as f:
anchors = f.readline()
anchors = [float(x) for x in anchors.split(',')]
return np.array(anchors).reshape(-1, 2)
#---------------------------------------#
# 用于生成层
#---------------------------------------#
# l2 正则化
def _batch_normalization_layer(self, input_layer, name = None, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
'''
Introduction
------------
对卷积层提取的feature map使用batch normalization
Parameters
----------
input_layer: 输入的四维tensor
name: batchnorm层的名字
trainging: 是否为训练过程
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
bn_layer: batch normalization处理之后的feature map
'''
bn_layer = tf.layers.batch_normalization(inputs = input_layer,
momentum = norm_decay, epsilon = norm_epsilon, center = True,
scale = True, training = training, name = name)
return tf.nn.leaky_relu(bn_layer, alpha = 0.1)
# 这个就是用来进行卷积的
def _conv2d_layer(self, inputs, filters_num, kernel_size, name, use_bias = False, strides = 1):
"""
Introduction
------------
使用tf.layers.conv2d减少权重和偏置矩阵初始化过程,以及卷积后加上偏置项的操作
经过卷积之后需要进行batch norm,最后使用leaky ReLU激活函数
根据卷积时的步长,如果卷积的步长为2,则对图像进行降采样
比如,输入图片的大小为416*416,卷积核大小为3,若stride为2时,(416 - 3 + 2)/ 2 + 1, 计算结果为208,相当于做了池化层处理
因此需要对stride大于1的时候,先进行一个padding操作, 采用四周都padding一维代替'same'方式
Parameters
----------
inputs: 输入变量
filters_num: 卷积核数量
strides: 卷积步长
name: 卷积层名字
trainging: 是否为训练过程
use_bias: 是否使用偏置项
kernel_size: 卷积核大小
Returns
-------
conv: 卷积之后的feature map
"""
conv = tf.layers.conv2d(
inputs = inputs, filters = filters_num,
kernel_size = kernel_size, strides = [strides, strides], kernel_initializer = tf.glorot_uniform_initializer(),
padding = ('SAME' if strides == 1 else 'VALID'), kernel_regularizer = tf.contrib.layers.l2_regularizer(scale = 5e-4), use_bias = use_bias, name = name)
return conv
# 这个用来进行残差卷积的
# 残差卷积就是进行一次3X3的卷积,然后保存该卷积layer
# 再进行一次1X1的卷积和一次3X3的卷积,并把这个结果加上layer作为最后的结果
def _Residual_block(self, inputs, filters_num, blocks_num, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
Darknet的残差block,类似resnet的两层卷积结构,分别采用1x1和3x3的卷积核,使用1x1是为了减少channel的维度
Parameters
----------
inputs: 输入变量
filters_num: 卷积核数量
trainging: 是否为训练过程
blocks_num: block的数量
conv_index: 为了方便加载预训练权重,统一命名序号
weights_dict: 加载预训练模型的权重
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
inputs: 经过残差网络处理后的结果
"""
# 在输入feature map的长宽维度进行padding
inputs = tf.pad(inputs, paddings=[[0, 0], [1, 0], [1, 0], [0, 0]], mode='CONSTANT')
layer = self._conv2d_layer(inputs, filters_num, kernel_size = 3, strides = 2, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
for _ in range(blocks_num):
shortcut = layer
layer = self._conv2d_layer(layer, filters_num // 2, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
layer = self._conv2d_layer(layer, filters_num, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
layer = self._batch_normalization_layer(layer, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
layer += shortcut
return layer, conv_index
#---------------------------------------#
# 生成_darknet53
#---------------------------------------#
def _darknet53(self, inputs, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
构建yolo3使用的darknet53网络结构
Parameters
----------
inputs: 模型输入变量
conv_index: 卷积层数序号,方便根据名字加载预训练权重
weights_dict: 预训练权重
training: 是否为训练
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
conv: 经过52层卷积计算之后的结果, 输入图片为416x416x3,则此时输出的结果shape为13x13x1024
route1: 返回第26层卷积计算结果52x52x256, 供后续使用
route2: 返回第43层卷积计算结果26x26x512, 供后续使用
conv_index: 卷积层计数,方便在加载预训练模型时使用
"""
with tf.variable_scope('darknet53'):
# 416,416,3 -> 416,416,32
conv = self._conv2d_layer(inputs, filters_num = 32, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
# 416,416,32 -> 208,208,64
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 64, blocks_num = 1, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# 208,208,64 -> 104,104,128
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 128, blocks_num = 2, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# 104,104,128 -> 52,52,256
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 256, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route1 = 52,52,256
route1 = conv
# 52,52,256 -> 26,26,512
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 512, blocks_num = 8, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route2 = 26,26,512
route2 = conv
# 26,26,512 -> 13,13,1024
conv, conv_index = self._Residual_block(conv, conv_index = conv_index, filters_num = 1024, blocks_num = 4, training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
# route3 = 13,13,1024
return route1, route2, conv, conv_index
# 输出两个网络结果
# 第一个是进行5次卷积后,用于下一次逆卷积的,卷积过程是1X1,3X3,1X1,3X3,1X1
# 第二个是进行5+2次卷积,作为一个特征层的,卷积过程是1X1,3X3,1X1,3X3,1X1,3X3,1X1
def _yolo_block(self, inputs, filters_num, out_filters, conv_index, training = True, norm_decay = 0.99, norm_epsilon = 1e-3):
"""
Introduction
------------
yolo3在Darknet53提取的特征层基础上,又加了针对3种不同比例的feature map的block,这样来提高对小物体的检测率
Parameters
----------
inputs: 输入特征
filters_num: 卷积核数量
out_filters: 最后输出层的卷积核数量
conv_index: 卷积层数序号,方便根据名字加载预训练权重
training: 是否为训练
norm_decay: 在预测时计算moving average时的衰减率
norm_epsilon: 方差加上极小的数,防止除以0的情况
Returns
-------
route: 返回最后一层卷积的前一层结果
conv: 返回最后一层卷积的结果
conv_index: conv层计数
"""
conv = self._conv2d_layer(inputs, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = filters_num, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
route = conv
conv = self._conv2d_layer(conv, filters_num = filters_num * 2, kernel_size = 3, strides = 1, name = "conv2d_" + str(conv_index))
conv = self._batch_normalization_layer(conv, name = "batch_normalization_" + str(conv_index), training = training, norm_decay = norm_decay, norm_epsilon = norm_epsilon)
conv_index += 1
conv = self._conv2d_layer(conv, filters_num = out_filters, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index), use_bias = True)
conv_index += 1
return route, conv, conv_index
# 返回三个特征层的内容
def yolo_inference(self, inputs, num_anchors, num_classes, training = True):
"""
Introduction
------------
构建yolo模型结构
Parameters
----------
inputs: 模型的输入变量
num_anchors: 每个grid cell负责检测的anchor数量
num_classes: 类别数量
training: 是否为训练模式
"""
conv_index = 1
# route1 = 52,52,256、route2 = 26,26,512、route3 = 13,13,1024
conv2d_26, conv2d_43, conv, conv_index = self._darknet53(inputs, conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
with tf.variable_scope('yolo'):
#--------------------------------------#
# 获得第一个特征层
#--------------------------------------#
# conv2d_57 = 13,13,512,conv2d_59 = 13,13,255(3x(80+5))
conv2d_57, conv2d_59, conv_index = self._yolo_block(conv, 512, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
#--------------------------------------#
# 获得第二个特征层
#--------------------------------------#
conv2d_60 = self._conv2d_layer(conv2d_57, filters_num = 256, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv2d_60 = self._batch_normalization_layer(conv2d_60, name = "batch_normalization_" + str(conv_index),training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
conv_index += 1
# unSample_0 = 26,26,256
unSample_0 = tf.image.resize_nearest_neighbor(conv2d_60, [2 * tf.shape(conv2d_60)[1], 2 * tf.shape(conv2d_60)[1]], name='upSample_0')
# route0 = 26,26,768
route0 = tf.concat([unSample_0, conv2d_43], axis = -1, name = 'route_0')
# conv2d_65 = 52,52,256,conv2d_67 = 26,26,255
conv2d_65, conv2d_67, conv_index = self._yolo_block(route0, 256, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
#--------------------------------------#
# 获得第三个特征层
#--------------------------------------#
conv2d_68 = self._conv2d_layer(conv2d_65, filters_num = 128, kernel_size = 1, strides = 1, name = "conv2d_" + str(conv_index))
conv2d_68 = self._batch_normalization_layer(conv2d_68, name = "batch_normalization_" + str(conv_index), training=training, norm_decay=self.norm_decay, norm_epsilon = self.norm_epsilon)
conv_index += 1
# unSample_1 = 52,52,128
unSample_1 = tf.image.resize_nearest_neighbor(conv2d_68, [2 * tf.shape(conv2d_68)[1], 2 * tf.shape(conv2d_68)[1]], name='upSample_1')
# route1= 52,52,384
route1 = tf.concat([unSample_1, conv2d_26], axis = -1, name = 'route_1')
# conv2d_75 = 52,52,255
_, conv2d_75, _ = self._yolo_block(route1, 128, num_anchors * (num_classes + 5), conv_index = conv_index, training = training, norm_decay = self.norm_decay, norm_epsilon = self.norm_epsilon)
return [conv2d_59, conv2d_67, conv2d_75]
5.4.2 Configuration file
num_parallel_calls = 4
input_shape = 416
max_boxes = 20
jitter = 0.3
hue = 0.1
sat = 1.0
cont = 0.8
bri = 0.1
norm_decay = 0.99
norm_epsilon = 1e-3
pre_train = True
num_anchors = 9
num_classes = 80
training = True
ignore_thresh = .5
learning_rate = 0.001
train_batch_size = 10
val_batch_size = 10
train_num = 2800
val_num = 5000
Epoch = 50
obj_threshold = 0.5
nms_threshold = 0.5
gpu_index = "0"
log_dir = './logs'
data_dir = './model_data'
model_dir = './test_model/model.ckpt-192192'
pre_train_yolo3 = True
yolo3_weights_path = './model_data/yolov3.weights'
darknet53_weights_path = './model_data/darknet53.weights'
anchors_path = './model_data/yolo_anchors.txt'
classes_path = './model_data/coco_classes.txt'
image_file = "./img/img.jpg"
5.4.3 detect file
import os
import config
import argparse
import numpy as np
import tensorflow as tf
from yolo_predict import yolo_predictor
from PIL import Image, ImageFont, ImageDraw
from utils import letterbox_image, load_weights
# 指定使用GPU的Index
os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu_index
def detect(image_path, model_path, yolo_weights = None):
"""
Introduction
------------
加载模型,进行预测
Parameters
----------
model_path: 模型路径,当使用yolo_weights无用
image_path: 图片路径
"""
#---------------------------------------#
# 图片预处理
#---------------------------------------#
image = Image.open(image_path)
# 对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
resize_image = letterbox_image(image, (416, 416))
image_data = np.array(resize_image, dtype = np.float32)
# 归一化
image_data /= 255.
# 转格式,第一维度填充
image_data = np.expand_dims(image_data, axis = 0)
#---------------------------------------#
# 图片输入
#---------------------------------------#
# input_image_shape原图的size
input_image_shape = tf.placeholder(dtype = tf.int32, shape = (2,))
# 图像
input_image = tf.placeholder(shape = [None, 416, 416, 3], dtype = tf.float32)
# 进入yolo_predictor进行预测,yolo_predictor是用于预测的一个对象
predictor = yolo_predictor(config.obj_threshold, config.nms_threshold, config.classes_path, config.anchors_path)
with tf.Session() as sess:
#---------------------------------------#
# 图片预测
#---------------------------------------#
if yolo_weights is not None:
with tf.variable_scope('predict'):
boxes, scores, classes = predictor.predict(input_image, input_image_shape)
# 载入模型
load_op = load_weights(tf.global_variables(scope = 'predict'), weights_file = yolo_weights)
sess.run(load_op)
# 进行预测
out_boxes, out_scores, out_classes = sess.run(
[boxes, scores, classes],
feed_dict={
# image_data这个resize过
input_image: image_data,
# 以y、x的方式传入
input_image_shape: [image.size[1], image.size[0]]
})
else:
boxes, scores, classes = predictor.predict(input_image, input_image_shape)
saver = tf.train.Saver()
saver.restore(sess, model_path)
out_boxes, out_scores, out_classes = sess.run(
[boxes, scores, classes],
feed_dict={
input_image: image_data,
input_image_shape: [image.size[1], image.size[0]]
})
#---------------------------------------#
# 画框
#---------------------------------------#
# 找到几个box,打印
print('Found {} boxes for {}'.format(len(out_boxes), 'img'))
font = ImageFont.truetype(font = 'font/FiraMono-Medium.otf', size = np.floor(3e-2 * image.size[1] + 0.5).astype('int32'))
# 厚度
thickness = (image.size[0] + image.size[1]) // 300
for i, c in reversed(list(enumerate(out_classes))):
# 获得预测名字,box和分数
predicted_class = predictor.class_names[c]
box = out_boxes[i]
score = out_scores[i]
# 打印
label = '{} {:.2f}'.format(predicted_class, score)
# 用于画框框和文字
draw = ImageDraw.Draw(image)
# textsize用于获得写字的时候,按照这个字体,要多大的框
label_size = draw.textsize(label, font)
# 获得四个边
top, left, bottom, right = box
top = max(0, np.floor(top + 0.5).astype('int32'))
left = max(0, np.floor(left + 0.5).astype('int32'))
bottom = min(image.size[1]-1, np.floor(bottom + 0.5).astype('int32'))
right = min(image.size[0]-1, np.floor(right + 0.5).astype('int32'))
print(label, (left, top), (right, bottom))
print(label_size)
if top - label_size[1] >= 0:
text_origin = np.array([left, top - label_size[1]])
else:
text_origin = np.array([left, top + 1])
# My kingdom for a good redistributable image drawing library.
for i in range(thickness):
draw.rectangle(
[left + i, top + i, right - i, bottom - i],
outline = predictor.colors[c])
draw.rectangle(
[tuple(text_origin), tuple(text_origin + label_size)],
fill = predictor.colors[c])
draw.text(text_origin, label, fill=(0, 0, 0), font=font)
del draw
image.show()
image.save('./img/result1.jpg')
if __name__ == '__main__':
# 当使用yolo3自带的weights的时候
if config.pre_train_yolo3 == True:
detect(config.image_file, config.model_dir, config.yolo3_weights_path)
# 当使用模型的时候
else:
detect(config.image_file, config.model_dir)
5.4.4 gen_anchors.py
import numpy as np
import matplotlib.pyplot as plt
def convert_coco_bbox(size, box):
"""
Introduction
------------
计算box的长宽和原始图像的长宽比值
Parameters
----------
size: 原始图像大小
box: 标注box的信息
Returns
x, y, w, h 标注box和原始图像的比值
"""
dw = 1. / size[0]
dh = 1. / size[1]
x = (box[0] + box[2]) / 2.0 - 1
y = (box[1] + box[3]) / 2.0 - 1
w = box[2]
h = box[3]
x = x * dw
w = w * dw
y = y * dh
h = h * dh
return x, y, w, h
def box_iou(boxes, clusters):
"""
Introduction
------------
计算每个box和聚类中心的距离值
Parameters
----------
boxes: 所有的box数据
clusters: 聚类中心
"""
box_num = boxes.shape[0]
cluster_num = clusters.shape[0]
box_area = boxes[:, 0] * boxes[:, 1]
#每个box的面积重复9次,对应9个聚类中心
box_area = box_area.repeat(cluster_num)
box_area = np.reshape(box_area, [box_num, cluster_num])
cluster_area = clusters[:, 0] * clusters[:, 1]
cluster_area = np.tile(cluster_area, [1, box_num])
cluster_area = np.reshape(cluster_area, [box_num, cluster_num])
#这里计算两个矩形的iou,默认所有矩形的左上角坐标都是在原点,然后计算iou,因此只需取长宽最小值相乘就是重叠区域的面积
boxes_width = np.reshape(boxes[:, 0].repeat(cluster_num), [box_num, cluster_num])
clusters_width = np.reshape(np.tile(clusters[:, 0], [1, box_num]), [box_num, cluster_num])
min_width = np.minimum(clusters_width, boxes_width)
boxes_high = np.reshape(boxes[:, 1].repeat(cluster_num), [box_num, cluster_num])
clusters_high = np.reshape(np.tile(clusters[:, 1], [1, box_num]), [box_num, cluster_num])
min_high = np.minimum(clusters_high, boxes_high)
iou = np.multiply(min_high, min_width) / (box_area + cluster_area - np.multiply(min_high, min_width))
return iou
def avg_iou(boxes, clusters):
"""
Introduction
------------
计算所有box和聚类中心的最大iou均值作为准确率
Parameters
----------
boxes: 所有的box
clusters: 聚类中心
Returns
-------
accuracy: 准确率
"""
return np.mean(np.max(box_iou(boxes, clusters), axis =1))
def Kmeans(boxes, cluster_num, iteration_cutoff = 25, function = np.median):
"""
Introduction
------------
根据所有box的长宽进行Kmeans聚类
Parameters
----------
boxes: 所有的box的长宽
cluster_num: 聚类的数量
iteration_cutoff: 当准确率不再降低多少轮停止迭代
function: 聚类中心更新的方式
Returns
-------
clusters: 聚类中心box的大小
"""
boxes_num = boxes.shape[0]
best_average_iou = 0
best_avg_iou_iteration = 0
best_clusters = []
anchors = []
np.random.seed()
# 随机选择所有boxes中的box作为聚类中心
clusters = boxes[np.random.choice(boxes_num, cluster_num, replace = False)]
count = 0
while True:
distances = 1. - box_iou(boxes, clusters)
boxes_iou = np.min(distances, axis=1)
# 获取每个box距离哪个聚类中心最近
current_box_cluster = np.argmin(distances, axis=1)
average_iou = np.mean(1. - boxes_iou)
if average_iou > best_average_iou:
best_average_iou = average_iou
best_clusters = clusters
best_avg_iou_iteration = count
# 通过function的方式更新聚类中心
for cluster in range(cluster_num):
clusters[cluster] = function(boxes[current_box_cluster == cluster], axis=0)
if count >= best_avg_iou_iteration + iteration_cutoff:
break
print("Sum of all distances (cost) = {}".format(np.sum(boxes_iou)))
print("iter: {} Accuracy: {:.2f}%".format(count, avg_iou(boxes, clusters) * 100))
count += 1
for cluster in best_clusters:
anchors.append([round(cluster[0] * 416), round(cluster[1] * 416)])
return anchors, best_average_iou
def load_cocoDataset(annfile):
"""
Introduction
------------
读取coco数据集的标注信息
Parameters
----------
datasets: 数据集名字列表
"""
data = []
coco = COCO(annfile)
cats = coco.loadCats(coco.getCatIds())
coco.loadImgs()
base_classes = {
cat['id'] : cat['name'] for cat in cats}
imgId_catIds = [coco.getImgIds(catIds = cat_ids) for cat_ids in base_classes.keys()]
image_ids = [img_id for img_cat_id in imgId_catIds for img_id in img_cat_id ]
for image_id in image_ids:
annIds = coco.getAnnIds(imgIds = image_id)
anns = coco.loadAnns(annIds)
img = coco.loadImgs(image_id)[0]
image_width = img['width']
image_height = img['height']
for ann in anns:
box = ann['bbox']
bb = convert_coco_bbox((image_width, image_height), box)
data.append(bb[2:])
return np.array(data)
def process(dataFile, cluster_num, iteration_cutoff = 25, function = np.median):
"""
Introduction
------------
主处理函数
Parameters
----------
dataFile: 数据集的标注文件
cluster_num: 聚类中心数目
iteration_cutoff: 当准确率不再降低多少轮停止迭代
function: 聚类中心更新的方式
"""
last_best_iou = 0
last_anchors = []
boxes = load_cocoDataset(dataFile)
box_w = boxes[:1000, 0]
box_h = boxes[:1000, 1]
plt.scatter(box_h, box_w, c = 'r')
anchors = Kmeans(boxes, cluster_num, iteration_cutoff, function)
plt.scatter(anchors[:,0], anchors[:, 1], c = 'b')
plt.show()
for _ in range(100):
anchors, best_iou = Kmeans(boxes, cluster_num, iteration_cutoff, function)
if best_iou > last_best_iou:
last_anchors = anchors
last_best_iou = best_iou
print("anchors: {}, avg iou: {}".format(last_anchors, last_best_iou))
print("final anchors: {}, avg iou: {}".format(last_anchors, last_best_iou))
if __name__ == '__main__':
process('./annotations/instances_train2014.json', 9)
5.4.5 utils.py
import json
import numpy as np
import tensorflow as tf
from PIL import Image
from collections import defaultdict
def load_weights(var_list, weights_file):
"""
Introduction
------------
加载预训练好的darknet53权重文件
Parameters
----------
var_list: 赋值变量名
weights_file: 权重文件
Returns
-------
assign_ops: 赋值更新操作
"""
with open(weights_file, "rb") as fp:
_ = np.fromfile(fp, dtype=np.int32, count=5)
weights = np.fromfile(fp, dtype=np.float32)
ptr = 0
i = 0
assign_ops = []
while i < len(var_list) - 1:
var1 = var_list[i]
var2 = var_list[i + 1]
# do something only if we process conv layer
if 'conv2d' in var1.name.split('/')[-2]:
# check type of next layer
if 'batch_normalization' in var2.name.split('/')[-2]:
# load batch norm params
gamma, beta, mean, var = var_list[i + 1:i + 5]
batch_norm_vars = [beta, gamma, mean, var]
for var in batch_norm_vars:
shape = var.shape.as_list()
num_params = np.prod(shape)
var_weights = weights[ptr:ptr + num_params].reshape(shape)
ptr += num_params
assign_ops.append(tf.assign(var, var_weights, validate_shape=True))
# we move the pointer by 4, because we loaded 4 variables
i += 4
elif 'conv2d' in var2.name.split('/')[-2]:
# load biases
bias = var2
bias_shape = bias.shape.as_list()
bias_params = np.prod(bias_shape)
bias_weights = weights[ptr:ptr + bias_params].reshape(bias_shape)
ptr += bias_params
assign_ops.append(tf.assign(bias, bias_weights, validate_shape=True))
# we loaded 1 variable
i += 1
# we can load weights of conv layer
shape = var1.shape.as_list()
num_params = np.prod(shape)
var_weights = weights[ptr:ptr + num_params].reshape((shape[3], shape[2], shape[0], shape[1]))
# remember to transpose to column-major
var_weights = np.transpose(var_weights, (2, 3, 1, 0))
ptr += num_params
assign_ops.append(tf.assign(var1, var_weights, validate_shape=True))
i += 1
return assign_ops
def letterbox_image(image, size):
"""
Introduction
------------
对预测输入图像进行缩放,按照长宽比进行缩放,不足的地方进行填充
Parameters
----------
image: 输入图像
size: 图像大小
Returns
-------
boxed_image: 缩放后的图像
"""
image_w, image_h = image.size
w, h = size
new_w = int(image_w * min(w*1.0/image_w, h*1.0/image_h))
new_h = int(image_h * min(w*1.0/image_w, h*1.0/image_h))
resized_image = image.resize((new_w,new_h), Image.BICUBIC)
boxed_image = Image.new('RGB', size, (128, 128, 128))
boxed_image.paste(resized_image, ((w-new_w)//2,(h-new_h)//2))
return boxed_image
def draw_box(image, bbox):
"""
Introduction
------------
通过tensorboard把训练数据可视化
Parameters
----------
image: 训练数据图片
bbox: 训练数据图片中标记box坐标
"""
xmin, ymin, xmax, ymax, label = tf.split(value = bbox, num_or_size_splits = 5, axis=2)
height = tf.cast(tf.shape(image)[1], tf.float32)
weight = tf.cast(tf.shape(image)[2], tf.float32)
new_bbox = tf.concat([tf.cast(ymin, tf.float32) / height, tf.cast(xmin, tf.float32) / weight, tf.cast(ymax, tf.float32) / height, tf.cast(xmax, tf.float32) / weight], 2)
new_image = tf.image.draw_bounding_boxes(image, new_bbox)
tf.summary.image('input', new_image)
def voc_ap(rec, prec):
"""
--- Official matlab code VOC2012---
mrec=[0 ; rec ; 1];
mpre=[0 ; prec ; 0];
for i=numel(mpre)-1:-1:1
mpre(i)=max(mpre(i),mpre(i+1));
end
i=find(mrec(2:end)~=mrec(1:end-1))+1;
ap=sum((mrec(i)-mrec(i-1)).*mpre(i));
"""
rec.insert(0, 0.0) # insert 0.0 at begining of list
rec.append(1.0) # insert 1.0 at end of list
mrec = rec[:]
prec.insert(0, 0.0) # insert 0.0 at begining of list
prec.append(0.0) # insert 0.0 at end of list
mpre = prec[:]
for i in range(len(mpre) - 2, -1, -1):
mpre[i] = max(mpre[i], mpre[i + 1])
i_list = []
for i in range(1, len(mrec)):
if mrec[i] != mrec[i - 1]:
i_list.append(i)
ap = 0.0
for i in i_list:
ap += ((mrec[i] - mrec[i - 1]) * mpre[i])
return ap, mrec, mpre
5.4.6 Prediction script
import os
import config
import random
import colorsys
import numpy as np
import tensorflow as tf
from model.yolo3_model import yolo
class yolo_predictor:
def __init__(self, obj_threshold, nms_threshold, classes_file, anchors_file):
"""
Introduction
------------
初始化函数
Parameters
----------
obj_threshold: 目标检测为物体的阈值
nms_threshold: nms阈值
"""
self.obj_threshold = obj_threshold
self.nms_threshold = nms_threshold
# 预读取
self.classes_path = classes_file
self.anchors_path = anchors_file
# 读取种类名称
self.class_names = self._get_class()
# 读取先验框
self.anchors = self._get_anchors()
# 画框框用
hsv_tuples = [(x / len(self.class_names), 1., 1.)for x in range(len(self.class_names))]
self.colors = list(map(lambda x: colorsys.hsv_to_rgb(*x), hsv_tuples))
self.colors = list(map(lambda x: (int(x[0] * 255), int(x[1] * 255), int(x[2] * 255)), self.colors))
random.seed(10101)
random.shuffle(self.colors)
random.seed(None)
def _get_class(self):
"""
Introduction
------------
读取类别名称
"""
classes_path = os.path.expanduser(self.classes_path)
with open(classes_path) as f:
class_names = f.readlines()
class_names = [c.strip() for c in class_names]
return class_names
def _get_anchors(self):
"""
Introduction
------------
读取anchors数据
"""
anchors_path = os.path.expanduser(self.anchors_path)
with open(anchors_path) as f:
anchors = f.readline()
anchors = [float(x) for x in anchors.split(',')]
anchors = np.array(anchors).reshape(-1, 2)
return anchors
#---------------------------------------#
# 对三个特征层解码
# 进行排序并进行非极大抑制
#---------------------------------------#
def boxes_and_scores(self, feats, anchors, classes_num, input_shape, image_shape):
"""
Introduction
------------
将预测出的box坐标转换为对应原图的坐标,然后计算每个box的分数
Parameters
----------
feats: yolo输出的feature map
anchors: anchor的位置
class_num: 类别数目
input_shape: 输入大小
image_shape: 图片大小
Returns
-------
boxes: 物体框的位置
boxes_scores: 物体框的分数,为置信度和类别概率的乘积
"""
# 获得特征
box_xy, box_wh, box_confidence, box_class_probs = self._get_feats(feats, anchors, classes_num, input_shape)
# 寻找在原图上的位置
boxes = self.correct_boxes(box_xy, box_wh, input_shape, image_shape)
boxes = tf.reshape(boxes, [-1, 4])
# 获得置信度box_confidence * box_class_probs
box_scores = box_confidence * box_class_probs
box_scores = tf.reshape(box_scores, [-1, classes_num])
return boxes, box_scores
# 获得在原图上框的位置
def correct_boxes(self, box_xy, box_wh, input_shape, image_shape):
"""
Introduction
------------
计算物体框预测坐标在原图中的位置坐标
Parameters
----------
box_xy: 物体框左上角坐标
box_wh: 物体框的宽高
input_shape: 输入的大小
image_shape: 图片的大小
Returns
-------
boxes: 物体框的位置
"""
box_yx = box_xy[..., ::-1]
box_hw = box_wh[..., ::-1]
# 416,416
input_shape = tf.cast(input_shape, dtype = tf.float32)
# 实际图片的大小
image_shape = tf.cast(image_shape, dtype = tf.float32)
new_shape = tf.round(image_shape * tf.reduce_min(input_shape / image_shape))
offset = (input_shape - new_shape) / 2. / input_shape
scale = input_shape / new_shape
box_yx = (box_yx - offset) * scale
box_hw *= scale
box_mins = box_yx - (box_hw / 2.)
box_maxes = box_yx + (box_hw / 2.)
boxes = tf.concat([
box_mins[..., 0:1],
box_mins[..., 1:2],
box_maxes[..., 0:1],
box_maxes[..., 1:2]
], axis = -1)
boxes *= tf.concat([image_shape, image_shape], axis = -1)
return boxes
# 其实是解码的过程
def _get_feats(self, feats, anchors, num_classes, input_shape):
"""
Introduction
------------
根据yolo最后一层的输出确定bounding box
Parameters
----------
feats: yolo模型最后一层输出
anchors: anchors的位置
num_classes: 类别数量
input_shape: 输入大小
Returns
-------
box_xy, box_wh, box_confidence, box_class_probs
"""
num_anchors = len(anchors)
anchors_tensor = tf.reshape(tf.constant(anchors, dtype=tf.float32), [1, 1, 1, num_anchors, 2])
grid_size = tf.shape(feats)[1:3]
predictions = tf.reshape(feats, [-1, grid_size[0], grid_size[1], num_anchors, num_classes + 5])
# 这里构建13*13*1*2的矩阵,对应每个格子加上对应的坐标
grid_y = tf.tile(tf.reshape(tf.range(grid_size[0]), [-1, 1, 1, 1]), [1, grid_size[1], 1, 1])
grid_x = tf.tile(tf.reshape(tf.range(grid_size[1]), [1, -1, 1, 1]), [grid_size[0], 1, 1, 1])
grid = tf.concat([grid_x, grid_y], axis = -1)
grid = tf.cast(grid, tf.float32)
# 将x,y坐标归一化,相对网格的位置
box_xy = (tf.sigmoid(predictions[..., :2]) + grid) / tf.cast(grid_size[::-1], tf.float32)
# 将w,h也归一化
box_wh = tf.exp(predictions[..., 2:4]) * anchors_tensor / tf.cast(input_shape[::-1], tf.float32)
box_confidence = tf.sigmoid(predictions[..., 4:5])
box_class_probs = tf.sigmoid(predictions[..., 5:])
return box_xy, box_wh, box_confidence, box_class_probs
def eval(self, yolo_outputs, image_shape, max_boxes = 20):
"""
Introduction
------------
根据Yolo模型的输出进行非极大值抑制,获取最后的物体检测框和物体检测类别
Parameters
----------
yolo_outputs: yolo模型输出
image_shape: 图片的大小
max_boxes: 最大box数量
Returns
-------
boxes_: 物体框的位置
scores_: 物体类别的概率
classes_: 物体类别
"""
# 每一个特征层对应三个先验框
anchor_mask = [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
boxes = []
box_scores = []
# inputshape是416x416
# image_shape是实际图片的大小
input_shape = tf.shape(yolo_outputs[0])[1 : 3] * 32
# 对三个特征层的输出获取每个预测box坐标和box的分数,score = 置信度x类别概率
#---------------------------------------#
# 对三个特征层解码
# 获得分数和框的位置
#---------------------------------------#
for i in range(len(yolo_outputs)):
_boxes, _box_scores = self.boxes_and_scores(yolo_outputs[i], self.anchors[anchor_mask[i]], len(self.class_names), input_shape, image_shape)
boxes.append(_boxes)
box_scores.append(_box_scores)
# 放在一行里面便于操作
boxes = tf.concat(boxes, axis = 0)
box_scores = tf.concat(box_scores, axis = 0)
mask = box_scores >= self.obj_threshold
max_boxes_tensor = tf.constant(max_boxes, dtype = tf.int32)
boxes_ = []
scores_ = []
classes_ = []
#---------------------------------------#
# 1、取出每一类得分大于self.obj_threshold
# 的框和得分
# 2、对得分进行非极大抑制
#---------------------------------------#
# 对每一个类进行判断
for c in range(len(self.class_names)):
# 取出所有类为c的box
class_boxes = tf.boolean_mask(boxes, mask[:, c])
# 取出所有类为c的分数
class_box_scores = tf.boolean_mask(box_scores[:, c], mask[:, c])
# 非极大抑制
nms_index = tf.image.non_max_suppression(class_boxes, class_box_scores, max_boxes_tensor, iou_threshold = self.nms_threshold)
# 获取非极大抑制的结果
class_boxes = tf.gather(class_boxes, nms_index)
class_box_scores = tf.gather(class_box_scores, nms_index)
classes = tf.ones_like(class_box_scores, 'int32') * c
boxes_.append(class_boxes)
scores_.append(class_box_scores)
classes_.append(classes)
boxes_ = tf.concat(boxes_, axis = 0)
scores_ = tf.concat(scores_, axis = 0)
classes_ = tf.concat(classes_, axis = 0)
return boxes_, scores_, classes_
#---------------------------------------#
# predict用于预测,分三步
# 1、建立yolo对象
# 2、获得预测结果
# 3、对预测结果进行处理
#---------------------------------------#
def predict(self, inputs, image_shape):
"""
Introduction
------------
构建预测模型
Parameters
----------
inputs: 处理之后的输入图片
image_shape: 图像原始大小
Returns
-------
boxes: 物体框坐标
scores: 物体概率值
classes: 物体类别
"""
model = yolo(config.norm_epsilon, config.norm_decay, self.anchors_path, self.classes_path, pre_train = False)
# yolo_inference用于获得网络的预测结果
output = model.yolo_inference(inputs, config.num_anchors // 3, config.num_classes, training = False)
boxes, scores, classes = self.eval(output, image_shape, max_boxes = 20)
return boxes, scores, classes
6. Expansion-SSD
SSD is also a multi-feature layer network, which has a total of 11 layers, and the first half structure is VGG16:
- First, through multiple 3X3 convolutional layers, 5 maximum pooling with a step size of 2 to extract features, 5 Blocks are formed, and the fourth Block is used to extract small targets (feature preservation of large targets after multiple convolutions) is better, small target features will disappear, and small target features need to be extracted in a relatively early layer).
- Perform a convolution kernel expansion dilate.
- Read the characteristics of the seventh Block7.
- Use 1x1 and 3x3 convolution to extract features respectively, use step size 2 when 3x3 convolution, reduce the number of features, and obtain the features of the eighth Block8.
- Repeat step 4 to obtain the features of the 9, 10, and 11 convolutional layers.