基于深度学习和OpenCV的目标检测(Python)

今天说说使用深度学习进行目标检测的文章，第一部分讲讲Single shot detector（SSD）和MobileNet。这二者相结合，可以用来实现更快速的，实时的目标检测，尤其是在资源有限的设备上（包括Raspberry Pi, smartphones等等）。

这里就说说如何使用OpenCV中的dnn模块，用来导入一个实现训练好的目标检测网络。使我们可以把图像传送到深度网络中，然后得到图中每个物体的包围框（x,y）坐标。最后，我们使用MobileNet SSDs来检验这些图像。

使用Single Shot Detectors进行目标检测

当提到用深度学习进行目标检测时，主要有下面三种方法：

Faster R-CNNs
You Only Look Once(YOLO)
Single Shot Detectors(SSDs)

Faster R-CNNs是最常听说的基于深度学习的神经网络了。然而，这种方法在技术上是很难懂的（尤其是对于深度学习新手），也难以实现，训练起来也是很困难。

此外，即使是使用了“Faster”的方法实现R-CNNs（这里R表示候选区域Region Proposal），算法依然是比较慢的，大约是7FPS。

如果我们追求速度，我们可以转向YOLO，因为它非常的快，在TianXGPU上可以达到40-90 FPS，最快的版本可能达到155 FPS。但YOLO的问题在于它的精度还有待提高。

SSDs最初是由谷歌开发的，可以说是以上两者之间的平衡。相对于Faster R-CNNs，它的算法更加直接。相对于YOLO，又更加准确。

MobileNets:高效（深度）神经网路

20170425202801182.png

如上图：
（左）标准的卷积层，包含batchnorm和ReLU。
（右）将卷基层分为depthwise 和pointwise 层，然后再加上batchnorm和ReLU（图片和标题出自Liu et al.）

当搭建目标检测网络时，我们一般使用现有的网络架构，例如VGG 或者ResNet，然后在目标检测过程中使用它们。问题是，这些网络结构可能非常大，大约会有200-500MB。

这类网络架构不适用于资源有限的设备，因为因为他们的规模太大计算结果太多。作为替代的选择，我们使用MobileNets,另一个谷歌研究员的文章作品，我们称之为“MobileNets”。因为这就是为了资源有限的设备（比如说手机）而设计的。MobileNets与传统CNNs不同之处在于可分离卷积（depthwiseseparable convolution）。

depthwise separable convolution的概念一般是指把卷积分解成两部分：

1.一个3x3的denthwise卷积（深度卷积）

2.接着一个1x1的pointwise卷积（点卷积）

这使我们可以减少网络中的参数，降低计算量

这里有个问题就是损失了精度——MobileNets并不像其他的网络那样精度高。

但是他们更加的节省资源。

结合MobileNets和SSDs进行更快更高效的深度学习目标检测

如果我们把MobileNets和SSDs框架结合起来，我们可以实现更快速，更高效的基于深度学习的目标检测。这里使用的模型是original tensorflow impetension的Caffe版本，是由chuanqi305训练的。

MobileNets SSDs最初是在COCO dataset 上训练的，然后在PASCAL VOC进行调试并得到了72.7%的平均准确率。可以检测20种物体（1种是背景类的），包括飞机、单车、鸟、船、瓶子、公交车、汽车、猫、椅子、奶牛、餐桌、狗、马、摩托车、人、盆栽、羊、沙发、火车、和电视机。

基于深度学习的OpenCV目标检测

下面说说使用OpenCV搭建深度学习目标检测器。

首先新建一个文件，命名为“deep_learning_object_detection.py”，并插入如下代码：

# import the necessary packages
import numpy as np
import argparse
import cv2

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", “–image”, required=True,
help=“path to input image”)
ap.add_argument("-p", “–prototxt”, required=True,
help=“path to Caffe ‘deploy’ prototxt file”)
ap.add_argument("-m", “–model”, required=True,
help=“path to Caffe pre-trained model”)
ap.add_argument("-c", “–confidence”, type=float, default=0.4,
help=“minimum probability to filter weak detections”)
args = vars(ap.parse_args())

我们要做的第一件事是导入这个例子中需要的包——cv2中包含的dnn模块，前提是使用的是OpenCV3.3版本。

然后传入参数：

--image : 输入图像的路径
--prototxt: Caffe模型的路径
--model：预训练的模型的路径
--confidence: 能过滤弱检测器的最小的可能性的阈值，默认是20%。

接下来，初始化类的标签和包围框的颜色：

CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
 "bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
 "dog", "horse", "motorbike", "person", "pottedplant", "sheep",
 "sofa", "train", "tvmonitor"]
COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))

上面创建了一个叫“CLASSES”的列表，接下来是颜色列表，用于存放对应的包围框的颜色。接下来导入模型：

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])

上面这段代码，主要是导入模型并打印了相关的信息。
接下来，我们导入待测的图片并准备blob，以便传输到在网络中。

# load the input image and construct an input blob for the image
# by resizing to a fixed 300x300 pixels and then normalizing it
# (note: normalization is done via the authors of the MobileNet SSD
# implementation)

image = cv2.imread(args[“image”])
(h, w) = image.shape[:2]
blob = cv2.dnn.blobFromImage(cv2.resize(image, (300, 300)), 0.007843, (300, 300), 127.5)

注意上面的注释块，我们导入了图片，提取了高度和宽度，计算了300x300的像素blob

现在我们准备关键的工作——把这个blob传入神经网络。

# pass the blob through the network and obtain the detections and
# predictions
print("[INFO] computing object detections...")
net.setInput(blob)
detections = net.forward()

这里我们设置了神经网络的输入，并且计算输入的前向传播，并将结果存储为“detections”计算前向传播和相关检测将花费一点时间，具体取决于模型和输入的尺寸，但是在这个例子里，大部分CPU都能进行快速的完成。

我们在“detections”中进行循环，检测图像中什么位置有什么样的目标：


# loop over the detections
for i in np.arange(0, detections.shape[2]):
 # extract the confidence (i.e., probability) associated with the
 # prediction
 confidence = detections[0, 0, i, 2]
 # filter out weak detections by ensuring the `confidence` is
 # greater than the minimum confidence
 if confidence > args["confidence"]:
 # extract the index of the class label from the `detections`,
 # then compute the (x, y)-coordinates of the bounding box for
 # the object
 idx = int(detections[0, 0, i, 1])

box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])

(startX, startY, endX, endY) = box.astype(“int”)

# display the prediction
label = “{}: {:.2f}%”.format(CLASSES[idx], confidence * 100)

print("[INFO] {}".format(label))

cv2.rectangle(image, (startX, startY), (endX, endY),
COLORS[idx], 2)

y = startY - 15 if startY - 15 > 15 else startY + 15

cv2.putText(image, label, (startX, y),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)

从detectons的循环开始，记得单张图片中可能出现多个目标。我们还对每次检测应用了一个置信度的检查机制。如果置信度足够高（比如说超过了阈值），那么我们将会在终端上显示这个预测并且在图像上绘制彩色的包围框和文字，接下来我们逐句分析：

对detections进行循环，首先我们提取confidence值

如果confidence值超过了最小的阈值，我们提取类的标签序号并且计算围绕着被测物体的包围框。

然后，我们提取包围框的(x,y)坐标，用于绘制矩形和显示文字。

接下来，我们创建一个文字标签label，包含类CLASS的名字和置信度。

使用该标签，在终端上显示出来，同时根据(x,y)坐标绘制一个彩色的包围着物体的矩形框。

总的来说，我们希望标签在矩形之上，如果空间不够，也可以把他们显示在矩形框的最上面一根线的下方。

最后，我们在图像上叠加彩色的标签文字

接下来的步骤就是显示结果：

# show the output image
cv2.imshow("Output", image)
cv2.imwrite("output.jpg",image)
cv2.waitKey(0)

我们在屏幕上显示输出的图片，直到用户按下一个任意键将其中止。同时将绘制标记后的图像保存下来。

OpenCV和深度学习目标检测结果

要运行上面的代码只要打开终端，运行

$ python deep_learning_object_detection.py \
 --prototxt MobileNetSSD_deploy.prototxt.txt \
 --model MobileNetSSD_deploy.caffemodel --image images/example_19.jpg

结果

output-19.jpg

[INFO] loading model...
[INFO] computing object detections...
[INFO] car: 99.71%
[INFO] car: 98.40%

上图识别出一辆汽车的概率是99.99%

换一个例子：

运行


$ python deep_learning_object_detection.py \
 --prototxt MobileNetSSD_deploy.prototxt.txt \
 --model MobileNetSSD_deploy.caffemodel --image images/example_18.jpg

结果

output-18.jpg

[INFO] loading model...
[INFO] computing object detections...
[INFO] chair: 99.96%
[INFO] chair: 99.78%
[INFO] chair: 58.37%
[INFO] diningtable: 99.90%
[INFO] pottedplant: 51.57%

再换一个例子：

运行

$ python deep_learning_object_detection.py \
 --prototxt MobileNetSSD_deploy.prototxt.txt \
 --model MobileNetSSD_deploy.caffemodel --image images/example_16.jpg

结果

output-16.jpg

[INFO] loading model...
[INFO] computing object detections...
[INFO] bicycle: 69.82%
[INFO] bicycle: 66.97%
[INFO] car: 99.99%
[INFO] person: 99.98%
[INFO] person: 51.63%

总结

这一篇我们使用了MobileNets + SSDs检测器和OpenCV3.3最新的dnn模块进行图像中的目标检测。接下来还会写如何在视频流中进行目标检测。

参考：
https://www.pyimagesearch.com/2017/09/11/object-detection-with-deep-learning-and-opencv/