OpenCV实战（33）——OpenCV与深度学习的碰撞

0. 前言

深度学习是机器学习的一个子领域，基于传统的神经网络和卷积神经网络，在语音识别、文本识别和图像分类等领域能够获得接近甚至超越人类水平的准确率。OpenCV 在其核心算法中添加了深度学习模块作为基础模块，并借助 CPU 和 GPU 来提高其性能。

1. 深度学习和卷积神经网络

将机器学习算法应用于现实世界问题时的出色表现使它们为相关应用程序提供了新思路。深度学习基于神经网络理论，深度学习的快速发展主要是由于以下原因，首先是可用的算力允许部署大规模的神经网络，使其能够解决具有挑战性的问题，虽然初代神经网络(感知器)只有一层并仅有很少的权重参数需要调整，但今天的网络可以有数百层和数千万个参数需要优化(因此称为深度网络)；其次海量的数据使神经网络的训练成为可能，为了获得优异性能，深度网络需要数千甚至数百万个带标签的样本(这是因为需要优化的参数数量非常多)。
深度网络中最重要的一个分支是卷积神经网络 (Convolutional Neural Networks, CNN)，其基于卷积操作，要学习的参数是构成网络的所有滤波器核中的值。将这些滤波器组织成多个网络层，早期的网络层可以提取对象基本形状，例如线条和角等，而后期的层可以逐渐检测更复杂的模式，例如，眼睛、嘴巴和头发等。
OpenCV 中包含一个深度神经网络模块，主要用于导入使用其他机器学习库(例如 TensorFlow、Caffe 或 Torch) 训练过的深度网络。

2. 使用深度学习进行人脸检测

在本节中，我们将学习如何在 OpenCV 中使用预训练的深度学习执行进行人脸检测。我们需要下载预训练的人脸检测模型并使用 OpenCV 方法导入模型，并了解如何将输入图像或图像帧转换为深度学习模型所需的结构。
在 OpenCV 中使用深度学习模型非常简单，仅需要加载预训练模型文件并了解其基本配置。我们首先需要下载预训练的深度神经网络模型，接下来我们以人脸检测为例，讲解如何使用在 OpenCV 中使用深度神经网络模型。

2.1 SSD 简介

本节使用单次检测器 (Single-Shot Detector, SSD) DNN 算法检测图像中的人脸，SSD 算法在处理图像时同时预测边界框和类别。SSD DNN 结构如下：

使用尺寸为 300x300 的输入图像
输入图像经过多个卷积层，得到不同尺度的不同特征
对于每个特征图，使用 3x3 卷积滤波器评估一组默认边界框
评估每个默认边界框时，预测边界框偏移量和类别概率

模型架构如下所示：

SSD 网络架构
SSD 是一种 DNN 算法，可用于对多个类别进行分类，我们可以使用修改后的网络执行人脸检测。在 OpenCV 中，定义和使用 DNN 模型最重要的函数是 blobFomImage、readNetFrom、setInput 和 forward。
使用 blobFromImage 函数可以将输入图像转换为 blob，调用方法如下：

blobFromImage(image, scaleFactor, size, mean, swapRB, crop);

blobFromImage 函数中的每个参数含义如下：

image：输入图像
size：输出图像的尺寸大小
mean：将在图像中减去的标量，如果使用均值减法，在 swapRB = True 时，结果为 (mean-R, mean-G, mean-B)
scalefactor：图像值缩放因子
swapRB：标志位，表示是否需要交换 3 通道图像中的第一个和最后一个通道
crop：标志位，表示图像在调整大小后是否需要裁剪

要加载模型，我们可以使用 readFrom[type] 导入器导入使用以下机器学习库训练的模型：

Caffe
Tensorflow
PyTorch
Keras

导入深度学习模型并创建输入 blob 后，就可以使用 Net 类的 setInput 函数将输入 blob 输入到神经网络中，其中第一个参数是 blob 输入，第二个参数是输入层的名称(如果存在多个输入层，需要指定输入层名)。最后调用函数 forward，为输入 blob 执行前向计算并以 cv::Mat 格式返回预测结果。
在人脸检测算法中，返回的 cv::Mat 具有以下结构，detection.size[2] 是检测到的物体的数量，detect.size[3] 是每次检测的结果数据(边界框数据和置信度)，其结构如下：

Column 0：对象存在的置信度
Column 1：边界框的置信度
Column 2：检测到的人脸的置信度
Column 3：左下边界框 X 坐标
Column 4：左下边界框 Y 坐标
Column 5：右上角边界框 X 坐标
Column 6：右上角边界框 Y 坐标

边界框与图像大小相关，当我们要在图像中绘制边界框矩形时，我们需要乘以图像大小。

2.2 使用 SSD 执行人脸检测

(1) 下载人脸检测器的模型并保存在 data 文件夹中，通常需要两个文件：权重文件 deploy.prototxt 和网络结构文件 res10_300x300_ssd_iter_140000.caffemodel。为了使用人脸检测算法中，我们下载定义网络结构的 deploy.prototxt 文件和包含网络权重的 res10_300x300_ssd_iter_140000.caffemodel 文件。

(2) 使用预训练深度神经网络 (Deep Neural Network, DNN)，创建 face_detection.cpp 文件并导入所需的库：

#include <opencv2/dnn.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>

using namespace cv;
using namespace std;
using namespace cv::dnn;

(3) 声明需要在 DNN 算法中使用的全局变量，用于定义了输入网络、预处理数据和要加载的文件名：

float confidenceThreshold = 0.5;
String modelConfiguration = "deploy.prototxt";
String modelBinary = "res10_300x300_ssd_iter_140000.caffemodel";
const size_t inWidth = 300;
const size_t inHeight = 300;
const double inScaleFactor = 1.0;
const Scalar meanVal(104.0, 177.0, 123.0);

(4) 创建 main 函数并在 OpenCV dnn::Net 类中加载模型：

int main(int argc, char **argv) {
    
    
    dnn::Net net = readNetFromCaffe(modelConfiguration, modelBinary);

(5) 调用 empty() 函数检查 DNN 是否正确加载：

    if (net.empty()) {
    
    
        cerr << "Can't load network by using the following files: " << endl;
        cerr << "prototxt: " << modelConfiguration << endl;
        cerr << "caffemodel: " << modelBinary << endl;
        cerr << "Models are available here:" << endl;
        cerr << "<OPENCV_SRC_DIR>/samples/dnn/face_detector" << endl;
        cerr << "or here:" << endl;
        cerr << "https://github.com/opencv/opencv/tree/master/samples/dnn/face_detector" << endl;
        exit(-1);
    }

(6) 如果 DNN 正确加载，我们就可以开始捕获图像帧。检查应用程序中输入参数的数量以确定需要加载默认值或要处理的视频文件：

    VideoCapture cap;
    if (argc==1) {
    
    
        cap = VideoCapture(0);
        if(!cap.isOpened()) {
    
    
            cout << "Couldn't find  default camera" << endl;
            return -1;
        }
    } else {
    
    
        cap.open(argv[1]);
        if(!cap.isOpened()) {
    
    
            cout << "Couldn't open image or video: " << argv[1] << endl;
            return -1;
        }
    }

(7) 如果视频捕获对象正确打开，就可以开始主循环来获取每一视频帧：

    for(;;)
    {
    
    
        Mat frame;
        cap >> frame; // 获取新帧
        if (frame.empty()) {
    
    
            waitKey();
            break;
        }

(8) 在 DNN 算法中处理图像。准备要输入到 DNN 算法的图像，需要使用 blobFromImage 函数将 OpenCV Mat 结构转换为 DNN 结构 blob，OpenCV 中使用 cv::Mat 类来存储 blob：

        //! [Prepare blob]
        Mat inputBlob = blobFromImage(frame, inScaleFactor,
                                      Size(inWidth, inHeight), meanVal, false, false); //Convert Mat to batch of images

(9) 将视频帧转换为 blob 后，输入到 DNN 中并使用前向传播函数 forward 进行检测：

        //! [Set input blob]
        net.setInput(inputBlob, "data"); // 设定网络输入
        //! [Make forward pass]
        Mat detection = net.forward("detection_out"); // 计算输出
        Mat detectionMat(detection.size[2], detection.size[3], CV_32F, detection.ptr<float>());

(10) 为图像中每个检测到的人脸绘制一个矩形框并给出其置信度：

        for(int i = 0; i < detectionMat.rows; i++)
        {
    
    
            float confidence = detectionMat.at<float>(i, 2);
            if(confidence > confidenceThreshold)
            {
    
    
                int xLeftBottom = static_cast<int>(detectionMat.at<float>(i, 3) * frame.cols);
                int yLeftBottom = static_cast<int>(detectionMat.at<float>(i, 4) * frame.rows);
                int xRightTop = static_cast<int>(detectionMat.at<float>(i, 5) * frame.cols);
                int yRightTop = static_cast<int>(detectionMat.at<float>(i, 6) * frame.rows);
                Rect object((int)xLeftBottom, (int)yLeftBottom,
                            (int)(xRightTop - xLeftBottom),
                            (int)(yRightTop - yLeftBottom));
                rectangle(frame, object, Scalar(0, 255, 0));
                stringstream ss;
                ss.str("");
                ss << confidence;
                String conf(ss.str());
                String label = "Face: " + conf;
                int baseLine = 0;
                Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
                rectangle(frame, Rect(Point(xLeftBottom, yLeftBottom - labelSize.height),
                                      Size(labelSize.width, labelSize.height + baseLine)),
                          Scalar(255, 255, 255), FILLED);
                putText(frame, label, Point(xLeftBottom, yLeftBottom),
                        FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0,0,0));
            }
        }

执行以上代码，得到的检测结果如下所示：

人脸检测结果

3. 完整代码

完整代码 face_detection.cpp 如下所示：

#include <opencv2/dnn.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <iostream>

using namespace cv;
using namespace std;
using namespace cv::dnn;

float confidenceThreshold = 0.5;
String modelConfiguration = "deploy.prototxt";
String modelBinary = "res10_300x300_ssd_iter_140000.caffemodel";
const size_t inWidth = 300;
const size_t inHeight = 300;
const double inScaleFactor = 1.0;
const Scalar meanVal(104.0, 177.0, 123.0);

int main(int argc, char **argv) {
    
    
    dnn::Net net = readNetFromCaffe(modelConfiguration, modelBinary);
    if (net.empty()) {
    
    
        cerr << "Can't load network by using the following files: " << endl;
        cerr << "prototxt: " << modelConfiguration << endl;
        cerr << "caffemodel: " << modelBinary << endl;
        cerr << "Models are available here:" << endl;
        cerr << "<OPENCV_SRC_DIR>/samples/dnn/face_detector" << endl;
        cerr << "or here:" << endl;
        cerr << "https://github.com/opencv/opencv/tree/master/samples/dnn/face_detector" << endl;
        exit(-1);
    }
    VideoCapture cap;
    if (argc==1) {
    
    
        cap = VideoCapture(0);
        if(!cap.isOpened()) {
    
    
            cout << "Couldn't find  default camera" << endl;
            return -1;
        }
    } else {
    
    
        cap.open(argv[1]);
        if(!cap.isOpened()) {
    
    
            cout << "Couldn't open image or video: " << argv[1] << endl;
            return -1;
        }
    }
    for(;;)
    {
    
    
        Mat frame;
        cap >> frame; // 获取新帧
        if (frame.empty()) {
    
    
            waitKey();
            break;
        }
        //! [Prepare blob]
        Mat inputBlob = blobFromImage(frame, inScaleFactor,
                                      Size(inWidth, inHeight), meanVal, false, false); //Convert Mat to batch of images
        
        //! [Set input blob]
        net.setInput(inputBlob, "data"); // 设定网络输入
        //! [Make forward pass]
        Mat detection = net.forward("detection_out"); // 计算输出
        Mat detectionMat(detection.size[2], detection.size[3], CV_32F, detection.ptr<float>());
        for(int i = 0; i < detectionMat.rows; i++)
        {
    
    
            float confidence = detectionMat.at<float>(i, 2);
            if(confidence > confidenceThreshold)
            {
    
    
                int xLeftBottom = static_cast<int>(detectionMat.at<float>(i, 3) * frame.cols);
                int yLeftBottom = static_cast<int>(detectionMat.at<float>(i, 4) * frame.rows);
                int xRightTop = static_cast<int>(detectionMat.at<float>(i, 5) * frame.cols);
                int yRightTop = static_cast<int>(detectionMat.at<float>(i, 6) * frame.rows);
                Rect object((int)xLeftBottom, (int)yLeftBottom,
                            (int)(xRightTop - xLeftBottom),
                            (int)(yRightTop - yLeftBottom));
                rectangle(frame, object, Scalar(0, 255, 0));
                stringstream ss;
                ss.str("");
                ss << confidence;
                String conf(ss.str());
                String label = "Face: " + conf;
                int baseLine = 0;
                Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
                rectangle(frame, Rect(Point(xLeftBottom, yLeftBottom - labelSize.height),
                                      Size(labelSize.width, labelSize.height + baseLine)),
                          Scalar(255, 255, 255), FILLED);
                putText(frame, label, Point(xLeftBottom, yLeftBottom),
                        FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0,0,0));
            }
        }
        imshow("detections", frame);
        if (waitKey(1) >= 0) break;
    }
    return 0;
}

小结

在本文中，我们首先通过 cv2::dnn::blobFromImage() 和 cv2::dnn::blobFromImages() 函数了解了如何在 OpenCV 中构建网络输入 blob，然后通过实战学习将流行的深度学习模型架构应用于目标检测任务中，构建 OpenCV 计算机视觉项目。

系列链接

OpenCV实战（1）——OpenCV与图像处理基础
 OpenCV实战（2）——OpenCV核心数据结构
 OpenCV实战（3）——图像感兴趣区域
 OpenCV实战（4）——像素操作
 OpenCV实战（5）——图像运算详解
 OpenCV实战（6）——OpenCV策略设计模式
 OpenCV实战（7）——OpenCV色彩空间转换
 OpenCV实战（8）——直方图详解
 OpenCV实战（9）——基于反向投影直方图检测图像内容
 OpenCV实战（10）——积分图像详解
 OpenCV实战（11）——形态学变换详解
 OpenCV实战（12）——图像滤波详解
 OpenCV实战（13）——高通滤波器及其应用
 OpenCV实战（14）——图像线条提取
 OpenCV实战（15）——轮廓检测详解
 OpenCV实战（16）——角点检测详解
 OpenCV实战（17）——FAST特征点检测
 OpenCV实战（18）——特征匹配
 OpenCV实战（19）——特征描述符
 OpenCV实战（20）——图像投影关系
 OpenCV实战（21）——基于随机样本一致匹配图像
 OpenCV实战（22）——单应性及其应用
 OpenCV实战（23）——相机标定
 OpenCV实战（24）——相机姿态估计
 OpenCV实战（25）——3D场景重建
 OpenCV实战（26）——视频序列处理
 OpenCV实战（27）——追踪视频中的特征点
 OpenCV实战（28）——光流估计
 OpenCV实战（29）——视频对象追踪
 OpenCV实战（30）——OpenCV与机器学习的碰撞
 OpenCV实战（31）——基于级联Haar特征的目标检测
 OpenCV实战（32）——使用SVM和定向梯度直方图执行目标检测