基于OpenCV和YOLOv3深度学习的目标检测

翻译Deep Learning based Object Detection using YOLOv3 with OpenCV ( Python / C++ )

基于OpenCV和YOLOv3深度学习的目标检测
本文，我们学习如何使用OpenCV，和目前先进的目标检测技术YOLOv3，

YOLOv3是当前流行的目标检测算法YOLO(You Only Look Once)的最新变种算法。所发行的模型能识别图片和视频中的80种物体，而且更重要的是它实时性强，而且准确度接近Single Shot MultiBox（SSD）。

从OpenCV 3.4.2开始，我们可以很容易的在OpenCV应用程序中使用YOLOv3模型。

YOLO是什么原理？

      我们可以把目标检测看成是目标定位和目标识别的结合。
      在传统的计算机视觉方法中，采用滑动窗口查找不同区域和大小的目标。因为这是消耗量较大的算法，目标的纵横比通常假定是固定的。
      早期的基于深度学习的目标检测算法，如R-CNN和快速R-CNN，使用了选择型搜索（Selective Search）来缩小必须测试的边界框的数量。
      另外一种方法称为Overfeat，通过卷积计算滑动窗口，从多个尺度扫描了图像。
      然后使用快速R-CNN算法，使用Region Proposal Network(RPN)区别将要测试的边界框。通过巧妙的设计，用于目标识别的特征点，也被RPN用于提出潜在的边界框，因此节省了大量的计算。
      另一方面，YOLO使用了完全不同的方法解决目标检测问题。它将图像进行神经网络的一次性处理。SSD是另外一种将图像进行神经网络一次性处理的方法，但是YOLOv3比SSD实现了可观的精度，同时又较快的运算速度。YOLOv3在M40，TitanX和1080Ti这类GPU上实时效果更好。
让我们看看YOLO如何在一张图片中检测目标。
    首先，它把原图分解成一张13x13网格的图片。这169个单元会根据原图的大小而改变。对于一张416x416像素的图片，每个图片单元的大小是32x32像素。每个图片单元用于预测图像中的多个边界框。
    对于每个边界框，这个网络会计算边界框所包含特定目标的置信度，同时计算所包含的目标是属于一个类别的可能性。
    由于低置信度，或者多个边界框高置信度地包含同一个目标，大部分边界框都可以被消除。这技术叫做非最大抑制（non-maximum suppression）。
    YOLOv3的作者，Joseph Redmon和Ali Farhadi，让YOLOv3比前一代作品YOLOv2更加精确和快速。YOLOv3更擅长处理多个不同尺寸。他们还通过加大了网络，并添加快捷链接将其引入剩余网络来改进网络。

为什么选择OpenCV的YOLO

这里有多个理由。

容易整合到现有的OpenCV程序中：如果应用程序已经使用了OpenCV，并想简单地使用YOLOv3，完全不需要担心Darknet源代码的编译和建立。
OpenCV的CPU版本是9倍更加快：OpenCV的DNN模块，其CPU运行是十分快的。比如，当用了OpenMP的Darknet在CPU上处理一张图片消耗2秒，OpenCV的实现只需要0.22秒。请看下面的表格。
支持Python。Darknet是用C语言写的，因此并不官方支持Python。相反，OpenCV是支持Python的。会有支持Darknet的接口。

在Darknet和OpenCV上跑YOLOv3的速度测试

      下面的表格展示了在Darknet和OpenCV上YOLOv3的性能，输入图片的尺寸是416x416。可以预料得到，GPU版本的Darknet在性能上比其他优越。同时，理所当然的Darknet配合OpenMP会好于没有OpenMP的Darknet，因为OpenMP支持多核的CPU。
    意料之外的是，CPU版本的OpenCV在执行DNN时候，是9倍的快过Darknet和OpenML。

表1. 分别在Darknet和OpenCV上跑YOLOv3的速度对比
OS	Framework	CPU/GPU	Time(ms)/Frame
Linux 16.04	Darknet	12x Intel Core i7-6850K CPU @ 3.60GHz	9370
Linux 16.04	Darknet + OpenMP	12x Intel Core i7-6850K CPU @ 3.60GHz	1942
Linux 16.04	OpenCV [CPU]	12x Intel Core i7-6850K CPU @ 3.60GHz	220
Linux 16.04	Darknet	NVIDIA GeForce 1080 Ti GPU	23
macOS	DarkNet	2.5 GHz Intel Core i7 CPU	7260
macOS	OpenCV [CPU]	2.5 GHz Intel Core i7 CPU	400

注意：在GPU版本的OpenCV上跑DNN时候遇到了困难。本工作只是测试了Intel的GPU，因此如果没有Intel的GPU，代码会让你在CPU上跑。

采用YOLOv3的目标检测，C++/Python两种语言

让我们看看，YOLOv3在OpenCV运行目标检测的效果。

第1步：下载模型。

我们从命令行中执行脚本getModels.sh开始

sudo chmod a+x getModels.sh
./getModels.sh

//译者添加：
Windows下替代方案：

1、http://gnuwin32.sourceforge.net/packages/wget.htm 安装wget
 cd 到wget安装目录，执行
wget https://pjreddie.com/media/files/yolov3.weights
wget https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg?raw=true -O ./yolov3.cfg
wget https://github.com/pjreddie/darknet/blob/master/data/coco.names?raw=true -O ./coco.names

开始下载yolov3.weights文件（包括了提前训练好的网络的高度），和yolov3.cfg文件（包含了网络的配置）和coco.names（包括了COCO数据库中使用的80种不同的目标种类名字）

第2步：初始化参数

    YOLO3算法生检测结果，以边界框形式框住。每一个边界框旁随着一个置信值。第一阶段，全部低于置信度阀值的都先忽略。算法继续。
    剩余的边界框执行非最大抑制算法，去除了重叠的边界框。非最大抑制由一个参数nmsThrehold控制。可以尝试改变这个数值，观察输出的边界框的改变。
    接下来，设置了输入图片的宽度（inpWidth）和高度（inpHeight）。我们设置他们为416，以便对比YOLOv3作者提供的Darknets的C代码。如果想要更快的速度，读者可以改变宽度和高度到320。如果想要更准确的结果，改变他们到608。

Python代码：

# Initialize the parameters
confThreshold = 0.5  #Confidence threshold
nmsThreshold = 0.4   #Non-maximum suppression threshold
inpWidth = 416       #Width of network's input image
inpHeight = 416      #Height of network's input image

C++代码：

// Initialize the parameters
float confThreshold = 0.5; // Confidence threshold
float nmsThreshold = 0.4;  // Non-maximum suppression threshold
int inpWidth = 416;        // Width of network's input image
int inpHeight = 416;       // Height of network's input image

第3步：读取模型和类别

文件coco.names包含了训练时的所有目标。我们读出类别的名字。
接着，我们读取了网络，其包含两个部分：
1、yolov3.weights: 预训练得到的高度。
2、yolov3.cfg：配置文件

我们把DNN的后端设置为OpenCV，目标设置为CPU。可以尝试设定更好的目标为cv.dnn.DNN_TARGET_OPENCL在GPU中执行。但是要记住当前的OpenCV版本只在Intel的GPU上测试，如果没有Intel的GPU则会自动设置为CPU。

Python:

# Load names of classes
classesFile = "coco.names";
classes = None
with open(classesFile, 'rt') as f:
    classes = f.read().rstrip('\n').split('\n')
 
# Give the configuration and weight files for the model and load the network using them.
modelConfiguration = "yolov3.cfg";
modelWeights = "yolov3.weights";
 
net = cv.dnn.readNetFromDarknet(modelConfiguration, modelWeights)
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CPU)

C++

// Load names of classes
string classesFile = "coco.names";
ifstream ifs(classesFile.c_str());
string line;
while (getline(ifs, line)) classes.push_back(line);
 
// Give the configuration and weight files for the model
String modelConfiguration = "yolov3.cfg";
String modelWeights = "yolov3.weights";
 
// Load the network
Net net = readNetFromDarknet(modelConfiguration, modelWeights);
net.setPreferableBackend(DNN_BACKEND_OPENCV);
net.setPreferableTarget(DNN_TARGET_CPU);

第4步：读取输入

这一步我们读取图像，视频流或者网络摄像头。另外，我们也使用Videowriter（OpenCV里的一个类）保存带有输出边界框的每一帧。

Python

outputFile = "yolo_out_py.avi"
if (args.image):
    # Open the image file
    if not os.path.isfile(args.image):
        print("Input image file ", args.image, " doesn't exist")
        sys.exit(1)
    cap = cv.VideoCapture(args.image)
    outputFile = args.image[:-4]+'_yolo_out_py.jpg'
elif (args.video):
    # Open the video file
    if not os.path.isfile(args.video):
        print("Input video file ", args.video, " doesn't exist")
        sys.exit(1)
    cap = cv.VideoCapture(args.video)
    outputFile = args.video[:-4]+'_yolo_out_py.avi'
else:
    # Webcam input
    cap = cv.VideoCapture(0)
 
# Get the video writer initialized to save the output video
if (not args.image):
    vid_writer = cv.VideoWriter(outputFile, cv.VideoWriter_fourcc('M','J','P','G'), 30, (round(cap.get(cv.CAP_PROP_FRAME_WIDTH)),round(cap.get(cv.CAP_PROP_FRAME_HEIGHT))))

C++

outputFile = "yolo_out_cpp.avi";
if (parser.has("image"))
{
    // Open the image file
    str = parser.get<String>("image");
    ifstream ifile(str);
    if (!ifile) throw("error");
    cap.open(str);
    str.replace(str.end()-4, str.end(), "_yolo_out.jpg");
    outputFile = str;
}
else if (parser.has("video"))
{
    // Open the video file
    str = parser.get<String>("video");
    ifstream ifile(str);
    if (!ifile) throw("error");
    cap.open(str);
    str.replace(str.end()-4, str.end(), "_yolo_out.avi");
    outputFile = str;
}
// Open the webcaom
else cap.open(parser.get<int>("device"));
 
// Get the video writer initialized to save the output video
if (!parser.has("image")) {
   video.open(outputFile, VideoWriter::fourcc('M','J','P','G'), 28, Size(cap.get(CAP_PROP_FRAME_WIDTH),          cap.get(CAP_PROP_FRAME_HEIGHT)));
}

第5步：处理每一帧

    输入到神经网络的图像需要以一种叫bolb的格式保存。
    读取了输入图片或者视频流的一帧图像后，这帧需要经过bolbFromImage函数处理为神经网络得到输入的bolb。在这个过程中，图像像素以一个1/255的比例因子，被缩放到0到1之间。同时，图像在不裁剪的情况下，大小调整到416x416。注意我们没有降低图像平均值，因此传递[0,0,0]到函数的平均值输入，保持swapRB参数到默认值1。
    输出的bolb传递到网络，经过网络处理，输出了预料到的一堆边界框清单。这些边界框通过了后处理，滤除了低置信值的。我们随后再详细的说明后处理的步骤。我们在每一帧的左上方打印出了推断时间。伴随着最后的边界框，图像保存到硬盘中，之后可以作为图像输入或者通过Videowriter作为视频流输入。

Python：

while cv.waitKey(1) < 0:
     
    # get frame from the video
    hasFrame, frame = cap.read()
     
    # Stop the program if reached end of video
    if not hasFrame:
        print("Done processing !!!")
        print("Output file is stored as ", outputFile)
        cv.waitKey(3000)
        break
 
    # Create a 4D blob from a frame.
    blob = cv.dnn.blobFromImage(frame, 1/255, (inpWidth, inpHeight), [0,0,0], 1, crop=False)
 
    # Sets the input to the network
    net.setInput(blob)
 
    # Runs the forward pass to get output of the output layers
    outs = net.forward(getOutputsNames(net))
 
    # Remove the bounding boxes with low confidence
    postprocess(frame, outs)
 
    # Put efficiency information. The function getPerfProfile returns the
    # overall time for inference(t) and the timings for each of the layers(in layersTimes)
    t, _ = net.getPerfProfile()
    label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())
    cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
 
    # Write the frame with the detection boxes
    if (args.image):
        cv.imwrite(outputFile, frame.astype(np.uint8));
    else:
        vid_writer.write(frame.astype(np.uint8))

c++

// Process frames.
while (waitKey(1) < 0)
{
    // get frame from the video
    cap >> frame;
 
    // Stop the program if reached end of video
    if (frame.empty()) {
        cout << "Done processing !!!" << endl;
        cout << "Output file is stored as " << outputFile << endl;
        waitKey(3000);
        break;
    }
    // Create a 4D blob from a frame.
    blobFromImage(frame, blob, 1/255.0, cvSize(inpWidth, inpHeight), Scalar(0,0,0), true, false);
     
    //Sets the input to the network
    net.setInput(blob);
     
    // Runs the forward pass to get output of the output layers
    vector<Mat> outs;
    net.forward(outs, getOutputsNames(net));
     
    // Remove the bounding boxes with low confidence
    postprocess(frame, outs);
     
    // Put efficiency information. The function getPerfProfile returns the
    // overall time for inference(t) and the timings for each of the layers(in layersTimes)
    vector<double> layersTimes;
    double freq = getTickFrequency() / 1000;
    double t = net.getPerfProfile(layersTimes) / freq;
    string label = format("Inference time for a frame : %.2f ms", t);
    putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 255));
     
    // Write the frame with the detection boxes
    Mat detectedFrame;
    frame.convertTo(detectedFrame, CV_8U);
    if (parser.has("image")) imwrite(outputFile, detectedFrame);
    else video.write(detectedFrame);
     
}

现在，让我们详细分析一下上面调用的函数。

第4a步：得到输出层的名字
OpenCV的网络类中的前向功能需要结束层，直到它在网络中运行。因为我们需要运行整个网络，所以我们需要识别网络中的最后一层。我们通过使用getUnconnectedOutLayers()获得未连接的输出层的名字，该层基本就是网络的最后层。然后我们运行前向网络，得到输出，如前面的代码片段（net.forward(getOutputsNames(net))）。
python:

# Get the names of the output layers
def getOutputsNames(net):
    # Get the names of all the layers in the network
    layersNames = net.getLayerNames()
    # Get the names of the output layers, i.e. the layers with unconnected outputs
    return [layersNames[i[0] - 1] for i in net.getUnconnectedOutLayers()]

c++

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net)
{
    static vector<String> names;
    if (names.empty())
    {
        //Get the indices of the output layers, i.e. the layers with unconnected outputs
        vector<int> outLayers = net.getUnconnectedOutLayers();
         
        //get the names of all the layers in the network
        vector<String> layersNames = net.getLayerNames();
         
        // Get the names of the output layers in names
        names.resize(outLayers.size());
        for (size_t i = 0; i < outLayers.size(); ++i)
        names[i] = layersNames[outLayers[i] - 1];
    }
    return names;
}

第4b步：后处理网络输出
网络输出的每个边界框都分别由带有类别的一个向量和5个元素表示。
头四个元素代表center_x, center_y, width和height。第五个元素表示包含着目标的边界框的置信度。

其余的元素是和每个类别（如目标种类）有关的置信度。边界框分配给最高分数对应的那一种类。
一个边界框的最高分数也叫做它的置信度（confidence）。如果边界框的置信度低于规定的阀值，算法上不再处理这个边界框。
置信度大于或等于置信度阀值的边界框，将进行非最大抑制。这会减少重叠的边界框数目。

Python

# Remove the bounding boxes with low confidence using non-maxima suppression
def postprocess(frame, outs):
    frameHeight = frame.shape[0]
    frameWidth = frame.shape[1]
 
    classIds = []
    confidences = []
    boxes = []
    # Scan through all the bounding boxes output from the network and keep only the
    # ones with high confidence scores. Assign the box's class label as the class with the highest score.
    classIds = []
    confidences = []
    boxes = []
    for out in outs:
        for detection in out:
            scores = detection[5:]
            classId = np.argmax(scores)
            confidence = scores[classId]
            if confidence > confThreshold:
                center_x = int(detection[0] * frameWidth)
                center_y = int(detection[1] * frameHeight)
                width = int(detection[2] * frameWidth)
                height = int(detection[3] * frameHeight)
                left = int(center_x - width / 2)
                top = int(center_y - height / 2)
                classIds.append(classId)
                confidences.append(float(confidence))
                boxes.append([left, top, width, height])
 
    # Perform non maximum suppression to eliminate redundant overlapping boxes with
    # lower confidences.
    indices = cv.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)
    for i in indices:
        i = i[0]
        box = boxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        drawPred(classIds[i], confidences[i], left, top, left + width, top + height)

c++

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& outs)
{
    vector<int> classIds;
    vector<float> confidences;
    vector<Rect> boxes;
     
    for (size_t i = 0; i < outs.size(); ++i)
    {
        // Scan through all the bounding boxes output from the network and keep only the
        // ones with high confidence scores. Assign the box's class label as the class
        // with the highest score for the box.
        float* data = (float*)outs[i].data;
        for (int j = 0; j < outs[i].rows; ++j, data += outs[i].cols)
        {
            Mat scores = outs[i].row(j).colRange(5, outs[i].cols);
            Point classIdPoint;
            double confidence;
            // Get the value and location of the maximum score
            minMaxLoc(scores, 0, &confidence, 0, &classIdPoint);
            if (confidence > confThreshold)
            {
                int centerX = (int)(data[0] * frame.cols);
                int centerY = (int)(data[1] * frame.rows);
                int width = (int)(data[2] * frame.cols);
                int height = (int)(data[3] * frame.rows);
                int left = centerX - width / 2;
                int top = centerY - height / 2;
                 
                classIds.push_back(classIdPoint.x);
                confidences.push_back((float)confidence);
                boxes.push_back(Rect(left, top, width, height));
            }
        }
    }
     
    // Perform non maximum suppression to eliminate redundant overlapping boxes with
    // lower confidences
    vector<int> indices;
    NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
    for (size_t i = 0; i < indices.size(); ++i)
    {
        int idx = indices[i];
        Rect box = boxes[idx];
        drawPred(classIds[idx], confidences[idx], box.x, box.y,
                 box.x + box.width, box.y + box.height, frame);
    }
}

非最大抑制由参数nmsThreshold控制。如果nmsThreshold设置太少，比如0.1，我们可能检测不到相同或不同种类的重叠目标。如果设置得太高，比如1，可能出现一个目标有多个边界框包围。所以我们在上面的代码使用了0.4这个中间的值。下面的gif展示了NMS阀值改变时候的效果。

第4c步：画出计算得到的边界框
最后，经过非最大抑制后，得到了边界框。我们把边界框在输入帧上画出，并标出种类名和置信值。

Python

# Draw the predicted bounding box
def drawPred(classId, conf, left, top, right, bottom):
    # Draw a bounding box.
    cv.rectangle(frame, (left, top), (right, bottom), (0, 0, 255))
     
    label = '%.2f' % conf
         
    # Get the label for the class name and its confidence
    if classes:
        assert(classId < len(classes))
        label = '%s:%s' % (classes[classId], label)
 
    #Display the label at the top of the bounding box
    labelSize, baseLine = cv.getTextSize(label, cv.FONT_HERSHEY_SIMPLEX, 0.5, 1)
    top = max(top, labelSize[1])
    cv.putText(frame, label, (left, top), cv.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255))

c++

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
    //Draw a rectangle displaying the bounding box
    rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 0, 255));
     
    //Get the label for the class name and its confidence
    string label = format("%.2f", conf);
    if (!classes.empty())
    {
        CV_Assert(classId < (int)classes.size());
        label = classes[classId] + ":" + label;
    }
     
    //Display the label at the top of the bounding box
    int baseLine;
    Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
    top = max(top, labelSize.height);
    putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(255,255,255));

订阅&下载代码
如果你喜欢本文，想下载代码（C++和Python），和在文中的例子图片，请订阅我们的时事通信。你会获得一封免费的计算机视觉指南。在我们的时事通信上，我们共享了C++/Python语言的OpenCV教程和例子，同时还有计算机视觉和机器学习算法和新闻。

参考：
YOLOv3 Tech Report


特别鸣谢fanyi.baidu.com

原文地址：https://www.learnopencv.com/deep-learning-based-object-detection-using-yolov3-with-opencv-python-c/

基于OpenCV和YOLOv3深度学习的目标检测

猜你喜欢