License Plate Detection and Recognition in Unconstrained Scenarios (unconstrained license plate detection and recognition in the scene)

0 Abstract :

Despite the large number of automatic number plate recognition (ALPR) commercial and academic methods, but most of the existing methods are concentrated in a particular license plate (LP) region (such as Europe, the United States, Brazil, China Taiwan and so on), and often contain approximately explore positive image the data set. This paper presents a complete ALPR system, focusing on unconstrained scenario where the plate image may exist serious distortion due to the tilt angle of view. Our main contribution is to introduce a novel convolutional neural network (CNN), the network can detect and correct a distortion in a single image a plurality of plates, these plates can be corrected by an optical character recognition (OCR) method to get the final result. As an additional contribution, we also come from different regions and with the acquisition conditions challenging LP set of images were manually labeled. Our experimental results show that the method proposed in this paper, without any adjustment or fine-tune the parameters for a particular scene, license plate recognition similar results with the traditional scene of the most advanced business system performance, and challenging scene better than the other academic and business methods.

1 Introduction :

Some traffic-related applications, such as detection of the stolen vehicle, and the parking fee control access authentication are related to vehicle identification, which is executed by the automatic number plate recognition (the ALPR) system. Parallel processing and deep learning (DL) of the latest progress in improving many computer vision tasks, such as target detection / recognition and optical character recognition (OCR), which is obviously beneficial to optimize the ALPR system. In fact, the depth of the convolutional neural network (CNN) has been applied to the vehicle and license plate (LP) detection of advanced machine learning techniques. In addition to academic papers, some commercial ALPR system is also exploring the DL method. They are usually deployed in large data centers, through network services, capable of handling thousands to millions of images every day, and constantly improved. These systems include Sighthound (https://www.sighthound.com/), OpenALPR (http://www.openalpr.com/) and Amazon Rekognition (https://aws.amazon.com/rekognition/).

Despite the progress in the most advanced technology, but most of the ALPR system mainly uses the front view of the vehicle and the LP, which is common in applications such as charge monitoring and verification in the parking lot. However, the scene image acquisition easier (e.g., law enforcement personnel move the camera or smartphone walking) may cause oblique viewing angle, which may be highly distorted LP but still readable, shown in Figure 1, even the most advanced commercial ALPR the system recognition accuracy of these plates is also not satisfactory.

In this work, we present a complete ALPR system, can have a good performance in a variety of scenes and camera.

Our main contribution is the introduction of a new network, LP can be detected in a number of different camera pose and estimate its distortion, allowing the correction process before OCR.

Another contribution is the extensive use of synthetic distort the real image to enhance the training data, allowing the use of less than 200 manually tagged images from the beginning to train the network.

In this work, we present a complete ALPR system, can have a good performance in a variety of scenes and camera.

Our main contribution is the introduction of a new network, LP can be detected in a number of different camera pose and estimate its distortion, allowing the correction process before OCR.

Another contribution is the extensive use of synthetic distort the real image to enhance the training data, allowing the use of less than 200 manually tagged images from the beginning to train the network.
RELATED WORK :

ALPR is to find and identify the image of the license plate. It is usually divided into four sub-tasks, to form a continuous pipeline: vehicle detection, license plate detection, character segmentation and character recognition . For simplicity, we will be a combination of the last two subtasks called OCR.

In the past it has been proposed many different ALPR systems or related subtasks, usually binary image or gray scale analysis to find the candidate proposed (such as LP and characters), and then is hand-made feature extraction methods and classical machine learning classifier device. With the rise of the DL, the most advanced technology began to develop in the other direction, a lot of work now have adopted CNN, because it can achieve universal object detection and identification with high accuracy.

ALPR scene is associated with text location (STS) and a read field number (e.g., Google Street View images from [22]) problem, the goal is to find and read the text / number in a natural scene. Although ALPR can be viewed as a special case of the STS, but the two issues, or different, in ALPR, we need to learn there is no semantic information of characters and numbers (not too many font changes), and is focused on STS text font information comprising a high variability, and to explore the possible lexical and semantic information, such as [30]. Digital reading semantic information is not available, only digital processing, simpler than the ALPR context, because it avoids the common digital / confused letters, for example, B-8, D-0,1-I, 5-S.

The main contribution of this paper is a novel LP detection network, we look back through this particular sub-task-based method DL, and some can handle text distortion LP STS detection method can be used to start this section. Next, we will complete the ALPR system based on the DL.

2.1 license plate detection

Success YOLO network [23, 24] inspired a lot of recent work aimed at achieving real-time detection of LP. Hsu et al., Using a slightly modified YOLO [23] and YOLOv2 [24] LP network detection, network output of expanding the number of detectors to improve the particle size, and obtaining the probability detection block belonging to two classes (LP and background). Their network between precision and recall achieved a good compromise, but the positioning of the bounding box of the extract is not accurate. In addition, due to the Yolo not good at detecting small targets, it is necessary to further assess the situation the car away from the camera.

In [31], based on two trained YOLO networks, the purpose is to detect the rotation LP. The first region comprises a network for finding the LP, referred to as "attention model" and the second rotational network acquisition LP rectangular bounding box. Nevertheless, they are only considered in a plane of rotation, rather than a more complex deformation caused by oblique camera angle, those shown in FIG. 1, for example. Moreover, since they do not provide a complete ALPR system, it is difficult to evaluate the performance of OCR methods of the detected region.

Using a sliding window method or in conjunction with a detector plate CNN candidate filters may also be found in [3,2,27] in. However, since no such YOLO, SSD [21] and Faster R-CNN [25] like target detection architecture shared computing, they are often inefficient calculated.

While positioning SceneText (STS) method focuses on large font changes and lexical / semantic information, but it is worth noting that some of the approach to the rotation / twisting text, can be used for detecting the inclination angle of view of LP. Jaderberg and colleagues [13] proposed a text recognition method based on CNN natural scene, the use of completely synthetic data set to train the model. Despite the good results, but they are still very dependent on N-gram, this does not apply ALPR. Gupta et al [7] realistically through the text pasted into the real image synthesis to explore the data sets, the text focuses positioning. The output is rotated bounding box around the text, which is found from the surface to limit rotation of the ALPR scene in common.

Recently, Wang et al. [29] proposed a variety of geometric position of the text detection method, called instances switching network (ITN). It essentially consists of three CNN consisting of: a backbone network used to calculate the feature, a switching network for inferring affine parameter, assuming the presence of text, a final classification and mapping features in the network, based on the input affine parameters established by sampling feature. Although this method (theory) can be away from the rotating surface, but actually it can not infer correctly mapped to the text region rectangular conversion, because there is no physical boundary region around the text should be mapped to the undistorted rectangular view. In the ALPR, LP is to construct rectangular and flat, and we discover this information to convert the regression parameters, see Section 3.2.

2.2 ALPR complex method

Silva and Jung [28] and Laroca et al [17] proposed a complete ALPR system is based on a series of improvements YOLO network. In [28] in two different networks, a joint detection for automobiles and the LP, the other for performing OCR. In [17] the use of five CCP network, substantially each ALPR a subtask, two for character recognition. Although both systems are real-time, but they only focus on Brazil license plates, license plates and no correction, deal only with front and close to the rectangle LP.

Thermi et al [27] using a series of morphological operators based, Gaussian filtering, edge detection and pretreatment geometric analysis to find LP and the candidate characters. Then, by using two different CNN (i) a selection of a single positive sample from a candidate set of LP in the classification of each image; (ii) identifying a character segmentation. This method handles only a single LP for each image, and according to the authors, the distortion of the LP and poor lighting conditions will affect performance.

Lee et al [19] proposed a Faster R-CNN-based networks [25]. Using the network region to locate a candidate proposed LP region, the corresponding characteristic diagram of a cell layer RoI crop. Then, these candidates are sent to the last part of the network, the probability is calculated / LP is not, by recursively and performs OCR neural network. Although promising, the assessment proposed by the authors showed that the tilt LP contains the most challenging scenarios, optimize performance yet.

Business Systems is a good reference to the most advanced technologies. Although they usually only related to its architecture (or no) information, but we can still use them as a black box to assess the final output. The first portion of the, examples are Sighthound, OpenALPR (Metropolis platform 2 which is the official NVIDIA partner) and Amazon Rekognition (a generic AI engine includes a text recognition module may be used to detect and identify the LP).
3 text method

 

 

 3.1 Vehicle Detection

Since the vehicle is one of many classic base object detection and recognition of the existence of data sets, such as PASCAL-VOC [5], ImageNet [26] and COCO [20], we decided not to start training from scratch detector, choosing instead to consider a number of criteria to perform a known model vehicle detection. In one aspect, a high recall rate is desirable, because the vehicle has any visible LP undetected undetected direct result of the whole LP. On the other hand, the need for rapid detection of quasi, because each error detected by the vehicle must be verified WPOD-NET. Based on these considerations, we decided to use YOLOv2 network , because it is faster (about 70 FPS) and a good compromise between precision and recall (PASCAL-VOC 76.8% mAP) . We no YOLOv2 make any changes or improvements, but the network as a black box, combined with the vehicle (ie cars and buses) related to the output, and ignore the other categories .
Then a positive detection result input before WPOD-NET resize. According to experience, a larger input image allows detection of small objects, but increases the computational cost. Substantially front / back view, the ratio between the LP and the window size of the bounding box (BB) is small. However, for the inclined / diagonal view, the ratio is smaller, because the vehicle tends to be greater and longer BB. Accordingly, the oblique viewing angle of the input image should be scaled to a larger size than the positive angle of view, to keep the LP regions remain recognizable.

While such may be used [32] of 3D pose estimation method to determine the scaling size, but BB Based on the aspect ratio of the vehicle presents a simple and fast process. When it is close to 1, the use of smaller size and have an aspect ratio becomes larger as the size increases. More specifically, the scaling factor F SC is given by the following formula:

 

 

 Wherein W is V and H V are the width and height of the vehicle BB. Note, D min ≤ F SC min (W is V , H V ) ≦ D max , so D min and D max define the scope of the smallest dimension of the scaled BB. Based on experiments and try to maintain a good compromise between accuracy and run time, we choose D min = 288 and D max = 608 .

Vehicle detection and correction 3.2

And the object plane is a rectangular plate essence, they are attached in order to identify the vehicle. To take advantage of its shape, we propose a new CNN, called deformable plane target detection network . This network may implement a detection plate provided with a series of deformation, and the regression coefficients of the affine transformation, the affine transformation LP correcting distorted rectangular shape to resemble a front view. Although learning perspective projection plane instead of the affine transformation, the perspective transformation in the denominator division may be small, and therefore lead to numerical instability.

WPOD-NET is the use of YOLO, SSD and Spatial Transformer Networks (STN) idea development. YOLO SSD based single input and performing a fast multiple target detection and recognition, but they do not consider spatial transform only each detection generates a rectangular bounding box. In contrast, STN may be used to detect a non-rectangular area, but it can not handle multiple transformation, only a single spatial transformation performed on the entire input. 

Use WPOD-NET detection process shown in Figure 3. Initially, the output of the vehicle after detection module scaled into WPOD. Obtained prior to process 8-channel mapping features, including the target / affine transformation parameters and the probability of non-target. In order to extract distorted LP, let us first consider a fixed size of about a center cell of (m, n) of an imaginary rectangle. If the probability of a rectangular frame including an object above a given threshold detection is used to construct the portion of the regression parameter affine transformation matrix imaginary square region LP. Thus, we can easily be corrected to LP horizontal and vertical alignment of the object.
Network architecture : the proposed architecture has a total convolution layer 21, 14 included in the residual block. All of fixed size convolution filter 3 * 3 .. In addition to the detection module, the entire network uses ReLU activation function. There are four layers maximum pool size is 2 * 2, steps 2, the input dimension can be reduced 16-fold. Finally, the detection module has two parallel convolutional layers: (i) a softmax function is activated by the inference probability value is used; (ii) the other for the regression parameter affine function is not active (or, equivalently, using identity function F (x) = x as the activation function).

 

 

Loss function:

 

For a height H and a width W of the input image, and a network is given by Ns = 16 (maximum four cell layers) steps, wherein FIG network output size M × N × 8, where M = H / N s and W is = N / N s . For each feature point in FIG. (M, n), is estimated to be eight values: the first two values v1, v2 represents the probability of the presence or absence of a target, the following six values v3 ~ v8 for constructing affine transformation matrix T Mn :

其中对v3和v6使用了最大函数以确保对角线为正(避免不期望的镜像或过度旋转)。

为了匹配网络的输出分辨率,通过网络步幅的倒数重新缩放点pi ,并根据特征图中的每个点(m,n)重新进行中心化。这是通过应用归一化函数来完成的:

 

 

其中α表示虚构正方形一侧的缩放常数。我们设置α= 7.75,这是增强训练数据中最大和最小LP维度除以网络步幅之后的平均值

别人理解:Amn(p)表示标注的车牌位置首先除以网络步幅Ns之后,再减去(m,n)中心化,最后再除以平均大小α,得到的就是中心在原点,长宽各接近于1的矩形框。因为这是标注的信息,表示的是车牌的真实位置。Tmn(q) 表示使用学习得到的参数对标准单位的矩形框进行仿射变换之后得到的框,表示的是预测到的车牌的位置。

假设在位置(m,n)处有一个车牌,损失函数的第一部分是考虑标准正方形的仿射版本与LP的标准化标注之间的误差,表示为:

 

 

 损失函数的第二部分处理在(m,n)处具有/不具有对象的概率。类似于SSD置信度损失[21],基本上是两个对数损失函数的和:

 

 

 

其中,Iobj 为目标指示函数,如果在点(m,n)处包含目标,则该函数值为1,否则为0;logloss(y,p)=−ylog(p) 。如果一个目标的矩形边界框与任一个同样大小,中心点为(m,n)的矩形框之间的IOU大于γobj (根据经验设置为0.3),则该目标被认为是在点(m,n)内部。
​ 

最终的损失是定位损失和分类损失的和,即(4)和(5)的和:

在这里插入图片描述

训练细节:

为了训练提出的WPOD-NET,我们创建了一个包含196个图像的数据集,其中105个来自Cars数据集,40个来自SSIG数据集(训练子集),51个来自AOLP数据集(LE子集)。对于每个图像,我们手动标注图片中LP的4个角(有时不止一个)。来自汽车数据集的所选图像主要包括欧洲LP,但也有许多美国以及其他类型的LP。来自SSIG和AOLP的图像分别包含巴西和中国台湾的LP。一些带注释的样本如图5所示。

 

 

鉴于训练数据集中注释图像的数量很少,使用数据集增强至关重要。我们使用了下述增强方法:

校正:假设LP位于平面上,基于LP的标注校正整个图像;
宽高比:LP宽高比在区间[2,4]中随机设置,以适应不同区域的大小;
居中:LP中心成为图像中心;
缩放:缩放LP,使其宽度在40像素到280像素之间(根据LP的可读性进行实验设置)。就是基于该范围确定了公式(3)中使用的α值;
旋转:执行具有随机选择角度的3D旋转,以考虑各种相机设置;
镜像:按照50%的概率进行镜像操作;
平移:将车牌从图像中心处随机平移,限制其平移后的范围位于208 * 208大小的正方形内;
裁剪:在平移之前考虑LP中心,我们在其周围裁剪208×208区域;
色彩变换:在HSV色彩空间进行颜色修改;
标注:通过应用用于增强输入图像的相同空间变换来调整四个LP角的位置。

从上面提到的所选择的一组变换中,可以从单个手动标记的样本获得具有非常不同的视觉特性的各种增强测试图像。例如,图6示出了从相同图像获得的20个不同的增强样本。

 

 3.3 OCR

校正后的LP上的字符分割和识别使用修改的YOLO网络执行,采用[28]中所示的架构。然而,通过使用合成和增强数据来应对世界各地(欧洲,美国和巴西)不同地区的LP特征,训练数据集在这项工作中得到了相当大的扩展。

人工创建的数据包括将一串七个字符粘贴到带纹理的背景上,然后执行随机变换,例如旋转,平移,噪声和模糊。一些生成的样本和用于合成数据的处理管道的简短概述如图7所示。如第4节所示,合成数据的使用有助于极大地改善网络泛化能力,因此使用完全相同的网络对世界各地的LP识别表现良好。

 

 3.4验证集

我们的目标之一是开发一种在各种不受约束的场景中表现良好的技术,但也应该在受控场景中很好地工作(例如主要是正面视图)。因此,我们选择了四个在线可用的数据集,即OpenALPR(BR和EU),SSIG和AOLP(RP),它们涵盖了许多不同的情况,如表1第一部分所述。我们考虑三个不同的变量:LP角度(正面或倾斜),从车辆到摄像机的距离(近距离,中距离和远距离)以及拍摄照片的区域。

 

 

目前在LP失真方面使用的更具挑战性的数据集是AOLP Road Patrol(RP)子集,其试图模拟将摄像机安装在巡逻车辆中或由人手持的情况。在相机远离车辆的情况下,SSIG数据集似乎是最具挑战性的数据集。它由高分辨率图像组成,远距离车辆的LP可能仍然可读。他们都没有同时出现多个(同时)车辆的LP。

虽然所有这些数据库涵盖了许多情况,但据我们所知,文献中缺乏更具通用性的数据集。因此,本文的另一个贡献是从汽车数据集中选择的一组新的102个图像(命名为CD-HARD)进行标注,涵盖各种具有挑战性的情况。我们选择的图像主要是具有强LP失真但仍然是人类可以读取的图像。其中一些图像(LP区域周围的裁剪块)如图1所示,用于激发这项工作中解决的问题。
4实验结果

本节介绍了我们完整的ALPR系统的实验分析,以及与其他最先进的方法和商业系统的比较。不幸的是,大多数学术ALPR论文都关注特定情景(例如 单个国家或地区,环境条件,摄像机位置等)。因此,文献中有许多分散的数据集,每个数据集都通过一组方法进行评估。此外,许多论文仅关注LP检测或字符分割,这进一步限制了完整ALPR流水线的比较可能性。在这项工作中,我们使用四个独立的数据集来评估所提出的方法在不同场景和区域布局中的准确性。我们还展示了与提供完整ALPR系统的商业产品和论文的比较。

所提出的方法包含三个网络,为此我们根据经验设置以下接受阈值:车辆(YOLOv2)和LP(WPOD-NET)检测为0.5,字符检测和识别(OCR-NET)为0.4。此外,值得注意的是,字符“I”和“1”对于巴西LP来说是相同的。因此,它们被认为是评估OpenALPR BR和SSIG数据集的单一类。没有其他启发式或后处理应用于OCR模块产生的结果。

我们根据正确识别的LP的百分比来评估系统,如果所有字符都被正确识别并且没有检测到其他字符,则认为LP是正确的。重要的是要注意,本王使用了完全相同的网络应用于所有数据集:没有使用特定的训练程序来调整给定类型的LP(例如欧洲或台湾人)的网络。管道中唯一的轻微修改是针对AOLP Road Patrol数据集。在该数据集中,车辆非常靠近摄像机(导致车辆探测器在几种情况下失效),因此我们直接将LP探测器(WPOD-NET)应用于输入图像。

为了展示在OCR-NET训练程序中包含全合成数据的好处,我们使用两组训练数据评估我们的系统:(i)真实增强数据加人工生成的数据;(ii)只有真正的增强数据。这两个版本分别在表2中用“Ours”和“Ours(no artf.)”表示。可以观察到,添加完全合成数据提高了所有测试数据集的准确性(AOLP RP数据集的增益≈5%)。此外,为了突出纠正检测边界框的改进,我们还提出了使用常规非矫正边界框的结果,在表2中标识为“ours(unrect.)”。正如预期的那样,大多数正面数据集的结果变化不大(ALPR-EU甚至略好),但是具有挑战性倾斜LP(AOLP-RP和拟议的CD-HARD)的数据集中存在相当大的准确度下降。

表2还显示了与有竞争力的(商业和学术)系统的对比结果,表明我们的系统在代表更多受控情景的数据库中实现了与商业系统相当的识别率,其中LP主要是正面的(OpenALPR EU和BR,以及SSIG)。更准确地说,它是两个OpenALPR数据集中的第二最佳方法,也是SSIG中的第一最佳方法。然而,在具有挑战性的情况下(AOLP RP和提出的CD-HARD数据集),我们的系统表现优于所有比较方法(与第二个最佳结果相比,准确度增益超过7%)。

值得一提的是李等人的方法[18,19],Hsu等人[10]和Laroca等[17]专注于单个区域或数据集。通过超越它们,我们展示了强大的泛化能力。同样重要的是要注意,最具挑战性的数据集(AOLPRP和CD-HARD)的完整LP识别率高于将OCR模块直接应用于带标注的矩形LP边界框(AOLP-RP为79.21%,CD-HARD为53.85%)。这种增益是由于WPOD-NET允许的非扭曲,这在LP强烈失真时极大地帮助了OCR任务。图8展示了图1中的LP校正后的图像和OCR识别后的结果。右上方LP的检测分数低于接受阈值,说明了存在漏检的LP。

本文提出的WPOD-NET使用TensorFlow框架实现,而最初的YOLOv2车辆检测和OCR-NET是使用DarkNet框架创建和执行的。Python包装器用于集成两个框架。用于我们实验的硬件是Intel Xeon处理器,具有12Gb RAM和NVIDIA Titan X GPU。通过该配置,我们能够以平均5 FPS(考虑所有数据集)运行完整的ALPR系统。该时间高度依赖于输入图像中检测到的车辆数量。因此,增加车辆检测阈值将导致更高的FPS,但是召回率更低。
5总结

在这项工作中,我们为无约束的场景提出了一个完整的深度学习ALPR系统。我们的结果表明,所提出的方法在具有挑战性的数据集中远远优于现有方法,其中包含在强倾斜视图处捕获的LP,同时在更受控制的数据集中保持良好结果。

这项工作的主要贡献是引入了一种新型网络,该网络允许通过为每个检测单元生成仿射变换矩阵来检测和解除失真的LP。这一步减轻了OCR网络的负担,因为它需要处理较少的失真。

作为额外的贡献,我们提出了一个新的具有挑战性的数据集,用于评估大多数倾斜LP的捕获中的ALPR系统。数据集的标注将公开提供,以便数据集可以用作新的具有挑战性的LP基准。

对于未来的工作,我们希望扩展我们的解决方案,以检测摩托车LP。由于纵横比和布局的差异,这带来了新的挑战。此外,我们打算在交通监控场景中探索所获得的自动相机校准问题的仿射变换。
6代码解析

6.1车辆检测

基于YoloV2的车辆检测,这里不过多展开。

6.2车牌检测和矫正

输入图像:检测到的车辆局部图像块

处理过程:

首先计算输入图像的长宽比,然后使用该比值乘以288,对相乘后的结果按照规则(x + (x % 16))进行变换,取该整数倍值与608的最小值作为变换大小。计算变换大小和原始图像短边的比值,保持长宽比将输入图像安装该系数进行变换,对变换后的大小进行处理使其成为16的整数倍。(这个过程处理的很繁琐,尤其按照(x + x%16)进行处理,不理解)

接下来就是对图像变为浮点值、缩放、变成四维结构的操作,然后送入模型进行推理,得到了 w/16 * h/16 * 8的输出。

接下来对输出结果进行后处理,分别读取输出的概率值和仿射变换参数。由于标注是车牌的大小位于40 - 208之间,所以(40 + 208)/16 = 7.75作为输出的车牌的平均大小。标准的车牌为位于原点,长宽各为1的正方形。仿射变换矩阵乘以标准车牌,再乘以车牌的平均大小,最后再加上当前点在输出中的cell的中心点位置。最后除以输出的大小,得到车牌位置相对原始图像的位置,得到的结果全部位于 0 - 1之间。

对当前图像的所有超过阈值的输出按照上述步骤进行处理之后,进行NMS操作,减少重复检测的车牌。最后再次进行仿射变换,将获取的车牌变换成240 * 80的标准大小。

至此,完成车牌的检测和校正过程。

千里之行,始于足下。

原文链接:https://blog.csdn.net/cdknight_happy/article/details/93190934

 

Guess you like

Origin www.cnblogs.com/ariel-dreamland/p/12291811.html