I use AI to reply to the beauty car sales series [yolo license plate recognition] (4)

Review of last issue

In the last issue, from the perspective of data enhancement, we performed mixup, affine transformation, blurring, etc. on the license plate recognition, and finally improved the top1 accuracy rate from 0.9683 to 0.991 on the test set of the ccpd data set (up by 2.3 point), but in the actual video, there are still a lot of false detections. Recognize different angle pictures of a license plate into many different license plates. Although it did not achieve the ultimate goal, it also made us more clearly aware of the difficulty of implementing deep learning projects. It doesn't work well in public datasets, but it will work great in landing projects. Still need to conduct specific analysis for specific projects. This issue will re-examine our license plate recognition project from the perspective of network structure.

License Plate Recognition Network Analysis

The original license plate recognition network used shufflenet and split it into two networks based on blue and green cards. The blue card network outputs a 7*34-dimensional probability matrix, and the green card network outputs an 8*34-dimensional probability matrix. The essential reason is that CNN can only output fixed-dimensional predictions and cannot output dynamic predictions.

In fact, license plate recognition is similar to OCR text recognition. It's just that the format of the license plate is much simpler than OCR. Because the length, width and character position of the license plate are fixed, we can directly use the classification network to output the probability of each character. However, this approach is also inappropriate. First of all, we divided the two networks, resulting in separate training and reasoning for the blue card and the green card. In addition, the characters of the license plate are actually unevenly distributed. There is usually a space after the first letter, but when we design the network, the default characters are evenly distributed, which will lead to certain deviations in the CNN receptive field learning. Therefore, we need to learn from the mature OCR network and how the mature license plate recognition network solves this problem. Naturally, CTC technology has entered our field of vision.

Introduction to CTC Technology

CTC technology is mainly proposed for speech recognition, and its full name is: Connectionist Temporal Classification (connection timing classification). It was originally applied to voice, and now OCR is also widely used. It is mainly used when the prediction length is not fixed, and the label usually does not contain position information. For example, to recognize a sentence in OCR, since the length of a sentence is usually not fixed, there are long and short, but the output of the network is fixed, so assuming that the longest sentence is 9 characters (for the convenience of drawing), then the network output 9*N predictions. Suppose a sentence "hello", as shown in the figure below ('-' in the figure indicates that there is no character here), the prediction in the first line is more in line with expectations, and the receptive field corresponding to the prediction of each character is basically correct. The prediction in the second line is even worse, and the receptive fields corresponding to many predicted characters do not match. But just by the label "hello", it is impossible to judge which prediction is better. Then it is impossible to provide a supervisory signal for network training for good training.

CTC technology is mainly to solve this problem. First introduce a blank character blank to indicate that there are no characters here. And consecutive characters will be merged into one character. For example, the predicted output h, e, l, l, l, o, o will be treated as helo. So what if there are really two "l"s? For example, there are two consecutive "l" in "hello", CTC stipulates that a blank character must be inserted between the two identical characters, represented by "-", then h, e, l, -, l, l, o, o It can be correctly parsed as "hello".

CTC Calculation Example

The following is a very simple example to illustrate how CTC works. For convenience, we assume that there are only two characters t and o in our language. The picture we need to identify is shown in the figure below, and the label of the picture is "to". Assuming that the length of our network output is 5, then the dimension of the final output of the network is 5*(2+1). The reason why you need to add 1 is because CTC requires a blank character.

For this matrix, during training, if it is recognized as "ttooo",'-tooo','-t-oo','-to--','-too-','-tooo'..... . etc. are correct. Because in CTC, the same characters connected together will be equivalent to one character, so ttooo is equivalent to to, and the - character is a blank character and can be deleted, so the -t-ooo level is the same as to. The same is true for the latter ones. That is to say, these predictions are regarded as correct predictions. So how to obtain the loss. In effect, these probabilities are summed together as the correct final probability:

  • ttooo probability p1 = 0.6*0.7*0.5*0.6*0.7
  • -tooo probability p2 = 0.2*0.7*0.5*0.6*0.7
  • -t-oo probability p3 = 0.2*0.7*0.4*0.6*0.7
  • Subsequent calculations are omitted...

Then the final predicted correct probability p = p1 + p2 + p3 + ...... , the predicted wrong probability is 1 - p, then it is actually a binary classification problem, and the loss can be calculated according to the cross entropy, and Do a gradient return. In general, CTC exhaustively enumerates all predictions, and then takes all predictions equivalent to labels as correct predictions according to the rules. And then transformed into a two classification problem.

CTC loss in Pytorch

ctc loss has been built into pytorch, which can be used very conveniently. There are mainly two APIs, one is to create ctc loss; the other is to calculate ctc loss.

  • API for creating ctc loss
ctc_loss = nn.CTCLoss(blank=len(CHARS)-1, reduction='mean')
#blank: Indicates the serial number of the blank character blank. As we mentioned in the example just now, CTC adds a blank symbol. The incoming here is
# is the serial number of the blank character. For example, if our prediction text contains only 26 English letters, and their serial numbers are represented by 0~25, then blank can
# Defined as 26.
  • Calculate ctc loss
loss = ctc_loss(log_probs, targets, input_lengths, target_lengths)
#log_probs: The output of the network, the shape is: (T, N, C), T is the output sequence length, N corresponds to the batch size, and C corresponds to
# is how many possibilities each character has. For example, just predict 26 letters + 1 blank symbol, then C is 27
#targets: Tag value. The shape is: (X), which refers to the combination of labels of N data (one dimension)
#input_lengths: The predicted length of each picture (in fact, if it is [T, T, T, T...] a total of N)
#target_lengths: The label length of each image, specified according to the actual situation
  • ctc loss example

Suppose we use ctc loss to predict the license plate number, each character has 31 (province) + 36 (letters and numbers) + 1 (blank) = 68 possibilities, as follows:

dict={0: 'Beijing',
      1: 'Shanghai',
      2: 'Jin',
      3: 'Yu',
      ......
      31:  '0', 
      32:  '1', 
      33:  '2', 
      34:  '3',
      35:  '4',
      ......
      41:  'A', 
      42:  'B', 
      43:  'C',
      ......
      67:  '-'];

The length of the license plate is 7 characters (blue plate) and 8 characters (green plate). The output length suggested by CTC is 2*max_len + 1, in order to take into account the situation that there is a blank between every two characters. That is, the output length is 2*8+1 = 17.

The batch size is assumed to be 2, then the shape of the output log_probs of the final network is T * N * C = 17 * 2 * 68;

Assuming that the labels of the two pictures with a batch size of 2 are "Beijing A88888" and "Shanghai AD12345", the serial numbers converted into characters are: [0, 41, 38, 36, 48, 38, 38] and [1, 41, 44, 32, 33, 34, 35, 36]. Then the final targets are: [0, 41, 38, 36, 48, 38, 38, 1, 41, 44, 32, 33, 34, 35, 36].

input_lengths is [17, 17]

target_lengths is [7, 8]

Modify the license plate recognition network

With CTC loss, we don't need to train blue cards and green cards separately as before, and we can put them in a network for training. The final structure of the network is shown in the figure below. The final output of the network is a 17*68-dimensional prediction matrix.

The corresponding relationship between the prediction matrix and the original license plate is shown in the figure below

network training

After using CTC technology, after training for 6 epochs, the accuracy reached an astonishing 0.9999.

The inference results using the CTC model on several test pictures are shown in the figure below:

The effect is quite good, but there are also errors in recognition, indicating that CTC is indeed feasible

video false detection

Although the above picture test effect is better. But using real video tests, false detections still occur. That is, a license plate is recognized as a different license plate number due to different angles (but these license plate numbers are relatively close). As shown in the figure below, there are three cars in the video, but multiple license plates are recognized.

Analyzing the above false detections, first of all, provinces are easier to predict errors. This is because the CCPD data set is mainly collected from parking lots in Anhui, so the amount of data for the province of "Wan" in the license plate is particularly large, while that for other provinces will be much less. Therefore, if some data enhancement or training strategies are not done, it is easy to lead to the prediction preference of the license plate for "Wan". However, this is not a big problem for us to use the license plate to identify the vehicle entity, we can directly ignore the province bit. Because at the same time and at the same place, the probability of two cars with different provinces and all the same bits appearing is almost zero. Then deal with the problem of multiple similar license plates. In fact, it is possible to eliminate false detections by calculating the similarity between two license plates. Still based on the previous assumption, the probability of AT95S5 and AT9SS5 appearing at the same place and at the same time is very small, so when it is judged that the similarity between the two license plates is high, we can consider them as one license plate. Summarize the processing flow, as shown in the following figure:

Achievements

Through the above strategy, the current statistics are relatively accurate. As shown below.

3a9eb75f6f4958b2720202e475a3ba1.gif

Summarize

In this issue, we introduced the CTC technology to solve the problem of indefinite length in license plate recognition (7 digits for blue license plate and 8 digits for green license plate). The number of licensed vehicles and green licensed vehicles. In the next issue, we will go to outdoor statistics to see how many new energy vehicles there are. Stay tuned!

Guess you like

Origin blog.csdn.net/cjnewstar111/article/details/124122704