Computer vision - day88 reading paper: Traffic target detection and recognition based on driver's attention field of view

The method uses the driver 3D absolute coordinates of gaze points obtained by joint cross-calibration of a forward-looking stereo imaging system and a non-contact 3D gaze tracker. In the detection stage, the multi-scale HOG-SVM and Faster r-cnn models are combined. The identification stage validates the generated hypothesis set through the ResNet-101 network. We apply this approach to real data collected during driving in an urban environment.

Click to jump to the original address

II. RELATED WORKS

A. General Object Detection

General object detection algorithms can be divided into two categories: traditional and deep learning-based.

There are two main categories of object detection methods based on deep learning: region-based methods and regression-based methods.

B. Traffic sign detection and recognition

Symbol detection methods are generally classified into color-based methods, shape-based methods, and hybrid methods.

1. Color threshold segmentation is the most commonly used method among color-based methods, which reduces the search area by ignoring non-target regions.

2. Traffic signs also have specific shapes, which can be searched by shape-based methods. Due to its robustness to illumination changes and image noise, the Hough transform is one of the most commonly used shape-based methods.

3. The hybrid method takes advantage of the color and shape of the logo. The classification stage mainly uses template matching, SVM, genetic algorithm (Genetic Algorithm, GA), artificial neural network (Artificial Neural Network, ANN), AdaBoost and methods based on deep learning .

Convolutional Neural Networks (cnn) are a subset of deep neural network models capable of learning robust and discriminative features from raw data. There are various CNNs that have been used to recognize traffic signs.

C. Vehicle detection

Many traditional vehicle detection methods include a hypothesis generation (HG) step and a hypothesis validation (HV) step. Deep learning is still used a lot.

D. Pedestrian detection

Pedestrian detection methods using deep learning can be classified into one-stage techniques and two-stage techniques.

E. Traffic light detection

Color segmentation is a commonly used method to reduce the search space in traffic scene images.

III. PROPOSED METHOD

In this section, we describe our proposed traffic object detection and recognition method based on the driver's attention horizon. First, the dataset used in this study is introduced. Based on this, we describe a method for finding the driver's attentional gaze region in a forward-facing stereo imaging system. Next, in the object detection stage, the model we train and the method used to enrich our dataset are described. Then discuss the region of interest (roi) integration method we use. Finally, the target recognition stage is given. Figure 1 illustrates our proposed framework.

What I mainly want to read are the first and third subsections, and the second subsection is left for interested readers to dig out by themselves.

image-20230502093001981

figure 1. Framework overview. Our framework detects and recognizes traffic objects within the driver's field of view. from left to right:

a) RoadLAB vehicle with forward-facing stereo vision and eye-tracking system.

b) Dataset created by RoadLAB experimental vehicle.

c) Calculate the radius of the driver's field of view as the attention gaze cone, and locate the 2D ellipse of the reprojection of the driver's field of view.

d) We use two different model types in the detection stage of the framework; Model A consists of multi-scale HOG-SVM followed by applying CNN in two steps, and Model B is a Faster region based CNN. The detection results are integrated through a network management-based algorithm.

e) In the recognition stage, we train three independent models of traffic signs, vehicles, and traffic lights.

A. The RoadLAB Dataset

An essential element of deep learning based object detection systems is the availability of a large number of sample images. In this section, we present our own object dataset from the RoadLAB experimental data sequence

## 此处参考文献名:
1、A probabilistic model for visual driver gaze approximation from head pose estimation;2、Portable andscalable vision-based vehicular instrumentation for the analysis of driver
intentionality;
3、Multi-depth cross-calibration of remote eye gaze trackers and stereoscopic scene systems)。

Our dataset contains 3,225 background class sample images, and 5,172, 1,984, 1,290, and 1,875 traffic sign, vehicle, pedestrian, and traffic light object class sample images, respectively. Vehicle classes include 3 different classes including cars, buses and trucks. Traffic light levels are divided into four levels: red, yellow, green, and unclear. Finally, the Traffic Signs category includes 19 different types of traffic signs. Additionally, some traffic sign categories include multiple sign types, such as "Maximum Speed ​​Limit", "Construction", "Stopping", etc.

B. Driver Gaze Positioning

A circle is usually projected onto the imaging plane of a stereo sensor in the form of a 2D ellipse

image-20230502093733385

Figure II. (Top): Depiction of driver's attention gaze cone. (Bottom): Reprojection of the 3D attention circle into the corresponding 2D ellipse on the image plane of the forward stereoscopic scene system.

C. Object detection stage

To detect traffic objects of interest inside and outside the driver's attention domain, we adopt a framework consisting of two different model types and describe them:

Model A

The first model consists of two steps including multi-scale HOG-SVM followed by a ResNet-101 network. The multiscale HOG-SVM descriptor counts the number of occurrences of gradient directions in an image region, and then uses a block normalization algorithm, which is more invariant to edge contrast and shading. Since the Region of Interest (RoI) contains objects of different sizes, we use a multi-scale approach to address the object detection problem. We take the hog features extracted from each sliding window of each layer as independent samples, and then feed them into the svm classifier.

image-20230502094123891

Figure 4 is an internal view of the multi-scale HOG-SVM. The remaining ROIs from the HOG-SVM classifier are classified into 5 categories: background, traffic sign, vehicle, pedestrian and traffic light.

In the second stage, we used ResNet-101 [38], a popular CNN that has been trained on more than 1 million images from the ImageNet database.

image-20230502094519870

Figure 5 shows sample results obtained using this model.

image-20230502094817400However, in our empirical experiments, we noticed that the multi-scale HOG-SVM has difficulty localizing the vehicles that occupy most of the image (Figure 6 illustrates this problem). Therefore, we also use the Faster R-CNN model to detect vehicles.

Model B

We trained a Faster R-CNN model on the dataset to localize vehicles. In our empirical experiments, we observed that Model B was able to correctly detect vehicles occupying larger image regions, or vehicles that were very close to the vehicles detected by the instrument. Conversely, based on our empirical experiments as well as our survey of the literature, we found that Faster R-CNN has difficulty handling objects with low resolution or small size. Therefore, to detect objects of different sizes, we combine the results of Model A and Model B to make the best use of both models. Hypotheses generated in this phase are directly transferred to the integration phase, where the detection results are merged.

image-20230502094756560

Figure 7 shows the vehicle detection results obtained by Model B.

D. Data augmentation

E. Comprehensive test results

After completing the detection phase on test images, to improve detection performance, we eliminate redundant detections and merge the remaining detections into a complete set of results. To this end, we use a method based on Non-Maximum Suppression (NMS). When multiple bounding boxes overlap, NMS keeps the bounding box with the highest score and eliminates other bounding boxes whose overlapping ratio exceeds a preset threshold. We use the Pascal overlap score to find the overlap ratio a0 between them. The resulting ratio is:

image-20230502095025961

F. Object recognition stage

image-20230502095428474Fig. 8 shows the result samples for four types of traffic objects. More precisely, a traffic light recognizer can classify traffic light hypotheses into 5 categories, a vehicle recognizer can classify vehicle hypotheses into 4 classes, and a traffic sign recognizer can classify traffic sign hypotheses into 20 classes.

IV. Experimental results

A. Parameters

To obtain fine-tuned parameters for each classifier model, we use cross-validation experiments on our training dataset. We split the training data into a basic training set and a validation set. Then, use the base training set to train the classifier, and then use the validation set to evaluate the model. By exploring various ranges of tuning parameters, we selected the parameter settings that yielded the greatest validation accuracy. Then, retrain the classifier on the full training set with the adjusted parameters. Our model achieves 95.1% and 94.2% performance on the training and validation sets, respectively. Finally, we test the model on pre-separated unseen data consisting of a set of randomly selected samples.

B. Results of the object detection phase

Table 1. Description of data augmentation

image-20230502095643870

Table 2, the detection result description, is the f1 score of different traffic objects.image-20230502095710881

image-20230502095915580

Figure 10 shows the performance of our detector calculated using receiver operating characteristic (ROC) curves, labeling the true positive rate (TPR) versus the false positive rate (FPR). In the figure, class1, class2, class3 and class4 represent pedestrians, traffic signs, traffic lights and vehicles respectively.

C. Credibility

image-20230502095845013

The trust spectrum in Figure 11 shows the overall trust of pedestrians, traffic signs, traffic lights, and vehicles. It can be seen that the vehicle level has the highest reliability and the pedestrian level has the lowest reliability.

D. Results of the object recognition stage

image-20230502100008261

Figure 13 presents the confusion matrix for traffic light recognition. The results show that the overall accuracy of the model reaches 96.2%.

image-20230502100039104

The results shown in Figure 14 show that the overall classification accuracy of the vehicle recognizer model is 94.8%. This confusion matrix shows that the model is able to recognize vehicular objects (i.e., vehicles, buses, and trucks) with less than 3% probability of mislabeling errors. The background class has the lowest accuracy at 87.3%.

V. Conclusion

We conduct a literature review on detection and recognition methods for four important traffic object categories: traffic signs, vehicles, pedestrians, and traffic lights. In general, the availability of appropriate and sufficient training data is a crucial factor during learning to achieve a discriminative model. In this work, we collected more than 10,000 object sample images from a sequence belonging to the RoadLAB project [3]. We also enrich our training data with augmentation and HEM strategies. The driver's attention visual area is positioned on the imaging plane of the forward-facing stereo system, and a detection and recognition framework for traffic targets inside and outside the driver's attention field of view is designed. We considered 3, 4, and 19 different types of vehicles, traffic lights, and traffic signs, respectively. In the target detection stage, the traditional model and the deep learning-based model are combined to detect targets of different scales. Finally, in the recognition stage, with the trained ResNet-101 network, our framework achieves 96.1%, 96.2%, and 94.8% accuracy in classifying traffic signs, traffic lights, and vehicles, respectively.

Guess you like

Origin blog.csdn.net/qq_43537420/article/details/130461592