Implementation of driver fatigue detection system based on feature and parameter estimation methods

Resource download address : https://download.csdn.net/download/sheziqiong/88312655
Resource download address : https://download.csdn.net/download/sheziqiong/88312655

1. Related work

At the 2018 CES Asia exhibition, iFlytek demonstrated their driver fatigue detection system. Their system can use computer vision methods to obtain data from the camera such as face orientation, position, pupil orientation, eye opening and closing, blink frequency, pupil contraction rate, etc., and use these data to calculate the driver's attention in real time degree of concentration. I experienced their system in the field and it was very responsive and accurate.

2. Overall idea

Influenced by the iFlytek DEMO, when I made this final report, I also hoped to use this method based on feature detection and parameter calculation. I think that this method is not completely black box (compared to pure CNN network training images), and may be better in terms of interpretability. In addition, this method can also detect and record other physical signs. These physical signs Data also has a lot of value to mine.

The experiment I conducted was divided into three steps:

  • This step of obtaining facial feature points (such as eyes, nose, mouth, and contours) through image feature point detection
    can be solved using the famous dlib (King, 2018). There is a well-trained model: shape_predictor_68_face_landmarks.dat . Using it, 68 facial key points can be obtained.

  • Preprocessing, judging the orientation, position, and eye opening and closing information of the face through 68 key points. The
    main difficulty in judging the orientation and position of the face is to estimate the three-dimensional information of the face. If there is three-dimensional information, you can use the OpenCV function solvePnP. to calculate the orientation and position of an object. However, because the depth information of the monocular camera is missing, 3D data cannot be truly obtained. Therefore, when understanding the orientation of a person's face, it is necessary to cooperate with some anthropometric statistical data (the distance between the root of the nose and various organs of the face). OK. The calculation of eye opening and closing is relatively simple, because there is already feature point data, so the height between the eyelids is calculated and divided by the width between the corners of the eyes. In order to increase the number of feature points and reduce the amount of missing information, the 68 feature points are also subtracted from end to end (converted into displacements) and included in the features.
    For the 6 face latitude information, 2 eye opening and closing degree information, and 68 feature point displacement information obtained in the second step (each feature point is replaced by two features of x and y displacement), there are 144 features in one picture, and Enter the LSTM neural network for training and prediction.

  • Detecting the position of facial feature points in pictures. Facial
    feature point detection uses dlib. dlib has two key functions: dlib.get_frontal_face_detector() and dlib.shape_predictor(predictor_path). The former is a built-in face detection algorithm that uses the HOG pyramid to detect the boundaries of the face area. The latter is used to detect feature points in an area and output the coordinates of these feature points. It requires a pre-trained model (passed in through the file path method) to work properly.

Using the pre-trained model shape_predictor_68_face_landmarks.dat, you can get the coordinates of the 68 feature point positions. After connecting them, you can have the effect as shown in the figure (red is the result of HOG pyramid detection, blue is the result of shape_predictor, only the same A line connecting the characteristic points of an organ).

3. Preprocess the feature points to obtain the 6-dimensional information of the face and eye opening and closing.

When dozing off while driving while fatigued, the person's face will droop, sometimes shake slightly, and the eyes will squint slightly. This is very different from when a person is awake, looking forward or slightly upward, and turning the head steadily. This is our intuitive understanding of fatigue driving. Put this information into the LSTM network for training, and let the machine automatically identify this information, which may achieve good results. Therefore, it is important to capture this information from images.

Let’s first discuss the acquisition of 6-dimensional information about human faces. As mentioned above, the monocular camera does not contain depth information, and the monocular image information cannot estimate the direction of the face (because it is equivalent to projecting 3D coordinates onto a 2D plane and cannot be restored). To approximately restore 2-dimensional information to 3-dimensional information, some additional information (or prior knowledge) is required, such as the average distance of facial features in anthropometry (Wikipedia, 2018).

Here we refer to a paper (Lemaignan SG, 2016) and his code implementation (Lemaignan, 2018). The code combines the average distance between facial features given by OpenCV and Wikipedia to model the human face. I modified it based on the author's code so that it can preprocess the video screenshots in the data set in batches and output the 6-dimensional face information data in an array. For specific modifications and code running methods, please see the source code I submitted.

Use this algorithm to batch process all images in the dataset. After the processing is completed, an additional 6-dimensional feature is obtained, and the effect is as follows:

Next, the eye opening and closing information is calculated. The data distribution of 68 feature points is shown in the figure (Rosebrock, 2017):

The opening and closing degree of the right eye can be obtained by the following formula (the same applies to the left eye):

The 6-dimensional information of the head and the two-dimensional information of eye opening and closing, a total of 8-dimensional information. If possible, information such as pupil orientation and pupil contraction rate should also be collected, but I was not successful. As mentioned before, only 8 features are extracted from a picture, and there may be a lot of information loss. Therefore, I processed the 68 coordinate points and added them as features. Considering that the coordinates themselves are not very meaningful, but the coordinate displacement is more meaningful, I subtracted these coordinate points from end to end and converted them into displacements. A displacement has two components, x and y, so this is another 136 features. In this way, the preprocessing is completed. Extracting a picture into 144 features related to the face can block out many picture details, such as lighting, skin color, head position (related to height), the fluttering of curtains in the background... (although useful information may be lost) . More importantly, a picture becomes 144 numbers, which greatly simplifies the input data. The input data for each frame becomes simpler, making it possible to process the relationship between frames (timing information) (the amount of calculation will not be excessive). big).

5. Train the LSTM network model and classify it

I didn't have much contact with machine learning during my college years, so this step is more about copying and not understanding deeply. After doing some research, I discovered LSTM networks, which can be used to process timing information. I tried Tensorflow, but it was still difficult to get started. Later, I switched to Keras and used a higher-level and easier-to-understand API encapsulated by Keras.

The input requirement of Keras's LSTM layer is three-dimensional data: (sample number, time frame number (timestep), feature vector). The preprocessed data is exactly this three-dimensional, and most shapes are: (?, 64, 144) .

I basically built an LSTM network by referring to the sample code in the Keras documentation (Keras, 2018). After the LSTM layer, a Dense layer is added, and the model is completed.

During training, I ran 500 epochs, and the final result was: 78.12% accuracy on the test set. And because the images are extracted into features, the training speed is very fast. The effect is quite gratifying.

Resource download address : https://download.csdn.net/download/sheziqiong/88312655
Resource download address : https://download.csdn.net/download/sheziqiong/88312655

Guess you like

Origin blog.csdn.net/newlw/article/details/132752469