Implementing eye tracking on AIxBoard based on OpenVINO

Author: Luo Yicheng of Shantou University

This article will use the background of training a small eye-tracking AI model to introduce the process from customizing the network model in Pytorch to optimizing the model using the OpenVINO™ NNCF quantification tool and deploying it to the AIxBoard™ developer kit.

This project has been open source: RedWhiteLuo/HeadEyeTrack (github.com)

Development environment: Windows 11 + Pycharm. The model training platform is 12700H, and the deployment platform is AIxBoard™.

Introduction to AIxBoard™ Developer Kit

Powered by the Intel® Celeron® processor N-series, this developer kit is pre-validated with Ubuntu* Desktop and the OpenVINO™ tool suite to help achieve more in education. This combination provides students with the performance they need to develop programming skills and prototype solutions in the areas of AI, vision processing, and the Internet of Things.

Of course the first step is to unbox it~~

1. Determine the overall process

In the V1.0 version, I used eye pictures directly to train the neural network, and found that the results were not ideal. After inspection, I found that because the orientation of the head would deflect towards the direction of the gaze, the difference in sample distribution in the training set was very small. , resulting in unsatisfactory results, so a composite model structure was adopted in the V1.5 version to introduce head position information:

2. Model structure

Taking the existing project "lookie-lookie" on the Internet as a reference, we can know the network structure used by the project by viewing its source code, and we can make modifications based on this.

Therefore, in Pytroch we can inherit nn.Module and define the following model:

class EyeImageModel(nn.Module):
    def __init__(self):
        super(EyeImageModel, self).__init__()
        self.model = Sequential(
            # in-> [N, 3, 32, 128]
            BatchNorm2d(3),
            Conv2d(3, 2, kernel_size=(5, 5), padding=2),
            LeakyReLU(),
            MaxPool2d(kernel_size=(2, 2), stride=(2, 2)),
            Conv2d(2, 20, kernel_size=(5, 5), padding=2),
            ELU(),
            Conv2d(20, 10, kernel_size=(5, 5), padding=2),
            Tanh(),
            Flatten(1, 3),
            Dropout(0.01),
            Linear(10240, 1024),
            Softplus(),
            Linear(1024, 2),
        )

    def forward(self, x):
        return self.model(x)


class PositionOffset(nn.Module):
    def __init__(self):
        super(PositionOffset, self).__init__()
        self.model = Sequential(
            Conv2d(1, 32, kernel_size=(2, 2), padding=0),
            Softplus(),
            Conv2d(32, 64, kernel_size=(2, 2), padding=1),
            Conv2d(64, 64, kernel_size=(2, 2), padding=0),
            ELU(),
            Conv2d(64, 128, kernel_size=(2, 2), padding=0),
            Tanh(),
            Flatten(1, 3),
            Dropout(0.01),
            Linear(128, 32),
            Sigmoid(),
            Linear(32, 2),
        )

    def forward(self, x):
        return self.model(x)


class EyeTrackModel(nn.Module):
    def __init__(self):
        super(EyeTrackModel, self).__init__()
        self.eye_img_model = EyeImageModel()
        self.position_offset = PositionOffset()

    def forward(self, x):
        eye_img_result = self.eye_img_model(x[0])
        end = torch.cat((eye_img_result, x[1]), dim=1)
        end = torch.reshape(end, (-1, 1, 3, 3))
        end = self.position_offset(end)
        return end

A composite model is composed of two small models. EyeImageModel is responsible for converting the eye image into two parameters. It forms an N*1*3*3 matrix with the head position information in EyeTrackModel, and performs a convolution operation in PositionOffset. and return the result

3. Obtaining the training data set

After defining the network structure, we need to obtain enough data sets. Through the Peppa_Pig_Face_Landmark project, we can easily obtain 98 key points of the face.

610265158/Peppa_Pig_Face_Landmark: A simple face detect and alignment method, which is easy and stable. (github.com)

By following the mouse position with our eyes, obtaining the image and mouse position in real time and saving them, we can quickly obtain the data set.

In order to keep the proportion of effective information as much as possible, we first intercept the pictures of the two eyes and then stitch them into one picture, that is, delete the position of the bridge of the nose, and then save it. This method can reduce the deflection of the head to a certain extent. Excessive distortion of eye images.

def save_img_and_coords(img, coords, annot, saved_img_index):
    img_save_path = './dataset/img/' + '%d.png' % saved_img_index
    annot_save_path = './dataset/annot/' + '%d.txt' % saved_img_index
    cv2.imwrite(img_save_path, img)
    np.savetxt(annot_save_path, np.array([*coords, *annot]))
    print("[INFO] | SAVED:", saved_img_index)

def trim_eye_img(image, face_kp):
    """
    :param image: [H W C] 格式人脸图片
    :param face_kp: 面部关键点
    :return: 拼接后的图片 [H W C] 格式
    """
    l_l, l_r, l_t, l_b = return_boundary(face_kp[60:68])
    r_l, r_r, r_t, r_b = return_boundary(face_kp[68:76])
    left_eye_img = image[int(l_t):int(l_b), int(l_l):int(l_r)]
    right_eye_img = image[int(r_t):int(r_b), int(r_l):int(r_r)]
    left_eye_img = cv2.resize(left_eye_img, (64, 32), interpolation=cv2.INTER_AREA)
    right_eye_img = cv2.resize(right_eye_img, (64, 32), interpolation=cv2.INTER_AREA)
    return np.concatenate((left_eye_img, right_eye_img), axis=1)

In this step, name the saved files index.png and index.txt, and save them in the /img and /annot subfolders.

It should be noted that when using cv2.VideoCapture() , the default size of the image obtained is ( 640 , 480 ). The eye pictures obtained after FaceLandMark are too blurry, so you need to manually specify the camera resolution:

vide_capture = cv2.VideoCapture(1)
vide_capture.set(cv2.CAP_PROP_FRAME_WIDTH, HEIGHT)
vide_capture.set(cv2.CAP_PROP_FRAME_HEIGHT, WEIGHT)

4. Simple DataLoader

Since the size of each sample image is only 32 x 128, which is relatively small, we simply load them all directly into the memory.

def EpochDataLoader(path, batch_size=64):
    """
    :param path: 数据集的根路径
    :param batch_size: batch_size
    :return: epoch_img, epoch_annots, epoch_coords:
[M, batch_size, C, H, W], [M, batch_size, 7], [M, batch_size, 2]
    """
    epoch_img, epoch_annots, epoch_coords = [], [], []
    all_file_name = os.listdir(path + "img/")  # get all file name -> list
    file_num = len(all_file_name)
    batch_num = file_num // batch_size

    for i in range(batch_num):  # how many batch
        curr_batch = all_file_name[batch_size * i:batch_size * (i + 1)]
        batch_img, batch_annots, batch_coords = [], [], []
        for file_name in curr_batch:

            img = cv2.imread(str(path) + "img/" + str(file_name))  # [H, W, C] format
            img = img.transpose((2, 0, 1))
            img = img / 255  # [C, H, W] format
            data = np.loadtxt(str(path) + "annot/" + str(file_name).split(".")[0] + ".txt")
            annot_mora, coord_mora = np.array([1920, 1080, 1920, 1080, 1, 1, 1.4]), np.array([1920, 1080])
            annot, coord = data[2:]/annot_mora, data[:2]/coord_mora

            batch_img.append(img)
            batch_annots.append(annot)
            batch_coords.append(coord)

        epoch_img.append(batch_img)
        epoch_annots.append(batch_annots)
        epoch_coords.append(batch_coords)

    epoch_img = torch.from_numpy(np.array(epoch_img)).float()
    epoch_annots = torch.from_numpy(np.array(epoch_annots)).float()
    epoch_coords = torch.from_numpy(np.array(epoch_coords)).float()
    return epoch_img, epoch_annots, epoch_coords

This function can return all samples at once.

5. Define the loss function and train it

Since the output result of the network is N two-dimensional coordinates, torch.nn.MSELoss() is used directly as the loss function.

def eye_track_train():
    img, annot, coord = EpochDataLoader(TRAIN_DATASET_PATH, batch_size=TRAIN_BATCH_SIZE)
    batch_num = img.size()[0]
    model = EyeTrackModel().to(device).train()
    loss = torch.nn.MSELoss()
    optim = torch.optim.SGD(model.parameters(), lr=LEARN_STEP)
    writer = SummaryWriter(LOG_SAVE_PATH)

    trained_batch_num = 0
    for epoch in range(TRAIN_EPOCH):
        for batch in range(batch_num):
            batch_img = img[batch].to(device)
            batch_annot = annot[batch].to(device)
            batch_coords = coord[batch].to(device)
            # infer and calculate loss
            outputs = model((batch_img, batch_annot))
            result_loss = loss(outputs, batch_coords)
            # reset grad and calculate grad then optim model
            optim.zero_grad()
            result_loss.backward()
            optim.step()
            # save loss and print info
            trained_batch_num += 1
            writer.add_scalar("loss", result_loss.item(), trained_batch_num)
            print("[INFO]: trained epoch num | trained batch num | loss "
                  , epoch + 1, trained_batch_num, result_loss.item())
        if epoch % 100 == 0:
            torch.save(model, "../model/ET-" + str(epoch) + ".pt")
    # save model
    torch.save(model, "../model/ET-last.pt")
    writer.close()
    print("[SUCCEED!] model saved!")

The model will be saved every 100 rounds during the training process, and will also be saved after the training is completed.

6. Export to ONNX model

Through torch.onnx.export() we can easily export the onnx model

def export_onnx(model_path, if_fp16=False):
    """
    :param model_path: 模型的路径
:param if_fp16: 是否要将模型压缩为 FP16 格式
:return: 模型输出路径
    """
    model = torch.load(model_path, map_location=torch.device('cpu')).eval()
    print(model)
    model_path = model_path.split(".")[0]
    dummy_input_img = torch.randn(1, 3, 32, 128, device='cpu')
    dummy_input_position = torch.randn(1, 7, device='cpu')
    torch.onnx.export(model, [dummy_input_img, dummy_input_position], model_path + ".onnx", export_params=True)
    model = mo.convert_model(model_path + ".onnx", compress_to_fp16=if_fp16)  # if_fp16=False, output = FP32
    serialize(model, model_path + ".xml")
    print(EyeTrackModel(), "\n[FINISHED] CONVERT DONE!")
    return model_path + ".xml"

7. Use Openvino’s NNCF tool for int8 quantification

Neural Network Compression Framework (NNCF) provides a new post-training quantization API available in Python that is aimed at reusing the code for model training or validation that is usually available with the model in the source framework, for example, PyTorch* or TensroFlow*. The API is cross-framework and currently supports models representing in the following frameworks: PyTorch, TensorFlow 2.x, ONNX, and OpenVINO.

Post-training Quantization with NNCF (new) — OpenVINO™ documentation

We can know from Openvino's official documentation: Post-training Quantization with NNCF is divided into two sub-modules:

- Basic quantization

- Quantization with accuracy control

def basic_quantization(input_model

def basic_quantization(input_model_path):
    # prepare required data
    data = data_source(path=DATASET_ROOT_PATH)
    nncf_calibration_dataset = nncf.Dataset(data, transform_fn)
    # set the parameter of how to quantize
    subset_size = 1000
    preset = nncf.QuantizationPreset.MIXED
    # load model
    ov_model = Core().read_model(input_model_path)
    # perform quantize
    quantized_model = nncf.quantize(ov_model, nncf_calibration_dataset, preset=preset, subset_size=subset_size)
    # save model
    output_model_path = input_model_path.split(".")[0] + "_BASIC_INT8.xml"
    serialize(quantized_model, output_model_path)


def accuracy_quantization(input_model_path, max_drop):
    # prepare required data
    calibration_source = data_source(path=DATASET_ROOT_PATH, with_annot=False)
    validation_source = data_source(path=DATASET_ROOT_PATH, with_annot=True)
    calibration_dataset = nncf.Dataset(calibration_source, transform_fn)
    validation_dataset = nncf.Dataset(validation_source, transform_fn_with_annot)
    # load model
    xml_model = Core().read_model(input_model_path)
    # perform quantize
    quantized_model = nncf.quantize_with_accuracy_control(xml_model,
                                                          calibration_dataset=calibration_dataset,
                                                          validation_dataset=validation_dataset,
                                                          validation_fn=validate,
                                                          max_drop=max_drop)
    # save model
    output_model_path = xml_model_path.split(".")[0] + "_ACC_INT8.xml"
    serialize(quantized_model, output_model_path)


def export_onnx(model_path, if_fp16=False):
    """
    :param model_path: the path that will be converted
    :param if_fp16: if the output onnx model compressed to fp16
    :return: output xml model path
    """
    model = torch.load(model_path, map_location=torch.device('cpu')).eval()
    print(model)
    model_path = model_path.split(".")[0]
    dummy_input_img = torch.randn(1, 3, 32, 128, device='cpu')
    dummy_input_position = torch.randn(1, 7, device='cpu')
    torch.onnx.export(model, [dummy_input_img, dummy_input_position], model_path + ".onnx", export_params=True)
    model = mo.convert_model(model_path + ".onnx", compress_to_fp16=if_fp16)  # if_fp16=False, output = FP32
    serialize(model, model_path + ".xml")
    print(EyeTrackModel(), "\n[FINISHED] CONVERT DONE!")
    return model_path + ".xml"

What needs to be noted here is the part of nncf.Dataset(calibration_source, transform_fn). What calibration_source returns must be an iterable object. Each iteration returns a training sample [1, C, H, W], and transform_fn is an iterable object. This training sample is converted (such as changing the number of channels, exchanging H, W). The operation here is to normalize and convert to numpy.

8. Performance improvement after quantification

Tested hardware platform 12700H

This small model can achieve significant performance improvements after quantization using the OpenVINO NNCF method:

benchmark_app -m ET-last_ACC_INT8.xml -d CPU -api async<br/>

[ INFO ] Execution Devices:['CPU']

[ INFO ] Count:            226480 iterations

[ INFO ] Duration:         60006.66 ms

[ INFO ] Latency:

[ INFO ]    Median:        3.98 ms

[ INFO ]    Average:       4.18 ms

[ INFO ]    Min:           2.74 ms

[ INFO ]    Max:           38.98 ms

[ INFO ] Throughput:   3774.25 FPS

benchmark_app -m ET-last_INT8.xml -d CPU -api async<br/>
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count:            513088 iterations
[ INFO ] Duration:         60002.85 ms
[ INFO ] Latency:
[ INFO ]    Median:        1.46 ms
[ INFO ]    Average:       1.76 ms
[ INFO ]    Min:           0.82 ms
[ INFO ]    Max:           61.07 ms
[ INFO ] Throughput:   8551.06 FPS

9. Deploy on AIxBoard™ Developer Kit

Since Python is already installed on AlxBoard™, you only need to install OpenVINO.

Download the Intel distribution OpenVINO tool suite (intel.cn)

Then execute python eye_track.py in the root directory of the project to view the inference results of the network, as shown in the figure below

Performance overview:

Running performance on iGPU

Performance running on CPU

10. Summary

OpenVINO™ provides a convenient and fast development method that can achieve model conversion and model quantification through several core APIs.

AIxBoard™ provides a highly versatile and high-performance deployment platform based on x86 architecture. It is small in size and very suitable for the final deployment of the project.

After the self-training Pytorch model is optimized through the OpenVINO™ model optimization tool, the OpenVINO™ Runtime is used for inference. For small models as shown above, huge performance improvements can be achieved, and the inference process is simple and clear. Inference on the development board requires only a few core functions to implement inference based on the self-trained Pytorch model.

Guess you like

Origin blog.csdn.net/gc5r8w07u/article/details/132829746