Author: Luo Yicheng of Shantou University
This article will use the background of training a small eye-tracking AI model to introduce the process from customizing the network model in Pytorch to optimizing the model using the OpenVINO™ NNCF quantification tool and deploying it to the AIxBoard™ developer kit.
This project has been open source: RedWhiteLuo/HeadEyeTrack (github.com)
Development environment: Windows 11 + Pycharm. The model training platform is 12700H, and the deployment platform is AIxBoard™.
Introduction to AIxBoard™ Developer Kit
Powered by the Intel® Celeron® processor N-series, this developer kit is pre-validated with Ubuntu* Desktop and the OpenVINO™ tool suite to help achieve more in education. This combination provides students with the performance they need to develop programming skills and prototype solutions in the areas of AI, vision processing, and the Internet of Things.
Of course the first step is to unbox it~~
1. Determine the overall process
In the V1.0 version, I used eye pictures directly to train the neural network, and found that the results were not ideal. After inspection, I found that because the orientation of the head would deflect towards the direction of the gaze, the difference in sample distribution in the training set was very small. , resulting in unsatisfactory results, so a composite model structure was adopted in the V1.5 version to introduce head position information:
2. Model structure
Taking the existing project "lookie-lookie" on the Internet as a reference, we can know the network structure used by the project by viewing its source code, and we can make modifications based on this.
Therefore, in Pytroch we can inherit nn.Module and define the following model:
class EyeImageModel(nn.Module):
def __init__(self):
super(EyeImageModel, self).__init__()
self.model = Sequential(
# in-> [N, 3, 32, 128]
BatchNorm2d(3),
Conv2d(3, 2, kernel_size=(5, 5), padding=2),
LeakyReLU(),
MaxPool2d(kernel_size=(2, 2), stride=(2, 2)),
Conv2d(2, 20, kernel_size=(5, 5), padding=2),
ELU(),
Conv2d(20, 10, kernel_size=(5, 5), padding=2),
Tanh(),
Flatten(1, 3),
Dropout(0.01),
Linear(10240, 1024),
Softplus(),
Linear(1024, 2),
)
def forward(self, x):
return self.model(x)
class PositionOffset(nn.Module):
def __init__(self):
super(PositionOffset, self).__init__()
self.model = Sequential(
Conv2d(1, 32, kernel_size=(2, 2), padding=0),
Softplus(),
Conv2d(32, 64, kernel_size=(2, 2), padding=1),
Conv2d(64, 64, kernel_size=(2, 2), padding=0),
ELU(),
Conv2d(64, 128, kernel_size=(2, 2), padding=0),
Tanh(),
Flatten(1, 3),
Dropout(0.01),
Linear(128, 32),
Sigmoid(),
Linear(32, 2),
)
def forward(self, x):
return self.model(x)
class EyeTrackModel(nn.Module):
def __init__(self):
super(EyeTrackModel, self).__init__()
self.eye_img_model = EyeImageModel()
self.position_offset = PositionOffset()
def forward(self, x):
eye_img_result = self.eye_img_model(x[0])
end = torch.cat((eye_img_result, x[1]), dim=1)
end = torch.reshape(end, (-1, 1, 3, 3))
end = self.position_offset(end)
return end
A composite model is composed of two small models. EyeImageModel is responsible for converting the eye image into two parameters. It forms an N*1*3*3 matrix with the head position information in EyeTrackModel, and performs a convolution operation in PositionOffset. and return the result
3. Obtaining the training data set
After defining the network structure, we need to obtain enough data sets. Through the Peppa_Pig_Face_Landmark project, we can easily obtain 98 key points of the face.
By following the mouse position with our eyes, obtaining the image and mouse position in real time and saving them, we can quickly obtain the data set.
In order to keep the proportion of effective information as much as possible, we first intercept the pictures of the two eyes and then stitch them into one picture, that is, delete the position of the bridge of the nose, and then save it. This method can reduce the deflection of the head to a certain extent. Excessive distortion of eye images.
def save_img_and_coords(img, coords, annot, saved_img_index):
img_save_path = './dataset/img/' + '%d.png' % saved_img_index
annot_save_path = './dataset/annot/' + '%d.txt' % saved_img_index
cv2.imwrite(img_save_path, img)
np.savetxt(annot_save_path, np.array([*coords, *annot]))
print("[INFO] | SAVED:", saved_img_index)
def trim_eye_img(image, face_kp):
"""
:param image: [H W C] 格式人脸图片
:param face_kp: 面部关键点
:return: 拼接后的图片 [H W C] 格式
"""
l_l, l_r, l_t, l_b = return_boundary(face_kp[60:68])
r_l, r_r, r_t, r_b = return_boundary(face_kp[68:76])
left_eye_img = image[int(l_t):int(l_b), int(l_l):int(l_r)]
right_eye_img = image[int(r_t):int(r_b), int(r_l):int(r_r)]
left_eye_img = cv2.resize(left_eye_img, (64, 32), interpolation=cv2.INTER_AREA)
right_eye_img = cv2.resize(right_eye_img, (64, 32), interpolation=cv2.INTER_AREA)
return np.concatenate((left_eye_img, right_eye_img), axis=1)
In this step, name the saved files index.png and index.txt, and save them in the /img and /annot subfolders.
It should be noted that when using cv2.VideoCapture() , the default size of the image obtained is ( 640 , 480 ). The eye pictures obtained after FaceLandMark are too blurry, so you need to manually specify the camera resolution:
vide_capture = cv2.VideoCapture(1)
vide_capture.set(cv2.CAP_PROP_FRAME_WIDTH, HEIGHT)
vide_capture.set(cv2.CAP_PROP_FRAME_HEIGHT, WEIGHT)
4. Simple DataLoader
Since the size of each sample image is only 32 x 128, which is relatively small, we simply load them all directly into the memory.
def EpochDataLoader(path, batch_size=64):
"""
:param path: 数据集的根路径
:param batch_size: batch_size
:return: epoch_img, epoch_annots, epoch_coords:
[M, batch_size, C, H, W], [M, batch_size, 7], [M, batch_size, 2]
"""
epoch_img, epoch_annots, epoch_coords = [], [], []
all_file_name = os.listdir(path + "img/") # get all file name -> list
file_num = len(all_file_name)
batch_num = file_num // batch_size
for i in range(batch_num): # how many batch
curr_batch = all_file_name[batch_size * i:batch_size * (i + 1)]
batch_img, batch_annots, batch_coords = [], [], []
for file_name in curr_batch:
img = cv2.imread(str(path) + "img/" + str(file_name)) # [H, W, C] format
img = img.transpose((2, 0, 1))
img = img / 255 # [C, H, W] format
data = np.loadtxt(str(path) + "annot/" + str(file_name).split(".")[0] + ".txt")
annot_mora, coord_mora = np.array([1920, 1080, 1920, 1080, 1, 1, 1.4]), np.array([1920, 1080])
annot, coord = data[2:]/annot_mora, data[:2]/coord_mora
batch_img.append(img)
batch_annots.append(annot)
batch_coords.append(coord)
epoch_img.append(batch_img)
epoch_annots.append(batch_annots)
epoch_coords.append(batch_coords)
epoch_img = torch.from_numpy(np.array(epoch_img)).float()
epoch_annots = torch.from_numpy(np.array(epoch_annots)).float()
epoch_coords = torch.from_numpy(np.array(epoch_coords)).float()
return epoch_img, epoch_annots, epoch_coords
This function can return all samples at once.
5. Define the loss function and train it
Since the output result of the network is N two-dimensional coordinates, torch.nn.MSELoss() is used directly as the loss function.
def eye_track_train():
img, annot, coord = EpochDataLoader(TRAIN_DATASET_PATH, batch_size=TRAIN_BATCH_SIZE)
batch_num = img.size()[0]
model = EyeTrackModel().to(device).train()
loss = torch.nn.MSELoss()
optim = torch.optim.SGD(model.parameters(), lr=LEARN_STEP)
writer = SummaryWriter(LOG_SAVE_PATH)
trained_batch_num = 0
for epoch in range(TRAIN_EPOCH):
for batch in range(batch_num):
batch_img = img[batch].to(device)
batch_annot = annot[batch].to(device)
batch_coords = coord[batch].to(device)
# infer and calculate loss
outputs = model((batch_img, batch_annot))
result_loss = loss(outputs, batch_coords)
# reset grad and calculate grad then optim model
optim.zero_grad()
result_loss.backward()
optim.step()
# save loss and print info
trained_batch_num += 1
writer.add_scalar("loss", result_loss.item(), trained_batch_num)
print("[INFO]: trained epoch num | trained batch num | loss "
, epoch + 1, trained_batch_num, result_loss.item())
if epoch % 100 == 0:
torch.save(model, "../model/ET-" + str(epoch) + ".pt")
# save model
torch.save(model, "../model/ET-last.pt")
writer.close()
print("[SUCCEED!] model saved!")
The model will be saved every 100 rounds during the training process, and will also be saved after the training is completed.
6. Export to ONNX model
Through torch.onnx.export() we can easily export the onnx model
def export_onnx(model_path, if_fp16=False):
"""
:param model_path: 模型的路径
:param if_fp16: 是否要将模型压缩为 FP16 格式
:return: 模型输出路径
"""
model = torch.load(model_path, map_location=torch.device('cpu')).eval()
print(model)
model_path = model_path.split(".")[0]
dummy_input_img = torch.randn(1, 3, 32, 128, device='cpu')
dummy_input_position = torch.randn(1, 7, device='cpu')
torch.onnx.export(model, [dummy_input_img, dummy_input_position], model_path + ".onnx", export_params=True)
model = mo.convert_model(model_path + ".onnx", compress_to_fp16=if_fp16) # if_fp16=False, output = FP32
serialize(model, model_path + ".xml")
print(EyeTrackModel(), "\n[FINISHED] CONVERT DONE!")
return model_path + ".xml"
7. Use Openvino’s NNCF tool for int8 quantification
Neural Network Compression Framework (NNCF) provides a new post-training quantization API available in Python that is aimed at reusing the code for model training or validation that is usually available with the model in the source framework, for example, PyTorch* or TensroFlow*. The API is cross-framework and currently supports models representing in the following frameworks: PyTorch, TensorFlow 2.x, ONNX, and OpenVINO.
Post-training Quantization with NNCF (new) — OpenVINO™ documentation
We can know from Openvino's official documentation: Post-training Quantization with NNCF is divided into two sub-modules:
- Basic quantization
- Quantization with accuracy control
def basic_quantization(input_model
def basic_quantization(input_model_path):
# prepare required data
data = data_source(path=DATASET_ROOT_PATH)
nncf_calibration_dataset = nncf.Dataset(data, transform_fn)
# set the parameter of how to quantize
subset_size = 1000
preset = nncf.QuantizationPreset.MIXED
# load model
ov_model = Core().read_model(input_model_path)
# perform quantize
quantized_model = nncf.quantize(ov_model, nncf_calibration_dataset, preset=preset, subset_size=subset_size)
# save model
output_model_path = input_model_path.split(".")[0] + "_BASIC_INT8.xml"
serialize(quantized_model, output_model_path)
def accuracy_quantization(input_model_path, max_drop):
# prepare required data
calibration_source = data_source(path=DATASET_ROOT_PATH, with_annot=False)
validation_source = data_source(path=DATASET_ROOT_PATH, with_annot=True)
calibration_dataset = nncf.Dataset(calibration_source, transform_fn)
validation_dataset = nncf.Dataset(validation_source, transform_fn_with_annot)
# load model
xml_model = Core().read_model(input_model_path)
# perform quantize
quantized_model = nncf.quantize_with_accuracy_control(xml_model,
calibration_dataset=calibration_dataset,
validation_dataset=validation_dataset,
validation_fn=validate,
max_drop=max_drop)
# save model
output_model_path = xml_model_path.split(".")[0] + "_ACC_INT8.xml"
serialize(quantized_model, output_model_path)
def export_onnx(model_path, if_fp16=False):
"""
:param model_path: the path that will be converted
:param if_fp16: if the output onnx model compressed to fp16
:return: output xml model path
"""
model = torch.load(model_path, map_location=torch.device('cpu')).eval()
print(model)
model_path = model_path.split(".")[0]
dummy_input_img = torch.randn(1, 3, 32, 128, device='cpu')
dummy_input_position = torch.randn(1, 7, device='cpu')
torch.onnx.export(model, [dummy_input_img, dummy_input_position], model_path + ".onnx", export_params=True)
model = mo.convert_model(model_path + ".onnx", compress_to_fp16=if_fp16) # if_fp16=False, output = FP32
serialize(model, model_path + ".xml")
print(EyeTrackModel(), "\n[FINISHED] CONVERT DONE!")
return model_path + ".xml"
What needs to be noted here is the part of nncf.Dataset(calibration_source, transform_fn). What calibration_source returns must be an iterable object. Each iteration returns a training sample [1, C, H, W], and transform_fn is an iterable object. This training sample is converted (such as changing the number of channels, exchanging H, W). The operation here is to normalize and convert to numpy.
8. Performance improvement after quantification
Tested hardware platform 12700H
This small model can achieve significant performance improvements after quantization using the OpenVINO NNCF method:
benchmark_app -m ET-last_ACC_INT8.xml -d CPU -api async<br/>
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 226480 iterations
[ INFO ] Duration: 60006.66 ms
[ INFO ] Latency:
[ INFO ] Median: 3.98 ms
[ INFO ] Average: 4.18 ms
[ INFO ] Min: 2.74 ms
[ INFO ] Max: 38.98 ms
[ INFO ] Throughput: 3774.25 FPS
benchmark_app -m ET-last_INT8.xml -d CPU -api async<br/>
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 513088 iterations
[ INFO ] Duration: 60002.85 ms
[ INFO ] Latency:
[ INFO ] Median: 1.46 ms
[ INFO ] Average: 1.76 ms
[ INFO ] Min: 0.82 ms
[ INFO ] Max: 61.07 ms
[ INFO ] Throughput: 8551.06 FPS
9. Deploy on AIxBoard™ Developer Kit
Since Python is already installed on AlxBoard™, you only need to install OpenVINO.
Download the Intel distribution OpenVINO tool suite (intel.cn)
Then execute python eye_track.py in the root directory of the project to view the inference results of the network, as shown in the figure below
Performance overview:
Running performance on iGPU
Performance running on CPU
10. Summary
OpenVINO™ provides a convenient and fast development method that can achieve model conversion and model quantification through several core APIs.
AIxBoard™ provides a highly versatile and high-performance deployment platform based on x86 architecture. It is small in size and very suitable for the final deployment of the project.
After the self-training Pytorch model is optimized through the OpenVINO™ model optimization tool, the OpenVINO™ Runtime is used for inference. For small models as shown above, huge performance improvements can be achieved, and the inference process is simple and clear. Inference on the development board requires only a few core functions to implement inference based on the self-trained Pytorch model.