Real-time face attribute detection system for pytorch project

Introduction

This project uses the CelebA face attribute data set to train the face attribute classification model, uses mediapipe for face detection, uses onnxruntime for model reasoning, and finally realizes a complete real-time face attribute recognition system with 30-100 frames on Intel's Pentium cpu . ps: I originally planned to write it as a paid column. After all, this is a system that can be commercialized with a little modification. It is not a toy example, but considering that the number of views will soon exceed 100,000, I will post it as a commemorative work.

python package environment

Training environment:
python 3.9.12
torch 1.12.0+cu116
torchvision 0.13.0+cu116
export model related:
onnx 1.12.0
onnxruntime 1.14.0
deployment model environment:
onnxruntime 1.14.0
cv2(opencv-python) 4.7.0
mediapipe 0.9 .0.1
python3.10.6
Note: There is no specific requirement for the version. It is only written to facilitate bug checking. Generally, you can complete this project with the above packages. If there are no packages, please install it yourself with pip.

Dataset preparation

Manually download the CelebA dataset

insert image description here

Enter the URL above the screenshot (the URL is not directly posted because the external link will be recognized as a strange rule of low-quality articles), then click on the baidu drive on the picture, and enter the extraction code:

insert image description here
Download everything under the above folders and put them together. For example, after downloading here, the structure is shown in the figure below, and none of them can be less or more:

insert image description here

The img_align_celeba folder is after the compressed package is decompressed. The following is the picture of the face directly, and there is no secondary directory:

insert image description here

load dataset

First, use torchvision to directly load the data set, divide the training set, and test the set. Here download is set to False to indicate that you have downloaded manually and do not need to download automatically (the speed is very slow). In addition, the root path is the root path of the data set. I put everything here under D:/face/celeba, so the root path is to write D:/face.

from torchvision import datasets
import torchvision.transforms as transforms
train_dataset = datasets.CelebA(root="D:/face",
                                split='train',
                                transform = transforms.Compose([
                                    transforms.CenterCrop(128),
                                    transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.5, 0.5, 0.5],
                                                        std=[0.5, 0.5, 0.5])]),
                                download=False)
test_dataset = datasets.CelebA(root="D:/face",
                                split='test',
                                transform = transforms.Compose([
                                    transforms.CenterCrop(128),
                                    transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.5, 0.5, 0.5],
                                                        std=[0.5, 0.5, 0.5])]),
                                download=False)

Here, three preprocessing steps are performed on the data set, because it involves getting rid of torch for the same processing in the later reasoning process, so it needs to be explained in detail:

statement significance
transforms.CenterCrop(128) Center crop 128x128 image
transforms.ToTensor() Divide the pixel value of 0-255 by 255 and convert it to 0-1
transforms.Normalize(mean=[0.5, 0.5, 0.5],std=[0.5, 0.5, 0.5])]) Subtract 0.5 mean for three channels and divide by 0.5 standard deviation

Then we use torch's dataloader class to formally load the dataset for reading:

import torch
train_dataloader = torch.utils.data.DataLoader(dataset=train_dataset, shuffle=True, batch_size=128,num_workers=4)
test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, shuffle=False, batch_size=128,num_workers=4)

I have adjusted batch_size to 128 and num_workers to 4. These two indicators affect the training speed. If the computer configuration is not good, please reduce it appropriately. If the configuration is good, please increase the speed of training. The amount of data itself is still very large. 2 minutes.

Define the model class

The model adopts the slimnet architecture in 2017, which is a very lightweight network. If you are interested in slimnet, please search for related papers by yourself. Here is the code directly:

import torch
import torch.nn as nn
import torch.nn.functional as F


class ConvBNReLU(nn.Sequential):
    def __init__(self, in_planes, out_planes, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_planes, out_planes, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_planes),
            nn.ReLU(inplace=True)
        )


class DWSeparableConv(nn.Module):
    def __init__(self, inp, oup):
        super().__init__()
        self.dwc = ConvBNReLU(inp, inp, kernel_size=3, groups=inp)
        self.pwc = ConvBNReLU(inp, oup, kernel_size=1)

    def forward(self, x):
        x = self.dwc(x)
        x = self.pwc(x)

        return x


class SSEBlock(nn.Module):
    def __init__(self, inp, oup):
        super().__init__()
        out_channel = oup * 4
        self.pwc1 = ConvBNReLU(inp, oup, kernel_size=1)
        self.pwc2 = ConvBNReLU(oup, out_channel, kernel_size=1)
        self.dwc = DWSeparableConv(oup, out_channel)

    def forward(self, x):
        x = self.pwc1(x)
        out1 = self.pwc2(x)
        out2 = self.dwc(x)

        return torch.cat((out1, out2), 1)


class SlimModule(nn.Module):
    def __init__(self, inp, oup):
        super().__init__()
        hidden_dim = oup * 4
        out_channel = oup * 3
        self.sse1 = SSEBlock(inp, oup)
        self.sse2 = SSEBlock(hidden_dim * 2, oup)
        self.dwc = DWSeparableConv(hidden_dim * 2, out_channel)
        self.conv = ConvBNReLU(inp, hidden_dim * 2, kernel_size=1)

    def forward(self, x):
        out = self.sse1(x)
        out += self.conv(x)
        out = self.sse2(out)
        out = self.dwc(out)

        return out


class SlimNet(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.conv = ConvBNReLU(3, 96, kernel_size=7, stride=2)
        self.max_pool0 = nn.MaxPool2d(kernel_size=3, stride=2)

        self.module1 = SlimModule(96, 16)
        self.module2 = SlimModule(48, 32)
        self.module3 = SlimModule(96, 48)
        self.module4 = SlimModule(144, 64)

        self.max_pool1 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.max_pool2 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.max_pool3 = nn.MaxPool2d(kernel_size=3, stride=2)
        self.max_pool4 = nn.MaxPool2d(kernel_size=3, stride=2)

        self.gap = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(192, num_classes)

    def forward(self, x):
        x = self.max_pool0(self.conv(x))
        x = self.max_pool1(self.module1(x))
        x = self.max_pool2(self.module2(x))
        x = self.max_pool3(self.module3(x))
        x = self.max_pool4(self.module4(x))
        x = self.gap(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x
device = torch.device('cuda')
model = SlimNet(num_classes=40).to(device=device)

I used cuda training for device here. If you don’t have an Nvidia graphics card, please write cpu. num_classes=40 means that the final output has 40 facial attribute features. The following are the features I translated for easy understanding:

insert image description here
It can be seen that we are finally a multi-classification task, but the network output here is not the classification of 0,1, nor the probability value of 0-1. It is necessary to use the sigmoid function to convert to the possibility of 0-1 when predicting, and then use 0.5 classification.

training network

I won’t go into details about the training process in the middle. They are all pytorch stereotypes. It should be noted that when selecting a loss function, you must understand its data type. Here, the target is an integer type, which needs to be converted to a double type to calculate the loss with the score. In this training, a best_acc is manually set to indicate the best accuracy. Once the accuracy is found to be better than it when evaluating the test set during training, the current model will be automatically saved.

loss_criterion = nn.BCEWithLogitsLoss() #定义损失函数
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001) #定义优化器
best_acc = 0.90325 #最好的在测试集上的准确度,可手动修改
seed = 18203861252700 #固定起始种子
for epoch in range(50): #训练五十轮
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True#以上这些都是企图固定种子,但经过测试只能固定起始种子,可删掉
    total_train = 0 #总共的训练图片数量,用来计算准确率
    correct_train = 0 #模型分类对的训练图片
    running_loss = 0 #训练集上的loss
    running_test_loss = 0 #测试集上的loss
    total_test = 0 #测试的图片总数
    correct_test = 0 #分类对的测试图片数
    model.train() #训练模式
    for data, target in train_dataloader:
        data = data.to(device=device)
        target = target.type(torch.DoubleTensor).to(device=device)
        
        score = model(data)
        loss = loss_criterion(score, target)
        running_loss += loss.item()
        optimizer.zero_grad()
        
        loss.backward()
        
        optimizer.step()
        sigmoid_logits = torch.sigmoid(score)
        predictions = sigmoid_logits > 0.5 #使结果变为true,false的数组
        total_train += target.size(0) * target.size(1)
        correct_train += (target.type(predictions.type()) == predictions).sum().item()
    model.eval() #测试模式
    with torch.no_grad():
         for batch_idx, (images,labels) in enumerate(test_dataloader):
            images, labels = images.to(device), labels.type(torch.DoubleTensor).to(device)
            logits = model.forward(images)
            test_loss = loss_criterion(logits, labels)
            running_test_loss += test_loss.item()
            sigmoid_logits = torch.sigmoid(logits)
            predictions = sigmoid_logits > 0.5
            total_test += labels.size(0) * labels.size(1)
            correct_test += (labels.int() == predictions.int()).sum().item()
    test_acc = correct_test/total_test
    if test_acc > best_acc:
        best_acc = test_acc
        torch.save(model,f"model_{
      
      test_acc*100}.pt")
    print(f"For epoch : {
      
      epoch} training loss: {
      
      running_loss/len(train_dataloader)}")
    print(f'train accruacy is {
      
      correct_train*100/total_train}%')
    print(f"For epoch : {
      
      epoch} test loss: {
      
      running_test_loss/len(test_dataloader)}")
    print(f'test accruacy is {
      
      test_acc*100}%')

After the training, you can find the following files in the directory, which are the good models we found during the training process, for subsequent export:
insert image description here

Model export onnx

It is best to write in the same notebook as above. If you save it separately, you need to copy and paste the part that defines the network above, otherwise you will not find the definition of the network.

torch_model = torch.load("model_90.56845506462278.pt", map_location='cpu')
torch_model.eval()
x = torch.randn(1, 3, 128, 128, requires_grad=True) #随机128x128输入
torch_out = torch_model(x)
print(torch_out)
# 导出模型
torch.onnx.export(torch_model,               # 需要导出的模型
                  x,                         # 模型输入
                  "cpu.onnx",   # 保存模型位置
                  export_params=True,        # 保存训练参数
                  opset_version=10,          # onnx的opset版本
                  do_constant_folding=True,  # 是否进行常量折叠优化,这里开关都一样
                  input_names = ['input'],   # 输入名字
                  output_names = ['output'], # 输出名字
                  )
ort_session = onnxruntime.InferenceSession("cpu.onnx",providers=['CPUExecutionProvider'])
#尝试进行推理看是否报错
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {
    
    ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)
print(ort_outs[0])
# 比较onnx模型推理的结果和torch推理的结果误差是否在可容忍范围内
np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)

print("Exported model has been tested with ONNXRuntime, and the result looks good!")

If everything is ok, you will find the cpu.onnx file in the directory:
insert image description here

Image Minimal Inference Example

We directly copy the first picture in the dataset and rename it to test_face.jpg
insert image description here
for reasoning:

from PIL import Image
import torchvision.transforms as transforms
import onnxruntime
import torch
import numpy as np
img = Image.open("test_face.jpg")
comp = transforms.Compose([transforms.CenterCrop(128),transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),]) #torch的预处理
img = comp(img)
img.unsqueeze_(0)
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
x = to_numpy(img)
ort_session = onnxruntime.InferenceSession("cpu.onnx")
ort_inputs = {
    
    ort_session.get_inputs()[0].name: x}
ort_outs = ort_session.run(None, ort_inputs)
def sigmoid_array(x): #使用sigmoid转换概率值
    return 1 / (1 + np.exp(-x))
result = sigmoid_array(ort_outs[0]) > 0.5
list_attr_cn = np.array(["早上刚刮下午长出来的一点胡子","拱形眉毛","有吸引力的","眼袋","秃头","刘海","大嘴唇"
,"大鼻子","黑发","金发","模糊的","棕发","浓眉","圆胖","双下巴","眼镜","山羊胡子","灰白发","浓妆","高高的颧骨",
"男性","嘴微微张开","胡子","眯眯眼","没有胡子","鹅蛋脸","苍白皮肤","尖鼻子","后退的发际线","红润脸颊",
"鬓角","微笑","直发","卷发","耳环","帽子","口红","项链","领带","年轻"])
print(list_attr_cn[result[0]])

Here are the results:
['arched eyebrows' 'attractive' 'brown hair' 'heavy makeup' 'high cheekbones' 'slightly open mouth' 'no beard' 'pointy nose' 'smile' 'lipstick' 'young']

With the toy example, the whole system can be completed just by adding a few million details.

Video real-time face detection

Put the code directly. If you want to explain a lot in detail, it is not as convenient as the direct code comment. In short, use opencv to read the video and draw text, and use Mediapipe for fast face recognition.

import onnxruntime
import time
import numpy as np
import cv2
import mediapipe as mp
mp_face_detection = mp.solutions.face_detection
mp_drawing = mp.solutions.drawing_utils
cap = cv2.VideoCapture("test_face3.mp4") #视频输入,如果需要摄像头,请改成数字0,并修改下面的break为continue
ort_session = onnxruntime.InferenceSession("cpu.onnx")
list_attr_en = np.array(["5_o_Clock_Shadow","Arched_Eyebrows","Attractive","Bags_Under_Eyes","Bald",
"Bangs","Big_Lips","Big_Nose","Black_Hair","Blond_Hair","Blurry","Brown_Hair",
"Bushy_Eyebrows","Chubby","Double_Chin","Eyeglasses","Goatee","Gray_Hair",
"Heavy_Makeup","High_Cheekbones","Male","Mouth_Slightly_Open","Mustache","Narrow_Eyes",
"No_Beard","Oval_Face","Pale_Skin","Pointy_Nose","Receding_Hairline","Rosy_Cheeks",
"Sideburns","Smiling","Straight_Hair","Wavy_Hair","Wearing_Earrings","Wearing_Hat",
"Wearing_Lipstick","Wearing_Necklace","Wearing_Necktie","Young"]) #英文原版属性
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) #获取视频宽度
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) #获取视频高度
fps =  cap.get(cv2.CAP_PROP_FPS) #获取视频FPS,如果是实时摄像头请手动设定帧数
out = cv2.VideoWriter('output3.avi', cv2.VideoWriter_fourcc(*"MJPG"), fps, (width,height)) #保存视频,没需求可去掉
def cv2_preprocess(img): #numpy预处理和torch处理一样
    img = cv2.resize(img, (128, 128), interpolation=cv2.INTER_NEAREST)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    mean = [0.5,0.5,0.5] #一定要是3个通道,不能直接减0.5
    std = [0.5,0.5,0.5]
    img = ((img / 255.0 - mean) / std)
    img = img.transpose((2,0,1)) #hwc变为chw
    img = np.expand_dims(img, axis=0) #3维到4维
    img = np.ascontiguousarray(img, dtype=np.float32) #转换浮点型
    return img
def sigmoid_array(x): #sigmoid函数手动设定
    return 1 / (1 + np.exp(-x))
def result_inference(input_array): #推理环节
    ort_inputs = {
    
    ort_session.get_inputs()[0].name: input_array}
    ort_outs = ort_session.run(None, ort_inputs)
    possibility = sigmoid_array(ort_outs[0]) > 0.5
    result = list_attr_en[possibility[0]]
    return result
with mp_face_detection.FaceDetection(
    model_selection=1, min_detection_confidence=0.5) as face_detection:
    #人脸识别,1为通用模型,0为近距离模型
  while cap.isOpened():
    a1 = time.time()
    success, image = cap.read()
    if not success:
      print("Ignoring empty camera frame.")
      break
    image.flags.writeable = False #据说这样写可以加速人脸识别推理
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = face_detection.process(image)
    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    image2 = image.copy() #copy复制,因为cv2会直接覆盖原有数组
    if results.detections:
      for detection in results.detections:
        mp_drawing.draw_detection(image, detection)
        image_rows, image_cols, _ = image.shape
        location = detection.location_data.relative_bounding_box #获取人脸位置
        start_point = mp_drawing._normalized_to_pixel_coordinates(location.xmin, location.ymin,image_cols,image_rows) #获取人脸左上角的点
        end_point = mp_drawing._normalized_to_pixel_coordinates(location.xmin + location.width, location.ymin + location.height,image_cols,image_rows) #获取右下角的点
        x1,y1 = start_point #左上点坐标
        x2,y2 = end_point #右下点坐标
        img_infer = image2[y1-70:y2,x1-50:x2+50].copy() #为了营造相似环境,把左上角和右上角的点连线囊括的区域扩大提高准确度
        img_infer = cv2_preprocess(img_infer)
        result = result_inference(img_infer)
    #     # cv2.imshow('test',img_infer)
        # if cv2.waitKey(5) & 0xFF == 27:
        #   break
        for i in range(0,len(result)):
            image = cv2.putText(image, result[i],(x1,y1+i*40), cv2.FONT_HERSHEY_SIMPLEX, 
                   1, (255,255,255), 1, cv2.LINE_AA) #画文字,一行一行画,如果想要中文请自行编译安装freetype版opencv,不推荐用别的库包裹转换中文,速度慢
      a2 = time.time()
    out.write(image)
    print(f'one pic time is {
      
      a2 - a1} s')

In addition, the function provided by mediapipe itself is to draw the key points of the face. If you don’t want the key points, you can comment it out like me:
insert image description here

final result

insert image description here
The inference time has tested multiple videos. The time for a complete frame from reading to detecting and drawing is about 0.01-0.035, which can reach 30-100 frames per second.

You can see from the picture below that the identified attributes are quite accurate. The picture shows an attractive young woman with brown hair and no beard. The change in the attribute bar is because the areas captured by face recognition are different, and there is no algorithm for face tracking.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/129791421