Face restoration and mosaic removal algorithm CodeFormer - C++ and Python model deployment

1. Face restoration algorithm

1. Introduction to algorithm

CodeFormer is a face restoration model based on deep learning of AI technology. It was jointly developed by Nanyang Technological University and SenseTime Technology Joint Research Center. It can receive blurred or mosaic images as input and generate clearer original images. Algorithm source code address: https://github.com/sczhou/CodeFormer
This technology may have wide applications in areas such as image repair, image enhancement, and privacy protection. The algorithm was jointly developed by Nanyang Technological University and SenseTime Technology Joint Research Center, combining VQGAN and Transformer.
VQGAN is a generative model commonly used for image generation tasks. It uses vector quantization technology to encode images into a series of discrete vectors, and then converts these vectors into images through a decoder. This method is often able to produce high-quality images, especially when combined with neural networks such as Transformer.
Transformer is a neural network architecture widely used in fields such as natural language processing and computer vision. It excels in sequence data processing and can also be used for image generation and processing tasks.
Insert image description here

In the fields of surveillance, security, and privacy protection, facial images are often affected by a variety of factors, including lighting, pixel limitations, focus issues, and human movement. These factors may cause images to be blurry, deformed, or contain a large amount of noise. In this case, trying to recover a clear original face image is an extremely challenging task.
Blind face restoration is an ill-posed problem, which means there are multiple possible solutions and the true original image cannot be uniquely determined from limited observation data. Therefore, in this field, it is often necessary to rely on advanced computer vision and image processing techniques, as well as deep learning models, to try to improve the quality of blurred or damaged images.
Some methods and techniques can be used to deal with the problem of blind face restoration, including but not limited to:
Deep learning models: Using deep learning models such as convolutional neural networks (CNN) and generative adversarial networks (GAN), you can try to recover faces from blurred or deformed people. Restore original details in face images.
Super-resolution technology: Super-resolution methods aim to reconstruct high-resolution images from low-resolution images, which can also be used for face image restoration.
Prior knowledge: Using prior knowledge, such as face structure, lighting model, etc., can help improve the accuracy of restoration.
Multimodal fusion: Combining data from different sensors and information sources can improve the robustness of restoration.
However, even with these techniques, fully recovering a clear original face image can still be an extremely challenging task due to the ill-posed nature of the problem, especially under extreme conditions. In practical applications, image quality and available information may need to be weighed to achieve optimal results.

2. Algorithm effect

In the officially announced repaired face effects, you can see the repair effects of the algorithm on various inputs:
old photo repair,
Insert image description here
face repair,
Insert image description here
black and white face image enhancement, face restoration,
Insert image description here
face restoration
Insert image description here

2. Model deployment

If you want to use C++ for model inference deployment, you must first convert the model to onnx. After converting to onnx, you can use the onnxruntime c++ library for deployment, or you can use OpenCV's DNN. After converting to onnx, you can also convert it to an ncnn model for use. ncnn performs model deployment. The original model can be downloaded from the official open source interface .
There are two methods for model inference. One is to directly super-score the entire image without judging whether there is a face. However, this method seems to cause bugs in the originally clear image, that is, it generates some incomprehensible processing.

1. C++ uses onnxruntime to deploy the model

#include "CodeFormer.h"

CodeFormer::CodeFormer(std::string model_path)
{
    
    
	//OrtStatus* status = OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);  ///nvidia-cuda加速
	sessionOptions.SetGraphOptimizationLevel(ORT_ENABLE_BASIC);
	std::wstring widestr = std::wstring(model_path.begin(), model_path.end());   ///如果在windows系统就这么写
	ort_session = new Ort::Session(env, widestr.c_str(), sessionOptions);   ///如果在windows系统就这么写
	///ort_session = new Session(env, model_path.c_str(), sessionOptions);  ///如果在linux系统,就这么写

	size_t numInputNodes = ort_session->GetInputCount();
	size_t numOutputNodes = ort_session->GetOutputCount();
	Ort::AllocatorWithDefaultOptions allocator;
	for (int i = 0; i < numInputNodes; i++)
	{
    
    
		input_names.push_back(ort_session->GetInputName(i, allocator));
		Ort::TypeInfo input_type_info = ort_session->GetInputTypeInfo(i);
		auto input_tensor_info = input_type_info.GetTensorTypeAndShapeInfo();
		auto input_dims = input_tensor_info.GetShape();
		input_node_dims.push_back(input_dims);
	}
	for (int i = 0; i < numOutputNodes; i++)
	{
    
    
		output_names.push_back(ort_session->GetOutputName(i, allocator));
		Ort::TypeInfo output_type_info = ort_session->GetOutputTypeInfo(i);
		auto output_tensor_info = output_type_info.GetTensorTypeAndShapeInfo();
		auto output_dims = output_tensor_info.GetShape();
		output_node_dims.push_back(output_dims);
	}

	this->inpHeight = input_node_dims[0][2];
	this->inpWidth = input_node_dims[0][3];
	this->outHeight = output_node_dims[0][2];
	this->outWidth = output_node_dims[0][3];
	input2_tensor.push_back(0.5);
}

void CodeFormer::preprocess(cv::Mat &srcimg)
{
    
    
	cv::Mat dstimg;
	cv::cvtColor(srcimg, dstimg, cv::COLOR_BGR2RGB);
	resize(dstimg, dstimg, cv::Size(this->inpWidth, this->inpHeight), cv::INTER_LINEAR);
	this->input_image_.resize(this->inpWidth * this->inpHeight * dstimg.channels());
	int k = 0;
	for (int c = 0; c < 3; c++)
	{
    
    
		for (int i = 0; i < this->inpHeight; i++)
		{
    
    
			for (int j = 0; j < this->inpWidth; j++)
			{
    
    
				float pix = dstimg.ptr<uchar>(i)[j * 3 + c];
				this->input_image_[k] = (pix / 255.0 - 0.5) / 0.5;
				k++;
			}
		}
	}
}

cv::Mat CodeFormer::detect(cv::Mat &srcimg)
{
    
    
	int im_h = srcimg.rows;
	int im_w = srcimg.cols;
	this->preprocess(srcimg);
	std::array<int64_t, 4> input_shape_{
    
     1, 3, this->inpHeight, this->inpWidth };
	std::vector<int64_t> input2_shape_ = {
    
     1 };

	auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
	std::vector<Ort::Value> ort_inputs;
	ort_inputs.push_back(Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape_.data(), input_shape_.size()));
	ort_inputs.push_back(Ort::Value::CreateTensor<double>(allocator_info, input2_tensor.data(), input2_tensor.size(), input2_shape_.data(), input2_shape_.size()));
	std::vector<Ort::Value> ort_outputs = ort_session->Run(Ort::RunOptions{
    
     nullptr }, input_names.data(), ort_inputs.data(), ort_inputs.size(), output_names.data(), output_names.size());

	post_process
	float* pred = ort_outputs[0].GetTensorMutableData<float>();
	//cv::Mat mask(outHeight, outWidth, CV_32FC3, pred); /经过试验,直接这样赋值,是不行的
	const unsigned int channel_step = outHeight * outWidth;
	std::vector<cv::Mat> channel_mats;
	cv::Mat rmat(outHeight, outWidth, CV_32FC1, pred); // R
	cv::Mat gmat(outHeight, outWidth, CV_32FC1, pred + channel_step); // G
	cv::Mat bmat(outHeight, outWidth, CV_32FC1, pred + 2 * channel_step); // B
	channel_mats.push_back(rmat);
	channel_mats.push_back(gmat);
	channel_mats.push_back(bmat);
	cv::Mat mask;
	merge(channel_mats, mask); // CV_32FC3 allocated

	///不用for循环遍历cv::Mat里的每个像素值,实现numpy.clip函数
	mask.setTo(this->min_max[0], mask < this->min_max[0]);
	mask.setTo(this->min_max[1], mask > this->min_max[1]);   也可以用threshold函数,阈值类型THRESH_TOZERO_INV

	mask = (mask - this->min_max[0]) / (this->min_max[1] - this->min_max[0]);
	mask *= 255.0;
	mask.convertTo(mask, CV_8UC3);
	//cvtColor(mask, mask, cv::COLOR_BGR2RGB);
	return mask;
}


void CodeFormer::detect_video(const std::string& video_path,const std::string& output_path, unsigned int writer_fps)
{
    
    
	cv::VideoCapture video_capture(video_path);

	if (!video_capture.isOpened())
	{
    
    
		std::cout << "Can not open video: " << video_path << "\n";
		return;
	}

	cv::Size S = cv::Size((int)video_capture.get(cv::CAP_PROP_FRAME_WIDTH),
		(int)video_capture.get(cv::CAP_PROP_FRAME_HEIGHT));

	cv::VideoWriter output_video(output_path, cv::VideoWriter::fourcc('m', 'p', '4', 'v'),
		30.0, S);
	
	if (!output_video.isOpened())
	{
    
    
		std::cout << "Can not open writer: " << output_path << "\n";
		return;
	}

	cv::Mat cv_mat;
	while (video_capture.read(cv_mat))
	{
    
    
		cv::Mat cv_dst = detect(cv_mat);

		output_video << cv_dst;
	}
	video_capture.release();
	output_video.release();
}

First try the effect of the official sample:
Insert image description here
the super-resolution effect of thin mosaic:
Insert image description here
Insert image description here
the super-resolution effect of thick mosaic is not very good, it just feels a bit close to the face:
Insert image description here
if the image is already clear, it is not very ideal after super-resolution, basically It cannot be used. The onnx effect can only optimize faces:

Insert image description here
Insert image description here

2.onnx model python inference

import os
import cv2
import argparse
import glob
import torch
import torch.onnx
from torchvision.transforms.functional import normalize
from basicsr.utils import imwrite, img2tensor, tensor2img
from basicsr.utils.download_util import load_file_from_url
from basicsr.utils.misc import gpu_is_available, get_device
from facelib.utils.face_restoration_helper import FaceRestoreHelper
from facelib.utils.misc import is_gray
import onnxruntime as ort

from basicsr.utils.registry import ARCH_REGISTRY

pretrain_model_url = {
    
    
    'restoration': 'https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth',
}

if __name__ == '__main__':
    # device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    device = get_device()
    parser = argparse.ArgumentParser()

    parser.add_argument('-i', '--input_path', type=str, default='./inputs/whole_imgs', 
            help='Input image, video or folder. Default: inputs/whole_imgs')
    parser.add_argument('-o', '--output_path', type=str, default=None, 
            help='Output folder. Default: results/<input_name>_<w>')
    parser.add_argument('-w', '--fidelity_weight', type=float, default=0.5, 
            help='Balance the quality and fidelity. Default: 0.5')
    parser.add_argument('-s', '--upscale', type=int, default=2, 
            help='The final upsampling scale of the image. Default: 2')
    parser.add_argument('--has_aligned', action='store_true', help='Input are cropped and aligned faces. Default: False')
    parser.add_argument('--only_center_face', action='store_true', help='Only restore the center face. Default: False')
    parser.add_argument('--draw_box', action='store_true', help='Draw the bounding box for the detected faces. Default: False')
    # large det_model: 'YOLOv5l', 'retinaface_resnet50'
    # small det_model: 'YOLOv5n', 'retinaface_mobile0.25'
    parser.add_argument('--detection_model', type=str, default='retinaface_resnet50', 
            help='Face detector. Optional: retinaface_resnet50, retinaface_mobile0.25, YOLOv5l, YOLOv5n, dlib. \
                Default: retinaface_resnet50')
    parser.add_argument('--bg_upsampler', type=str, default='None', help='Background upsampler. Optional: realesrgan')
    parser.add_argument('--face_upsample', action='store_true', help='Face upsampler after enhancement. Default: False')
    parser.add_argument('--bg_tile', type=int, default=400, help='Tile size for background sampler. Default: 400')
    parser.add_argument('--suffix', type=str, default=None, help='Suffix of the restored faces. Default: None')
    parser.add_argument('--save_video_fps', type=float, default=None, help='Frame rate for saving video. Default: None')

    args = parser.parse_args()

    # ------------------------ input & output ------------------------
    w = args.fidelity_weight
    input_video = False
    if args.input_path.endswith(('jpg', 'jpeg', 'png', 'JPG', 'JPEG', 'PNG')): # input single img path
        input_img_list = [args.input_path]
        result_root = f'results/test_img_{
      
      w}'
    # elif args.input_path.endswith(('mp4', 'mov', 'avi', 'MP4', 'MOV', 'AVI')): # input video path
    #     from basicsr.utils.video_util import VideoReader, VideoWriter
    #     input_img_list = []
    #     vidreader = VideoReader(args.input_path)
    #     image = vidreader.get_frame()
    #     while image is not None:
    #         input_img_list.append(image)
    #         image = vidreader.get_frame()
    #     audio = vidreader.get_audio()
    #     fps = vidreader.get_fps() if args.save_video_fps is None else args.save_video_fps   
    #     video_name = os.path.basename(args.input_path)[:-4]
    #     result_root = f'results/{video_name}_{w}'
    #     input_video = True
    #     vidreader.close()
    # else: # input img folder
    #     if args.input_path.endswith('/'):  # solve when path ends with /
    #         args.input_path = args.input_path[:-1]
    #     # scan all the jpg and png images
    #     input_img_list = sorted(glob.glob(os.path.join(args.input_path, '*.[jpJP][pnPN]*[gG]')))
    #     result_root = f'results/{os.path.basename(args.input_path)}_{w}'
    else:
        raise ValueError("wtf???")

    if not args.output_path is None: # set output path
        result_root = args.output_path

    test_img_num = len(input_img_list)
    if test_img_num == 0:
        raise FileNotFoundError('No input image/video is found...\n' 
            '\tNote that --input_path for video should end with .mp4|.mov|.avi')

    # # ------------------ set up background upsampler ------------------
    # if args.bg_upsampler == 'realesrgan':
    #     bg_upsampler = set_realesrgan()
    # else:
    #     bg_upsampler = None

    # # ------------------ set up face upsampler ------------------
    # if args.face_upsample:
    #     if bg_upsampler is not None:
    #         face_upsampler = bg_upsampler
    #     else:
    #         face_upsampler = set_realesrgan()
    # else:
    #     face_upsampler = None

    # ------------------ set up CodeFormer restorer -------------------
    net = ARCH_REGISTRY.get('CodeFormer')(dim_embd=512, codebook_size=1024, n_head=8, n_layers=9, 
                                            connect_list=['32', '64', '128', '256']).to(device)
    
    # ckpt_path = 'weights/CodeFormer/codeformer.pth'
    ckpt_path = load_file_from_url(url=pretrain_model_url['restoration'], 
                                    model_dir='weights/CodeFormer', progress=True, file_name=None)
    checkpoint = torch.load(ckpt_path)['params_ema']
    net.load_state_dict(checkpoint)
    net.eval()

    # # ------------------ set up FaceRestoreHelper -------------------
    # # large det_model: 'YOLOv5l', 'retinaface_resnet50'
    # # small det_model: 'YOLOv5n', 'retinaface_mobile0.25'
    # if not args.has_aligned: 
    #     print(f'Face detection model: {args.detection_model}')
    # # if bg_upsampler is not None: 
    # #     print(f'Background upsampling: True, Face upsampling: {args.face_upsample}')
    # # else:
    # #     print(f'Background upsampling: False, Face upsampling: {args.face_upsample}')
    # else:
    #     raise ValueError("wtf???")

    face_helper = FaceRestoreHelper(
        args.upscale,
        face_size=512,
        crop_ratio=(1, 1),
        # det_model = args.detection_model,
        # save_ext='png',
        # use_parse=True,
        # device=device
    )

    # -------------------- start to processing ---------------------
    for i, img_path in enumerate(input_img_list):
        # # clean all the intermediate results to process the next image
        # face_helper.clean_all()
        
        if isinstance(img_path, str):
            img_name = os.path.basename(img_path)
            basename, ext = os.path.splitext(img_name)
            print(f'[{
      
      i+1}/{
      
      test_img_num}] Processing: {
      
      img_name}')
            img = cv2.imread(img_path, cv2.IMREAD_COLOR)
        # else: # for video processing
        #     basename = str(i).zfill(6)
        #     img_name = f'{video_name}_{basename}' if input_video else basename
        #     print(f'[{i+1}/{test_img_num}] Processing: {img_name}')
        #     img = img_path

        if args.has_aligned: 
            # the input faces are already cropped and aligned
            img = cv2.resize(img, (512, 512), interpolation=cv2.INTER_LINEAR)
            # face_helper.is_gray = is_gray(img, threshold=10)
            # if face_helper.is_gray:
            #     print('Grayscale input: True')
            face_helper.cropped_faces = [img]
        # else:
        #     face_helper.read_image(img)
        #     # get face landmarks for each face
        #     num_det_faces = face_helper.get_face_landmarks_5(
        #         only_center_face=args.only_center_face, resize=640, eye_dist_threshold=5)
        #     print(f'\tdetect {num_det_faces} faces')
        #     # align and warp each face
        #     face_helper.align_warp_face()
        else:
            raise ValueError("wtf???")

        # face restoration for each cropped face
        for idx, cropped_face in enumerate(face_helper.cropped_faces):
            # prepare data
            cropped_face_t = img2tensor(cropped_face / 255., bgr2rgb=True, float32=True)
            normalize(cropped_face_t, (0.5, 0.5, 0.5), (0.5, 0.5, 0.5), inplace=True)
            cropped_face_t = cropped_face_t.unsqueeze(0).to(device)

            try:
                with torch.no_grad():
                    # output = net(cropped_face_t, w=w, adain=True)[0]
                    # output = net(cropped_face_t)[0]
                    output = net(cropped_face_t, w)[0]
                    restored_face = tensor2img(output, rgb2bgr=True, min_max=(-1, 1))
                del output
                # torch.cuda.empty_cache()
            except Exception as error:
                print(f'\tFailed inference for CodeFormer: {
      
      error}')
                restored_face = tensor2img(cropped_face_t, rgb2bgr=True, min_max=(-1, 1))
            
            # now, export the "net" codeformer to onnx
            print("Exporting CodeFormer to ONNX...")
            torch.onnx.export(net,
                # (cropped_face_t,),
                (cropped_face_t,w),
                "codeformer.onnx", 
                # verbose=True,
                export_params=True,
                opset_version=11,
                do_constant_folding=True,
                input_names = ['x','w'],
                output_names = ['y'],
            )

            # now, try to load the onnx model and run it
            print("Loading CodeFormer ONNX...")
            ort_session = ort.InferenceSession("codeformer.onnx", providers=['CPUExecutionProvider'])
            print("Running CodeFormer ONNX...")
            ort_inputs = {
    
    
                ort_session.get_inputs()[0].name: cropped_face_t.cpu().numpy(),
                ort_session.get_inputs()[1].name: torch.tensor(w).double().cpu().numpy(),
            }
            ort_outs = ort_session.run(None, ort_inputs)
            restored_face_onnx = tensor2img(torch.from_numpy(ort_outs[0]), rgb2bgr=True, min_max=(-1, 1))
            restored_face_onnx = restored_face_onnx.astype('uint8')

            restored_face = restored_face.astype('uint8')

            print("Comparing CodeFormer outputs...")
            # see how similar the outputs are: flatten and then compute all the differences
            diff = (restored_face_onnx.astype('float32') - restored_face.astype('float32')).flatten()
            # calculate min, max, mean, and std
            min_diff = diff.min()
            max_diff = diff.max()
            mean_diff = diff.mean()
            std_diff = diff.std()
            print(f"Min diff: {
      
      min_diff}, Max diff: {
      
      max_diff}, Mean diff: {
      
      mean_diff}, Std diff: {
      
      std_diff}")

            # face_helper.add_restored_face(restored_face, cropped_face)
            face_helper.add_restored_face(restored_face_onnx, cropped_face)

        # # paste_back
        # if not args.has_aligned:
        #     # upsample the background
        #     if bg_upsampler is not None:
        #         # Now only support RealESRGAN for upsampling background
        #         bg_img = bg_upsampler.enhance(img, outscale=args.upscale)[0]
        #     else:
        #         bg_img = None
        #     face_helper.get_inverse_affine(None)
        #     # paste each restored face to the input image
        #     if args.face_upsample and face_upsampler is not None: 
        #         restored_img = face_helper.paste_faces_to_input_image(upsample_img=bg_img, draw_box=args.draw_box, face_upsampler=face_upsampler)
        #     else:
        #         restored_img = face_helper.paste_faces_to_input_image(upsample_img=bg_img, draw_box=args.draw_box)

        # save faces
        for idx, (cropped_face, restored_face) in enumerate(zip(face_helper.cropped_faces, face_helper.restored_faces)):
            # save cropped face
            if not args.has_aligned: 
                save_crop_path = os.path.join(result_root, 'cropped_faces', f'{
      
      basename}_{
      
      idx:02d}.png')
                imwrite(cropped_face, save_crop_path)
            # save restored face
            if args.has_aligned:
                save_face_name = f'{
      
      basename}.png'
            else:
                save_face_name = f'{
      
      basename}_{
      
      idx:02d}.png'
            if args.suffix is not None:
                save_face_name = f'{
      
      save_face_name[:-4]}_{
      
      args.suffix}.png'
            save_restore_path = os.path.join(result_root, 'restored_faces', save_face_name)
            imwrite(restored_face, save_restore_path)

        # # save restored img
        # if not args.has_aligned and restored_img is not None:
        #     if args.suffix is not None:
        #         basename = f'{basename}_{args.suffix}'
        #     save_restore_path = os.path.join(result_root, 'final_results', f'{basename}.png')
        #     imwrite(restored_img, save_restore_path)

    # # save enhanced video
    # if input_video:
    #     print('Video Saving...')
    #     # load images
    #     video_frames = []
    #     img_list = sorted(glob.glob(os.path.join(result_root, 'final_results', '*.[jp][pn]g')))
    #     for img_path in img_list:
    #         img = cv2.imread(img_path)
    #         video_frames.append(img)
    #     # write images to video
    #     height, width = video_frames[0].shape[:2]
    #     if args.suffix is not None:
    #         video_name = f'{video_name}_{args.suffix}.png'
    #     save_restore_path = os.path.join(result_root, f'{video_name}.mp4')
    #     vidwriter = VideoWriter(save_restore_path, height, width, fps, audio)
         
    #     for f in video_frames:
    #         vidwriter.write_frame(f)
    #     vidwriter.close()

    print(f'\nAll results are saved in {
      
      result_root}')

Guess you like

Origin blog.csdn.net/matt45m/article/details/132830451