ModNet matting algorithm and camera real-time matting example

Table of contents

1. Reasons for using green screen in video matting

1. The reason for the color of the camera

2. Reasons for cutout effect

3. Economic cost

2. Background knowledge of cutout

1、Trimap

2. What is cutout

3. Classification of matting algorithms

3. Deep Image Matting Algorithm

1. Network structure diagram

2. Algorithm Interpretation

(1) Encoder-Decoder stage

(2) Refinement stage

4. ModNet Algorithm: Trimap-Free Portrait Matting in Real Time

1. Network structure diagram

2. Algorithm Interpretation

Five, ModNet matting practice


1. Reasons for using green screen in video matting

1. The reason for the color of the camera

The mainstream camera sensor is RGB three-channel, so for the most accurate matting, it is best to use the original color of the three primary colors. In addition, most of the CMOS sensor matrix of the camera adopts a Bayer array. There are 2 green photosensitive points in the array, higher than the red and blue, so the information is richer and easier to remove.

2. Reasons for cutout effect

Most of the characters and skins in the video are green complementary colors with high contrast, which makes it easier for the computer to distinguish edges and textured hairs during rendering processing, thereby reducing the workload of matting.

3. Economic cost

The green background has high brightness, and the brightness can be lowered to save power when shooting.

2. Background knowledge of cutout

Portrait matting: algorithm overview and engineering implementation (1) - Cloud Community - HUAWEI CLOUD

1、Trimap

The most commonly used prior knowledge is a ternary map, each pixel is one of {0, 128, 255}, representing the foreground, unknown and background respectively.

2. What is cutout

For a picture I, the part of the portrait we are interested in is called the foreground F, and the rest is the background B, then the image I can be regarded as a weighted fusion of F and B: I=
alpha F+(1−alpha) B alpha The shape is consistent with I.

The matting task is to find the appropriate weight alpha matrix.

The process of fusing the foreground image and the background image according to the above formula is exemplified as follows:

Suppose the middle circle part of a picture is the foreground and the rest is the background. After the above two images are combined according to the formula, the middle circle is all foreground-related pixels, and the outside of the circle is all background-related pixels. Alpha corresponds to the probability matrix of the foreground image.

If the alpha training is completed, if you want to complete the cutout of a picture, you only need alpha*original picture+ (1-alpha)*white background picture.

Alpha is a continuous value between [0, 1], which can be understood as the probability that a pixel belongs to the foreground, which is different from portrait segmentation. In the portrait segmentation task, alpha can only be 0 or 1, which is essentially a classification task, while matting is a regression task.

For the ground truth of the image-cutting task, it can be seen that the values ​​​​are distributed between 0 and 1.

The ground truth of semantic segmentation can see that the value is either 0 or 1.

3. Classification of matting algorithms

Currently popular matting algorithms can be roughly divided into two categories.

One is the Trimap-based method that requires prior information. The broad prior information includes Trimap, rough mask, unmanned background image, pose information, etc. The network uses prior information and image information to jointly predict alpha

The other is the Trimap-free method, which predicts alpha only based on image information, which is more friendly to practical applications, but the effect is generally not as good as the Trimap-based method.

The current mainstream is the trimap-free algorithm.

3. Deep Image Matting Algorithm

1. Network structure diagram

2. Algorithm Interpretation

The network includes the Encoder-Decoder stage and the Refinement stage

(1) Encoder-Decoder stage

The input is the patch of the RGB image and the concat corresponding to the trimap, so it contains 4 channels, and the single-channel raw alpha pred is output after encoding and decoding. The loss of this stage consists of two parts:

The first part is the absolute error between the predicted alpha and the real alpha. Considering that L1 loss is not differentiable at 0, use Charbonnier Loss to approximate:

The second part is the absolute error between the RGB image composed of the predicted alpha, the real foreground and the real background, and the real RGB image. Its function is to impose constraints on the network, and Charbonnier Loss is also used to approximate:

The final Loss is a weighted sum of two parts:

(2) Refinement stage

Its input is the concat of the raw alpha pred output by the Encoder-Decoder stage and the original RGB image, which is also 4 channels, and the original RGB can provide boundary detail information for refine. The point is to use a skip connection to perform an add operation on the raw alpha pred output by the Encoder-Decoder stage and the refined alpha pred output by the Refinement stage, and then output the final prediction result. In fact, the Refinement stage is a residual block, and the boundary information is modeled through residual learning, which is exactly the same as the noise modeling of the denoising model.

There is only one loss in the Refinement stage: refined alpha pred and GT alpha matte calculate Charbonnier Loss.

4. ModNet Algorithm: Trimap-Free Portrait Matting in Real Time

1. Network structure diagram

2. Algorithm Interpretation

The network structure consists of: semantic estimation branch, detail prediction branch, and semantic-detail fusion branch.

Five, ModNet matting practice

Reference article:

[Matting] MODNet: Real-time portrait matting model-onnx python deployment_onnx model download_Dudu Taicai Blog-CSDN Blog

The original author's onnix model link: https://download.csdn.net/download/qq_40035462/85046509

Code example:

import cv2
import time
from tqdm import tqdm
import numpy as np
import onnxruntime as rt


class Matting:
    def __init__(self, model_path='onnx_model\modnet.onnx', input_size=(512, 512)):
        self.model_path = model_path
        self.sess = rt.InferenceSession(self.model_path, providers=['CUDAExecutionProvider'])
        # self.sess = rt.InferenceSession(self.model_path)  # 默认使用cpu
        self.input_name = self.sess.get_inputs()[0].name
        self.label_name = self.sess.get_outputs()[0].name
        self.input_size = input_size
        self.txt_font = cv2.FONT_HERSHEY_PLAIN

    def normalize(self, im, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]):
        im = im.astype(np.float32, copy=False) / 255.0
        im -= mean
        im /= std
        return im

    def resize(self, im, target_size=608, interp=cv2.INTER_LINEAR):
        if isinstance(target_size, list) or isinstance(target_size, tuple):
            w = target_size[0]
            h = target_size[1]
        else:
            w = target_size
            h = target_size
        im = cv2.resize(im, (w, h), interpolation=interp)
        return im

    def preprocess(self, image, target_size=(512, 512), interp=cv2.INTER_LINEAR):
        image = self.normalize(image)
        image = self.resize(image, target_size=target_size, interp=interp)
        image = np.transpose(image, [2, 0, 1])
        image = image[None, :, :, :]
        return image

    def predict_frame(self, bgr_image):
        assert len(bgr_image.shape) == 3, "Please input RGB image."
        raw_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
        h, w, c = raw_image.shape
        image = self.preprocess(raw_image, target_size=self.input_size)

        pred = self.sess.run(
            [self.label_name],
            {self.input_name: image.astype(np.float32)}
        )[0]
        pred = pred[0, 0]
        matte_np = self.resize(pred, target_size=(w, h), interp=cv2.INTER_NEAREST)
        matte_np = np.expand_dims(matte_np, axis=-1)
        return matte_np

    def predict_image(self, source_image_path, save_image_path):
        bgr_image = cv2.imread(source_image_path)
        assert len(bgr_image.shape) == 3, "Please input RGB image."
        matte_np = self.predict_frame(bgr_image)
        matting_frame = matte_np * bgr_image + (1 - matte_np) * np.full(bgr_image.shape, 255.0)
        matting_frame = matting_frame.astype('uint8')
        cv2.imwrite(save_image_path, matting_frame)

    def predict_camera(self):
        cap_video = cv2.VideoCapture(0)
        if not cap_video.isOpened():
            raise IOError("Error opening video stream or file.")
        beg = time.time()
        count = 0
        while cap_video.isOpened():
            ret, raw_frame = cap_video.read()
            if ret:
                count += 1
                matte_np = self.predict_frame(raw_frame)
                matting_frame = matte_np * raw_frame + (1 - matte_np) * np.full(raw_frame.shape, 255.0)
                matting_frame = matting_frame.astype('uint8')

                end = time.time()
                fps = round(count / (end - beg), 2)
                if count >= 50:
                    count = 0
                    beg = end

                cv2.putText(matting_frame, "fps: " + str(fps), (20, 20), self.txt_font, 2, (0, 0, 255), 1)

                cv2.imshow('Matting', matting_frame)
                if cv2.waitKey(1) & 0xFF == ord('q'):
                    break
            else:
                break
        cap_video.release()
        cv2.destroyWindow()

    def check_video(self, src_path, dst_path):
        cap1 = cv2.VideoCapture(src_path)
        fps1 = int(cap1.get(cv2.CAP_PROP_FPS))
        number_frames1 = cap1.get(cv2.CAP_PROP_FRAME_COUNT)
        cap2 = cv2.VideoCapture(dst_path)
        fps2 = int(cap2.get(cv2.CAP_PROP_FPS))
        number_frames2 = cap2.get(cv2.CAP_PROP_FRAME_COUNT)
        assert fps1 == fps2 and number_frames1 == number_frames2, "fps or number of frames not equal."

    def predict_video(self, video_path, save_path, threshold=2e-7):
        # 使用odf策略
        time_beg = time.time()
        pre_t2 = None  # 前2步matte
        pre_t1 = None  # 前1步matte

        cap = cv2.VideoCapture(video_path)
        fps = int(cap.get(cv2.CAP_PROP_FPS))
        size = (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)),
                int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)))
        number_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
        print("source video fps: {}, video resolution: {}, video frames: {}".format(fps, size, number_frames))
        videoWriter = cv2.VideoWriter(save_path, cv2.VideoWriter_fourcc('I', '4', '2', '0'), fps, size)

        ret, frame = cap.read()
        with tqdm(range(int(number_frames))) as t:
            for c in t:
                matte_np = self.predict_frame(frame)
                if pre_t2 is None:
                    pre_t2 = matte_np
                elif pre_t1 is None:
                    pre_t1 = matte_np
                    # 第一帧写入
                    matting_frame = pre_t2 * frame + (1 - pre_t2) * np.full(frame.shape, 255.0)
                    videoWriter.write(matting_frame.astype('uint8'))
                else:
                    # odf
                    error_interval = np.mean(np.abs(pre_t2 - matte_np))
                    error_neigh = np.mean(np.abs(pre_t1 - pre_t2))
                    if error_interval < threshold < error_neigh:
                        pre_t1 = pre_t2

                    matting_frame = pre_t1 * frame + (1 - pre_t1) * np.full(frame.shape, 255.0)
                    videoWriter.write(matting_frame.astype('uint8'))
                    pre_t2 = pre_t1
                    pre_t1 = matte_np

                ret, frame = cap.read()
            # 最后一帧写入
            matting_frame = pre_t1 * frame + (1 - pre_t1) * np.full(frame.shape, 255.0)
            videoWriter.write(matting_frame.astype('uint8'))
            cap.release()
        print("video matting over, time consume: {}, fps: {}".format(time.time() - time_beg, number_frames / (time.time() - time_beg)))


if __name__ == '__main__':
    model = Matting(model_path='onnx_model\modnet.onnx', input_size=(512, 512))
    model.predict_camera()
    # model.predict_image('images\\1.jpeg', 'output\\1.png')
    # model.predict_image('images\\2.jpeg', 'output\\2.png')
    # model.predict_image('images\\3.jpeg', 'output\\3.png')
    # model.predict_image('images\\4.jpeg', 'output\\4.png')
    # model.predict_video("video\dance.avi", "output\dance_matting.avi")

See the top attachment for the modnet.onnx file involved in the code. 

Guess you like

Origin blog.csdn.net/benben044/article/details/131136506