Video face replacement using Pytorch and OpenCV

The "DeepFaceLab" project has been released for a long time, and for research purposes, this article will introduce his principles and create a simplified version using Pytorch and OpenCV.

This article will be divided into 3 parts. The first part extracts faces from two videos and builds a standard face dataset. The second part uses the dataset together with a neural network to learn how to represent faces in a latent space, and reconstruct images of faces from that representation. The final part uses a neural network to create in each frame of the video a face identical to that in the source video but with the expression of the person in the target video. Then replace the original face with a fake face and save the new frame as a new fake video.

The basic structure of the project (before the first run) looks like this

 ├── face_masking.py
 ├── main.py
 ├── face_extraction_tools.py
 ├── quick96.py
 ├── merge_frame_to_fake_video.py
 ├── data
 │ ├── data_dst.mp4
 │ ├── data_src.mp4

main.py is the main script, and the data folder contains the data_dst.mp4 and data_src.mp4 files required by the program.

Extraction and Alignment - Building Datasets

In the first part, we mainly introduce the code in the face_extraction_tools.py file.

Because the first step is to extract frames from the video, we need to build a function that saves the frames as JPEG images. This function accepts a path to the video and another path to the output folder.

 def extract_frames_from_video(video_path: Union[str, Path], output_folder: Union[str, Path], frames_to_skip: int=0) -> None:
     """
     Extract frame from video as a JPG images.
     Args:
         video_path (str | Path): the path to the input video from it the frame will be extracted
         output_folder (str | Path): the folder where the frames will be saved
         frames_to_skip (int): how many frames to skip after a frame which is saved. 0 will save all the frames.
             If, for example, this value is 2, the first frame will be saved, then frame 2 and 3 will be skipped,
             the 4th frame will be saved, and so on.
 
     Returns:
 
     """
 
     video_path = Path(video_path)
     output_folder = Path(output_folder)
 
     if not video_path.exists():
         raise ValueError(f'The path to the video file {video_path.absolute()} is not exist')
     if not output_folder.exists():
         output_folder.mkdir(parents=True)
 
     video_capture = cv2.VideoCapture(str(video_path))
 
     extract_frame_counter = 0
     saved_frame_counter = 0
     while True:
         ret, frame = video_capture.read()
         if not ret:
             break
 
         if extract_frame_counter % (frames_to_skip + 1) == 0:
             cv2.imwrite(str(output_folder / f'{saved_frame_counter:05d}.jpg'), frame, [cv2.IMWRITE_JPEG_QUALITY, 90])
             saved_frame_counter += 1
 
         extract_frame_counter += 1
 
     print(f'{saved_frame_counter} of {extract_frame_counter} frames saved')

The function first checks whether the video file exists, and whether the output folder exists, and creates it automatically if it does not exist. Then use OpenCV's videoccapture class to create an object to read the video and save it frame by frame as a JPEG file in the output folder. Frames can also be skipped according to the frames_to_skip parameter.

Then there is the need to build a face extractor. The tool should be able to detect a face in an image, extract it and align it. The best way to build such a tool is to create a FaceExtractor class with methods for detection, extraction and alignment.

For the detection part, we will use YuNet with OpenCV. YuNet is a fast and accurate cnn-based face detector that can be used by the FaceDetectorYN class in OpenCV. To create such a FaceDetectorYN object, we need an ONNX file with weights. The file can be found in OpenCV Zoo, the current version is named "face_detection_yunet_2023mar.onnx".

Our init () method looks like this:

 def __init__(self, image_size):
         """
         Create a YuNet face detector to get face from image of size 'image_size'. The YuNet model
         will be downloaded from opencv zoo, if it's not already exist.
         Args:
             image_size (tuple): a tuple of (width: int, height: int) of the image to be analyzed
         """
         detection_model_path = Path('models/face_detection_yunet_2023mar.onnx')
         if not detection_model_path.exists():
             detection_model_path.parent.mkdir(parents=True, exist_ok=True)
             url = "https://github.com/opencv/opencv_zoo/blob/main/models/face_detection_yunet/face_detection_yunet_2023mar.onnx"
             print('Downloading face detection model...')
             filename, headers = urlretrieve(url, filename=str(detection_model_path))
             print('Download finish!')
 
         self.detector = cv2.FaceDetectorYN.create(str(detection_model_path), "", image_size)

The function first checks if the weights file exists, and if not, downloads it from the web. Then create a FaceDetectorYN object with the weights file and the image size to analyze. The detection method uses the YuNet detection method to find faces in the image

     def detect(self, image):
         ret, faces = self.detector.detect(image)
         return ret, faces

The output of YuNet is a 2D array of size [num_faces, 15] containing the following information:

0-1: x, y of the upper left corner of the bounding box
2-3: The width and height of the frame
4-5: x, y of the right eye (blue dot in the sample picture)
6-7: left eye x, y (red dot in sample picture)
8-9: Nose tip x, y (green point in the example picture)
10-11: x, y at the right corner of the mouth (the pink dot in the sample image)
12-13: x, y of the left corner of the mouth (yellow point in the sample picture)
14: Facial Scoring

Now that we have the face position data, we can use it to get an aligned image of the face. Here the information of eye position is mainly used. We want the eyes to be at the same level (same y coordinate) in the aligned image.

  @staticmethod
     def align(image, face, desired_face_width=256, left_eye_desired_coordinate=np.array((0.37, 0.37))):
         """
         Align the face so the eyes will be at the same level
         Args:
             image (np.ndarray): image with face
             face (np.ndarray):  face coordinates from the detection step
             desired_face_width (int): the final width of the aligned face image
             left_eye_desired_coordinate (np.ndarray): a length 2 array of values between
              0 and 1 where the left eye should be in the aligned image
 
         Returns:
             (np.ndarray): aligned face image
         """
         desired_face_height = desired_face_width
         right_eye_desired_coordinate = np.array((1 - left_eye_desired_coordinate[0], left_eye_desired_coordinate[1]))
 
         # get coordinate of the center of the eyes in the image
         right_eye = face[4:6]
         left_eye = face[6:8]
 
         # compute the angle of the right eye relative to the left eye
         dist_eyes_x = right_eye[0] - left_eye[0]
         dist_eyes_y = right_eye[1] - left_eye[1]
         dist_between_eyes = np.sqrt(dist_eyes_x ** 2 + dist_eyes_y ** 2)
         angles_between_eyes = np.rad2deg(np.arctan2(dist_eyes_y, dist_eyes_x) - np.pi)
         eyes_center = (left_eye + right_eye) // 2
 
         desired_dist_between_eyes = desired_face_width * (
                     right_eye_desired_coordinate[0] - left_eye_desired_coordinate[0])
         scale = desired_dist_between_eyes / dist_between_eyes
 
         M = cv2.getRotationMatrix2D(eyes_center, angles_between_eyes, scale)
 
         M[0, 2] += 0.5 * desired_face_width - eyes_center[0]
         M[1, 2] += left_eye_desired_coordinate[1] * desired_face_height - eyes_center[1]
 
         face_aligned = cv2.warpAffine(image, M, (desired_face_width, desired_face_height), flags=cv2.INTER_CUBIC)
         return face_aligned

This method obtains the image and information of a single face, and outputs the width of the image and the desired relative position of the left eye. We assume the output image is squared and the desired position of the right eye has the same y-position and x-position of 1 - left_eye_x. Calculates the distance and angle between the eyes, and the center point between the eyes.

The last method is the extract method, which is similar to the align method, but without transformation, it also returns the bounding box of the face in the image.

 def extract_and_align_face_from_image(input_dir: Union[str, Path], desired_face_width: int=256) -> None:
     """
     Extract the face from an image, align it and save to a directory inside in the input directory
     Args:
         input_dir (str|Path): path to the directory contains the images extracted from a video
         desired_face_width (int): the width of the aligned imaged in pixels
 
     Returns:
 
     """
 
     input_dir = Path(input_dir)
     output_dir = input_dir / 'aligned'
     if output_dir.exists():
         rmtree(output_dir)
     output_dir.mkdir()
 
 
     image = cv2.imread(str(input_dir / '00000.jpg'))
     image_height = image.shape[0]
     image_width = image.shape[1]
 
     detector = FaceExtractor((image_width, image_height))
 
     for image_path in tqdm(list(input_dir.glob('*.jpg'))):
         image = cv2.imread(str(image_path))
 
         ret, faces = detector.detect(image)
         if faces is None:
             continue
 
         face_aligned = detector.align(image, faces[0, :], desired_face_width)
         cv2.imwrite(str(output_dir / f'{image_path.name}'), face_aligned, [cv2.IMWRITE_JPEG_QUALITY, 90])

train

For the web we will use AutoEncoder. In AutoEncoder, there are two main components - encoder and decoder. The encoder takes the original image and finds its latent representation, and the decoder uses the latent representation to reconstruct the original image.

For our task, an encoder is trained to find a latent face representation and two decoders - one to reconstruct the source face and the other to reconstruct the target face.

After these three components are trained, we return to the original goal: to create an image of a source face but with a target expression. That is to say using decoder A and the image of face B.

The face latent space preserves the main features of the face, such as position, orientation and expression. The decoder takes this encoded information and learns how to construct a full-face image. Since decoder A only knows how to construct a type A face, it takes the features of image B from the encoder and constructs a type A image from it.

In this paper we will use a slightly modified version of the Quick96 architecture from the original DeepFaceLab project.

Full details of the model can be found in the quick96.py file.

Before we can train the model, we also need to process the data. To make the model robust and avoid overfitting, we also need to apply two types of augmentation on the original face images. The first is general transformations, including rotation, scaling, translation in x and y directions, and horizontal flipping. For each transformation, we define a range for the parameter or probability (for example, the range of angles we can use to rotate it), and then choose a random value from the range to apply to the image.

 random_transform_args = {
     'rotation_range': 10,
     'zoom_range': 0.05,
     'shift_range': 0.05,
     'random_flip': 0.5,
   }
 
 def random_transform(image, rotation_range, zoom_range, shift_range, random_flip):
     """
     Make a random transformation for an image, including rotation, zoom, shift and flip.
     Args:
         image (np.array): an image to be transformed
         rotation_range (float): the range of possible angles to rotate - [-rotation_range, rotation_range]
         zoom_range (float): range of possible scales - [1 - zoom_range, 1 + zoom_range]
         shift_range (float): the percent of translation for x  and y
         random_flip (float): the probability of horizontal flip
 
     Returns:
         (np.array): transformed image
     """
     h, w = image.shape[0:2]
     rotation = np.random.uniform(-rotation_range, rotation_range)
     scale = np.random.uniform(1 - zoom_range, 1 + zoom_range)
     tx = np.random.uniform(-shift_range, shift_range) * w
     ty = np.random.uniform(-shift_range, shift_range) * h
     mat = cv2.getRotationMatrix2D((w // 2, h // 2), rotation, scale)
     mat[:, 2] += (tx, ty)
     result = cv2.warpAffine(image, mat, (w, h), borderMode=cv2.BORDER_REPLICATE)
     if np.random.random() < random_flip:
         result = result[:, ::-1]
     return result

The 2nd is the distortion created by using a noisy interpolated map. This distortion will force the model to understand key features of faces and make it more generalizable.

 def random_warp(image):
     """
     Create a distorted face image and a target undistorted image
     Args:
         image  (np.array): image to warp
 
     Returns:
         (np.array): warped version of the image
         (np.array): target image to construct from the warped version
     """
     h, w = image.shape[:2]
 
     # build coordinate map to wrap the image according to
     range_ = np.linspace(h / 2 - h * 0.4, h / 2 + h * 0.4, 5)
     mapx = np.broadcast_to(range_, (5, 5))
     mapy = mapx.T
 
     # add noise to get a distortion of the face while warp the image
     mapx = mapx + np.random.normal(size=(5, 5), scale=5*h/256)
     mapy = mapy + np.random.normal(size=(5, 5), scale=5*h/256)
 
     # get interpolation map for the center of the face with size of (96, 96)
     interp_mapx = cv2.resize(mapx, (int(w / 2 * (1 + 0.25)) , int(h / 2 * (1 + 0.25))))[int(w/2 * 0.25/2):int(w / 2 * (1 + 0.25) - w/2 * 0.25/2), int(w/2 * 0.25/2):int(w / 2 * (1 + 0.25) - w/2 * 0.25/2)].astype('float32')
     interp_mapy = cv2.resize(mapy, (int(w / 2 * (1 + 0.25)) , int(h / 2 * (1 + 0.25))))[int(w/2 * 0.25/2):int(w / 2 * (1 + 0.25) - w/2 * 0.25/2), int(w/2 * 0.25/2):int(w / 2 * (1 + 0.25) - w/2 * 0.25/2)].astype('float32')
 
     # remap the face image according to the interpolation map to get warp version
     warped_image = cv2.remap(image, interp_mapx, interp_mapy, cv2.INTER_LINEAR)
 
     # create the target (undistorted) image
     # find a transformation to go from the source coordinates to the destination coordinate
     src_points = np.stack([mapx.ravel(), mapy.ravel()], axis=-1)
     dst_points = np.mgrid[0:w//2+1:w//8, 0:h//2+1:h//8].T.reshape(-1, 2)
 
     # We want to find a similarity matrix (scale rotation and translation) between the
     # source and destination points. The matrix should have the structure
     # [[a, -b, c],
     #  [b,  a, d]]
     # so we can construct unknown vector [a, b, c, d] and solve for it using least
     # squares with the source and destination x and y points.
     A = np.zeros((2 * src_points.shape[0], 2))
     A[0::2, :] = src_points  # [x, y]
     A[0::2, 1] = -A[0::2, 1] # [x, -y]
     A[1::2, :] = src_points[:, ::-1]  # [y, x]
     A = np.hstack((A, np.tile(np.eye(2), (src_points.shape[0], 1))))  # [x, -y, 1, 0] for x coordinate and [y, x, 0 ,1] for y coordinate
     b = dst_points.flatten()  # arrange as [x0, y0, x1, y1, ..., xN, yN]
 
     similarity_mat = np.linalg.lstsq(A, b, rcond=None)[0] # get the similarity matrix elements as vector [a, b, c, d]
     # construct the similarity matrix from the result vector of the least squares
     similarity_mat = np.array([[similarity_mat[0], -similarity_mat[1], similarity_mat[2]],
                                [similarity_mat[1], similarity_mat[0], similarity_mat[3]]])
     # use the similarity matrix to construct the target image using affine transformation
     target_image = cv2.warpAffine(image, similarity_mat, (w // 2, h // 2))
 
     return warped_image, target_image

This function has two parts, we first create a coordinate map of the image in the area around the face. There is a map of x coordinates and a map of y coordinates. The values in the mapx and mapy variables are coordinates in pixels. Then add some noise to the image to make the coordinates move in random directions. With the noise we add, we get a distorted coordinate (pixels shift a bit in a random direction). The interpolated map was then cropped to include the center of the face, with a size of 96x96 pixels. Now we can use the warped map to remap the image, resulting in a new warped image.

In the second part the unwarped image is created, which is the target image the model should create from the warped image. Use noise as source coordinates and define a set of target coordinates for the target image. We then use least squares to find a similarity transformation matrix (scale rotation and translation), map it from source coordinates to target coordinates, and apply it to the image to obtain the target image.

Then you can create a Dataset class to process the data. The FaceData class is very simple. It takes the path to the folder containing the src and dst folders containing the data we created in the previous section and returns random source and destination images of size (2*96,2*96) normalized to 1. What our network will get is a transformed and warped image, and a target image of the source and target faces. So you need to implement a collate_fn

 def collate_fn(self, batch):
         """
         Collate function to arrange the data returns from a batch. The batch returns a list
         of tuples contains pairs of source and destination images, which is the input of this
         function, and the function returns a tuple with 4 4D tensors of the warp and target
         images for the source and destination
         Args:
             batch (list): a list of tuples contains pairs of source and destination images
                 as numpy array
 
         Returns:
             (torch.Tensor): a 4D tensor of the wrap version of the source images
             (torch.Tensor): a 4D tensor of the target source images
             (torch.Tensor): a 4D tensor of the wrap version of the destination images
             (torch.Tensor): a 4D tensor of the target destination images
         """
         images_src, images_dst = list(zip(*batch))  # convert list of tuples with pairs of images into tuples of source and destination images
         warp_image_src, target_image_src = get_training_data(images_src, len(images_src))
         warp_image_src = torch.tensor(warp_image_src, dtype=torch.float32).permute(0, 3, 1, 2).to(device)
         target_image_src = torch.tensor(target_image_src, dtype=torch.float32).permute(0, 3, 1, 2).to(device)
         warp_image_dst, target_image_dst = get_training_data(images_dst, len(images_dst))
         warp_image_dst = torch.tensor(warp_image_dst, dtype=torch.float32).permute(0, 3, 1, 2).to(device)
         target_image_dst = torch.tensor(target_image_dst, dtype=torch.float32).permute(0, 3, 1, 2).to(device)
 
         return warp_image_src, target_image_src, warp_image_dst, target_image_dst

When we get data from the Dataloader object, it will return a tuple containing the source and target image pairs from the FaceData object. collate_fn takes this result and transforms and distorts the image to get the target image and returns four 4D tensors for warped source image, target source image, warped target image, and target target image.

The loss function used for training is a combination of MSE (L2) loss and DSSIM

The training indicators and results are shown in the figure above

generate video

The last step is to create the video. The function that handles this task is called merge_frame_to_fake_video.py. We created facemask class using MediaPipe.

When initializing the facemask object, initialize the MediaPipe face detector.

 class FaceMasking:
     def __init__(self):
         landmarks_model_path = Path('models/face_landmarker.task')
         if not landmarks_model_path.exists():
             landmarks_model_path.parent.mkdir(parents=True, exist_ok=True)
             url = "https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/latest/face_landmarker.task"
             print('Downloading face landmarks model...')
             filename, headers = urlretrieve(url, filename=str(landmarks_model_path))
             print('Download finish!')
 
         base_options = python_mp.BaseOptions(model_asset_path=str(landmarks_model_path))
         options = vision.FaceLandmarkerOptions(base_options=base_options,
                                                output_face_blendshapes=False,
                                                output_facial_transformation_matrixes=False,
                                                num_faces=1)
         self.detector = vision.FaceLandmarker.create_from_options(options)

This class also has a method to get the mask from the face image:

 def get_mask(self, image):
         """
         return uint8 mask of the face in image
         Args:
             image (np.ndarray): RGB image with single face
 
         Returns:
             (np.ndarray): single channel uint8 mask of the face
         """
         im_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=image.astype(np.uint8).copy())
         detection_result = self.detector.detect(im_mp)
 
         x = np.array([landmark.x * image.shape[1] for landmark in detection_result.face_landmarks[0]], dtype=np.float32)
         y = np.array([landmark.y * image.shape[0] for landmark in detection_result.face_landmarks[0]], dtype=np.float32)
 
         hull = np.round(np.squeeze(cv2.convexHull(np.column_stack((x, y))))).astype(np.int32)
         mask = np.zeros(image.shape[:2], dtype=np.uint8)
         mask = cv2.fillConvexPoly(mask, hull, 255)
         kernel = np.ones((7, 7), np.uint8)
         mask = cv2.erode(mask, kernel)
 
         return mask

The function first converts the input image to a MediaPipe image structure, then uses the face detector to find faces. Then use OpenCV to find the convex hull of the points, and use OpenCV's fillConvexPoly function to fill the area of the convex hull, resulting in a binary mask. Finally, we apply an erosion operation to shrink the occlusion.

  def get_mask(self, image):
         """
         return uint8 mask of the face in image
         Args:
             image (np.ndarray): RGB image with single face
 
         Returns:
             (np.ndarray): single channel uint8 mask of the face
         """
         im_mp = mp.Image(image_format=mp.ImageFormat.SRGB, data=image.astype(np.uint8).copy())
         detection_result = self.detector.detect(im_mp)
 
         x = np.array([landmark.x * image.shape[1] for landmark in detection_result.face_landmarks[0]], dtype=np.float32)
         y = np.array([landmark.y * image.shape[0] for landmark in detection_result.face_landmarks[0]], dtype=np.float32)
 
         hull = np.round(np.squeeze(cv2.convexHull(np.column_stack((x, y))))).astype(np.int32)
         mask = np.zeros(image.shape[:2], dtype=np.uint8)
         mask = cv2.fillConvexPoly(mask, hull, 255)
         kernel = np.ones((7, 7), np.uint8)
         mask = cv2.erode(mask, kernel)
 
         return mask

The merge_frame_to_fake_video function is to integrate all the above steps, create a new video object, a FaceExtracot object, a facemask object, create neural network components, and load their weights.

 def merge_frames_to_fake_video(dst_frames_path, model_name='Quick96', saved_models_dir='saved_model'):
     model_path = Path(saved_models_dir) / f'{model_name}.pth'
     dst_frames_path = Path(dst_frames_path)
     image = Image.open(next(dst_frames_path.glob('*.jpg')))
     image_size = image.size
     result_video = cv2.VideoWriter(str(dst_frames_path.parent / 'fake.mp4'), cv2.VideoWriter_fourcc(*'MJPG'), 30, image.size)
 
     face_extractor = FaceExtractor(image_size)
     face_masker = FaceMasking()
 
     encoder = Encoder().to(device)
     inter = Inter().to(device)
     decoder = Decoder().to(device)
 
     saved_model = torch.load(model_path)
     encoder.load_state_dict(saved_model['encoder'])
     inter.load_state_dict(saved_model['inter'])
     decoder.load_state_dict(saved_model['decoder_src'])
 
     model = torch.nn.Sequential(encoder, inter, decoder)

Then for all frames in the target video, faces are found. If there is no face, write the picture into the video. If there is a face, it is extracted, transformed into an appropriate input to the network, and a new face is generated.

Mask the original face and the new face, and use the moment on the masked image to find the center of the original face. Replace the original face with a new face in a realistic way using seamless cloning (for example, change the skin tone of the fake face to fit the original face skin). Finally put the result back into the original frame as a new frame and write it to the video file.

     frames_list = sorted(dst_frames_path.glob('*.jpg'))
     for ii, frame_path in enumerate(frames_list, 1):
         print(f'Working om {ii}/{len(frames_list)}')
         frame = cv2.imread(str(frame_path))
         retval, face = face_extractor.detect(frame)
         if face is None:
             result_video.write(frame)
             continue
         face_image, face = face_extractor.extract(frame, face[0])
         face_image = face_image[..., ::-1].copy()
         face_image_cropped = cv2.resize(face_image, (96, 96)) #face_image_resized[96//2:96+96//2, 96//2:96+96//2]
         face_image_cropped_torch = torch.tensor(face_image_cropped / 255., dtype=torch.float32).permute(2, 0, 1).unsqueeze(0).to(device)
         generated_face_torch = model(face_image_cropped_torch)
         generated_face = (generated_face_torch.squeeze().permute(1,2,0).detach().cpu().numpy() * 255).astype(np.uint8)
 
 
         mask_origin = face_masker.get_mask(face_image_cropped)
         mask_fake = face_masker.get_mask(generated_face)
 
         origin_moments = cv2.moments(mask_origin)
         cx = np.round(origin_moments['m10'] / origin_moments['m00']).astype(int)
         cy = np.round(origin_moments['m01'] / origin_moments['m00']).astype(int)
         try:
             output_face = cv2.seamlessClone(generated_face, face_image_cropped, mask_fake, (cx, cy), cv2.NORMAL_CLONE)
         except:
             print('Skip')
             continue
 
         fake_face_image = cv2.resize(output_face, (face_image.shape[1], face_image.shape[0]))
         fake_face_image = fake_face_image[..., ::-1] # change to BGR
         frame[face[1]:face[1]+face[3], face[0]:face[0]+face[2]] = fake_face_image
         result_video.write(frame)
 
     result_video.release()

The result of one frame is this

The models aren't perfect, and certain angles of the face, especially the side view, result in a less-than-great image, but the overall effect is good.

to integrate

In order to run the whole process, a main script needs to be created as well.

 from pathlib import Path
 import face_extraction_tools as fet
 import quick96 as q96
 from merge_frame_to_fake_video import merge_frames_to_fake_video
 
 ##### user parameters #####
 # True for executing the step
 extract_and_align_src = True
 extract_and_align_dst = True
 train = True
 eval = False
 
 model_name = 'Quick96'  # use this name to save and load the model
 new_model = False  # True for creating a new model even if a model with the same name already exists
 
 ##### end of user parameters #####
 
 # the path for the videos to process
 data_root = Path('./data')
 src_video_path = data_root / 'data_src.mp4'
 dst_video_path = data_root / 'data_dst.mp4'
 
 # path to folders where the intermediate product will be saved
 src_processing_folder = data_root / 'src'
 dst_processing_folder = data_root / 'dst'
 
 # step 1: extract the frames from the videos
 if extract_and_align_src:
     fet.extract_frames_from_video(video_path=src_video_path, output_folder=src_processing_folder, frames_to_skip=0)
 if extract_and_align_dst:
     fet.extract_frames_from_video(video_path=dst_video_path, output_folder=dst_processing_folder, frames_to_skip=0)
 
 # step 2: extract and align face from frames
 if extract_and_align_src:
     fet.extract_and_align_face_from_image(input_dir=src_processing_folder, desired_face_width=256)
 if extract_and_align_dst:
     fet.extract_and_align_face_from_image(input_dir=dst_processing_folder, desired_face_width=256)
 
 # step 3: train the model
 if train:
     q96.train(str(data_root), model_name, new_model, saved_models_dir='saved_model')
 
 # step 4: create the fake video
 if eval:
     merge_frames_to_fake_video(dst_processing_folder, model_name, saved_models_dir='saved_model')

Summarize

In this article, we introduce the operation pipeline of DeepFaceLab and implement the process with our own method. We first extract frames from videos, then extract faces from frames and align them to create a database. Use neural networks to learn how to represent faces in latent space and how to reconstruct them. Traverse the frames of the target video, find the face and replace it, this is the complete process of this project.

This article is only for study and research, please refer to the actual project:

https://avoid.overfit.cn/post/ec72d69b57464a08803c86db8720e3e9

Author: DZ