AI video style conversion: Stable Diffusion+TemporalKit

basic method

First, use the Temporal-Kit plug-in to extract the key frame images in the video, then use Stable Diffusion WebUI to redraw the key frame images, and then use Temporal-Kit to process the converted key frame images, which will automatically supplement the images between key frames. , and finally combine these pictures to form a video.

This method suggests trying to find something with a simple background, a large subject in the picture, and slow movements. In this way, the picture elements generated during redrawing will be more stable, the subject movements are well connected, and the effect will be better.

Install TemporalKit

method one

Install through the URL in Stable Diffusion WebUI, open the "Extended Plugins" - "Install from URL" tab, enter the Github warehouse address: https://github.com/CiaraStrawberry/TemporalKit.git, and then click "Install" to install After success, you will see a prompt to restart, and then restart in the "Installed" tab. As shown below:

After restarting SD, you will see the Temporal-Kit tab in the first-level menu.

If you don't see it, check the console for error logs. I got a module not found error:

ModuleNotFoundError: No module named 'moviepy'

ModuleNotFoundError: No module named 'scenedetect'

This is because some Python packages that Temporal-Kit depends on do not exist. Just use pip to install them.

source /root/stable-diffusion-webui/venv/bin/activate
pip install moviepy
pip install scenedetect

I also used source xxx/activate here because my Stable Diffusion WebUI runs in a Python virtual environment. If yours does the same, you need to activate this virtual environment first, be careful to change it to your own file path, and then put the package They can only be found when installed into this virtual environment.

After installing these dependency packages, restart SD and it should be available under normal circumstances. If it still doesn't work, please leave a message explaining the problem.

Method Two

If you cannot access Github directly, for example, if you cannot access the external network, you can download this plug-in and put it in the extension plug-in directory of SD WebUI.

Download address of this plug-in: https://github.com/CiaraStrawberry/TemporalKit.git

If it is inconvenient for you to access Github, you can also follow my public account: Yinghuo Walk AI (yinghuo6ai) and send a message: Video style conversion to get the download address.

After unzipping the plug-in, place it in the extensions directory of your SD WebUI, as shown in the picture below:

Extract keyframes

Why extract keyframes? Extracting key frames is to convert the scenes with relatively large movement changes in the video into pictures. The next step is to redraw these pictures. If you do not extract key frames, but redraw each frame of the video, firstly, the workload will be large, and secondly, each redrawn picture may be a little different, and the picture may flicker seriously.

Find Temporal-Kit in the home tab of SD WebUI and click to open it. Then click "Pre-Processing" and upload the video to be processed in the video area. This is a section I intercepted from Douyin (the download address of this video will be provided at the end of the article). Don’t click “Run” right away, there are still some settings, please continue reading below.

You can see these settings below the video, which are settings for extracting images:

Sides: The sides of the generated image contain several video frames. If it is 2, it represents 4 video frames, which is 2*2; if it is 3, it represents 9 video frames, which is 3*3; the minimum setting is 1, that is, one picture contains one video frame. This should be set together with the Height Resolution later.

Height Resolution: Generate the pixel value of the height of the picture. The suggestion is: the height of the video * Sides. For example, my video is 1080*720, and the height of a single video frame is 720, but the previous Sides setting is 2, so it is 720*2= 1440. But this formula is not absolute. You can also write 720 or 2048. This value needs to consider the performance of the graphics card. If the graphics card is not good enough, do not set it too high.

frames per keyframe: How many video frames are used to extract a keyframe.

fps: How many frames does the video contain per second? You can generally get it by viewing the video details on your computer.

Target Folder: The output location of the keyframe image will actually be output to an input folder created in this directory. The intermediate files for subsequent processing are all in this folder, which is equivalent to a project directory, so it is recommended to create an input folder for each video. Different folders are created for different processes. Note that if it is a cloud, this needs to be a directory on the server.

Batch Settings: Because we need to process the entire video here, we need to check this Batch Run.

After setting the parameters, click "Run" on the right side of the page.

After all the keyframe pictures are extracted, the first extracted picture will be displayed in the image area. We can also see the extracted pictures in the file directory. Here we take AutoDL's JupyterLab as an example.

Then we can click "Picture to Make Picture" to proceed to the next step.

Convert style

After clicking "Tushengtu" in the previous step, the page jumps to "Tushengtu" and the first picture is automatically brought over.

We need to select a model and fill in some prompt words. In order to generate the ink painting style, I specially added a Lora (I will share the download address of my Lora at the end of the article). You can decide which model and Lora to use according to your own needs.

My prompt words are posted here for easy copying.

Prompt words: (masterpiece, realistic:1.4), (extremely intricate:1.2)>, a man, talking <lora:watercolor_v4:1>

Reverse prompt words: easy_negative,(((text))),(((logo))),(beard)

Then there are some parameter settings. Please adjust them according to the actual situation. If the effect is not good, adjust them.

Note two points:

  • The width and height of the image: This is brought from the page for extracting key frames. If the number is too large, it is recommended to make it smaller first, and then use super high-definition to enlarge it.
  • Redraw intensity: Do not set it too high, otherwise the redrawn images will change too much, making it difficult to connect, and the resulting video will flicker.

ControlNet is generally needed here to control the image to avoid excessive changes in redrawing and to stabilize the image. I chose the Tile model here, but you can also try line-drawing models such as SoftEdge, Canny, and Lineart.

Then it’s time to draw cards and keep generating pictures until you are satisfied.

Pay attention to record the generation seeds of satisfactory pictures, which will be used in batch generation soon.

Switch Tushengtu to "Batch Processing" and fill in the two directories:

  • Input directory: The directory of the output image in the keyframe extraction step.
  • Output directory: The directory where redrawn pictures are saved, fixed value output, just fill it in.

Fill in the seeds for generating satisfactory pictures here. Many tutorials on the Internet mention this, but do not expect that the elements in each picture after redrawing will remain consistent, because each picture of the video frame is different, a seed It is difficult to stably output various elements in the picture. You can experience it yourself.

The last step is to click the "Generate" button and wait for the batch processing to complete.

When you see this sentence below the image output area, the processing is basically completed. Sometimes the progress of the WebUI is not updated in time. Please pay attention to the output of the console or shell.

Synthetic video

Now we enter the exciting video synthesis phase. This step requires returning to the Temporal-Kit page.

Batch transformation

Click "Batch-Warp" to enter the batch transformation page.

Fill in the complete project directory in the Input Folder. Note that it is not the output directory or the input directory, but their superior directory.

Then click "read_last_settings" and it will load the source video and related parameters. Note that the "output resolution" here needs to be set manually. The default is 1024. It is recommended to change it to the resolution of the source video to maintain consistency. Just use automatic loading for other parameters.

Finally, click "run" to start video synthesis.

The principle of this video synthesis is to generate intermediate sequence frames based on key frames, and then stitch them together to generate a video. You can see the images generated in the middle in the result directory.

A 5-second video takes about 10 minutes on the A5000 graphics card on AutoDL. After the synthesis is successful, the video will be displayed on the right side of the Batch-Warp page and can be played directly or downloaded.

The generated video is still a bit flickering and incoherent, and my video selection was not very good.

The video synthesized here has no sound by default. We can synthesize the sound of the original video in the Cutting APP. The video cannot be sent here. Open the network disk to see the effect:

Alibaba cloud disk sharing

Single sheet transformation

Temporal-Kit also provides a "Temporal-Warp" tool. It has been tested that it can convert a single redrawn picture into a video, a relatively short video.

Combined with EBSynth to synthesize video

There are many steps to this step, which will be introduced in the next article.

Download

You can download and install the relevant models, plug-ins and materials introduced in this article by yourself according to the methods mentioned in the article, or you can use the ones I prepared by following the public account: Yinghuo Walk AI (yinghuo6ai), reply: Convert the video style to get the download address.

Guess you like

Origin blog.csdn.net/bossma/article/details/131893940