CVPR 2022| Kuaishou & Chinese Academy of Sciences open source StyTr^2: Transformer-based image stylization method

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

This article is reproduced from: Heart of the Machine | Author: Kuaishou Y-tech

This paper proposes an image style transfer method based on Transformer. We hope that this method can promote the cutting-edge research of image stylization and the application of Transformer in the field of vision, especially image generation.


  • Paper link:

  • Code address:

Image stylization is an interesting and practical topic. It can use reference style images to present content images. It has been widely studied in academia for many years and has been widely used in the industry including short video. For example, mobile Internet users can experience various portrait stylized special effects including hand-painted, watercolor, oil painting and Q-version cute style through a series of apps such as Kuaishou Main Station, Speed ​​Edition, Yitian Camera and Kuaiying.

Traditional texture synthesis-based stylization methods can generate vivid stylized images, but are computationally complex due to the inclusion of modeling of stroke appearance and painting process. Subsequently, the researchers focused on neural stylization based on convolutional neural networks. The optimization-based stylization method refers to the content image and the style image, and iteratively optimizes the generated results. According to the design of the encoder-stylization module-decoder, the arbitrary stylization method utilizes an end-to-end approach to adjust the second-order statistics of the content image according to the style image, and can efficiently generate stylized results. However, due to the limited ability to model the relationship between content and style, these methods cannot achieve satisfactory results in many cases. To overcome this problem, some research methods apply a self-attention mechanism to improve stylization results.

The current mainstream stylization methods generally use convolutional neural networks to learn style and content representation. Due to the limited receptive field of the convolution operation, only the deep convolution network can capture the long-range dependencies of the picture. However, increasing the depth of the network leads to a reduction in the resolution of image features and loss of details. The lack of detail manifested in the stylized results affects the preservation of content structure and the display of style patterns. As shown in Figure 1(a), the stylization algorithm based on convolutional neural network ignores some details in the process of feature extraction. The shallow layer of the network focuses on local features, and the deep layer can obtain global information by integrating local information. In addition, some research work found that the content representation obtained by typical CNN-based stylization methods is inaccurate, which will lead to the problem of content leakage: after several rounds of repeated stylization operations, the stylization results can hardly retain any of the original input. Content structure information.


Figure 1 (a) CNN-based stylized intermediate layer visualization results; (b) intermediate layer visualization results of our method

Following the success of Transformer in the field of Natural Language Processing (NLP), Transformer-based architectures have been used for various vision tasks. There are two advantages of applying Transformer to computer vision: First, with the help of self-attention mechanism, Transformer can easily learn the global information of the input, so that the overall understanding of the input can be obtained at each layer; Second, Transformer It is a relational modeling structure, and different layers can extract similar structural information (as shown in Figure 1(b)). Therefore, Transformer has strong feature representation ability, which can avoid the loss of details in the process of feature extraction, and can well preserve the generated structure.

This paper proposes a novel image stylization algorithm, namely StyTr^2, for the problem of bias in content expression in CNN-based stylization methods.


In order to utilize Transformer's ability to capture long-term dependencies to achieve image stylization, this paper designs the structure in Figure 2. The model mainly includes three parts: content Transformer encoder, style Transformer encoder and Transformer decoder. The content Transformer encoder and the style Transformer encoder are used to encode the long-range information of the images in the content domain and the style domain, respectively. This encoding method can effectively avoid the problem of loss of details. The Transformer decoder is used to convert content features into stylized results with styled image features.


Figure 2 Network structure

Furthermore, this paper raises two important issues for traditional positional encoding. First, for image generation tasks, should image semantics be considered when calculating PE (Position Encoding)? Traditional PE is designed according to logically ordered sentences, while image sequences are organized according to image content semantics. Suppose the distance between two image patches is d(.,.) . As shown in the right part of Figure 3(a), the difference between d((0 , 3 ), (1 , 3 )) (red and green blocks) is the same as d(( 0 , 3 ), (3 , 3 )) The differences between (red and cyan blocks) should be similar, since the stylization task requires similar content patches to have similar stylization results. Second, when the input image size increases exponentially, does traditional sinusoidal positional encoding still work for vision tasks? As shown in 3(a), when the image size changes, patches at the same semantic location (in blue The relative distances between the small rectangles) vary significantly, which is not suitable for the multi-scale input requirements in vision tasks.


Fig. 3 Schematic diagram of CAPE calculation

To this end, this paper proposes Content-Aware Positional Encoding (CAPE), which is scale-invariant and content semantically relevant, making it more suitable for stylization tasks.

Results display

As shown in Fig. 4, compared with the state-of-the-art method, StyTr^2 utilizes a Transformer-based network with better feature representation ability, captures long-term dependencies of input images, and avoids loss of content and style detail. Therefore, the results of our method can achieve high-quality stylization, making the results maintain good content structure and rich stylistic patterns at the same time.


Figure 4 Comparison of stylized results

Figure 5 shows the stylized results for rounds 1 and 20. First, compare the stylized results from the first round. The content structure of the results generated by the CNN-based method is damaged to varying degrees, but the results of this paper still have a clear content structure. While the results generated by ArtFlow maintain a clear content structure, the stylization effects are unsatisfactory (eg, edge defects and inappropriate style patterns). Second, as the number of stylization increases, the content structure generated by the CNN-based method tends to be blurred, while the content structure generated by our method is still clear.


Figure 5 Comparison of multi-round stylization results


ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-Transformer或者目标检测 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。

▲扫码或加微信: CVer6666,进交流群



Guess you like