Knowing things by learning | The attack and defense of AI and black products, explaining the detection of attacking text images in detail

Guide: With the improvement of the recognition ability of the OCR system, more and more black products are professionally fighting against OCR. In this process, how does AI resist the text and images of black product attacks? This article shares common algorithms for similarity feature training, and selects some representative works for introduction, hoping to give readers some ideas and methods.

Text|Deng Rui Netease Yidun Senior Computer Vision Algorithm Engineer

1. Background

In Internet information, text is an important information medium. By passing information such as WeChat and phone numbers, black products can complete the drainage of advertisements and pornography. In order to block the spread of these spam messages, in the field of content security review, the OCR text recognition system is an important part. It automatically identifies the text content in the image, finds abnormal text, and intercepts and filters harmful content.

With the improvement of the recognition ability of the OCR system, there are more and more black products that are professional against OCR. Typical black products will destroy the image features by removing characters, adding occlusions, distortions, etc., so as to achieve the purpose that the model is difficult to automatically recognize. Once they create characters that are difficult for the current system to recognize, they will generate them in batches and spread them by brush. In this process, in order to avoid the pursuit of the system, they are often accompanied by malicious variants. In addition, we also found that the same type of adversarial attack will reappear across time and across customers.

So how do we solve this problem? Consider the existing solutions, one is a stronger OCR recognition system, and the other is similar image recognition. The problem with the former is that if the text features are severely damaged, the OCR system will lose the basis for recognition and it will be difficult to recall. The latter is often used for similarity recognition of the overall image, including characters, text, background, etc. in the image. When the proportion of the text area is small and the background is changed (such as changing the background of a different beauty and handsome guy), the overall image similarity recognition is often prone to failure. .

insert image description here
A recent attack case (attack intensity is increasing)

After in-depth problem analysis and technical practice, the NetEase Yidun algorithm team has established a new generation of garbage image recognition services with text feature analysis as the core, and successfully implemented it online. This set of services includes modules such as text attribute analysis, salient region feature extraction, and feature similarity comparison analysis. It has good robustness to similar types of variants (occlusion, compression, rotation, background transformation, etc.) Shield's multi-factor identification capability enables precise management of spam images. The garbage pictures in the picture above are varied, and the new service has been iterated with a small number of samples, and the machine recall rate for new variants can reach 95%+.

Combined with large-scale unsupervised learning, the Yidun algorithm team has established multiple high-precision and high-recall feature analysis models to achieve accurate judgment of feature similarity. This article mainly shares common algorithms for similarity feature training, and selects some representative works for introduction.

2. Algorithm Sharing

Similarity learning is essentially a contrastive learning task, which learns some invariant features through the model and compresses image information into a high-dimensional feature space. There are two main learning methods, self-supervised learning (Self-Supervised Learning) and metric learning (Deep Metric Learning). The advantage of the former is that it can use a large amount of unlabeled data and zero labeling cost, but it can only learn "self", and the rejection rate of variants is high. Metric learning can learn the similarity of a class of objects, which is widely used in the field of image retrieval, but the categories of objects need to be labeled manually, which is slow and expensive. The two types of methods are introduced respectively below.

Moco Series

Moco[1] is a popular self-supervised learning algorithm whose goal is to learn a generalizable pre-trained feature by comparison with itself.

Before MoCo, a common contrastive learning framework, as shown in Figure (a) below, is an end-to-end framework. This framework is relatively simple. Two enhancements of a sample are used, one as query and one as key. They all use the same encoder for encoding, and then use constructive loss for learning. However, in actual use, this framework usually converges slowly, and because the batch size is limited by the video memory of the graphics card, the final convergence effect is relatively poor.

So there is an improved version (b) below. This method maintains a large memory-bank as a negative sample for training. The memory-bank stores a series of previously encoded features, which can be directly used for sampling. In contrast, at the same time, the samples in the memory-bank do not undergo gradient propagation. The emergence of memory-bank improves the convergence speed and convergence effect of contrastive learning, so that this module is used in many later papers.

The difference between MoCo and the previous two methods is that a momentum encoder (momentum encoder) and a dynamic dictionary queue are proposed, as shown in Figure (c). The key feature is stored in a dictionary queue after passing through the momentum encoder, and the momentum encoder is a separate encoder, which maintains the consistency of the feature through the momentum update, that is, compared to the memory-bank method, the dynamic The dictionary features are more stable and consistent, and the effect will be better. This is also proved in the experiment. Under the same queue size, MoCo is more than 2.6% better than memory bank.
insert image description here

SWAV

SwAV[2], the full name is Swapping Assignments between multiple Views of the same image, which is to complete self-supervised learning by exchanging label assignments in different views. It is independent of MoCo and SimCLR[3], a novel pairwise comparison self-supervised learning approach. Combined with the MultiCrop data enhancement method proposed in the article, the best results at that time were achieved.
insert image description here
Left: Self-supervised learning based on pairwise comparison. Right: SwAW learning style

The labels learned in SwAV come from clustering. It maintains a vector group with K prototypes. These K prototype vectors can be regarded as the center of the data (refer to the center of kmeans). After the samples are encoded by the encoder, the calculation and The distance between these K centers can get a soft code as the label of this sample. During training, one sample is transformed into two enhanced samples through data enhancement, and then the encoder obtains its own label respectively, and then gives its own label to the other party, that is, an exchange process is completed, and the exchanged label can be used KL-Loss for learning.
insert image description here

We know that the effect of model learning has a great relationship with the quality of labels. The higher the quality of labels, the better the model can learn. So how to get high-quality prototype in SwAV? The answer is Online Cluster, and the prototype in SwAV is also continuously trained, always representing the center vector of the model on the current dataset. Generally speaking, to find the center vector of the dataset, it is conventional to use the Kmeans algorithm (OfflineTraining) offline to calculate on the entire dataset. This idea is similar to DeepClusterV2[4]. SwAV uses a queue to maintain a small number of sample queues, and then uses the Sinkhorn algorithm to force the samples in the queue to be evenly distributed to each center. This optimization process can be expressed as, given a feature queue Z=[z1, z2,…zb], map them to prototypesC=[c1, c2,…ck], this mapping relationship can be expressed as a matrix Q=[q1, q2,…,qb], the optimization goal of Q is the feature and prototype The similarity is the largest. This process uses the sinkhorn algorithm to find an iterative approximate solution, which is the following formula. Because the length of the feature queue Z is not large, usually several thousand, the entire solution process is also very fast, and the additional overhead of training is relatively small.
insert image description here

The experimental results show that the effect of SwAW is very close to that of supervised training, and it is better than the comparison method of SimCLR class. In addition, in terms of label assignment method, SwAV's OnlineTraining is actually similar to DeepClusterV2's OfflineTraining, and the results are very good, both of which confirm the effectiveness of this label exchange method.
insert image description here

MultiSimilarity-Loss

In Deep Metric Learning (Deep Metric Learning), a series of pairwise Loss is proposed, such as contrastive loss, triplet loss, etc. In the article MultiSimilarity Loss, a unified framework is proposed to explain these Loss, which is called GPW in the article (General Pair Weighting), unify these Losses into the problem of weighting the similarity of Pairs. Under this framework, the author further proposes three forms of similarity, namely self-similarity, positive-relative-similarity and negative-relative- Similarity, the first of which is self-similarity, which is only related to itself, and the latter two are relative similarity, which is related to other samples. And Multi-Similarity is a Loss that considers these three similarities at the same time.

So what exactly are these three similarities? Summarize with a picture.
insert image description here
Table: Similarity contained in different Loss. Previous methods all contain only partial similarities.

The multiple similarities of MultiSimilarLoss are not only reflected in Loss, but also in the mining of training samples. Through iterative mining (Mining) and weighting (Weighting), the pair with high information content is continuously selected for training, and very good results have been achieved. The mining process is to select the positive and negative sample pairs with the optimal amount of information according to two principles. Weighting is to assign different weights to different sample pairs according to the similarity.
insert image description here
The final expression of MS Loss:
insert image description here

3. Summary

The training of large-scale similarity features is inseparable from self-supervised learning. Self-supervised learning can learn a feature model with superior effect at a lower cost. But relying on self-supervised learning alone also makes the model confined to the world of self (self), potentially reducing the generalization of the model, and reflecting the problem of low recall rate in online practical applications. How to solve this problem is another new topic. At present, Yidun has explored an algorithm scheme that combines active labeling and metric learning, and gradually guides the model to learn potential neighbor relationships, and has achieved certain results. In the future, Yidun will continue to follow up the latest methods in the industry and academia, and provide more solutions for this kind of anti-garbage management.

References

[1] MoCo: Momentum Contrast for Unsupervised Visual Representation Learning
[2] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
[3] A Simple Framework for Contrastive Learning of Visual Representations
[4] Multi-Similarity Loss with General Pair Weightingfor Deep Metric Learning

About the author: Deng Rui, senior computer vision algorithm engineer of Netease Yidun, is responsible for the development and implementation of OCR and audio and video algorithms in the field of content security.

Guess you like

Origin blog.csdn.net/yidunmarket/article/details/126940139