GroupVit:semantic segmentation emerges from text gemeration

The supervision signal of groupvit comes from text. It does not rely on manual annotation of segmentation mask. Instead, it can use image and text pairs for unsupervised training like clip, allowing the model to perform simple segmentation tasks. The method used for unsupervised segmentation on the visual side is grouping. If there are some cluster center points, then start to diverge from this point, and gradually expand nearby similar points into a group, which is actually equivalent to a segmentation mask. It is a bottom-up approach. The author actually re-examined the grouping method and proposed the grouping block, so that the model can slowly group the adjacent pixels bit by bit during the initial learning, and become It becomes a segmentation mask, as shown on the right side below. The group token segmentation is not very good at the beginning. After the model is learned, the segmentation of these group tokens is quite good when it goes deep. Groupvit adds this grouping block to the vit framework and adds learnable group tokens.

Like transformer, it is a hierarchical structure. The grouping block actually uses a method similar to self-attention to first calculate a similarity matrix, and then uses the similarity matrix to help the original image token do some clustering center allocation, thereby reducing the input from 196x384 to 64x384, clustering center The distribution is not differentiable, so sumbel softmax is used to make it differentiable. At this point, the first stage of grouping is completed, changing the sequence from 196+64 to 64x384. In a general data set, there are not too many types in one picture, so the 64 cluster centers can be changed into smaller ones. 8, so that some similar category blocks are merged, so 8 new group tokens are added, which is 8x384. The picture above shows two newly added group tokens. The author added another grouping after the ninth transformer layer. block, after three layers of transformer layer learning, the newly added 8 segment tokens have also been learned well. This grouping block is to allocate 64+8 group tokens to 8 tokens, and the image is divided into 8 blocks, each block corresponds to a feature.

Guess you like

Origin blog.csdn.net/u012193416/article/details/130271025