【SOT】SiamFC Code Notes

Code source: https://github.com/huanglianghua/siamfc-pytorch
combined with three blogs explained by siamfc-pytorch code
The following are notes when reading the above SiamFC code

preprocess data (data preprocessing)

dataset.py

Summary: return item = (z, x, box_z, box_x) through index index, and then return a pair of pair(z, x) after transforms

step:

  1. Obtain video sequence path, tag, meta (optional) according to index
  2. Filter out noisy images to get valid indices
  3. If there are more than 2 valid indexes, randomly select two indexes from them (the interval does not exceed T=100)
  4. The images corresponding to the two indexes are respectively used as x, z (converted to RGB format)
  5. The labels corresponding to the images corresponding to the two indexes are respectively box_z, box_x
  6. Perform the transforms operation on (z, x, box_z, box_x), and the output is (z, x)

transform.py

  • Import: (z, x, box_z, box_x)
  • Output: (z, x)

step:

  1. Do the following for both x and z:
    1. Convert the format of the box to [y, x, h, w]
    2. Crop a patch that is centered on the center of the box and slightly larger than the box size, and then resized to a size of 255. If necessary, use the average value of the color to pad
  2. for z:
    1. random resize(0.95 ~ 1.05)
    2. Crop a 255-8 block from the middle, if the size is not enough, fill it with the color mean
    3. Randomly crop a block of size 255-2*8
    4. Crop a block with a size of 127 from the middle, if the size is not enough, fill it with the color mean
    5. convert to tensor
  3. for x:
    1. random resize(0.95 ~ 1.05)
    2. Crop a 255-8 block from the middle, if the size is not enough, fill it with the color mean
    3. Randomly crop a block of size 255-2*8
    4. convert to tensor

The effect after processing:

siamfc.py (training + testing)

train_over (training process)

  1. set to training mode
  2. dataloader, originally __getitem__returns a pair (z, x), after loading by dataloader, z is stacked together, x is stacked together, not (z, x) is bound and stacked together
  3. The cycle of epoch and dataloader, train_step(batch, backward=True), binary cross entropy loss
  4. save checkpoint

train_step

输入: (z, x)
z: torch.Size([8, 3, 127, 127])
x: torch.Size([8, 3, 239, 239])

    x       z
    |       |
backbone backbone
    \       /
       head
        |
     responses
  ([8, 1, 15, 15])    _create_labels
        |             /
        |            /
            loss

_create_labels

  • Both exemplar image z and search image x are centered on the target, so the center of the labels is 1, and the center is 0.
  • Get the label under a certain channel under a batch:

track (test process)

  • The sequence of image paths for the input video and the destination position for the first frame

init(self, img, box)

  • Pass in the label and picture of the first frame, initialize some parameters, calculate the center of the search area after some calculations, etc.
  • @torch.no_grad()
  • Preprocessing is similar to the training process
  • step:
    1. Set eval mode
    2. Transform the format of the box to [y, x, h, w], [y, x] as the center point
    3. Create a Hanning window of size 272*272 and normalize
    4. Set 3 scale factors: 1.0375**(-1,0,1)
    5. Set slightly larger z_sz according to the box size, and x_sz = z_sz * 255 / 127. It can be seen here that the side length x_sz of search images before resize is about 4 times that of target_sz, as stated in the paperwe only search for the object within a region of approximately four times its previous size
    6. Only for z: crop a patch that is centered on the center of the box and slightly larger than the size of the box (z_sz), then resize to a size of 127, and pad with the average color if necessary
    7. Only for z: send to backbone, output self.kernel

update(self, img)

  • Pass in subsequent frames, and then return the box coordinates of the target according to the SiamFC network, and then show according to these coordinates, playing a demo effect
  • @torch.no_grad()
  • step:
    1. Set eval mode
    2. Only for x: crop a patch with the center of the box in init as the center and x_sz * f (f is the scaling factor) as the size, and then resize it into a size of 255. If necessary, use the average value of the color to pad, different scaling factors The corresponding x is spliced ​​together
    3. For x only: sent to backbone
    4. Only for x: Send the self.kernel feature extraction result of z obtained in init and the feature extraction result of x in the previous step to the head, and get responses
    5. Upsample the responses to a size of [3, 272, 272]
    6. Penalize the first and third channels of responses (two scales other than the middle) (* 0.9745), because the middle scale must be close to 1, and the scales on the other two sides are either reduced or enlarged, so give to punish
    7. Select the largest one among the 3 channels of responses (the index is scale_id), normalize it, and weight it with the Hanning window (the former is 0.824 and the latter is 0.176), and find the peak point through np.unravel_index
    8. After finding the peak point in the response image, calculate its position in the original image img
    9. Resize target_sz, z_sz, x_sz (adjusted according to the scale factor; if the target becomes larger, the size of these three also increases by the same ratio)
    10. Adjust the position obtained in 8 to the format of [left, top, w, h]

Guess you like

Origin blog.csdn.net/zylooooooooong/article/details/123192744