Code source: https://github.com/huanglianghua/siamfc-pytorch
combined with three blogs explained by siamfc-pytorch code
The following are notes when reading the above SiamFC code
preprocess data (data preprocessing)
dataset.py
Summary: return item = (z, x, box_z, box_x) through index index, and then return a pair of pair(z, x) after transforms
step:
- Obtain video sequence path, tag, meta (optional) according to index
- Filter out noisy images to get valid indices
- If there are more than 2 valid indexes, randomly select two indexes from them (the interval does not exceed T=100)
- The images corresponding to the two indexes are respectively used as x, z (converted to RGB format)
- The labels corresponding to the images corresponding to the two indexes are respectively box_z, box_x
- Perform the transforms operation on (z, x, box_z, box_x), and the output is (z, x)
transform.py
- Import: (z, x, box_z, box_x)
- Output: (z, x)
step:
- Do the following for both x and z:
- Convert the format of the box to [y, x, h, w]
- Crop a patch that is centered on the center of the box and slightly larger than the box size, and then resized to a size of 255. If necessary, use the average value of the color to pad
- for z:
- random resize(0.95 ~ 1.05)
- Crop a 255-8 block from the middle, if the size is not enough, fill it with the color mean
- Randomly crop a block of size 255-2*8
- Crop a block with a size of 127 from the middle, if the size is not enough, fill it with the color mean
- convert to tensor
- for x:
- random resize(0.95 ~ 1.05)
- Crop a 255-8 block from the middle, if the size is not enough, fill it with the color mean
- Randomly crop a block of size 255-2*8
- convert to tensor
The effect after processing:
siamfc.py (training + testing)
train_over (training process)
- set to training mode
- dataloader, originally
__getitem__
returns a pair (z, x), after loading by dataloader, z is stacked together, x is stacked together, not (z, x) is bound and stacked together - The cycle of epoch and dataloader, train_step(batch, backward=True), binary cross entropy loss
- save checkpoint
train_step
输入: (z, x)
z: torch.Size([8, 3, 127, 127])
x: torch.Size([8, 3, 239, 239])
x z
| |
backbone backbone
\ /
head
|
responses
([8, 1, 15, 15]) _create_labels
| /
| /
loss
_create_labels
- Both exemplar image z and search image x are centered on the target, so the center of the labels is 1, and the center is 0.
- Get the label under a certain channel under a batch:
track (test process)
- The sequence of image paths for the input video and the destination position for the first frame
init(self, img, box)
- Pass in the label and picture of the first frame, initialize some parameters, calculate the center of the search area after some calculations, etc.
- @torch.no_grad()
- Preprocessing is similar to the training process
- step:
- Set eval mode
- Transform the format of the box to [y, x, h, w], [y, x] as the center point
- Create a Hanning window of size 272*272 and normalize
- Set 3 scale factors: 1.0375**(-1,0,1)
- Set slightly larger z_sz according to the box size, and x_sz = z_sz * 255 / 127. It can be seen here that the side length x_sz of search images before resize is about 4 times that of target_sz, as stated in the paper
we only search for the object within a region of approximately four times its previous size
- Only for z: crop a patch that is centered on the center of the box and slightly larger than the size of the box (z_sz), then resize to a size of 127, and pad with the average color if necessary
- Only for z: send to backbone, output self.kernel
update(self, img)
- Pass in subsequent frames, and then return the box coordinates of the target according to the SiamFC network, and then show according to these coordinates, playing a demo effect
- @torch.no_grad()
- step:
- Set eval mode
- Only for x: crop a patch with the center of the box in init as the center and x_sz * f (f is the scaling factor) as the size, and then resize it into a size of 255. If necessary, use the average value of the color to pad, different scaling factors The corresponding x is spliced together
- For x only: sent to backbone
- Only for x: Send the self.kernel feature extraction result of z obtained in init and the feature extraction result of x in the previous step to the head, and get responses
- Upsample the responses to a size of [3, 272, 272]
- Penalize the first and third channels of responses (two scales other than the middle) (* 0.9745), because the middle scale must be close to 1, and the scales on the other two sides are either reduced or enlarged, so give to punish
- Select the largest one among the 3 channels of responses (the index is scale_id), normalize it, and weight it with the Hanning window (the former is 0.824 and the latter is 0.176), and find the peak point through np.unravel_index
- After finding the peak point in the response image, calculate its position in the original image img
- Resize target_sz, z_sz, x_sz (adjusted according to the scale factor; if the target becomes larger, the size of these three also increases by the same ratio)
- Adjust the position obtained in 8 to the format of [left, top, w, h]