When looking at the paper on target detection network, a set of comparative vocabulary appeared: bottom-up and top-down. After checking some information and combining personal understanding, I got the opinion:
top-down: as the name implies, it is carried out from top to bottom. Originally derived from the pedestrian detection framework, in the pedestrian detection, the pedestrian target is first detected to obtain the bounding box, and then the key points of the human body are detected in the bounding box, and each person's posture is connected. Applied to the target detection network, it is to obtain the approximate boundary of the target first, and then further determine the position of the target, such as RepPoints, and determine the target boundary through deformable convolution.
Bottom-up: From the bottom up, after the image is extracted to the feature map, the network first determines the edge extreme points or corner points of the target, and then determines the detection target by defining whether these points belong to the same target, and obtaining the boundary of the target, such as CornerNet (Upper left corner point, lower right corner point), ExtremeNet (upper, lower, left, right extreme point + center point).