[Target Recognition-Paper Notes]Object Detection in Videos by Short and Long Range Object Linking

文章标题:2018_Cite=13_Tang——Object Detection in Videos by Short and Long Range Object Linking

This paper is also called "2019_Cite=91_TPAMI_Tang - Object Detection in Videos by High Quality Object Linking"

If this blog is helpful to you, I hope you like it, collect it, follow it, and comment . Your recognition will be the biggest motivation for me to continue creating. You are also welcome to forward this article to your friends , maybe he is also interested in the field of object recognition?

Term Explanation

 proposal

        Candidate area

tubelet

        Refers to the approximate path of the target's movement, and the collection of the positions of the boxes at these moments during the movement of the target

non-maximum suppression

        What is NMS?

        What is the use? What problem is it used to solve ? —— aims to remove overlapping object detection boxes

tracklet

        Generally translated as "tracking small fragments", data association (data association) is used when tracking objects. The entire continuous tracking process is actually composed of many tracklets.

Is the region proposal network (RPN)
       a separate network? Do you have an introduction?

I know that Faster R-CNN
        is used for target recognition. May I ask how it is done? Is this idea continued to be used in later models? Or imitated its design ideas? . It seems that the region proposal network is part of the Faster R-CNN network

ROI (region of interest)

        The region of interest, similar to the bbox concept, defines the part of the image we are interested in, and only gives this part to the model for learning. In other words, the part outside this area will not be input to the subsequent model.

        Target detection mainly solves two problems (1) the position of the target, that is, the position of the four points of the bbox (2) to determine the category of the object in this box. The Region Of Interest here serves the first purpose. You want to determine the positions of the four vertices of this box, how do you find these four points? If a sliding window slides on the picture, then there are too many candidate bboxes, and the amount of calculation is obviously too large. So what to do, you draw an area and say that you are interested in this area, so it is named region of interest. Only this area is input into the model for recognition, which greatly reduces the candidate bbox

objectness

        As the name implies, this word means, (this candidate region region of interst), there is a possibility of an object (a measure of probability).

        The purpose of this thing is to allow us to quickly delete image windows that do not contain any objects.

        

        If an image has high Objectness, we expect it to have:

  • Unique across the entire image. (If your thing is similar to other regions and has no uniqueness, isn’t your region of interest just a part of the same background?)
  • There are strict boundaries around objects. (Generally speaking, there are only objects. For example, the sheep in the green box below has a clear border with the background green grass. The color and texture are different. One is white and the other is green. If there is no obvious border with the outside of the box, it means that your pattern and The background is very similar, then you have a high probability of being the background)
  • Different from the appearance of the surrounding environment. (same as above)

     

   How to measure objectness?

  • Multi-scale saliency : This is essentially a measure of the appearance uniqueness of the image window . The higher the density of unique pixels in the box compared to the entire image , the higher the value.

  • Color Contrast : The greater the color contrast between the pixels in the frame and the suggested image window and the surrounding area , the greater the value.

  • Edge Density : We define an edge as the boundary of an object, and this value is a measure of the edges near the image window boundary. Aninteresting algorithm to find these edges: https://cv-tricks.com/opencv-dnn/edge-detection-hed/ .

  • Superpixel Spanning : We define a superpixel as a cluster of pixels of nearly the same color . If the value is high, all superpixels inside the box are contained only within its boundaries.

RoI = Region of Interest

CoI = Cuboids of Interest

RoI Pooling

        The process of ROI Pooling is to map box rectangles of different sizes into rectangles of size w∗h

        For details, see this article ROI Pooling of Deep Learning_Running Daxiji's Blog-CSDN Blog . If you have time, read the summary.

response map:

        The output image formed after performing an operation on the original image can be called the response map of this operation.

What problem does it solve? :

        You have recognized that this is an object, and the box has been marked, but you don't know the category of this object. That is to say, you should do classification at this time, and provide the confidence level of your classification results.

        However, this classification is sometimes misleading. For example, in column (a), the third and fourth pictures misclassify squirrels as monkeys. (b) and (c) use some techniques, and the final classification will not be wrong. The algorithm proposed in this paper is dedicated to correcting this classification error.

        But for my graduation thesis, it's not a misclassification, it's just that I can't recognize the person who walked over. But although it doesn't meet my needs, adding this method to YOLOv8 will inspire the design of a unique structure, maybe it will increase a little after this change? , or don't lose points, that can also be used as an innovation point. ——Improving the confidence of classification is also an innovation.

This doesn't make much sense if it identifies people and non-humans. If people are divided into many categories, it makes sense to look at them.

Background of this article

Video object tracking is less effective due to the following characteristics of video

        (1)motion blurs and

        (2) The two factors of poor image quality and degraded image qualities
will lead to unstable classification performance of the same object in the sub-video (for example, in these few frames, the object is classified as a person, and those frames are classified as monkeys ) result in unstable classification for the same object across video.

In order to solve the above problem, some people in academia have done this in the past

  •         The following practices are mainly to capture time situational information Various attempts have been made to explore
  • temporal contexts
  •         Optical-flow based feature propagation optical-flow based feature propagation [38] and
  •         aggregate aggregation [37],
  •         Use tubelets to achieve cross-frame target association object association across frames by tubelets [16, 17, 18], and
  •         The identification is done by tracking track to detect [4].

Note that the algorithm proposed in this article is also the same as the above methods, which is to capture temporal context information to explore temporal contexts

algorithm

        Model name: time-space cube recommendation area network spatio-temporal cuboid proposal network

       (1)  l ink objects in the short ranges and long ranges . Object association across frame Object connection between different frames (2) aggregating the classification scores from the linked objects  . Exploit objects association across frames in the video

A total of four stages, three about short range object linking and three about short range object linking and one about long range object linking

(Step 1)cubic proposal for short segment

How to get short range linking?

        Create spatio-temporal cuboid proposals for space-time cube candidate regions. Generate a series of candidate spatio-temporal cuboids space-time cubes for short video segment . How does the link mentioned above reflect it? This kind of link is reflected in that in this cube, objects across different frames are regarded as the same object. The objects across frames lying in a cuboid are regarded as the same object.So the objects in a video segment are naturally linked by the cuboid

        The significance of this step is to propose a series of cubes, which can span pictures of different frames and wrap all the same objects on the image, just like the cube composed of the red solid line and dotted line in the figure below. This step aims to propose a set of cuboids (containers) which bound the sameobject across frames.

        Let me introduce how the following picture is drawn. If you look at the relative position of the green box in the red box, you can have the following findings. From the first picture to the second picture, the child has moved a little to the left horizontally. In the second to third pictures, in addition to moving to the left to the left border of the red frame, the child moved up to the upper border. ——From this we infer that the red frame is the red frame that can contain the three green frames after the green frame is drawn, and find the largest red frame.

The red cube is the step of "cubic proposal for a short segment", which is the task to be completed in the first step. The red cuboid, bounding the movement of the object, is the target of the cuboid proposal stage .

The green cube, that is, the tubelet, is the step of "object tubelet detection for a short segment", which is the second step, the task to be completed The tubelet, composed of the green object boxes in the video segment, is the target of the object tubelet objection stage.

In a video segment (in a short segment), the object (cuboid) of the ground truth is composed of K frames of images, and each photo is represented by the initial letter "I" of Image. These frames of photos are extracted from video clips one by one. Naturally, each photo represents a moment, so the upper right corner of the mathematical representation below shows, from time t to time t+K-1, a total of K a photograph.

The ground truth cuboid of objects is a set, which can be represented by the formula on the right  

How does the tubelet represent it?

        What is a tubelet? All the ground truth boxes in the K frame picture are those green real bounding boxes . a series of all ground truth boxes in the K frames,

        You can replace the above I with "b with a wavy line above it", indicating that the upper right mark of the time is reserved. Give this collection a name, which is the "crooked T with a wavy line" on the left to represent tubelets

        Note here that "b has a wavy line above it, and a time in the upper right corner", what does it refer to specifically? In other words, from which dimensions can you provide data to successfully represent a bounding box? It can be expressed by the following formula 

Each b in the above brackets is composed of four elements, the abscissa and ordinate of the center of the recognition frame, and the width and height of the recognition frame. The "weird T" in the upper right corner of these four elements indicates at which moment it is the information of the recognition frame of this frame.

Then, how to get the bounding box on the outermost edge of red?

        Naturally, you put all these green boxes together on each frame, and then take the four boxes that are closest to the top, bottom, left, and right. This red box is called a cubic proposal, which is "a wavy line above b" in the following formula. how to get it It is to add a bounding box to the outside of the above formula, and get the four closest ones.

        so you get a red box

How does bounding cuboid come about?

        (Is K (number of frames) red boxes finally forming this bounding cuboid? ——Or, there is a green tubelet box on each frame, and these green boxes finally form a red box?

        I lean towards the former. The way to form a bounding cuboid is to collect K, "b with a wavy line" (K (number of frames) red frame) The bounding cuboid in our approach is just a collection of

Although these K red boxes have different time (the frame position), but for the sake of simplicity, we all write b with a wavy line, and then the set they form is named "a wavy line above c "

We modified the region proposal network in the Faster R-CNN network, and replaced the RPN calculation cuboid proposal with the cuboid proposall network. The traditional RPN is the input of one picture and one picture, and our cuboid proposal network is a one-time input of K frames and pictures.

After inputting K frames of pictures, the output is a cube, which contains candidate cubes of interest, which is the candidate cube area that you are interested in. The latter model will only use the areas you are interested in for subsequent recognition calculations, and the cube areas other than your interest will not be recognized by the latter model.

This CoI (candidate cuboids of interest) is regressed with a w×h spatial grid ( what is this operation? Can you tell me how to do it? ). The output is a set of whk candidate cuboids
of interest (CoI), regressed from aw × h spatial grid

The previous CoI has the following two characteristics

        (1) At the same position in each frame of the picture, there are k reference boxes, because this records the k boxes drawn at k time points below this position.

        (2) The value of each CoI (candidate cuboids of interest) is closely related to the objectness score.

Replenish:

        Similar to the candidate region of interest on a two-dimensional plane, here is a cube, we call it candidate cubes of interest.

        The dimension of this cube is w×h×k, w represents the width of the bbox, h represents the height of the bbox, and k represents how many frames we span. Then we go and

(Step 2)Object tubelet detction for short segment

Through the operation in the previous step, you got the cuboid proposals in red. This step performs regression and classification on tubelets, regress and classify, and finally gets the green box solid line and dotted line

In this step, we use the 2D form (2D form) of the cube candidate area cuboid proposal as the area recommendation of the 2D box candidate area of ​​each frame (each frame). On each frame of the picture, each 2D candidate area is classified (identify what object is in the candidate area, is it a person or a car?), and some optimizations are also made through refine

——We use the 2D form of the cuboid proposal as the 2D box (region) proposal for each frame in this segment, which is classified and refined for each frame separately.

Considering that there is a "frame" in this part Considering a frame Iτ in this segment ( why there is a frame Iτ in this segment, why did you choose to do this later? What did you consider? ), we used FastR-CNNl to calculate The classification score and refine the box, fine-tuning the recognition box.

        RoI Pooling : We start with a RoI (Region of Interest) pooling operation ( I don't understand what RoI pooling is, I have time to read the link above ). The input of this pooling operation is a 2D region proposal, b, and then after a CNN operation, a processed feature map is obtained , which is also called a response map response map.

        Classification : Then the result of the above RoI pooling operation is fed into a classification layer, and a C+1-dimensional classification score vector is output. What does this C mean here? Let me explain here. C means the number of categories recognized and classified. The extra plus one means that the background category is also included in the category.

        Regression : The result of the above RoI pooling operation will also be fed into a regression layer. With this regression layer, we can implement a refine for the box. Then do a regression, one of the independent variables and dependent variables of the regression is the precise box locations in each frame of the image. The other one is a bit complicated, let me explain it carefully. Cuboid proposal (can it also be called a tubelet?), there are many frames in this tubelet, and all the objects that can be put into this tubelet are regarded as the same object. Then put the classification score of each frame in this tubelet. Then get a score through aggregate in some way. —— It then regresses the precise box locations in each frame over each cuboid proposal, yielding a tubelet with a single classification score which is aggregated from the scores of the boxes
in the tubelet. ———— we regress the precise box locations and classification scores for each frame separately, forming a tubelet representing the linked object boxes in the short video segment.We obtain the classification score of the tubelet, by aggregating the classification
scores of the boxes across frames. 

        Refine : The K that leads to the result ( what result ? How many frames?) fine-tunes the box. This adjustment is for the K frame, which constitutes the recognition result of the tubelet (green line) in this step (the whole sentence does not know what it means). The resulting K refined boxes for the K frames form the tubelet detection result over this segment. This means that it splits out, the bounding box on each frame, right?

         On these tubelets, all frames and classification scores are aggregated in some way to get a score.

 

 There are many ways to aggregate aggregate, you can find the mean value, or you can use the following formula for aggregation

 

 ( I really don't understand what the above three steps do specifically )

The advantage of aggregating the classification scores in the front is to improve the classification scores for positive detections (boosting the classification scores for positive detections), and then improve the overall classification scores to improve the classification quality

(Step 3)Tubelet non-maximum suppression

        What was NMS originally designed for? Remove those overlapping recognition frames so that the recognition frames do not overlap as much as possible. The non-maximum suppression (NMS) algorithm aims to remove overlapping object detection boxes. This method sorts all detected boxes based on scores. The recognition box with the highest score will be selected. For other recognition boxes whose scores are greater than the predefined threshold, if they also have significant overlap, they will be suppressed ( how to suppress? Tell me what action you took to suppress this score ). For the remaining boxes, the whole process is performed recursively.

        The most straightforward solution is to perform a frame-level NMS, that is, independently-unrelatedly, at each frame, to remove overlapping 2D recognition boxes. There is a tendency that doing so may result in an entire tubelet being broken into smaller tubelets that are divided into different classes of objects. The reason for this result is that the objects in the tubelet are not fully utilized for the same object points. The above tendency has been confirmed by empirical experiments

        We have added the following function to the non-maximum suppression algorithm, we extend the non-maximum suppression algorithm to, that is to remove spatially overlapping tubelets (moving trajectory of the box box) remove spatially-overlapping on short video slices tubelets in the short segment, which avoids tubelet broken (broken) avoiding tubelets broken by the framewise NMS in the process of frame-level NMS. In this process, we introduce an evaluation index to evaluate the overlap of tubelets, with the introduction of an overlap measure for tubelets. This process is named  tubelet NMS algorithm (T-NMS)

        The only difficulty is how to evaluate the overlap of two tublets. The definition of overlap is based on the concept of overlap of different recognition frames in the same frame. In fact, it is realized through IoU

(Step 4)Classification score refinement via temporally-overlapping tubelets

Our data processing is like this

        Split the video into some short video clips. The length of the entire video clip is K. After taking this video clip, the next step is to cut the length of K-1 before cutting, that is, the step size is K-1. This ensures that there is an overlapping moment of at least 1 second between two consecutive video clips.

Considering two tubelets that overlap in time (one second in the middle), the i-th tubelet is from the (m+1)th segment (what do you mean ?

If this one in the middle of two video clips is overlapping, that is

The overlap between and  reaches the threshold, and the overlap is very significant, so we will link them.

We are using a greedy algorithm in our implementation, specifically a greedy tubelet linking algorithm. In the beginning, we put all tubelets of short video segments into a pool, and we record the segment corresponding to each tubelet. For those tubelets with the highest classification scores, we pop them out first. For those frames that overlap at time T, we check the IoU of the recognition box. If the IoU is greater than a threshold, such as 0.4, we will merge the two tubelets into one tubelet. For overlapping frames, those recognition boxes with lower scores are removed.

Use Equation(2) to update the classification score for those tubelets that have been merged. Especially those two video clips are merged into one video, you need to record the corresponding video clip after merging. Then put the merged tubelets into the pool. Repeat this step until no tubelets need to be merged.

These tubelets left in the pool form the result of video object recognition. The tubelet score is assigned to each recognition box in the tubelet. The recognition boxes from all tubelets are particularly strongly related to a certain frame and will be regarded as the final recognition box. For the corresponding frame.

How do we get long range object linking?

        We first cut the whole video into small segments dividing the whole video sequence into short video segments. For the entire video, across multiple "temporally overlapping" video segments, if there is a significant overlap, then I will associate the corresponding tubelets. We link the tubelets with significant overlap across temporarily-overlapping short video segments In this way, a long range object link long-distance object link is formed? Or a long-term object link.

        ( I now suspect that the long range and short range in this article do not mean long distance or short distance, but long time intervals and short time intervals )

Under what circumstances will two tubelets be linked together and merged into the same tubelet?

        (1) Two boxes, sufficiently overlapping in space (2) The two boxes come from two frames of two adjacent tubelets that overlap sufficiently in time (3) If the two tubelets are adjacent

        ——If two boxes, (2)which are from the temporally-overlapping frame of (3)two neighboring tubelets, (1)have sufficient spatial overlap, the tubelets are linked together and merged.

Experimental dataset: ImageNet VID dataset

        https://image-net.org/challenges/LSVRC/2015/index

        It is a popular large-scale video object detection benchmark with 30 categories, all included in ImageNet DET. The model is trained and evaluated on 3862 video clips from the training set and 555 video clips from the validation set. The 2015 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC2015) introduced a task called Object Detection from Video (VID) with a new dataset.

Interpretation of others after reading this article

https://github.com/ocean1100/Reading/blob/5b1b4846045105fda9b8b9ac6dc276f7b1924ba9/VideoDetection/Object_Detection_in_Videos_by_Short_and_Long_Range_Object_Linking.md)

  1. Cuboid proposal for a short segment propose a cube

  2. Tubelet detection for a short segment

    1. Regression and classification of this cube, the score is the aggregation of all bbox scores

    2. Aggregation method: $0.5(mean+max)$

  3. Tubelet non-maximum suppression

    1. 2D NMS is prone to tubelet interruption

    2. IOU of tubelets: first calculate the IOU of the bbox of the corresponding frame, and take the smallest one

  4. Classification refinement through temporally-overlapping tubelets.

    1. M short segments (including 1K frames, K2K-1 frames...), all have Cuboid proposal

    2. The tubelets where the two bboxes of the K-frame bboxes overlap greatly are regarded as the same

    3. Follow the method of 2 to fuse the scores of two tubelets

https://htx1998.cn/2021/03/19/3-19/

Summary

Most of the previous methods associate targets in adjacent frames . This paper proposes to associate targets in the same frame for the first time . Unlike other methods, we connect objects in the same frame and propagate box scores across frames instead of propagating features.

New point in the text:

  1. A cuboid proposal network (CPN, cube candidate frame network) is proposed to extract spatiotemporal candidate cuboids that constrain the motion of objects.

  2. A short tubelet (small tube) detection network is proposed to monitor short tubelets in short videos.

  3. A short tubelet concatenation algorithm is proposed to form long tubelets by concatenating short tubelets with temporal overlap.

Related Work

Feature Propagation Without Object Linking

FGFA STSN STMN MANet: The features of the current frame are enhanced by aggregating features from other adjacent frames, and FGFA and MANet use optical flow to align the features of different frames and then aggregate. STSN uses deformable convolutional networks to propagate features across space and time. STMN adopts Conv-GRU to propagate features from adjacent frames. In addition, DFF also utilizes feature propagation to speed up object detection. The author proposes to use a very deep high-cost network to calculate the features of key frames, so that the optical flow calculated by shallow is propagated to non-key frames. None of the above methods use Object Linking.

Feature Propagation Using Object Linking

The Tubelet Proposal Network first generates static target proposals , and then predicts the relative displacement of subsequent frames . The features of boxes in tubelets are propagated to each box for classification by using a CNN-LSTM network . MANet predicts the relative displacement of adjacent frames for each current true proposal al , and the features of the adjacent frame box are propagated to the corresponding box of the current frame through average pooling. Unlike these methods, the authors connect objects in the same frame and propagate box scores across frames instead of propagating features. In addition, the author's method directly generates spatiotemporal cuboid proposals for video clips, rather than generating frame-by-frame proposals like Tubelet Proposal Network and MANet.

Score propagation using Object Linking

Object detection from video tubelets with convolutional neural networks.(Object Linking) T-CNN:Tubelets with convolutional neural networks for object from videos(Object Linking)

The above two articles propose two Object Linking methods. The first one tracks the box detected in the current frame to its neighboring frames to enhance its original detection results with higher recall. Scores are also propagated to improve classification accuracy. This connection is based on the average optical flow vector within the box. The second uses a tracking algorithm to connect objects and long tubelets, and then employs a classifier to aggregate detection scores among the tubelets.

The Seq-NMS method connects objects by selecting boxes with small spatial overlaps in adjacent frames, regardless of motion information, and then aggregates the scores of connected objects as the final score. The methods in Detect to track and track to detect simultaneously predict the object position for two frames, and the object displacement from the previous frame to the next frame. Displacement is then used to connect detected objects and tubelets. The detection scores of the same tubelet are reweighted in some way.

method

The task of video object detection is to infer the position and category of each frame of the video . In order to obtain high-quality object connections, the authors propose to connect objects in the same frame, which can improve classification accuracy.

Given as input a video segment segmented with temporal overlap , the author's method consists of 3 steps:

  1. Generate a cube proposal for a fragment. This step aims to generate a set of cubes (containers) to constrain the same object at different frames. (don't know)

  2. Perform Short tubelet detection for a small segment. For each cube proposal, regression and classification generate short tubelets ( a sequence of bounding boxes , each bounding box locates the target position of a frame) . Spatially overlapping short tubelets are removed by tubet non-maximum suppression (do not understand). A short tubelet is a representation of objects connected at different frames in a video segment. (don't know)

  3. Short tubelet linking for the entire video. This step concatenates objects with temporally overlapping segments and optimizes the classification scores of concatenated tubelets.

Among them, the first two steps of cuboid proposal generation and short tubelet detection generate short tubelets with overlapping time, so as to ensure that the short tubelet linking step can be connected.

1.Cuboid Proposal Generation

Tubelet: A collection of ground turth boxes.

Cuboid: My understanding is a set of the same bbox used to constrain the displacement of the target. Objects at different frames in cuboid are treated as the same object.

The Region Proposal Network of the author's Faster R-CNN was modified to Cuboid Proposal Network . The input of RPN is a single image, and the input of CPN is K frames of images. Output whk cuboid proposals, where there are wxh spatial grids, where each position has k reference boxes. Each cuboid proposal is associated with an objective score.

2.Short Tubelet Detection

The author uses the two-dimensional form of Cuboid Proposal as the 2Dbox (area) proposal for each frame of the video clip, and classifies and refines each frame separately. (That is to say, the proposal generated by RPN is a rough regression of the target position. Here, the two-dimensional CPN output is selected as the rough proposal for each frame)

For a certain frame of a video clip, the author uses the same method as Fast R-CNN to refine the bounding box and calculate its classification score . The input is a two-dimensional Region Proposal b , the response map I obtained by CNN, and then perform the ROI pooling operation, and the result of the ROI pooling is fed to the classification layer and the regression layer to generate (C+1)-dimensional classification score vectors respectively. , forming a refined bounding box.

Finally, K refined bounding boxes are generated for K frames, which is the Short tubelet detection result.

In order to remove redundant short tubelets, the author extended the standard non-maximum suppression method (NMS) to Tubelet-NMS (T-NMS) to remove short tubelets with overlapping positions. (Similar to a single image, if there is no NMS, there will be many overlapping redundant frames. The author extends the standard NMS to Tubelet) This strategy removes the 2D frame independently for each frame by helping the NMS to ensure that the tubelets are not destroyed ( If NMS is performed frame by frame, the existing Tubelet will be broken).

The author measures the spatial overlap of two tubelets according to the bounding box IOU of the same frame . (This may be because a video clip may generate multiple tubelets, you need a way to measure the similarity of two Tubelets, that is, as long as there are a bunch of bounding boxes that do not completely overlap, then the two tubelets are not the same )

3.Short Tubelet Linking

The author divides a video into a series of small clips with overlapping frames (step size K-1) of length K frames, each small clip generates a tubelet (small tube), when overlapping frames of different tubelets (the same frame ) are connected together when their spatial coincidence is higher than a predefined threshold .

The author adopts a greedy connection algorithm: at the beginning, short tubelets of all video clips are put into a pool, and the video clip corresponding to each tubelet is recorded. Then the algorithm first selects the short tubelet with the highest classification score from the pool, and calculates the IOU between its overlapping frame and other tubelets. When the IOU is greater than 0.4, the two short tubelets are merged into one long tubelet, and the two tubelets are removed. The lower score in overlapping frames of , according to equation (2) Aggregation(⋅)=12(mean(⋅)+max(⋅))������������(·)=12(� ����(·)+����(·)) calculates the new classification score. Repeat the above process until no tubelets can be merged.

Finally, the remaining tubelets in the pool form the result of video object detection. The tubelet score is assigned to each box.

Summarize

The cuboid proposal network (CPN) proposed in this paper is eye-catching. Its main idea is to divide a long video clip into short clips, and each short clip has a frame overlap. After that, multiple tubelets are connected together by similarity measurement of overlapping frames between different tubelets, and the classification scores of connected tubelets are optimized .

This method is quite different from previous aggregation methods, which mostly focused on selecting keyframes and non-keyframes, while our method splits long video clips into small ones. This idea can be used for reference in the future.

Existing problems:

  • How to choose the length of the split segment? 2

  • There are wrong classifications in picture (b), how to measure their similarity?

  • What if the IOU of a tubelet is very low compared to other tubelets? Allow broken links

Guess you like

Origin blog.csdn.net/Albert233333/article/details/130093049