Overview of computer vision tasks

Overview of computer vision tasks

Refer to the other two blogs
https://blog.csdn.net/weixin_44523062/article/details/104577628
https://blog.csdn.net/weixin_44523062/article/details/104535650 to
summarize the research direction of CV and its Do your own overview of the task. It is convenient to update in the future according to the dynamics of CV research directions.
Systematic description of courses and books in the field of CV

  • Course 1: Li FeiFei CS231n 2019 teaching concept focus on intuition thinking
  • Course 2: Ye Zi Computing Visual Deep Learning Practice 2017
  • Book: Computer Vision A Modern Method Second Edition


The four basic tasks of CV come from cs231n: classification, positioning, detection, segmentation
Lifeifei CS231

1. Image classification CNN + FC + softmax

  1. Task: Classification, input-Img–> output-label
  2. Method: use the labeled data
    set- > extract features-> training classifiers Lenet1998, Alexnet2012, ZFnet2013, Vggnet2014, GoogleNet2014, Resnet2015, Densenet2016
  3. Data set: Mnist handwritten digits, CIFAR10, Imagenet1000
    MNIST 60k training images, 10k test images, 10 categories, image size 1 × 28 × 28.
    CIFAR-10 50k training images, 10k test images, 10 categories, image size 3 × 32 × 32.
    CIFAR-100 50k training images, 10k test images, 100 categories, image size 3 × 32 × 32.
    ImageNet 1.2M training images, 50k verification images, 1k categories. In 2017 and before, the ILSVRC competition based on the mageNet dataset will be held every year, which is equivalent to the Olympics of the computer vision industry
  4. Application: Foundation of CV image understanding, preparation for target recognition and target segmentation
  5. Evaluation: Accuracy: classification right-num / all-num
  6. Types of extended classifiers: discriminant, generative (source cv a modern method)-refer to https://blog.csdn.net/u010358304/article/details/79748153
    Insert picture description here
    from the perspective of probability distribution, for a bunch of sample data , Each has a feature Xi corresponding to the classification label yi.
  • Generating model: using statistics and Bayes as the theoretical basis. Learn to get the joint probability distribution P (x, y), that is, the probability that feature x and label y co-occur, and then find the conditional probability distribution. Able to learn the mechanism of data generation.
    • 1. Naive Bayes 2. Mixed Gaussian model 3. Hidden Markov
  • Discriminant model: Learn to obtain the conditional probability distribution P (y | x), that is, the probability that the label y will appear when the feature x appears.
    • 1. Perceptron 2. k-nearest neighbor method 3. Decision tree 4. Logistic regression 5. Maximum entropy model 6. SVM 7. Boosting (AdaBoost) 8. Conditional random field (CRF) 9. CNN
  1. Code: pytorch comes with Resnet, Vggnet, and torchvision's dataset: Minist, CIFAR, which are backbones applied to deeper image understanding tasks-and all use the pretrained model finetune

2. Target detection (positioning + classification)

The localization task is a boundingbox that outputs specific targets: such as face detection and pedestrian detection. It is part of target detection.
Target detection tasks: 1 positioning + 2 categories + 3 confidence. The
traditional method is manual feature search.
Two types of deep learning methods: 1 based on candidate regions to extract features, and then boundingBox regression; 2 one stage based on regression positioning detection

2.1, object localization (object localization)

  1. On the basis of image classification, we also want to know where the target in the image is, usually in the form of a bounding box. The basic idea is multi-task learning. The network has two output branches. One branch is used for image classification, that is, full connection + softmax to determine the target category. The difference from pure image classification is that there is another "background" category. The other branch is used to judge the target position, that is, to complete the regression task and output four numbers to mark the position of the bounding box (such as the center point horizontal and vertical coordinates and the length and width of the bounding box). Before using.

  2. The idea of ​​human body pose positioning / face positioning target positioning can also be used for human body pose positioning or face positioning. Both of these require us to return to a series of key points of human joints or faces.

  3. Weakly supervised positioning Since target positioning is a relatively simple task, recent research hotspots are target positioning under the condition of only labeled information. The basic idea is to find some high-response saliency areas from the convolution results, and think that this area corresponds to the target in the image.

2.2, target detection

  1. Task: Identify the panoramic image target location Bounding Box, category label, and confidence. The universal detection framework has faster rcnn which can be used to detect a single specific target and multiple targets. According to the training data of the task, various detection models can be trained: such as faces, pedestrians, masks

  2. Method:
    Traditional method:
    1 Area selection (sliding window): Sliding window traversal is not targeted (scale, high time complexity, window redundancy)
    2 Feature extraction (SIFT, HOG, etc.) Manual feature instability
    3 classifier (SVM, Adaboost, etc.) + NMS and soft NMS high-precision
    harrlike face features at high retrieval rates + cascade classifiers, HOG pedestrian features + SVM classifiers, DPM variability parts model object detection
    deep learning methods: RCNN, yolo, ssd, fcn Multi-feature fusion
    1 Based on candidate regions two stages based on region proposal
    RCNN, Fast-RCNN, Faster-RCNN (RPN candidate region network), R-FCN
    candidate regions (merging similar sliding windows): fewer windows, higher recall ( (Use the texture, edge, color, etc. in the image)
    (1) Use Selective Search to extract Proposes, and then use CNN and other recognition techniques for classification.
    (2) Use the recognition library for pre-training, and then use the detection library to tune the parameters.
    (3) SVM is used to replace the last Softmax in the CNN network, and the 4096-dimensional vector output by CNN is used for Bounding Box regression.
    (4) The first two steps of the process (candidate region extraction + feature extraction) are not related to the category to be detected, and can be shared between different categories; when detecting multiple categories at the same time, only the last two steps (discrimination + refinement) need to be doubled. All are simple linear operations, very fast
    2 Based on regression one stage
    You only look once (Yolo-v1-3)
    Single Shot multiBox Detector (SSD)
    FPN: focus, multi-feature fusion, deconvolution of deep feature maps and fusion with shallow features
    RetinaNet interpretation https://blog.csdn.net/JNingWei/article/details/80038594

  3. The dataset Imagenet1000, PASCAL VOC20 category 2007, and MS COCO80 category
    PASCAL VOC contain 20 categories. Usually, the trainingval union of VOC07 and VOC12 is used as the training, and the test set of VOC07 is used as the test.
    COCO is more difficult than VOC. COCO contains 80k training images, 40k verification images, and 20k unpublished test images (test-dev), 80 categories, with an average of 7.2 targets per image. Usually, the union of 80k training and 35k verification images is used for training, the remaining 5k images are used for verification, and the 20k test images are used for online testing.

  4. Application: tracking, re-identification

  5. Evaluation method: mAP, the cross-combination ratio IoU> 0.5-0.7 is generally detected, and the average evaluation index F1 score = 2PR / R + P
    Insert picture description here
    mAP (mean average precision) is commonly used in target detection. The calculation method is as follows. When the intersection ratio of the predicted bounding box and the real bounding box is greater than a certain threshold (usually 0.5), the prediction is considered correct. For each category, we draw its precision-recall curve, the average accuracy is the area under the curve, and the program divides the calculation according to the number of detected targets by threshold . After that, the average accuracy of all categories is averaged to obtain mAP, which is [0, 100%]. The area of ​​the intersection of the bounding box and the true bounding box predicted by the
    intersection over union (IoU) algorithm divided by the area of ​​the union of these two bounding boxes is [0, 1]. The intersection ratio measures the closeness of the bounding box predicted by the algorithm and the real bounding box. The greater the intersection ratio, the higher the overlapping degree of the two bounding boxes.

  6. Difficulties or tricks One problem that may arise with
    non-max suppression (NMS) target detection is that the model makes multiple predictions for the same target, resulting in multiple bounding boxes. NMS aims to keep the prediction result closest to the true bounding box, and suppress other prediction results. The approach of NMS is that, first, for each category, NMS first counts the probability that each prediction output belongs to that category, and sorts the prediction results according to the probability from high to low. Secondly, NMS believes that the prediction result with small probability does not find the target, so it suppresses it. Then, the NMS finds the prediction result with the highest probability among the remaining prediction results, outputs it, and suppresses other bounding boxes that have a large overlap with the bounding box (such as IoU greater than 0.3). Repeat the previous step until all prediction results are processed. Another problem of
    online hard example mining (OHEM) target detection is the category imbalance. Most areas in the image do not contain targets, while only a small area contains targets. In addition, the difficulty of detecting different targets varies greatly. Most of the targets are easily detected, while a small number of targets are very difficult. OHEM and Boosting have similar ideas. They sort all candidate regions according to the loss value, and select a part of the candidate region with the highest loss value for optimization, making the network focus more on the more difficult targets in the image. In addition, in order to avoid selecting candidate regions that overlap greatly with each other, OHEM performs NMS on the candidate regions according to the loss value.
    In logarithmic regression, regression is much more difficult than classification optimization. \ ell_2 The loss is more sensitive to outliers. Due to the square, the outliers will have a large loss value, and at the same time, there will be a large gradient, making gradient explosions easy to occur during training. The gradient of \ ell_1 loss is discontinuous. In the logarithmic space, since the dynamic range of the value is much smaller, regression training is also much easier. In addition, some people use smooth \ ell_1 loss for optimization. Normalizing return goals in advance will also help training.
    Original link: https://blog.csdn.net/Fire_to_cheat_/article/details/88551011

  7. Code: Yolo, fasterR-CNN run on Imagenet or coco

3. Target segmentation (semantics, examples)

  1. Task: Segmentation to pixel level, mask of contour area. Semantic segmentation is a distinction between classes, and instance segmentation also needs to distinguish
    semantic segmentation . Semantic semantic: semantically understand the role of each pixel (for example, identify whether it is a car, motorcycle or other category) in the real world The meaning of the concept represented by the things, the objects with the same conceptual meaning are divided into
    instances : basic idea target detection + semantic segmentation. First use the target detection method to frame different instances in the image, and then use the semantic segmentation method to mark each pixel in different bounding boxes.
    In addition to semantic segmentation, instance segmentation classifies different types of instances, such as marking 5 cars with 5 different colors. The classification task is generally to identify what an image contains a single object, but when segmenting instances, we need to perform more complex tasks. We will see multiple overlapping objects and complex scenes with different backgrounds. We not only need to classify these different objects, but also determine the boundaries, differences and relationships between the objects!
  2. Method:
    Semantic segmentation
    FCN fully convolutional neural network U-shaped network
    Dilated Convolutions, DeepLab and RefineNet, Cascades2015
    Strength segmentation: Mask R-CNN
  3. Data set: MSCOCO, VOC
    PASCAL VOC 2012 1.5k training images, 1.5k verification images, 20 categories (including background).
    COCO has 83k training images, 41k verification images, 80k test images, 80 categories
  4. Application: Medical image segmentation
  5. Evaluation method: IoU, mAP
    Insert picture description here

  6. The four tasks of distinguishing classification, positioning, more general target recognition, semantic segmentation, and instance segmentation require a deeper understanding of the image. Given an input image, the image classification task aims to determine the category to which the image belongs. Positioning is based on image classification, and further determines where the target in the image is in the image, usually in the form of a bounding box. In target positioning, there is usually only one or a fixed number of targets, and target detection is more general, and the types and number of targets appearing in the image are uncertain. Semantic segmentation is a more advanced task of target detection. Target detection only needs to frame the bounding box of each target. Semantic segmentation needs to further determine which pixels in the image belong to which target. However, semantic segmentation does not distinguish between different instances belonging to the same category. For example, when there are multiple cats in the image, semantic segmentation will predict all pixels of the two cats as a category of "cats". Unlike this, instance segmentation needs to distinguish which pixels belong to the first cat and which pixels belong to the second cat. In addition, target tracking is usually used for video data, and has a close relationship with target detection, while using the timing relationship between frames.
    Insert picture description here
  7. Research team: Foolwood same person WangQiang
    SiamMask https://zhuanlan.zhihu.com/p/58154634
  8. Code: MaskRCNN

4. Target tracking (video)

  1. Task: Based on the panoramic MTSC, MTMC, multi-target single camera, multi-target multi-camera
    STSC, given a sliced ​​pedestrian image (probe image), from a panoramic video (panorama track, only a small part of the view is this pedestrian) Find the location of the probe. This panoramic video is a continuous frame taken by a single camera.

  2. Method: Generating Algorithm Discriminant Algorithm
    Generating algorithm uses generative model to describe apparent features, and minimize reconstruction error to search for targets, such as principal component analysis algorithm (PCA);
    discriminant algorithm is used to distinguish objects from background, and its performance is more robust, And gradually become the main means of tracking objects (discrimination algorithms are also called Tracking-by-Detection, deep learning also belongs to this category)
    traditional methods:
    1 generative optical flow method, meanshif only focus on the target, ignore the background
    2 relevant filtering method CSK There are prediction and speed-up
    deep learning methods:
    C-COT, ECO, MDnet, siamFC
    deep network models: stacked automatic encoder (SAE) and convolutional neural network (CNN).

  3. Data sets
    OTB50, OTB100, VOT2016
    cityflow The first cross-camera car tracking data set, or vehicle REID
    https://www.jiqizhixin.com/articles/2019-03-26-13 including existing tracking algorithm analysis Deep SORT

  4. Application: Intelligent monitoring, urban security
    Overview https://www.cnblogs.com/liuyihai/p/8338369.html

  5. Evaluation method
    Real-time, accuracy (need to add, how to define vot game in the code)

  6. Research team: Kuangshi Wangmengmeng Zhejiang University

  7. Code:
    foolwood-Siammask author, tracker's summary https://github.com/foolwood/benchmark_results
    as
    yolo v3 + tracking https://blog.csdn.net/weixin_42035807/article/details/89496378
    KCF article http: // www. robots.ox.ac.uk/~joao/publications/henriques_tpami2015.pdf
    KCF https://github.com/HenryZhangJianhe/KCF
    algorithm

  8. Code: KCF (nuclear filter), Correlation Filter

5. Target re-identification (ReID: person, car)

  1. Task: Image retrieval subtask, the given probe searches the same cross-camera image in the gallery is not based on the panorama, the data set is the image that has been detected and contains the target
  2. Method:
    Representation learning, cross-entropy classification, contrast loss, attribute loss,
    metric learning: triple loss,
    local alignment matching: PCB
    based on GAN generation
  3. Dataset
    Car: cityflow2019, Beijing Post ’s VeRi-776, Peking University ’s VehicleID Peking University ’s PKU-VD
    Pedestrian: Market1501, Duke
  4. Application
    Tracing, clustering
  5. Evaluation method: mAP
    answer Heavy recognition, target tracking difference between the target
    https://www.zhihu.com/question/283460186/answer/869165399
    answers pedestrian track and re-recognize the difference Luo Hao
    https://www.zhihu.com/question/ 68584669 The difference between REID and tracking
    7. Research team: Luo Hao, Zheng Liang

Six, image description RNN + attention

  1. Task: Image-> Text. Training images with text description, input Img–> output describe word

  2. Method: encoding and decoding, LSTM in RNN, attention mechanism

  3. data set:

  4. Application: Blind guide

  5. Evaluation method: Evaluation method for translated text

  6. difficulty

  7. Lack of data

Seven, image generation: GAN

  1. task
  2. method
  3. data set
  4. application
  5. evaluation standard
  6. difficulty

Eight, finetune and transfer learning

九、Cross Domain adaptive

1. For segmentation
2. For re-identification
3. For tracking

10. Unsupervised learning

Eleven, commonly used data sets

Common data set collation URL https://www.cnblogs.com/liuyihai/p/8338020.html

Twelve, computer vision tasks of geometric attributes

The above 8 items are all semantic-aware CV tasks. The tasks based on geometric attributes are divided into 3D modeling, augmented reality, and binocular vision.
Insert picture description here

13. Application Synthesis

  • Face recognition: Snapchat and Facebook use face detection algorithms to recognize faces.
  • Image retrieval: Google Images uses content-based queries to search for related images, and the algorithm analyzes the content in the query images and returns the results based on the best matching content.
  • Game and control: The more successful game application product using stereo vision is: Microsoft Kinect.
  • Surveillance: Surveillance cameras used to monitor suspicious behavior are scattered in major public places.
  • Biometric technology: fingerprint, iris and face matching are still some common methods in the field of biometrics.
  • Smart cars: Computer vision is still the main source of information for detecting traffic signs, lights and other visual features.
  • Yun Na Wu Gan Payment Retail http://www.yunatop.com/

14. Reference

Detailed explanation of the five major technologies of computer vision: image classification, object detection, target tracking, semantic segmentation and instance segmentation

Computer Vision Overview
Lifeifei Course Zihao explained

Fifteen, the following is the CVPR2019 excellent paper classification

http://bbs.cvmart.net/topics/302/cvpr2019paper
Insert picture description here

Published 63 original articles · praised 7 · views 3396

Guess you like

Origin blog.csdn.net/weixin_44523062/article/details/104468840