National University of Science and Technology. Image Processing and Computer Vision: Final Review Questions and Summary of Knowledge Points (2)

1. Please briefly describe the calculation process of Bag of Visual Word, and design an image classification system based on Bag of Visual Word;

(1) Feature extraction and description: Use the SIFT operator to extract the interest points and feature descriptions of the image, and generate several key points and their descriptors for each image in the training set.

(2) Construction of visual dictionary: cluster all the extracted SIFT features (assuming K clusters), and each cluster center is a visual word, thus obtaining a visual dictionary.

(3) Image representation: Extract SIFT features from the image, quantify it into a vectorized representation of visual words, count the number of occurrences of each word, and represent each image as a K-dimensional vector.

For image classification tasks: first use the bag of visual words to extract features, and then select an appropriate classifier for classification and recognition, for example, you can choose the KNN algorithm or SVM for classification.

2. Please briefly describe the shortcomings of the frame difference method for detecting moving objects, and discuss possible improvement methods; the main idea and basic method of background modeling.

Disadvantages of the frame difference method: it is greatly affected by noise. For dynamic scenes , due to the complex relative motion between the scene and the camera, the traditional frame difference method is no longer applicable. How to estimate and compensate the global motion has become the key to the problem.

Improved method of frame difference method: background modeling is an improved method. To detect a moving object in a moving scene, the key is to estimate the motion of the scene, compensate the motion of the scene through the estimated motion parameters, and finally obtain the moving object by the frame difference method.

The main idea of ​​background modeling is to use the redundant information of sequence images in time and space to separate the moving target in the scene from the background. By first modeling the background and then comparing the current frame with the background model, the foreground is distinguished from the background, that is, background subtraction.

Basic method of background modeling: It is hoped that a background model that can adapt to environmental changes can be established. Statistical background models include single Gaussian models, mixed Gaussian models, non-parametric models, etc. The single Gaussian model assumes that the distribution of each pixel feature in the time domain can be described by a single Gaussian distribution. The mixed Gaussian model (each component corresponds to a weight, and the Gaussian components are sorted in descending order according to the weight divided by the variance, the first b components are used as the background distribution, and the remaining components are used as the foreground distribution) can describe the background with a more complex distribution. The classic GMM is based on pixel modeling and ignores the image structure information. It can be improved by introducing MRF, non-parametric density estimation, and adaptive selection of the number of Gaussians.

3. Some basic concepts of convolutional neural network, such as receptive field, dropout, activation function, pooling, etc.;

Composition: input layer, hidden layer {convolutional layer, pooling layer, fully connected layer}, output layer

Features: local connections, shared weights

Receptive field: convolution kernel size. Another explanation: the pixel points on the feature map (feature map) output by each layer of the convolutional neural network map the size of the area on the input image, and the value at this point only depends on the value in the receptive field area.

Dropout: During the training process (including forward and back propagation), neurons are randomly deactivated with a certain probability, which can effectively prevent overfitting.

Activation function: Introduce nonlinear factors, act on the output of nodes through nonlinear functions, generate activation information and pass it to the next layer of network.

Pooling: It is a down-sampling method that aggregates and counts features at different positions in the same block to reduce the size of the feature map.

4. Please explain what is Over-Fitting (over-fitting), and discuss solutions to avoid Over-fitting;

Overfitting: The phenomenon that the model has a small error on the training set but a large error on the test set. Usually occurs when the model is too complex, such as too many parameters.

Solutions: Regularization (L1, L2), increasing data samples, early termination, Dropout

5. Please briefly describe the role of motion information in the MPEG-1 video coding standard, and understand the role of I frame, B frame and P frame;

Utilize motion information, adopt motion compensation algorithm, remove time redundant data, thus realize compression.

  • I: Intra frame: An I frame contains an image with complete content, which is used as a reference for the codec of other frame images, so this is what we often call a key frame.
  • P: Unidirectional predictive frame: A P frame refers to an image that uses the I frame that appeared before it as a reference image, and encoding the P frame is actually encoding the difference between them.
  • B: Bidirectional predictive frame: A B frame is an image that uses the images before and after it, that is, the I frame and the P frame, as reference images. To encode the B frame is to encode the difference between it and the I frame and P frame respectively.

6. The basic concepts and main methods of target tracking; please explain the relationship and difference between target tracking and target detection;

The concept of target tracking: through the processing and analysis of video data, the same moving target in different frames in the image sequence is associated to calculate the target's motion parameters.

The concept of single-target tracking: In the first frame, given the target to be tracked, in the subsequent video sequence, determine the position of the target in each frame.

The relationship between target detection and tracking:

  • First detect and then track, usually used for multi-target tracking, first detect the moving target in each frame, and then match the targets in the previous and subsequent frames to achieve trajectory association.
  • Tracking while detecting: Combining target detection and tracking, using the tracking results to determine the range of the area to be processed during detection, and using detection to obtain observations of the target state during tracking. Firstly, a feature model describing the target is established, and after the initialization of the initial frame, the matching search is continuously carried out in subsequent frames.

The main methods of object tracking:

Divided into two categories:

  • Generative model: Select the image patch most similar to the target appearance model from the candidate samples as the tracking result
  • Discriminative model: Model the tracking problem as a binary classification problem, that is, to judge whether each candidate sample is a background sample or a target sample

There are the following methods:

  • Feature-based matching: Extract the feature of the target and find that feature in each frame. The process of finding is the feature matching process.
  • Bayesian filter tracking: deal with the uncertainty in multi-target tracking. Under the framework of Bayesian theory, the multi-target tracking problem is converted into a process of inferring the maximum posterior probability of the target state . The basic principle of Bayesian filtering is to infer the posterior probability density distribution of the system state variables on the basis of all known information .
  • Kalman filter tracking: In essence, the Kalman filter is a recursive algorithm for predicting the state of a noisy linear dynamic system , and it is a process of continuous prediction and correction. When it is assumed that the system state model and observation model are both linear and conform to Gaussian distribution , and the noise also conforms to Gaussian distribution, the linear Kalman filter is the optimal filter. The Kalman filter algorithm is the recursive optimal estimation theory, which uses the state space description method and uses the linear minimum mean square error as the estimation criterion to optimally estimate the state variables.
  • Mean Shift Mean Shift: Discover hidden probability density functions in a set of data. Given an initial point x and a kernel function g(x), perform the following steps until the end condition is met:
    • Compute the offset mean vector m(x)
    • assign m(x) to x
    • If ||m(x)-x||< then end the loop
  • Mean Shift applied to object tracking:
    • Initialize the search window, using the color histogram as a description of the target model.
    • Computes the color probability distribution for the search window.
    • Run the meanshift algorithm to obtain the size and position of the new search window.
    • Re-search the size and position of the window in the next frame of video image, perform similarity matching, and then jump to the second step to continue until ||m(x)-x||<.

7. The basic concepts and knowledge of object detection (take R-CNN as an example), the basic process of R-CNN, how to train, region proposal, IOU, NMS; classification & region .

RCNN: object detection based on candidate regions

YOLO: Regression-Based Object Detection

R-CNN basic process:

  • Region Proposal: Extract several region candidate boxes from the original image by selective search (using image segmentation and hierarchical algorithms)
  • Area normalization: scale all candidate boxes to a fixed size
  • Feature extraction: The CNN network generates a fixed-length feature vector for each candidate region
  • Region classification: SVMs combined with NMS (non-maximum value suppression, select the region with the highest probability, and suppress other regions with IoU greater than the threshold) to obtain region borders, and finally perform position refinement through linear regression .

R-CNN training procedure:

  • Pre-training (migration learning): CNN pre-training on the ImageNet dataset
  • Tuning training: fine-tuning on the PASCAL dataset
  • Train the SVM classifier: if the IoU between each proposed region and the standard box is greater than the threshold, it is a positive sample, otherwise it is a negative sample. Positive samples include labeled samples and proposed regions with IoU greater than a threshold. Since positive samples are far less than negative samples, some representative negative samples are selected from the negative samples.

8. Please look forward to the development of computer vision in 2030. Please give an example of the application of computer vision from a reasonable perspective: it will be realized in 2030, but the current technical level has not yet reached or is immature. And try to explain the technical method in this example.

Today, computers can outperform humans in solving specific tasks using billions of images. Nonetheless, in the real world, it is rare to construct or find datasets containing such a large number of samples. High-quality labeled data is difficult to obtain in most fields, which limits the application capabilities of many computer vision algorithms in corresponding scenarios.

In this context, the proposal of Few Shot Learning (FSL) will solve the problem of machine learning application under the condition of severely limited data set size. The small sample learning method can use only a very small number of supervised samples under the premise of using prior knowledge , so that the model can quickly improve the generalization performance through very few steps of updating, so as to be applied to new related tasks. In recent years, few-shot learning has been applied to many applications in the fields of computer vision, natural language processing, human-computer interaction, knowledge graph and even biological computing.

The field of few-shot object detection is developing rapidly, but there are not many effective solutions. The most stable solution to this problem is the YOLO+ model-agnostic meta-learning algorithm.

And other difficulties: very fine-grained classification, very small and blurry target detection and segmentation, and how to ensure the stability of segmentation results under complex lighting changes . In addition, the migration from image to video also faces a smoothness problem, and subtle defects in the image are easily magnified in the video. The development of computer vision is facing a bottleneck, and deep learning plays a limited role in it, so new breakthroughs need to be found. The progress of deep learning has greatly improved the accuracy of CV recognition, but deep learning is too dependent on a large amount of labeled data, which makes computer vision researchers spend a lot of time on simple but complicated labeling tasks, while ignoring more important tasks .

9. Basic concepts and differences between classification and clustering;

Classification: supervised learning, for a given sample, learn a classification decision function from the data, the output variable takes a finite number of discrete values, representing the category.

Clustering: Unsupervised learning, for given samples, according to their data distribution, similar samples are assigned to the same cluster, and dissimilar samples are assigned to different clusters. The purpose of clustering is to discover the distribution characteristics of the data, specify the number of clusters in advance, but do not know the significance of the clusters.

10. Please introduce the role and function of the classifier and feature extraction module in the classification system;

Feature extraction: CNN, bag of visual words converted to vectorized representation of visual words. A feature is a representation of an image.

Classifiers: Naive Bayes, AdaBoost, SVM, KNN, Softmod, etc.

11. The calculation process of the color histogram feature.

Color histogram is a widely used color feature in many image retrieval systems. What it describes is the proportion of different colors in the entire image, reflecting the statistical characteristics of the color distribution of the image, and does not care about the spatial position of each color, that is, it cannot describe the object or object in the image. Color histograms are especially suitable for describing images that are difficult to automatically segment.

To calculate the color histogram, the color space needs to be divided into several small color intervals, and each small interval becomes a bin of the histogram. This process is called color quantization. Then, the color histogram can be obtained by counting the number of pixels whose color falls within each bin.

12. Please take the convolutional neural network model and color histogram as examples to explain the difference between the feature representation of automatic learning and the feature representation method of manual design;

Manually designed features: SIFT and HOG are both feature extraction methods based on the histogram of gradient directions in the image. In areas with relatively small data volumes, the speed and accuracy of traditional machine learning algorithms are more advantageous because traditional machine learning algorithms have strict reasoning . , The calculation process is controllable .

Automatic learning features: CNN, in terms of big data processing, deep neural network has higher accuracy and wider application fields.

13. Given two images, please give a calculation method for image similarity, and discuss its rationality and shortcomings.

  • Histograms can describe the global distribution of colors in an image, construct histograms, vectorize representations, and cosine similarity. The histogram is too simple and can only capture the similarity of color information, but cannot capture more information. As long as the color distribution is similar, it will be judged that the similarity between the two is high, which is obviously unreasonable.
  • Extract features, represent the picture as a vector, and characterize the similarity of two pictures by calculating the cosine distance between the vectors. The closer the cosine value is to 1, the closer the included angle is to 0 degrees, that is, the more similar the two vectors are. Cosine similarity is not sensitive to the absolute value of the specific value, so it cannot measure the difference in value.
  • Twin network: two networks accept input separately, share weights, and then calculate the distance or similarity between the two output vectors to determine the similarity of the original input.

14. Please introduce to an image (such as the picture below), the various types of conceptual information it may contain. How far can image understanding technology go now?

Vehicle detection, crowd density estimation, vehicle density estimation, scene classification, semantic segmentation, low-light enhancement, semantic segmentation, target detection, target tracking...

 

Guess you like

Origin blog.csdn.net/qq_41112170/article/details/125822539