National University of Science and Technology--Multimedia Analysis and Understanding--2018 Exam Questions

National University of Science and Technology – Multimedia Analysis and Understanding – 2018 Exam Questions

I took the exam in 2022. The 2018 exam questions came from the Internet, and I sorted out the answers.

1. Discuss what is multimedia? What are the application areas and challenges of multimedia analysis and understanding?

Reference answer :
(1) . Multimedia is content combined in different content forms, such as text, audio, image, animation, video, and interactive content. Or to answer, multimedia refers to the general term for various information carriers processed by computers, including text, audio, graphics, video and interactive content.

(2) . Multimedia analysis and understanding are widely used in security, education, communication, entertainment and other industries. Specifically, multimedia can be applied in fields such as image retrieval, content recommendation, visual surveillance, personalized video customization, social media, and video websites.

(3) . The challenges faced are as follows

  • How to represent data in different media and different modalities; data is often massive, high-dimensional, unstructured, and has its own complexity.
  • How to understand multimedia data and address issues such as semantic gaps.
  • How to mine the interrelationships between multimedia data, that is, synergy and complementarity.
  • How to meet the diverse information needs of users and handle user preferences and personalization well.

2. For the feature representation methods of text, audio and image data, please list two typical features and analyze their advantages and disadvantages.

Reference answer :
(1) . Text

  • Term frequency (TF) notation
    Advantages: The frequency of occurrence of words in a document can indicate the focus of a document, which is convenient for statistics and analysis.
    Disadvantages: Prepositions and copulas that appear many times in the text will also be counted with higher weights.
  • Advantages of Latent Semantic Analysis (LSA)
    : Through dimensionality reduction, it effectively solves the problem of polysemy and polysemy.
    Disadvantages: The problem of broken order in the middle of the document is still not solved.

(2) . Audio

  • Zero-crossing rate
    Advantages: It can reflect the average frequency of the signal in a short time frame.
    Disadvantages: only focus on the amplitude information in the short time window, and the frequency domain information is missing.
  • Mel-Frequency Cepstral Coefficients
    Advantages: Decorrelates and compresses features.
    Disadvantages: The information of all frequency bands is processed equally, and important information cannot be highlighted.

(3) . Image

  • Advantages of LBP
    : To a certain extent, the problem of illumination changes is eliminated, it has rotation invariance, and the calculation speed is fast.
    Disadvantage: The corresponding LBP operator will change when the illumination is uneven, and the LBP also loses the direction information.
  • SIFT
    advantages: It has good scale invariance and robustness.
    Disadvantages: The real-time performance is not high, and there are few feature points when it is available, and the feature points cannot be accurately extracted for objects with smooth edges.

3. A typical layer in a convolutional neural network usually includes three basic operations. Please answer the basic meaning or type of each operation, its basic characteristics or advantages and disadvantages.

Reference answer :
three basic operations of a typical layer: convolution —> nonlinear transformation —> pooling

(1) . Convolution operation:

  • Meaning: The convolution operation is also called filtering, and the convolution kernel function is also called a filter; the two-dimensional convolution on the input image, and the convolution output is called a feature map.
  • Features: Usually multiple different convolution kernels are used in the same convolutional layer to learn different features of the image. When the convolution kernel input contains multiple channels, the convolution kernel can be regarded as 3D.

(2) . Nonlinear transformation:

  • Meaning: First pass ϕ ( x ) \phi(x)ϕ ( x ) will bexxPoints in x space converted tozzpoint in z space, and get a linear assumption in z space, and then restore to the originalxxA quadratic hypothesis is obtained in the x- space.
  • Features: The advantage is that it has good mathematical properties, and the disadvantage is that it is easy to saturate, and the output is not 0 mean, which will affect the gradient.

(3) . Pooling operation:

  • Meaning: The pooling function uses the overall statistical characteristics of the adjacent positions of a certain position to replace the output of the network at that position. Commonly used pooling functions include maximum pooling and average pooling.
  • Features: When the output makes a small amount of translation, pooling can help the representation of the input to be approximately unchanged, that is, translation invariant. Reduce the scale of parameters and improve statistical efficiency. Using global pooling before the fully connected layer can keep the number of nodes in the fully connected layer constant and is not affected by the size of the input image.

4. Please explain the basic research content of image semantic understanding and the meaning of each content. Please select a typical algorithm or model for any one of them and describe its specific implementation process in detail.

Reference answer :
(1) . Image semantic understanding aims to study what kind of objects, what kind of instances, and the relationship between objects exist in the image. It is expected that the machine can automatically "understand" the external environment like a human being. Essentially, it learns the mapping relationship between low-level features and high-level semantics.

(2) . The basic tasks of image semantic understanding include:

  • Image Classification: Predict a class for each image.
  • Image annotation: Predict multiple semantic labels for each image.
  • Object Detection: Predict a class and a compact localization target for objects in an image.
  • Semantic Segmentation: Predict a semantic label for each pixel.
  • Image Description: Describe images in natural language.

(3) . A classic algorithm for target detection is as follows:
YOLO, its steps are as follows:
  a. Imagine the input image as a series of grids, and lay anchors of different sizes and sizes in each grid.
  b. Then send the picture to the feature network for feature extraction.
  c. Decode the feature map, including predicting anchor correction, confidence and category probability, etc.
  d. Filter and NMS the predicted bounding boxes.


5. Describe in detail the rationale for the proposed methods based on SVD and RBM, and compare their advantages and disadvantages.

Reference answer :
(1) . SVD
can be expressed as a sparse matrix RR for scoring all users and all productsR ; SVD-based recommendation method for matrixRRR decomposes and requires the matrix elements to be non-negative, as follows
RU × I = PU × KQK × I R_{U\times I}=P_{U\times K}Q_{K\times I}RU×I=PU×KQK×IThen use RRTrainingPP with known data in RP andQQQ , such thatPPP andQQQ multiplication best fits known ratings. Specifically, predicting userUUU vs. ProductIII的评分为。
r ^ u i = p u T q i \hat{r}_{ui}=p_{u}^{T}q_i r^ui=puTqiExample: eui = rui − r ^ ui e_{ui}=r_{ui}-\hat{r}_{ui}eui=ruir^ui, the total squared error is .
SSE = ∑ eui 2 \mathrm{SSE}=\sum{e_{ui}^{2}}SSE=eui2Then the SSE \mathrm{SSE}SSE is used as a loss to train the model.

(2) . RBM
regards a user's rating of a product as a softmax softmaxso f t max x neurons,softmax softmaxso f t max x neuron is a neuron of lengthkkA vector of k with only one component being 1 and the rest being 0. The ungraded part can use all 0softmax softmaxso f t max neuron representation . In this way, the rating of a certain user can be calculated by the matrixVVV , the activation probability of a given visible unit state is:
P ( hj = 1 ∣ V ) = 1 1 + exp ⁡ ( − bj − ∑ i = 1 M ∑ k = 1 KV i KW ij K ) P\left( { {h_j} = 1\left| V \right.} \right) = \frac{1}{ { 1 + \exp \left( { - b_j - \sum\nolimits_{i = 1}^ M {\sum\nolimits_{k = 1}^K {V_i^KW_{ij}^K} } } \right)}}P(hj=1V)=1+exp(bji=1Mk=1KViKWijK)1Similarly, given the state of the hidden unit, the activation rate of the visible unit is:
P ( V i K = 1 ∣ h ) = exp ⁡ ( ai K + ∑ j = 1 F wij K hj ) ∑ l = 1 K exp ⁡ ( ail + ∑ j = 1 F wijhj ) P\left( {V_i^K = 1\left| h \right.} \right) = \frac{ {\exp \left( {a_i^K + \sum \ nolimits_{j = 1}^F {w_{ij}^K{h_j}} } \right)}}{ {\ sum\nolimits_{l = 1}^K {\exp \left( {a_i^l + \ sum\nolimits_{j = 1}^F { {w_{ij}}{h_j}} } \right)} }}P(ViK=1h)=l=1Kexp(ail+j=1Fwijhj)exp(aiK+j=1FwijKhj)In the training phase, input the items rated too much by the user, calculate the values ​​of the input layer and the hidden layer in turn, and complete the encoding process; then, calculate the input value according to the hidden layer value, and complete the decoding process. Finally, the weight of RBM is updated according to the gap between the two.
In the prediction phase, user uuAll ratings of u as softmax softmaxso f t max x unit input, then calculate the activation probability of the hidden layer unit, and then calculate the probability of the visible layer unit, take the expectation of all probabilities as the predicted value.

(3) . Comparison: Although the calculation process of SVD is simpler, it is easy to cause overfitting due to the single training target, while RBM can prevent gradient explosion and gradient disappearance, but the process of seeking expectations will be more complicated and the learning efficiency is too slow.


6. Briefly describe the basic idea of ​​Iterative Quantization (ITQ), and compare the advantages and disadvantages of ITQ method and Locality Sensitive Hashing (LSH) method.

Reference answer :
(1) . The basic idea of ​​the iterative quantitative hashing method is to perform PCA dimension reduction on the data set first, and then find the rotation matrix with the smallest quantization error to obtain the binary code corresponding to the eigenvector under the optimal rotation matrix .

(2) . Advantages and disadvantages of ITQ method and Locality Sensitive Hash (LSH) method

  • ITQ
    • Advantages: Compared with the Local Sensitive Hash (LSH) method, there is one more operation, that is, the matrix rotation optimization is used after data dimensionality reduction, which can reduce the quantization error.
    • Disadvantages: Due to the unbalanced variance of different dimensions of PCA, when rotating PCA projection data to minimize quantization errors, it is necessary to continuously control the rotation angle, that is, to find the optimal rotation matrix and the corresponding encoding, which is relatively troublesome.
  • LSH
    • Advantages: Through the hash function mapping transformation operation, the original data set is divided into multiple sub-sets, and the data in each sub-set is adjacent and the number of elements in the sub-set is small, so one in the super-large set The problem of finding adjacent elements is transformed into the problem of finding adjacent elements in a small set, the amount of calculation is greatly reduced, and the calculation performance of approximate retrieval is improved.
    • Disadvantages: The Locality Sensitive Hash (LSH) method does not guarantee that the data closest to the query data point can be found.

7. What are the difficulties in moving object detection? The advantages and disadvantages of the commonly used methods are briefly described.

Reference answer :
(1) . Difficulties include: lighting changes, dynamic background, camouflaged targets, camera shake, camera out of focus, intermittent object movement, shadow effects, etc.

(2) . The current common methods include the following:

  • frame difference method
    • Advantages: the algorithm is simple, easy to implement, and the detection speed is fast. And generally the time interval between two adjacent frames is relatively short, so it is not particularly sensitive to the light changes of the scene.
    • Disadvantages: It is very sensitive to noise and the position of the detected object is inaccurate. Secondly, the detection result of the frame difference method is related to the target motion speed and the interval between two adjacent frames. A target moving too fast will be divided into two targets, and a target moving too slowly will be regarded as the background.
  • background subtraction
    • Advantages: the algorithm is relatively simple; to a certain extent, it overcomes the influence of ambient light.
    • Disadvantages: cannot be used for sports cameras; it is difficult to update the background image in real time.
  • statistical averaging
    • Advantages: Selecting appropriate parameters can correct the background image well, so as to obtain a more realistic background estimation image.
    • Disadvantages: For moving objects that appear frequently or stay in the scene for a long time, the model cannot extract moving objects well. In complex scenes, some false targets (such as swaying branches, etc.) will be detected as moving targets. This is due to changes in pixel values ​​caused by the swaying of tree branches in the scene.
  • mixed Gaussian model
    • Advantages: It can adapt to the slow change of the background over time, and can describe some periodic disturbances in the background such as flickering of the display screen and shaking of branches.
    • Disadvantages: It cannot accurately detect and extract slow-moving targets, cannot adapt well to false detection or missed detection caused by shadows and noise, and cannot adapt to sudden changes in the scene.
  • Nonparametric Kernel Density Probability Estimation
    • Advantages: It can gradually converge to the probability density of any shape, and it also has certain adaptability to dynamic scenes.
    • Disadvantages: The amount of calculation is very large, and it is difficult to realize real-time detection of video images. The memory requirements are relatively high.
  • Moving Object Detection Based on Codebook
    • Advantages: strong robustness, high computational efficiency, fast speed, less computation, and high accuracy.
    • Disadvantages: When there is a large motion prospect in the training frame, the establishment of the codebook model will be very inaccurate, and the adjustment of the update parameters is very complicated, which cannot be widely used in actual situations; since one or more codebook models must be established for each pixel of the video For this model, training the model is time-consuming, and if background reconstruction is required, the calculation speed will be greatly reduced.
  • ViBe
    • Advantages: simple idea, easy to implement. (Approximate infinite time window with finite samples) The amount of computation is small. High operational efficiency. (few samples; optimized similarity matching algorithm) sample decay strategy. (The random update strategy makes the sample life cycle decay exponentially, which is different from the first-in-first-out of other methods)
    • Cons: There are issues with ghosting, stationary objects, shadowed foregrounds, and incomplete moving objects.
  • SubSense
    • Advantages: The feedback mechanism is used, which is better adapted to different scenarios and more robust to noise.

Guess you like

Origin blog.csdn.net/weixin_44110393/article/details/128585238