Computer Vision - Theory - From Convolution to Recognition

interstellar1

foreword

Vue框架:Learn Vue
OJ算法系列:magic machine from the project - algorithm detailed explanation
Linux操作系统:of Fenghou Qimen - linux
C++11:Tongtianlu - C++11
Python常用模块:Tongtianlu - python

The computer vision series blog is divided into two main lines: algorithm theory + opencv practical operation
theory is derived from [computer vision (undergraduate) Lu Peng clear and complete collection of Beijing University of Posts and Telecommunications] (https://www.bilibili.com/video/BV1nz4y197Qv/?spm_id_from =333.337.search-card.all.click&vd_source=78f225f12090c7d6b26a7d94b5887157)
The practical operation will take kaggle as an example to explain opencv and neural network

1. Introduction:

  • Human Vision:

    image / video -> iris collects light -> retinal imaging -> transmits to the brain -> parses what you see

    Advantages: good for distinguishing, 150ms to distinguish whether it is an animal

    Disadvantages: not conducive to dynamic vision, multiple semantic segmentation, illusion perception (Clinton, static image moving, the same color)

    Computer Vision:

    image / video -> ccd / cmod imaging -> computer analysis

  • Purpose of CV:

    Understanding real meaning through pixels (semantic gap)

    The information that the image gives us: (two major research directions)

    1. 3D restoration
    2. semantic information
  • History line:

    1. In the 1950s, Hubel & Wiesel found that the neurons in the cerebral cortex of cats became active after seeing simple graphics
    2. 1960S, MIT Minsky left homework to complete the visual system with AI reasoning
    3. David Marr divides CV into three layers:
      • What are you doing?
      • How to represent inputs, outputs, computations, features. What algorithm to use to process the input.
      • How to use hardware to accelerate calculations, and what requirements does the hardware put forward for algorithms and representations.
  • application:

    Movie special effects, 3D reconstruction, face detection, iris recognition, fingerprint recognition, automatic driving, pedestrian detection, virtual reality, weather recognition,

Second, convolution:

Image Denoising:

  • Image classification:
    1. Single bit binary: 0 black 1 white
    2. Single Byte grayscale image: 255 bright and 0 dark
    3. Three ByteRGB images: 3 channels
  • Noise classification:
    1. Pepper noise: black and white
    2. Impulse Noise: White Point
    3. Gaussian noise: Random interference that obeys Gaussian distribution, which can be seen as each point on the ideal graph plus a point on the Gaussian function
  • Denoising principle: the weighted average of the pixel neighborhood replaces each pixel

Constant convolution:

  • Convolution kernel/filter kernel: the weight of each point

    Such as: k=1/9 & [[1,1,1],[1,1,1],[1,1,1]] with equal weight is the average volume

  • Convolution operation: the 3*3 matrix of each point is flipped and multiplied by the pixel value, added and assigned

    mk nl means the corresponding kl after flipping

convolution operation

  • The nature of the convolution operation:

    1. Linear:

      linear

    2. Offset unchanged:

      insert image description here

    3. exchange:
      insert image description here

    4. combine:
      insert image description here

    5. distribute:

      insert image description here

    6. Product of real numbers:
      insert image description here

    7. Pulse unchanged:
      insert image description here

  • padding:

    The points on the four sides lack neighbors, cannot be convolved, and padding is required.

    Five filling methods: fill in 0, fill in 1, ring fill the original image stretch side mirror side. Prevent the image from disappearing after several convolutions

  • Convolution Kernel Awards:

    1. Original image: [[0,0,0],[0,1,0],[0,0,0]]

    2. Shift left: [[0,0,0],[0,0,1],[0,0,0]]
      insert image description here

    3. Move down: [[0,1,0],[0,0,0],[0,0,0]]
      insert image description here

    4. Mean smoothing/denoising: k=1/9 & [[1,1,1],[1,1,1],[1,1,1]]
      insert image description here

    5. Sharpening: [[0,0,0],[0,2,0],[0,0,0]] - k=1/9 & [[1,1,1],[1,1,1] ],[1,1,1]]
      insert image description here

      Original image i*convolution kernel e - mean smoothing/denoising i*convolution kernel g = edge image

      Original image + edge image i*convolution kernel e = sharpening i*(2e - g)

  • Problem with mean convolution: ringing blur effect
    ringing

Gaussian convolution:

  • Gaussian function: the weight of the near one is large, and the weight of the far one is small

insert image description here

  • Geometric meaning:

    1. 3D effect: protruding in the middle, flat around
    2. 2D effect: middle white, surrounding black
    3. Array normalization effect: the middle value is large, and the surrounding is small
      insert image description here
  • Gaussian convolution kernel has two parameters:

    1. Window frame: 3*3 or 5*5, the larger the weight of each point, the smaller the weight, and if it is too large, it will be 0

    2. Variance σ: the smaller the sharper, the larger the more moderate

      insert image description here

    3. The window frame and variance of the Gaussian convolution kernel have a fixed relationship:
      h = 3 σ + 1 + 3 σ w = 3 σ + 1 + 3 σ h = 3σ + 1 + 3σ \\ w = 3σ + 1 + 3σh=3 p+1+3 pw=3 p+1+3 p

  • Gaussian kernel convolution Pythagorean theorem:

    Gaussian convolution on two right-angled sides σ = one Gaussian convolution on hypotenuse σ'

    σ hypotenuse length 2 = σ right angle side length 2 + σ right angle side length 2 σ_{hypotenuse length}^ 2 = σ_{right angle side length}^2 + σ_{right angle side length}^2phypotenuse length2=pside length2+pside length2

  • Gaussian kernel decomposition: divided into x & y directions respectively Gaussian

    insert image description here

  • Application of Gaussian kernel decomposition: reduce time complexity, large kernel can be simulated with two small kernels

    The convolution kernel of m*m is convolved with the image of n*n: original time complexity: O(n^2 * m^2), now (n^2 * m)

    insert image description here

  • Gaussian noise: The Gaussian noise model itself contains σ. The larger the σ of the noise, the larger the σ and window width required for the Gaussian kernel to eliminate the noise, but at the same time the edges will be blurred more

Salt and pepper denoising:

  • Median filter: It does not have a fixed weight. Every time an image arrives, all pixel values ​​​​in the window frame are sorted from large to small (non-linear), and the median is selected and placed directly on the current point, regardless of repetition: but multiplication and is linear

  • Advantages and disadvantages of median non-average:

    1. Generally more reliable than the mean: not affected by extreme values ​​(salt and pepper)
    2. When the window frame is relatively large, the median value has little to do with the current point, causing the image to be blurrier than the mean value

Sharpening degree:

  • a == 1:

    Sharpening: [[0,0,0],[0,2,0],[0,0,0]] - k=1/9 & [[1,1,1],[1,1,1] ],[1,1,1]]
    insert image description here
    Original image i*convolution kernel e - mean smoothing/denoising i*convolution kernel g = edge image

    Original image + edge image i*convolution kernel e = sharpening i*convolution kernel (2e - g)

  • α characterizes the degree of sharpening:
    insert image description here

  • The final result is also called Laplacian Gaussian, which is larger and sharper than the original Gaussian.

Third, edge detection:

  • Definition of edges: Edges are places where there are rapid changes in the image intensity function, and most of the semantic and shape information in the image can be encoded in the edges.

  • 4 classifications of edges: surface discontinuity, depth discontinuity, surface color discontinuity, light discontinuity

  • Signal description of the edge: white is 255, the value is higher than black 0, the place where the signal function changes abruptly is the edge, and the first derivative can be used for the description of the sudden change

    insert image description here

Image signal derivative:

  • Image x-direction derivative formula:

    insert image description here

    Set the variable that tends to 0 to 1 directly to get the derivative formula in CV:

    The horizontal derivative of a point (x, y) = - the pixel value of the point + the pixel value of the right point (-1, 1)

    insert image description here

    The derivative in the X direction (-1,1) can detect the vertical edge

  • Derivatives in the Y direction (-1,1) or (1,-1) can detect horizontal edges

Derivation operator:

  • Typical: -1, 1

  • Prewitt operator:

    Mx = [[-1,0,1],[-1,0,1],[-1,0,1]]

    My = [[1,1,1], [0,0,0], [-1,-1,-1]]
    insert image description here

  • Sobel operator: less sensitive to noise

    Mx = [[-1,0,1], [-2,0,2], [-1,0,1]]

    It can be split into [1,2,1] * [-1,0,1], which is equivalent to smoothing and removing noise first, and then finding the edge

    My = [[1,2,1], [0,0,0], [-1,-2,-1]]

    Can be split into [1,2,1] * [-1,0,1]
    insert image description here

  • Roberts operator: Hypotenuse

    Mx = [[0, 1], [-1, 0]]

    My = [[1, 0], [0, -1]]
    insert image description here

Image Gradient:

  • The gradient points to the direction where the intensity increases fastest, and the horizontal gradient of the vertical stripes and the vertical gradient of the horizontal stripes

  • Gradient calculation formula:

  • Gradient size formula:

    insert image description here

  • Gradient direction formula:

    insert image description here

  • To represent the edge, the results of X derivative convolution and Y derivative convolution are often used to represent the results of gradient derivatives.

Extract edges:

  • Noise removal: Noise makes the image signal continue to oscillate slightly, and the direct calculation of the gradient finds that the gradient changes everywhere and there are edges everywhere, so it is necessary to denoise first, and use Gaussian filtering (after padding to avoid the value disappearing)

  • The derivative operator performs convolution.

  • The time complexity can be reduced by using the associative law of convolution:

    1. Derivation of the Gaussian kernel (derivative operator convolution)
    2. Convolving the image with a Gaussian kernel
      insert image description here
  • Gaussian partial derivative template: Gaussian kernel for derivation

    Also as σ changes, the obtained edges become more and more rough, and you can choose σ according to your needs

    Although the Gaussian kernel is similar to the softmax full positive number, the Gaussian partial derivative kernel has a negative number

    The sum of the Gaussian kernel elements is 1, but the sum of the Gaussian partial derivative kernel elements is 0 (no edge cannot be detected)

canny algorithm:

  • First look at the advantages and disadvantages of traditional algorithms:

    1. Gaussian blur denoising (resulting in thicker edges)

    2. Directly find the gradient modulus of each point:

      Convolute each point in the X direction (-1,1), and convolute each point in the Y direction (-1,1)

      The square and root of the convolution result get the gradient value of this point

    3. Set noise recognition threshold:

      Points whose gradient value is lower than the threshold are identified as noise and deleted (set to 0)

      There are two problems with this operation:

      The filtered points are all considered as edges, thus widening the edges

      insert image description here

      Improper threshold setting leads to the deletion of many real edges, or the noise produces a lot of false edges

      The Canny algorithm uses a double threshold: first use a high threshold to determine the real edge, and then lower the threshold to increase the weak edge. And assume that all the real weak edges are connected to the strong edge, so even if the noise passes the low threshold, it will be deleted when it is not connected to the strong edge.

    4. Non-Maximum Suppression: Refinement of Edges

      Compare the gradient values ​​of a point and its neighbors in the gradient direction, leaving only the one with the largest gradient

      If the gradient direction is not an integer and there is no neighbor in this direction, then use linear interpolation to create a neighbor

  • Canny algorithm process:

    1. Gaussian blur denoising
    2. Calculate the gradient magnitude and direction
    3. Use direction to perform non-maximization suppression to find thin edges
    4. Double-threshold concatenated filtering of noisy edges using magnitude

Fourth, fitting:

  • After extracting the edge, use the edge to fit some shapes, and then analyze these shapes better. For example, if you get a circle by fitting, you can know the position of the center of the circle
  • Difficulties encountered in fitting:
    1. Noise, points that would have been on the line are now considered off-line
    2. External points, points that are not on the line may be mistaken for the line
    3. Missing Information: Occlusion

Least squares method:

  • Once the out-of-line points are involved, the gap between the final fitted line and the real line is very large

y-direction:

  • When we know which points are on the line (all points are points on the line), use the least squares method to fit the line, and find the m & b of the line y=mx+b

  • The definition and calculation of the energy function E in the y direction: the offset of all points to the line

    insert image description here

  • The y direction is derived from m and b to find the minimum value of E:

  • Disadvantages: (It is suspected that the horizontal line has no y, x and y have no relationship, and the horizontal line cannot be solved)

    Moreover, the vertical line y-y' is always 0, and the vertical line cannot be solved

    Unable to cope with line angle changes caused by camera rotation

All directions:

  • Find a & b & d of the line ax + by = d

  • Definition and calculation of the omnidirectional energy function E: the offset of all points to the straight line

    insert image description here

  • First solve the d of the extreme point:

    insert image description here

  • Then find ab and derive the N matrix:

    insert image description here

The idea of ​​maximum likelihood estimation:

insert image description here

Robust least squares:

  • If there are outliers, the outliers will not be considered as far as specified by the ρ function:

    insert image description here

  • If σ is too small, all points will contribute about the same to the line

    If σ is too large, it is similar to least squares, the weight of the near one is large, and the weight of the far one is small

  • Since the ρ function is nonlinear, it is not possible to find the minimum value by means of derivation, etc., and it is necessary to use stochastic gradient descent and other descent methods

RANSAC:

  • Many outliers, even many outliers

  • Follow these four steps:

    1. Choose two points uniformly at random
    2. Fitting a straight line based on two points
    3. All other points calculate the distance from themselves to the straight line, set a threshold k, and the distance is less than k as being on the line, otherwise it is not.
    4. Do this multiple times and choose the line with the most points on the line
  • The number of samples N is executed multiple times, and the calculation method needs to be calculated several times:

    The number of s points is fixed by the model, p normal probability and e outlier rate are specified

    insert image description here

    e indicates the probability of not being on the line, but actually the probability of being on the line

  • In the above formula, e is difficult to specify artificially. If you are confident in the data set, set a high point, and if you are not confident, set a low point. However, when the data set is not clear, you can use the adaptive RANSAC method:

Adaptive:

  • Adaptive process:

    1. N=+∞,sample_count =0

    2. While N > sample_count:

      Select a sample to fit a straight line and calculate the number of points on the line

      set outside point rate e = 1- (number of points on the line)/(total number of points)

      Recalculate N from e: N=log(1-p)/log(1-(1-e)^s)

      Increment sample_count by 1

  • If N=100 in a certain iteration, and the current N is 150, then iterate 50 times

  • If N=100 in a certain iteration, and the current N is 120, then end

  • Advantages and disadvantages:

    1. simple and versatile
    2. Applicable to many different problems
    3. works fine in practice
    4. But many parameters need to be tuned
    5. Not well suited for low initial ratios, causing too many iterations, or possibly failing completely
    6. Can't always get a good initialization model based on the minimum number of samples
  • With the method of least squares:

    The line found by RANSAC is only based on two points and a given distance threshold. It can be said that the fitted straight line is within this range, but the high probability is not the current line.

    When all interior points are found, the straight line found using the omnidirectional least squares method is closer to all points

Fingerprint recognition:

  • Affine transformation: After finding the special points in different pictures, you need to correspond them

    Six parameters require three pairs of points to form three equations, 3 x and 3 y

    insert image description here

  • First randomly find three pairs of corresponding points, use the remaining points to vote for the constructed abcdef matrix, and finally get an abcdef matrix that can satisfy the most correspondence

  • Different fingerprints and target fingerprints construct the optimal abcdef matrix, and use their respective optimal matrices to check how many corresponding points are contained

Hough transform:

  • Applicable situation: There are many external lines

Voting strategy:

  • Let each feature vote for all models compatible with it,

    It is hoped that the noisy features will not vote unanimously for any one model,

    As long as there are enough features remaining and not too much occlusion, a good model can be achieved

  • Initial voting method:

    1. Discretely select a finite number of a and b, each forming a grid
    2. The straight line y = ax + b, where ab is fixed, can only contribute one point (a,b) to the parameter space, and vote for the (a,b) grid
    3. The ab of a point is not fixed, contributes a straight line to the parameter space, and votes for many grids. The intersection point of two point voting straight lines is the straight line formed by the two points.
  • Disadvantages of initial voting: the ab parameter itself is infinite, and the vertical line needs infinite a

  • Polar voting method:

    1. θ∈(0, 180), and the vertical line can also be expressed as θ=90

    2. ρ can be found out:

      insert image description here

  • Gradient improvement method: use the magnitude and direction of the gradient to vote for it

Line OK:

  • Which grid has the most votes, the straight line corresponding to this grid is the target straight line

  • For the square it obviously voted for four dots (brightest), representing four straight lines

    For the circle there is clearly no agreed point, just a band in a range

Adjust the grid to fit the noise:

  • The noise makes the originally intact point voting range exceed the grid. There is a solution, which is to increase the width of one grid to adapt to a larger range
  • However, if the grid is too wide, multiple lines will be mistaken for one line, and if the grid is too narrow, the number of votes in each grid will be too small, or even not exceed the threshold
  • Soft voting: In order to solve the problem of too narrow and too few votes, every time a point is voted, the surrounding points are also counted from near to far

Canny gradient voting:

  • When we use Canny to detect an edge point, we also know its gradient direction, which means the line the point votes to is uniquely determined

    insert image description here

  • Modified Hough transform:

    For each edge point (x,y)

    ​ The gradient direction on θ=(x,y)

    ​ ρ=x cos θ+ysin θ

    ​ H(θ, ρ) = H(θ, ρ) +1

Hough circle:

  • Three parameters of the circle: (xa)^2 + (yb)^2 = r^2, so the voting space is three-dimensional

  • Given a point (x, y) on the circle, directly determine a point with R=0 in the XOYOR space

    Because the gradient direction of the arc is fixed, the circle either follows the gradient or deviates from the gradient. Each time the radius of the circle is determined, votes are cast for the center of the two points:

    The point (x, y, r) with the most votes in the center of the circle is the center of the circle, and the shape of the circle is known

    insert image description here

SNAKE:

  • There is occlusion, not sure if there should be a line here

5. Corner point:

  • Feature points are used for: image alignment, 3D reconstruction, motion tracking, robot navigation, indexing and database retrieval, object recognition

Panorama image:

  • Step 1: Extract features Step 2: Match features Step 3: Align images
  • Step 1: Four requirements:
    1. Repeatability Despite geometric and photometric transformations, the same features could be found in several images.
    2. Every trait is unique.
    3. Compact and efficient, preferably extracting far fewer features than image pixels
    4. Features occupy a relatively small area of ​​the image; resist clutter and occlusion. Even if the whole image is at the same point, the difference is very large due to rotation and translation.
  • Step 2: Matching features are matched with self-applicable RANSAC fingerprints

Basic detection:

  • The corner points meet the four requirements of the first step, and in the area near the corner, the image gradient has two or more dominant directions, which are unique and repeatable

  • The basic method of corner detection:

    Moving the small window in any direction produces a large change in intensity within the window

    The appearance of the window w(x,y) changes when moving [u,v]:

    E(u,v) = ∑w(x,y)·[I(x+u, y+v) - I(x,y)]^2, I represents the Image itself, and w(x,y) is also the center big and small around

Taylor expands:

  • Expand in (0,0) two dimensions:

    insert image description here

    insert image description here

    insert image description here

  • Take only the first derivative:

    insert image description here

    Neither λ tends to 0, which is the corner point

  • The effect of lambda:
    insert image description here

Edge and corner distinction:

  • Direct lambda:

    insert image description here

  • det formula method:

    insert image description here

    insert image description here

Harris corner detection method:

  • Can solve lighting, translation, rotation

  • PPT:

    1. Compute Gaussian derivative at each pixel
    2. Compute the second moment matrix M within a Gaussian window around each pixel
    3. Calculate corner response function in R
    4. Set threshold R
  1. Find the local maximum of the response function (non-maximum suppression)
  • It is used to detect the corner point (corner) in the image, that is, the pixel point where the two edges meet, the steps are:

    1. Image preprocessing: First, the input image is converted to a grayscale image.

    2. Calculate the Gaussian gradient at each pixel of the image: Use a gradient calculation method (such as the Sobel operator) to calculate the gradient of the image in the horizontal and vertical directions. This helps us capture edges in images.

    3. Calculate the second-moment matrix M in the Gaussian window around each pixel: For each pixel, use the calculated gradient information to calculate the structural tensor second-moment matrix, which is a 2x2 matrix used to describe the area around the pixel The gradient distribution of .

    4. Calculate the corner response function: evaluate the corner characteristics of each pixel by calculating the eigenvalue of the structure tensor (the eigenvalue represents the degree of gradient change). Harris corner detection uses the following corner response function:

      R = det(M) - k * trace(M)^2

      Among them, det(M) represents the determinant of the structure tensor, trace(M) represents the trace of the structure tensor (that is, the sum of elements on the main diagonal), k is an empirical parameter (usually the value is 0.04 - 0.06) , used to adjust the sensitivity of the response function.

    5. Threshold processing: According to the calculated corner response function, threshold processing is performed on each pixel point, and the pixel points whose response function is greater than the set threshold are marked as corner point candidates.

    6. Non-maximum value suppression: In the corner candidate points, that is, among all the found local maximum values ​​of the response function, non-maximum value suppression is performed on adjacent pixels, and only the pixel point with the largest response function is retained as the final corner point .

    7. Display corners: Mark the detected corner positions on the original image according to the detected corner coordinates.

Invariant and covariant:

  • Invariance invariance: The corner position of the transformed image does not change.

    F(T(img)) = F(img)

  • Covariance covariance: If there are two transformed versions of the same image, features should be detected at corresponding positions

    F(T(img)) = T’(F(img))

Advantages and disadvantages:

  • The Harris corner detection algorithm identifies corners by calculating the gradient distribution and corner characteristics of the area around the pixel point, and it has better performance in calculation speed and detection accuracy.
  • However, Harris has no scale characteristics. When changing the size and shape, one is that the R value changes accordingly, which may exceed the threshold, and the part that does not exceed the threshold after the change is invariance; the other is the invariant scaling of the corner position. For example, when the edge of the corner is very thick, the corner point cannot be extracted.

Six, Blob:

  • Goal: Independently detecting corresponding regions in scaled versions of the same image requires a scale selection mechanism to find feature region sizes that are covariant with image transformations.

  • Gaussian second derivative zero:

    The edge is where the image changes drastically, and the extreme value of the Gaussian first-order derivative corresponds to the zero-crossing point of the Gaussian second-order derivative (not always at the zero point):

    insert image description here

Laplace decay:

  • Laplace kernel: As the variance σ within the image increases, the attenuation becomes more severe:

    insert image description here

Laplacian multiscale detection:

  • Multiply σ^2 for convolution so that the signal does not attenuate: calculate the variance σ after a given image, and calculate the circle radius in Laplace

    insert image description here

  • Laplacian of Gaussian: a circular symmetric operator for 2D blob detection

    insert image description here

  • What is the maximum response of the Laplace function to a binary circle of radius r? For maximum response, the zeros of the Laplace function must be aligned with the circle

    insert image description here

    Therefore, the maximum response occurs at σ=r/√2.

  • That is, when the maximum response is found, multiply the σ at this time by √2 = r

  • step:

    1. Image preprocessing: First, convert the input image to a grayscale image
    2. Computing the Laplacian: Apply the Laplacian (usually using a discrete second derivative operation) to compute the Laplacian transform of the image. Discrete second derivative operations can be implemented by using a Laplacian template (such as a 3x3 template), which contains the weights of the center pixel and its surrounding 8 neighbor pixels. By applying the Laplacian operator to each pixel in the image, a Laplacian image of the same size as the original image can be obtained.
    3. Edge detection: In Laplacian images, edges are detected by finding extreme points of pixel values. In general, positive extreme points represent edges from light to dark, and negative extreme points represent edges from dark to light. A threshold can be set to determine which extreme points are considered edges.
    4. Strong edge enhancement: In order to enhance the display of edges, the detected edges can be enhanced. This is achieved by setting the grayscale value of edge pixels to the maximum value (usually 255), while setting the grayscale value of non-edge pixels to the minimum value (usually 0).
    5. Display Results: The enhanced edge image is displayed to visualize edges and significant changes.

    Laplacian detection is a simple and effective edge detection method, especially suitable for the detection of details and textures in images. However, since the discrete second derivative is sensitive to noise, Laplacian detection may be corrupted by noise, producing unstable marginal results. Therefore, in practical applications, it is usually combined with other image processing techniques (such as Gaussian filtering) to improve the quality and stability of edge detection.

Non-maximizing suppression:

  • In addition to the comparison response of a point with the same position under different σ, it also needs to be compared with the point at the adjacent position under the same σ. In this way, it is actually selecting the center of the circle and redrawing the circle

    insert image description here

Advantages and disadvantages:

  • The advantage is that the circles that can be found are complete
  • The disadvantage is that the larger the image, the larger the Gaussian kernel, the larger the first-order derivative of the Gaussian kernel, the larger the second-order derivative of the Gaussian kernel, and the greater the amount of calculation
  • Improvement: only make circles for harris corners, or directly use SIFT features

SIFT features:

  • Find the local maximum of the difference of Gaussian function in space and scale

  • DOG function: No second order derivative is required, the difference between two Gaussian kernel convolutions is enough, and the effect is close to the Laplacian kernel

    insert image description here

  • Moreover, the two convolutions can be convoluted with the σ Gaussian kernel first, and then continue to use ((kσ)^2 - σ^2) to obtain the kσ convolution result, and the two convolutions are 2σ+1 because the window frame is 2σ+1, so If the window frame is small, the amount of calculation is smaller than that of the direct kσ convolution.

    insert image description here

  • And the image can be zoomed, and finally the difference between the circle R detected in zooming and the circle R detected in non-zooming is the zoom factor.

    The zooming here can be directly sampled every other point in the image.

  • Using 5 different σ to convolve the image, you can get 4 equivalent Laplacian convolution images. In the four Laplacian convolution images, each of the three images is non-maximized and suppressed. , get a suitable σ, and get two σ in total.

    When the summary goal is ready to output s suitable σ, k is discretely selected, and the discrete interval is k = 2^(1/s)

The difference between Laplace and Blob:

  • Laplacian (blob) responses are invariant to rotation and scaling
  • blob position and scaling, covariance, rotation and scaling
  • What about intensity changes?

Adaptive Elliptic Affine Covariance-Scale Invariance

  • Affine transformations approximate viewpoint changes for roughly planar objects and roughly orthographic cameras

    insert image description here

  • To achieve an adaptive ellipse, you can use the harris corner detection tool:

    insert image description here

  • Find the circle constructed by SIFT or Laplace, find the two directions in which the gradient of the pixels in the circle changes most dramatically, and change the radius of the circle along the gradient direction until the gradient changes in the two directions are the same, and an ellipse with λ1=λ2 is obtained. At this time, the ellipse different inner angles

  • Vote by grid, cast an angle in the gradient direction in each grid, count the angle with the most votes, and reverse the angle in the ellipse to this

    insert image description here

  • It does not rely on the intensity to judge the similarity of the image, such as the difference in the light intensity of the photo during the day and at night. Continue to vote, vote for your own gradient direction in each small grid:

    insert image description here

  • Serialize the SIFT area according to the vote: 16 grids * the number of votes in each of the 8 directions = 128 bits, and the similarity can be known by comparing the 128-bit sequences in the SIFT ellipse. (Small grid voting can prevent misidentification I love China as much as I love China, I! = love)

SIFT image matching system:

  • Create a Gaussian Difference Pyramid - used to simulate the distance and blurriness of the observer from the object:

    1. Perform Gaussian convolution with different variances on the image to obtain multiple images as the first layer
    2. Sampling is performed on the first layer of images at intervals, and a Gaussian convolution with a variance of 2σ is performed. The purpose of sampling every other point is to reduce the image scale
    3. Perform several iterations as above to obtain several layers of Gaussian pyramids, and each layer has several images obtained by convolution with different variances
    4. Between the same layer of the pyramid, the difference between two adjacent pictures is obtained to obtain a Gaussian difference pyramid (Difference of Gaussian, DOG)
  • Key point location determination:

    1. Thresholding: Set a threshold in DOG to filter out low-contrast feature points
    2. Finding the extremum in the Gaussian difference pyramid: compare the pixel value of the pixel point and its 8 neighbors to determine the extremum
    3. Since the differential pyramid is not continuous. Use Taylor expansion to find extreme points more accurately
    4. Discard points with low contrast (possibly noise)
    5. Edge effect removal: Calculate the gradient of the point and put it into the direction histogram. If the angle between the dominant direction of a point and the edge direction is small, it will be considered as a point on the edge and will be discarded.
  • Key point direction assignment:

    1. Construct a gradient histogram: the neighborhood is divided into 16 sub-regions, and the gradient magnitude and direction of each point are calculated
    2. Dominant direction selection: Find the dominant direction with the largest gradient magnitude from the gradient histogram
    3. Create a key point descriptor: After determining the dominant direction of the key point, divide the neighborhood around the key point into sub-regions, and calculate the gradient direction histogram in each sub-region. The feature vectors of all subregions are concatenated to form the keypoint descriptor
  • Feature point matching:

    Using descriptors of keypoints, use Euclidean distance or similarity measure between feature points to find the best matching pair

    If the similarity with the corresponding point in the first picture is significantly higher than that with the corresponding point in the second picture, then it is regarded as a valid corresponding point. Conversely, if the similarity of the corresponding point of the first picture is not much different from that of the second picture, then this feature point may be wrong, and it will not participate in the judgment process.

Seven, texture segmentation classification:

  • The role of texture:
    1. Shape Extraction from Texture: Estimating surface orientation from image textures or shape from texture cues
    2. Segmentation/classification analysis, representing textures to group texture-consistent image regions
    3. Synthesis: generate new texture patches/images give some examples
  • Importance to perception:
    1. Often indications of material properties can be important appearance cues, especially when objects are similar in shape.
    2. The purpose is to distinguish shapes, boundaries and textures.

Brick wall texture classification:

  • step:

    1. Convolve with [-1, 1] to get vertical stripes, and use upper and lower convolution kernels [-1, 1] to get horizontal stripes
    2. Start sliding with a small window, and count the average horizontal gradient value and the average vertical gradient value in each small window. The horizontal gradient of the vertical grain is large, and the vertical gradient of the horizontal grain is large.
    3. Perform knn or kmeans clustering on all the content in the window to classify the texture
  • Define the distance between two points:

    insert image description here

  • Two disadvantages of the above approach:

    1. The premise of Knn or K-means: We know the approximate size of the texture, so we choose the appropriate size of the small window size. But for zebras with huge body textures, if the selected window is too small, the zebra stripes are all black.
    2. It is assumed that all textures can be easily classified by horizontal and vertical filters.

Adaptive window size:

  • Gradually transition from a small frame to a large frame. When the texture features in the frame change very little, it means that the window size is appropriate.

  • Many kinds of filters can be selected to perform Knn or kmeans on the D-dimensional space. There are 48 commonly used texture filters (6 directions + 4 sizes + point/edge/bar)

  • The larger the size of the convolution kernel, the more macroscopic the content of attention, and the more abstract the obtained picture

  • From the left black and right white convolution kernels used to detect vertical lines, a conclusion can be drawn, that is, the direction of the convolution kernel is similar to that of the detectable image:
    insert image description here

Multidimensional Gaussian:

  • official:

    insert image description here

  • Covariance matrix:

    insert image description here

Classify images by texture task:

  • Given a picture n*m, 48 texture convolution kernels are convolved for each point, and the final picture is expanded to n*m*48

  • Use knn to compare this picture with other pictures, and SVM can also be used when there are fewer samples

    insert image description here

Eight, split:

  • Over-segmented: Segmented too finely. Under-segmentation: Content that does not belong to the target is also divided into the target, resulting in too large a segmentation result.
  • Superpixels: The pixels with similar positions and semantics are grouped together, and the following are all bottom-up segmentations, starting from pixels.
  • Features of the segmentation task: Unsupervised. Machines are generally bottom-up, while humans are self-directed + bottom-up and occasionally supervised and occasionally unsupervised to find the law.

Basic division method:

  • Common basis for segmentation:

    insert image description here

  • Think that the RGB similarity is the same content + k-means classification: it cannot distinguish instances well, and it belongs to under-segmentation

    insert image description here

  • For RGB + coordinates in the picture: the background of different positions is also divided into different instances, which belongs to over-segmentation

    insert image description here

K-means advantages and disadvantages:

  • Advantages: very simple, can always converge to a local minimum of the error function

  • shortcoming:! Need to manually specify the number of categories k 2. The memory overhead is very large 3. It is very sensitive to the initial situation 4. Sensitive to outliers 5. Only "globular clusters" can be found

    Sensitivity to the initial value can be avoided by two methods: the density method finds the most dense points farthest from the current one by one or divides the space into equal distances

    The non-spherical classification can be found by GMM

mean shift center of gravity drift method:

  • Initialize the window at a single feature point,

    Perform a mean shift on each window until convergence, with windows close to the same "peak" or mode

  • Three core steps:

    1. randomly select a window
    2. Count the center of gravity of all points in the window, and move the window in this direction
    3. Until the movement stops, the areas that all windows walk through are grouped together
  • Clustering: A pattern of attraction for all data points in the basin.

    Basins of attraction: regions where all trajectories point to the same pattern

    insert image description here

  • Mean shift algorithm is a non-parametric clustering algorithm, which is mainly used for data clustering and density estimation.

    step:

    1. Initialization: Select an initial seed point as the center point of each cluster and determine a window size.
    2. Density Estimation: For each seed point, compute the density estimate of the data points within the window. A kernel function can be used to measure the density of data points within the window, usually using a Gaussian kernel function.
    3. Translation vector calculation: Calculate the translation vector of each data point relative to the seed point. The translation vector is calculated by calculating the difference between the centroid (mean) of the data point in the window and the current seed point.
    4. Translation: Translate the seed point along the translation vector to update the position of the seed point
    5. Convergence judgment: Repeat steps 3 and 4 until the seed point converges to a local maximum (that is, the translation vector is close to zero). This means that the seed point has found the cluster center with the highest local density.
    6. Clustering: Use the converged seed point as the center point of the cluster, and assign other data points to the nearest cluster center.
    7. Repeat steps 1 to 6 until all data points have been assigned to cluster centers.

    The core idea of ​​the mean shift algorithm is to iteratively adjust the position of the seed point to move it to a high-density area until it converges to a local maximum. This finds the cluster centers in the data and assigns the data points to the corresponding cluster centers.

  • Advantages and disadvantages:

    Advantages: It does not assume that the classification must be spherical, only one parameter (window size) is required, the number of classifications does not need to be specified in advance, and the noise interference is very small

    Disadvantages: the calculation is complex and large (but you can not calculate the points passed in the circle), it is very dependent on the window size setting (the window is too large, the number of classifications is small, and the number of classifications is too small), the point is too high when the dimension is too high Sparse is difficult to calculate the center of gravity

Normalization Cut method:

  • Graph theory: Each point in the picture is used as a vertex, and the edge weight of different vertices is the similarity of the point.

    Find the minimum cut of the network composed of all points in the picture, and complete the picture segmentation after separation.

    Split the graph into segments, remove links that cross between segments, it is easiest to break links with low affinity. Similar pixels should be in the same segment after segmentation, and dissimilar pixels should be in different segments.

  • Point similarity:

    Suppose we use a feature vector x to represent each pixel, and define a distance function suitable for this feature representation, then we can convert the distance between two feature vectors into similarity with the help of generalized Gaussian kernel:

    Small σ: only similar to nearby points; large σ: can be similar to distant points.

    insert image description here

  • Disadvantages of ordinary cuts: Min cuts tend to cut off very small, resulting in many independent small regions, which can be fixed by normalizing the weights of all edges of the cut.

  • Weights before normalization and cutting:

    Considering two-point polygons, w(A, B) = sum of weights of all edges between A and B

    insert image description here

  • Mathematical proof:

    insert image description here

  • step:

    1. Input: Image (represented as a matrix of pixels), standard deviation σ1. Convert pixels to vectors: Represent each pixel in the image as a vector. For a grayscale image, you can use the gray value of the pixel as the element of the vector; for a color image, you can use the color channel value of the pixel as the element of the vector.
    2. Compute pairwise pixel distances: For each pair of pixel vectors, compute the distance between them using a distance function of choice (such as Euclidean distance or Manhattan distance). Get a distance matrix where each element represents the distance between two pixels.
    3. Map the distance to the range [0, 1]: To map the distance to the range [0, 1], you can use the formula d' = (d - min_distance) / (max_distance - min_distance), where d is the original distance, d ' is the distance after mapping, and min_distance and max_distance are the minimum and maximum distances between all pixel pairs, respectively.
    4. Computing Similarity Using Gaussian Kernel Function: Convert distance to similarity using Gaussian kernel function. To calculate the similarity between each pair of pixels, you can use the formula similarity = exp(-d'^2 / (2 * σ^2)), where d' is the distance after mapping and σ is the standard deviation.
    5. Generate similarity matrix W (adjacency matrix): Fill the calculated similarity values ​​into a similarity matrix. Each element of the matrix represents the similarity between two pixels. Note that W is symmetric. And the diagonal is 0, because the distance between the same points is 0.
    6. Build a Laplacian matrix: Build a Laplacian matrix based on the similarity matrix. The Laplacian matrix can come in many forms, such as a symmetric normalized Laplacian matrix or an asymmetric Laplacian matrix. Define a diagonal matrix D, and the elements in the nth row of D that are not 0 are the sum of the values ​​​​in the nth row of W.
    7. Eigenvalue decomposition of the Laplacian matrix: Decompose the eigenvalue of the constructed Laplacian matrix to obtain the eigenvalues ​​and corresponding eigenvectors. (DW)y = λDy; take the y vector corresponding to the second smallest eigenvalue.
    8. Clustering or segmentation using eigenvectors: perform clustering or segmentation operations based on specific eigenvalues ​​of eigenvectors. This can be achieved using a clustering algorithm such as spectral clustering or an eigenvector-based thresholding operation. Set the threshold value, assuming it is 1, the ones below 1 belong to one category, and the ones above 1 belong to another category.
    9. Output segmentation results: According to the results of clustering or segmentation, the pixels in the image are divided into different regions or categories. Pixels can be assigned to different segmented regions or classes based on a certain threshold of feature vectors or the results of a clustering algorithm.
    10. Optional post-processing: Depending on the need, some post-processing steps can be performed to further refine the segmentation results. For example, edge smoothing techniques can be applied to remove noise or discontinuities on segmentation boundaries.
    11. Output the final result: The final image segmentation result is used as the output of the algorithm, which can be an image that marks the region or category to which each pixel belongs.
  • Pros: Is a general framework that can be used for many different features and affinity formulations. Disadvantages: high storage requirements and time complexity. The blocks that tend to be divided are all equal in size

Nine, recognition (classification + detection):

  • Various tasks:

    • Detection tasks: first find, then identify or classify
    • Segmentation task: first find, then identify or classify, then segment
    • Add semantic information based on detection
  • Design of algorithms capable of classifying images or videos: detecting and localizing objects; estimating semantic and geometric properties; classifying human activities and events

  • Increased recognition difficulty: 1. There are so many types that humans can recognize 1W to 3W objects. 2. The photo pixels of the same object under different viewing angles are very different. 3. Lighting makes the pixels of the same object very different. 4. The prior image is not very accurate. 5. Deformation. 6. Obscuration. 7. Background clutter. 8. Changes to inner classes.

  • Three big questions:

    Representation - How to represent object classes? Dense, random, feature points, multiple feature points. Which classification scheme?

    Learning - how to learn the classifier given the training data;

    Cognitive - how to use a classifier on new data

  • Representation: direct dense segmentation into small images into bag of words? Or also consider the relative position topological relationship between small graphs.

A priori a posteriori:

  • To deal with intra-class variability, it is convenient to use probabilistic models to describe object classes. Object Models: Generative, Discriminative, and Hybrid

    insert image description here

  • Discriminant model: nearest neighbor, neural network, support vector machine, Boosting

  • Generative models: Naive Bayes (optimal likelihood function), latent Dirichlet distribution, LDA, 2D component models, 3D information models.

Identify:

  • Several identification methods: classification (one category in one picture, support vector machine), detection (multiple categories in one picture, window is required)
  • Detection: Sliding windows to determine whether each window contains the target object.
  • Disadvantages of detection: 1. The target object is often not square 2. The final effect tends to judge the false as true 3. Many boxes containing a part of the target object are also considered correct. (requires non-maximizing suppression)

Bag of words model:

  • Originated from texture recognition, simple textures can be seen to distinguish objects, and the number of occurrences of different textures in objects is counted.

  • Principle: Extract the features in all pictures, vote for the relevant features in the subsequent pictures, and count the features with the most votes in the end. Which picture this feature originally came from, then this picture should also belong to this type of picture

    insert image description here

  • The vector representation of the article: the common mother word has 1W dimensions, and an article votes for these 10,000 words to form a histogram of its own characteristics.

  • step:

    1. Feature Extraction: Prepare a dataset of images containing the correct labels. Cut the image into slices one by one, and each slice is a local feature of the image. Commonly used feature extraction methods include SIFT, LBP, SURF, etc., but the features at this time are too repetitive to directly build a dictionary.
    2. Generate a dictionary/bag of words (codebook): Summarize all local feature points to form a set of feature points. Then use a clustering algorithm (such as K-means) to cluster the set of feature points, and divide the feature points into different clusters. Each cluster is a word, and a simplified dictionary is built. Build a histogram.
    3. Feature representation: For each image sample, it is expressed as a feature vector according to the distribution of its local feature points. The dimension of the feature vector is the size of the dictionary, and each dimension represents the frequency or weight of the corresponding word in the image.
    4. Classifier training: Using the constructed feature vector as input, input image samples and corresponding labels into the classifier for training. Commonly used classifiers include support vector machines (SVM), random forests, neural networks, etc.
    5. Image classification: For a new image to be recognized, its local features are extracted and expressed as feature vectors. Then, input the feature vector into the trained classifier for classification. The classifier will output the probability that the image belongs to each class or directly give the predicted class.
  • Advantages: Losing a few small pictures will not affect it, and the interference caused by occlusion, translation, and rotation is very small.

    Disadvantage: The positional relationship of each small picture in the picture is not considered.

Space Pyramid:

  • The spatial pyramid algorithm (Spatial Pyramid) can effectively deal with image content of different scales and sizes.

  • Principle: A picture is composed of many layers of feature votes

    insert image description here

  • step:

    1. Feature extraction: First, feature extraction is performed on the input image. Commonly used feature extraction methods include SIFT (Scale Invariant Feature Transform), HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern), etc.
    2. Divide the pyramid: Next, divide the image into a pyramid structure of different levels. Each pyramid level corresponds to a different scale and size. Usually, the number and size of pyramid levels are predefined, such as 2 levels, 3 levels or more.
    3. Block statistics: For each pyramid level, the image is divided into blocks of fixed size. These blocks can be square or rectangular areas. Then, the statistics of the features are computed within each block. This can include histograms, mean, variance, etc. In this way, the local features of the image at different spatial positions can be captured.
    4. Feature fusion: The feature statistics in each pyramid level are fused. A common approach is to concatenate features from different levels to form a comprehensive feature vector. This preserves information at different scales and sizes to better describe the content of the image.
    5. Classifier training: Use the fused feature vectors to train classifiers, such as support vector machines (SVM), random forests (Random Forest), or neural networks. Classifiers can learn patterns of image categories based on the provided training data and are used to classify and recognize new images. 6. Image classification: For a new image to be classified, first perform the same feature extraction and pyramid block process as the training image. Then, input the extracted features into the trained classifier for prediction. The classifier outputs the class label to which the image belongs.

postscript:

  • Looking at the development history of computer vision, I have a better understanding of the content of computer scientific research work. In fact, each algorithm comes from a paper.

Guess you like

Origin blog.csdn.net/buptsd/article/details/131483708