[Codecs] Detailed explanation of JPEG principle

###Date: 2018.4.22

===========================================================


JPEG is the abbreviation of Joint Photographic Experts Group, that is, the ISO and IEC Joint Photographic Experts Group, which is responsible for the formulation of still image compression standards. The algorithm developed by this expert group is called the JPEG algorithm, and has become a common standard for everyone, namely JPEG standard. JPEG compression is lossy compression, but the lost part is the part that is not easily perceptible to human vision. It makes full use of the insensitivity of human eyes to high-frequency information in computer colors to greatly save the need for processing. Data information. 
  The human eye has different sensitivities to different frequency components that constitute an image, which is determined by the visual physiological characteristics of the human eye. For example, the human eye contains 180 million columnar cells that are sensitive to brightness and 080 million vertebral cells that are sensitive to color. Since the number of columnar cells is much larger than that of vertebral cells, the sensitivity of the eye to brightness is greater than that of color. Sensitivity.
  In general, an original image information needs to be JPEG encoded. The process is divided into two steps:
  1. Remove the redundant information in the visual, that is, the spatial redundancy
  ; 2. Remove the redundant information of the data itself, that is, the structure (static) Redundancy
  1. Remove redundant visual information
  When you get an original unprocessed image, it is composed of various colors, that is, on a plane, there are various colors, and this plane is composed of horizontal and vertical It consists of many points in the direction. In fact, the color of each point, that is, the color of each pixel point that can be represented by the computer, can be decomposed into red, green, and blue, that is, RGB three-element colors to represent, that is, a certain proportion of these three colors is mixed. to get an actual color value.

So, in fact, the image of this plane can be understood as a three-dimensional system of Z in addition to horizontal X and vertical Y, as well as a color value. Z represents the specific numerical value occupied by the mixing of each branch R/G/B in the three-element color. The mixed value of RGB of each pixel may be different, and each value may be large or small, but the adjacent two The three values ​​of R/G/B of the point will be relatively close.

Since this original image is composed of many independent pixels, that is to say, they are all scattered and discrete. For example, the size of some images is 640X480, which means that there are 640 pixels horizontally and 480 pixels vertically.
  From the above content, we can know that two adjacent points will have a lot of colors that are very close, so how can we record these unnecessary data as little as possible in the final picture, that is, achieve compression Effect.
  This will involve the spectral characteristics of the image signal.
  The spectral line of the image signal is generally in the range of 0-6MHz, and an image contains components of various frequencies. However, most of the included spectral lines are low-frequency spectral lines, and only high-frequency spectral lines are included in the signal at the edge of the image, which occupies a very low proportion of the image area. This is the theoretical basis for JPEG image compression.
  Therefore, the specific method is that when digitally processing the image, the number of bits can be allocated according to the spectral factor: more bits are allocated to the low-frequency spectrum region with a large amount of information, and less bits are allocated to the high-frequency spectrum region with a low amount of information. The number of bits, and the image quality is not appreciable damage, to achieve the purpose of data compression.
  How to convert the color space domain of the original image into the spectral domain? This uses the mathematical discrete cosine transform, that is, the DCT (Discrete Cosine Transform) transform.
  DCT is an invertible, discrete orthogonal transform. Although the transformation process itself does not produce compression, the transformed frequency coefficients are very beneficial to code rate compression. That is, this transformation process obtains a DCT transformation coefficient, and this coefficient can be further processed, that is, the so-called quantization. After quantization, the effect of data compression can be achieved.
In general, the first step is to encode the image and remove the redundant information. The forward DCT (FDCT) in the DCT transform is used, and then the transformed coefficients are quantized. This process is based on The experience value is used to process the high-frequency data that the human visual system is not sensitive to, thereby greatly reducing the amount of data that needs to be processed. This is a combination of mathematical methods and experience values. 
  2. Remove the redundant information
  of the data itself Huffman coding is used to compress the final data in a lossless way, which is a purely mathematical processing method.
  In general, the above two steps are:
  if processing a color image, the JPEG algorithm first converts the RGB components into luminance components and color difference components, while losing half of the color information (halving the spatial resolution). Then, DCT is used for block transform coding, high-frequency coefficients are discarded, and the remaining coefficients are quantized to further reduce the amount of data. Finally, use RLE run-length encoding and Huffman encoding to complete the compression task. 
  3. Detailed analysis of JPEG principle
  The following will introduce the details of these two steps in more detail.
  The main contents involved in JPEG coding mainly include:
  1. Color Model Conversion (color model)
  2. DCT (Discrete Cosine Transform)
  3. Rearrange DCT results
  4. Quantization
  5. RLE coding
  6. Normal form Huffman coding
  7 . DC coding

1. Color space color space 
  In image processing, in order to use the characteristics of human viewing angle and reduce the amount of data, the color image represented by the RGB space is usually transformed into other color spaces.
  There are three kinds of color space transformations used now: YIQ, YUV and YCrCb.
  Each color space produces one luminance component signal and two chrominance component signals, and each transform uses parameters tailored to a certain type of display device. 


YUV is not an abbreviation of any English word, but just a symbol. Y represents brightness, UV is used to represent color difference, U and V are the two components that make up color;
  the importance of YUV representation is its brightness signal (Y) and chromaticity The signals (U, V) are independent of each other, that is, the black and white grayscale image formed by the Y signal component and the other two monochrome images formed by the U and V signals are independent of each other. Since Y, U, V are independent, these monochrome images can be encoded separately. In addition, black and white TVs can receive color TV signals by taking advantage of the independence between the YUV components.
  For example:
  to store a color image of RGB 8:8:8, that is, the R, G and B components are all represented by 8-bit binary numbers (1 byte), and the size of the image is 640 × 480 pixels, then the required The storage capacity is 640×480×(1+1+1)=921 600 bytes, that is, 900KB, in which (1+1+1) means that RGB occupies one byte each. 


If YUV is used to represent the same color image, the Y component is still 640×480, and the Y component is still represented by 8 bits, and the same U and V values ​​are used for every four adjacent pixels (2×2) respectively. value, then the storage space required to store the same image is reduced to 640×480×(1+1/(2*2)+1/(2*2))=460 800 bytes, or 450KB. That is, the data is compressed in half.


Whether it is a color image represented by YIQ, YUV and YCrCb or other models, since all displays are now driven by RGB values, it is required to convert the color component values ​​to RGB values ​​before displaying each pixel.
For TV, after considering the nonlinear characteristics of human visual system and TV cathode ray tube (CRT), the corresponding relationship between RGB and YUV can be approximately expressed by the following equation: 


namely:
  Y=0.3R+0.59G+0.11 B
  U=BY
  V=RY
  For computers, the color space transformation of the digital domain for computers is different from the color space transformation of the TV analog domain. Their components are represented by Y, Cr and Cb, and the conversion relationship with the RGB space is as follows:


From here, you can see The calculated Y, Cr and Cb components will have a large number of decimals, that is, floating-point numbers, which will lead to a large number of floating-point number operations in the JPEG encoding process. Of course, after certain optimization, these floating-point number operations can be Encode it in a way that these computers can process faster with shift-and-add.
  The inverse transformation relationship between RGB and YCrCb can be written in the following form:


Generally speaking, the content mentioned above is mainly for the original image, which can be processed in the color space first to reduce the collected image data.
  Please note that, in fact, the JPEG algorithm has nothing to do with color space. Color space is a problem related to image sampling, and it has no direct relationship with data compression.
  Therefore "RGB to YUV transform" and "YUV to RGB transform" are not included in the JPEG algorithm. Color images processed by the JPEG algorithm are separate color component images, so it can compress data from different color spaces such as RGB, YcbCr, and CMYK. 
2. Color depth color depth
  is composed of many points in an image, so the number of bits used to store each pixel point is called pixel depth. For a picture, this value can be different, which will make the difference between more and less data in the picture.
  Each pixel of a color image is represented by three components, R, G, and B. If each component is represented by 8 bits, then one pixel is represented by 3X8=24 bits, which means that the depth of the pixel is 24 bits, and each pixel can be It is one of 2 to the 24th power = 16 777 216 colors. The more bits that represent a pixel, the more colors it can express.
  When a pixel of a color image is represented by a binary number, in addition to the R, G, and B components represented by a fixed number of bits, one or several bits are often added as an attribute bit. For example, when RGB 5:5:5 represents a pixel, it is represented by 2 bytes with a total of 16 bits, of which R, G, and B each occupy 5 bits, and the remaining one is used as an attribute bit. In this case, the pixel depth is 16 bits and the image depth is 15 bits.
  When a pixel is represented by 32 bits, if R, G, and B are represented by 8 bits respectively, the remaining 8 bits are often called alpha channel bits, or overlay bits, interrupt bits, attributes bit. Its usage can be illustrated with an example of premultiplied alpha. If the four components of a pixel (A, R, G, B) are all represented by normalized values, when (A, R, G, B) is (1, 1, 0, 0), it will display red. When the pixel is (0.5, 1, 0, 0), the result of the premultiplication becomes (0.5, 0.5, 0, 0), which means that the intensity of the red displayed by the pixel is 1, and the red displayed now is Intensity dropped by half.
  This alpha value is used here to indicate how the pixel will produce special effects.
  Generally speaking, the higher the width, height and resolution of the image, the more pixels that make up an image, the larger the image file; the deeper the pixel depth, the more bits that express the color and brightness of a single pixel, and the image The larger the file.
  An image with only two colors of black and white is called a monochromatic image (monochrome). The pixel value of each pixel is stored in 1 bit, and its value is only "0" or "1". A 640×480 monochrome image needs to occupy 37.5 KB of storage space.
And grayscale images, that is, black and white images with deep color, if the pixel value of each pixel is represented by one byte instead of just one bit, then the grayscale value series is equal to 256 levels, and each pixel can be 0~ Any value between 255, a 640×480 grayscale image needs to occupy 300 KB of storage space, similar to the Y component mentioned above. 
  3. Discrete cosine transform DCT
  To convert the image from the color domain to the frequency domain, the commonly used transformation methods are:


The formula of DCT transformation is:

f(i, j) After DCT transformation, F(0, 0) is the DC coefficient, and the others are AC coefficients.
  Let me illustrate with an example.
  8x8 original image: 


After shifting by 128, make the range -128~127:

Use discrete cosine transform and round to the nearest integer:

The above picture is the DCT coefficient block that converts the sample block from the time domain to the frequency domain.
DCT converts blocks of original image information into sets of coefficients representing different frequency components, which has two advantages: First, the signal often concentrates most of its energy in a small range in the frequency domain, so the description is not important The components of , require only a few bits; second, the frequency-domain decomposition maps the processing of the human visual system and allows subsequent quantization to meet its sensitivity requirements.
  When u, v = 0, if the coefficient after discrete cosine forward transform (DCT) is F(0, 0)=1, then the recurrence function after inverse discrete cosine transform (IDCT) f(x, y)=1 /8, is a constant value, so F(0, 0) is called a direct current (DC) coefficient; when u, v≠0, the coefficient after forward transformation is F(u, v)=0, then after inverse transformation The recurrence function f(x, y) of is not a constant, and the coefficient F(u, v) after the forward transformation is an alternating current (AC) coefficient.
  The 64 DCT frequency coefficients after DCT correspond to the 64 pixel blocks before DCT, and there are 64 points before and after the DCT process, indicating that this process is just a lossless transformation process without compression.
  The spectrum of all DCT coefficient blocks of a single image is almost concentrated in the upper leftmost coefficient block.
  The direct current (DC) coefficient in the upper left corner of the frequency coefficient matrix output by DCT has the largest amplitude, which is -415 in the figure; other DCT coefficients downward and to the right with the DC coefficient as the starting point, the farther away from the DC component, the higher the frequency and the higher the amplitude. The lower the value is, the lower right corner of the figure is 2, that is, most of the image information is concentrated on the DC coefficient and its nearby low-frequency spectrum, and the high-frequency spectrum farther and farther away from the DC coefficient contains almost no image information, even only Contains clutter.
  Although DCT itself has no compression function, it has laid an indispensable foundation for "take" and "rounding" in future compression. 
  4. Quantization
  The quantization process is actually an optimization process for the DCT coefficients. It takes advantage of the insensitivity of the human eye to high-frequency parts to greatly simplify the data.
  The quantization process actually simply divides each component in the frequency domain by a constant for that component, and then rounds to the nearest whole number.
  This is the main lossy operation in the whole process.
With this result, many high-frequency components are often rounded to near 0, and many remaining are turned into small positive or negative numbers.
  The purpose of the whole quantization is to reduce the magnitude of non-"0" coefficients and to increase the number of "0" valued coefficients.
  Quantization is the biggest cause of image quality degradation.
  Because the human eye is more sensitive to luminance signals than color difference signals, two quantization tables are used: luminance quantization value and color difference quantization value.


Use this quantization matrix with the matrix of DCT coefficients obtained earlier:

eg, use ~415 (DC coefficients) and round to the nearest integer . In

general, the DCT transform is actually a low-pass filter in the spatial domain. Use fine quantization for the Y component and coarse quantization for the UV.
  The quantization table is the key to control the JPEG compression ratio. This step removes some high frequency content; another important reason is that there will be a color transition process between the points of all pictures, and a large amount of image information is contained in the low frequency. , after the quantization process, there will be a large number of consecutive zeros in the high frequency band. 
  5. "Z" shape arrangement
  The quantized data has a great feature, that is, the DC component is larger than the AC component, and the AC component contains a large number of 0s. In this way, how to simplify the quantized data so as to compress it to a greater extent.
This leads to the "Z" shape arrangement, as shown in the figure:

The result of the "Z" shape arrangement for the previously quantized coefficients is: 
  bottom ?26, ?3, 0, ?3, ?3, ?6, 2, ?4 , 1 ?4, 1, 1, 5, 1, 2, ?1, 1, ?1, 2, 0, 0, 0, 0, 0, ?1, ?1, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,
  0, 0, 0, 0, 0, 0, 0, 0 The feature of doing this at the top is that there will be multiple 0s in a row, which is very beneficial to use simple and intuitive run-length coding (RLE: Run Length Coding) to perform them. coding.
  The DC coefficients of 8×8 image blocks obtained after DCT transformation have two characteristics, one is that the value of the coefficients is relatively large, and the other is that the DC coefficient values ​​of adjacent 8×8 image blocks do not change much. According to this feature, the JPEG algorithm uses the Differential Pulse Modulation Coding (DPCM) technique to encode the difference (Delta) of the quantized DC coefficients between adjacent image blocks. That is, the characteristics of two adjacent image blocks are fully utilized to simplify the data again.
  That is, the DC component -26 above, needs to be processed separately.
  For the other 63 elements, zig-zag ("Z" shape) run-length encoding is used to increase the number of consecutive 0s in the stroke. 
  6. Length code
  Run Length Coding, also known as "run length coding" or "run length coding", is a lossless compression encoding.
  For example: 5555557777733322221111111
  A feature of this data is that the same content will appear many times, so you can use a simplified method to record this string of numbers, such as
  (5, 6) (7, 5) (3, 3) (2, 4) (l, 7)
is its run-length code.
  The number of bits in the run-length encoding will be far less than the number of bits in the original string.
  For data arranged in "Z" shape, run-length coding can be used to compress the data greatly.
  Let's elaborate with a simple example:
  57, 45, 0, 0, 0, 0, 23, 0, -30, -16, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, .., 0
  can be represented as
  (0, 57); (0, 45); (4, 23); (1, -30); (0, -16); (2, 1); EOB
  That is, the head of each group of numbers represents the number of 0s, and in order to be more conducive to subsequent processing, it must be 4 bits, that is, it can only be 0~15, which is a feature of this run-length encoding. 
  7. Paradigm Huffman coding
  After the DC coefficients are subjected to the above DPCM coding and the AC AC coefficients are subjected to RLE coding, the data obtained can be further compressed by one complement, that is, processed by Huffman coding.
  Paradigm Huffman coding is Canonical Huffman Code, and many popular compression methods now use normal Huffman coding technology, such as GZIB, ZLIB, PNG, JPEG, MPEG and so on.
  For the result after RLC in the above example, this value is not directly stored in JPEG for its storage, which is mainly to improve efficiency. 

For the above example content, you can get:
  57 is the 6th group, the actual saved value is 111001, encoded as (6, 111001)
  45 is encoded as (6, 101101)
  23 is (5, 10111)
  -30 is (5 , 00001)
  At this time, the previous example becomes:
  (0, 6), 111001; (0, 6), 101101; (4, 5), 10111; (1, 5), 00001; (0, 4), 0111 ; (2, 1), 1 ; (0, 0)
  In this way, the value in parentheses is just combined into one byte, the upper 4 bits are the number of 0s in front, and the lower 4 bits describe the number of digits in the following digits; The encoded numbers represent the range -32767..32767.


refer to:

http://www.360doc.com/content/12/0606/19/10144181_216456484.shtml

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325991463&siteId=291194637