[From scratch] Understanding video codec technology

[From scratch] Understanding video codec technology

auxten

Co-founder of CovenantSQL

​follow him

1,263 people liked this article

Reprinted from: https://github.com/leandromoreira/digital_video_introduction

An image can be viewed as a two-dimensional matrix . If color is taken into account, we can make a generalization: think of this image as a three-dimensional matrix - the extra dimension is used to store color information.

If we choose the three primary colors (red, green, blue) to represent these colors, this defines three planes: the first is the red plane, the second is the green plane, and the last is the blue plane.

We call each point in this matrix a pixel (picture element). The color of a pixel is represented by the intensities (usually numerical values) of the three primary colors. For example, a red pixel is green at 0 intensity, blue at 0 intensity, and red at maximum intensity. Pink pixels can be represented by a combination of three colors. If the value range of the specified intensity is 0 to 255, red 255, green 192, blue 203  represent pink.

Other Ways to Encode Color Images
There are many other models that can be used to represent colors and thus compose images. For example, each color is marked with a serial number (as shown in the figure below), so that each pixel can be represented by only one byte, instead of the usual 3 required by the RGB model. In such a model, we can use a two-dimensional matrix instead of a three-dimensional matrix to represent our colors, which will save storage space, but the number of colors will be limited.

For example, the following pictures. The first sheet contains all color planes. What remains are the red, green, and blue planes (shown as gray tones) respectively.

We can see that for the final imaging, the red plane contributes more to the intensity (the brightest of the three planes is the red plane), and the blue plane (last image) contributes mostly only between Mario's eyes and his part of clothing. All color planes contribute less to Mario's beard (the darkest part).

To store the intensity of a color, a certain size of data space is required, and this size is called the color depth. If the intensity of each color (plane) occupies 8 bits (the value ranges from 0 to 255), then the color depth is 24 (8*3) bits, and we can also deduce that we can use 2 to the 24th power of different color.

Great learning material: How real-world photos are made of 0's and 1's .

Another property of an image is resolution , the number of pixels in a plane. It is usually expressed as width * height, such as the  4x4  picture below.

Do It Yourself: Playing with Images and Colors
You can  play with images using jupyter (python, numpy, matplotlib, etc.) . You can also learn the principles of image filters (edge ​​detection, smoothing, blurring...) .

Another attribute of an image or video is the aspect ratio , which simply describes the proportional relationship between the width and height of an image or pixel.

When people say that this movie or photo is 16:9, they usually mean Display Aspect Ratio (DAR), however we can also have individual pixels of different shapes, which we call Pixel Aspect Ratio (PAR).

DVD's DAR is 4:3
Although the actual resolution of a DVD is 704x480, it still maintains a 4:3 aspect ratio because it has a PAR of 10:11 (704x10/480x11).

Now we can define video as continuous n frames per unit time , which can be regarded as a new dimension, n is the frame rate, if the unit time is seconds, it is equivalent to FPS (Frames Per Second) .

The amount of data required to play a video per second is its bit rate (that is, the often-called bit rate).

bitrate = width * height * color depth * frames per second

For example, a video with 30 frames per second, 24 bits per pixel, and a resolution of 480x240 would require 82,944,000 bits per second or 82.944 Mbps (30x480x240x24) if we did not do any compression  .

When the bitrate is nearly constant it is called constant bitrate ( CBR ); but it can also vary and is called variable bitrate ( VBR ).

This graphic shows a limited VBR that doesn't consume much data volume when the frame is black.

Early on, engineers came up with a technique that would double the perceived frame rate of video without consuming additional bandwidth . This technique is known as interlacing ; in general, it sends a frame at one point in time - a frame to fill one half of the screen, and a frame at the next point to fill the other half of the screen.

Today's screen rendering mostly uses progressive scan technology . This is a method of displaying, storing, and transmitting moving images where all the lines in each frame are drawn sequentially.

Now we know how a digitized image works; how its colors are programmed; how much bitrate it takes to display a video given the framerate and resolution ; whether it is constant (CBR) or variable (VBR); There are many others such as interlacing and PAR.

Do it yourself: Check video properties
You can use ffmpeg or mediainfo to check the interpretation of most properties .

Eliminate redundancy

We realized that we couldn't do without compressing the video; a single hour-long video at 720p and 30fps would require 278GB* . Just using a lossless data compression algorithm -- such as DEFLATE (used by PKZIP, Gzip, and PNG) -- cannot sufficiently reduce the bandwidth required for video, and we need to find other ways to compress video.

*We use the product to get this number 1280 x 720 x 24 x 30 x 3600 (width, height, bits per pixel, fps and seconds)

To do this, we can take advantage of a property of vision : we are much more sensitive to brightness than to color. Repetition in time : A video contains many images with only small changes. Repetition within the image : Each frame also contains many regions of the same or similar color.

Color, brightness and our eyes

Our eyes are more sensitive to brightness than to color , you can check out the picture below to test for yourself.

If you can't see that square A and square B on the left  are the same color, well, our brains are playing a trick that makes us pay more attention to light and dark than to color. There's a connector here on the right that's the same color, so we (the brain) can easily tell the fact that they're the same color.

A brief explanation of how our eyes work
The eye is a complex organ with many parts, but the ones we are most interested in are the cones and rods. The eye has about 120 million rods and 6 million cones .
In simple terms , let's put color and brightness on the functional part of the eye. Rods are primarily responsible for brightness , while cones are responsible for color . There are three types of cones, each with a different pigment, called: S-cones (blue), M-cones (green), and L-cones. Cones (red) .
Since we have many more rods (brightness) than cones, a reasonable inference is that we have a better ability to distinguish darkness from light than color.

Once we know we are more sensitive to luminance (brightness in images), we can take advantage of that.

color model

The principles of color images that we first looked at used  the RGB model , but there are others. There is a model that separates luma (brightness) from chroma (color), and it's called  YCbCr* .

* There are many models that do the same separation.

This color model uses  Y  to represent lightness, and two color channels: Cb (blue chroma) and Cr (red chroma). YCbCr can be converted from RGB or converted back to RGB. Using this model we can create images with full color, as shown below.

Conversion between YCbCr and RGB

One might ask, how do we represent all the colors without using green (chroma) ?

To answer this question, we will introduce the conversion from RGB to YCbCr. We will use   the coefficients from the standard BT.601 recommended by the ITU-R group *.

The first step is to calculate the brightness, we will use the ITU suggested constant and replace the RGB values.

Y = 0.299R + 0.587G + 0.114B

Once we have the luminance, we can split the color (blue chroma and red chroma):

Cb = 0.564(B - Y)
Cr = 0.713(R - Y)

# 并且我们也可以使用 YCbCr 转换回来,甚至得到绿色。
R = Y + 1.402Cr
B = Y + 1.772Cb
G = Y - 0.344Cb - 0.714Cr
*Organizations and standards are common in the digital video space, and they often define what is a standard, for example, what is 4K? What frame rate should we use? resolution? color model?

Typically, displays (monitors, TVs, screens, etc.) use only the RGB model , and organize them in different ways, see these zooms in:


Chroma Subsampling
Once we can separate luminance and chrominance from an image, we can selectively remove information by taking advantage of the fact that the human visual system is more sensitive to luminance than chrominance. Chroma subsampling is a technique for encoding images with lower chroma resolution than luma .


How much should we reduce the chroma resolution? There are already some patterns that define how resolution and merging ( 最终的颜色 = Y + Cb + Cr) are handled.
These modes are called subsampling systems and are expressed as a 3-part ratio -  a:x:y, which defines  a x 2 the relationship between the resolution of the chroma plane, and the resolution of the tiles on the luma plane.

  • a is the horizontal sampling reference (usually 4),
  • x is the number of chroma samples for the first row (relative to the horizontal resolution of a ),
  • y is the number of chroma samples for the second row.
One exception is 4:1:0, which provides one chroma sample per 4 x 4 block of luma plane resolution.

Common schemes used in modern codecs are: 4:4:4 (without subsampling)**, 4:2:2, 4:1:1, 4:2:0, 4:1:0 and 3: 1:1.

YCbCr 4:2:0 Merging
This is a slice of an image merged using YCbCr 4:2:0, note that we only spend 12 bits per pixel.


Below is the same image encoded using several major chroma subsampling techniques, the first row of images is the final YCbCr, while the last row of images shows the chroma resolution. Such a small loss is indeed a great victory.


Earlier we calculated that we would need  278GB to store an hour-long video file at 720p resolution and 30fps . If we use  YCbCr 4:2:0 we can cut 一半的大小(139GB)the hair, but it's still not ideal.
* We get this value by multiplying width, height, color depth and fps. Previously we needed 24 bits, now we only need 12 bits.


Do It Yourself: Check YCbCr Histogram
You can check YCbCr histogram with ffmpeg . This scene has more blue contribution, shown by the histogram .


Frame Types
Now we go one step further 时间冗余, but before we do that let's define some basic terminology. Suppose we have a 30fps movie, this is the first 4 frames.


We can see a lot of repetition within the frame , like the blue background , which doesn't change from frame 0 to frame 3. To solve this problem, we can abstractly classify them into three types of frames.
I-Frame (Intra-coded, Keyframe)
An I-frame (Referential, Keyframe, Intra-coded) is a self-contained frame . It doesn't rely on anything for rendering, I-frames are similar to still pictures. The first frame is usually an I-frame, but we'll see I-frames being inserted periodically between other types of frames.


P-frames (prediction)
P-frames take advantage of the fact that the current frame can almost always be rendered using a previous frame . For example, on the second frame, the only change is that the ball has moved forward. Using only the (second frame) reference to the previous frame and the delta, we can reconstruct the previous frame.


Do it yourself: Video with a single I-frame
Since P-frames use less data, why can't we encode the entire video with a single I-frame and the rest of the P-frames ?
After encoding this video, start watching it, and fast-forward to the end of the video , you will notice that it takes a while to actually jump to this part. This is because  P-frames require a reference frame (such as an I-frame) in order to render.
Another quick experiment you can do is to encode the video with a single I-frame, then encode again with an I-frame inserted every 2 seconds , and compare the size of the finished product .
How do B-frames (bidirectional prediction)
refer to previous and subsequent frames for better compression? ! Simply put, B-frames do just that.


Do it yourself: Compare videos using B-frames
You can generate two versions, one with B-frames and one without B-frames at all , and check the file size and quality.
Summary
These frame types are used to provide better compression ratios, and we'll see how this happens in the next chapter. Now, we can think of I-frames as being expensive, P-frames as being cheap, and the cheapest being B-frames.


Temporal Redundancy (Inter Prediction)
Let's explore the removal of temporal duplication . The technique to remove this type of redundancy is Inter Prediction .
We will try to spend less data to encode frame 0 and frame 1 consecutive in time.


We can do a subtraction, we simply subtract frame 1 from frame 0 to get the residual, so we only need to encode the residual .


But we have a better way to save data volume. First, we'll 0 号帧 treat it as a collection of chunks, and then we'll try to   match the chunks on the sum 帧 1 . 帧 0We can think of this as motion prediction .
Wikipedia—Block Motion Compensation
"Motion compensation is a method to describe the difference between adjacent frames (adjacent here means adjacent in encoding relationship, and the two frames may not be adjacent in playback order), specifically to describe the previous one. How does each small block of a frame (adjacent here means the front in the encoding relationship, not necessarily in front of the current frame in the playback order) move to a certain position in the current frame."


We expect the ball to  x=0, y=25 move from  x=6, y=26, and the x  and  y  values ​​are the motion vectors . A further way to save data is to encode only the difference between the two motion vectors. So, the final motion vector is  x=6 (6-0), y=1 (26-25).
In practice, the ball would be cut into n partitions, but the process is the same.
Objects on the frame move in 3D and get smaller as the ball moves into the background. When we try to find a matching block, it is normal not to find a perfect matching block . Here is a picture of motion prediction overlaid with actual values.


But we can see that when we use motion prediction , the amount of data encoded is less than when using simple residual frame techniques.

You can play around with these concepts using jupyter .
Do It Yourself: Viewing Motion Vectors
We can use ffmpeg to generate a video that includes inter predictions (motion vectors) .


Or we can use  Intel® Video Pro Analyzer (you need to pay, but there is also a free trial version that can only view the first 10 frames).

How do video codecs work?

what is it  It is the software or hardware used to compress or decompress digital video. Why?  People need to improve video quality with limited bandwidth or storage space. Remember when we calculated how much bandwidth would be required for 30 frames per second, 24 bits per pixel, 480x240 resolution video ? 82.944 Mbps without compression  . HD/FullHD/4K can only be provided by TV or Internet only by video codec. How to do it?  We will briefly describe the main techniques.

Video Codecs vs Containers A common
beginner mistake is to confuse digital video codecs with digital video containers . We can think of a container as a packaging format that contains video (and possibly audio) metadata, and the compressed video as its content.
In general, the format of a video file defines its video container. For example, a file  video.mp4  might be  an MPEG-4 Part 14  container, and a  video.mkv  file  called matroska might be . We can use  ffmpeg or mediainfo  to fully determine the codec and container format.

history

Before we jump into the inner workings of common codecs, let's step back and look at some older video codecs.

The video codec  H.261  was born in 1990 (technically 1988) and was designed to work at a  data rate of 64 kbit/s . It already uses concepts like chroma subsampling, macroblocking, and so on. In 1995, the H.263  video codec standard was released and continued until 2001.

In 2003  the first version of H.264/AVC  was completed. In the same year, a company called  TrueMotion  released their video codec  for royalty-free lossy video compression called VP3 . In 2008, Google acquired the company and released  VP8 in the same year . In December 2012, Google released  VP9 , ​​which is supported by about 3/4 browsers (including mobile phones) on the market.

AV1  is  a new video codec designed by the Alliance for Open Media (AOMedia)  consisting of companies such as Google, Mozilla, Microsoft, Amazon, Netflix, AMD, ARM, NVidia, Intel, Cisco, etc. It is royalty-free and open source. The first version  0.1.0 reference codec was released on April 7, 2016 .

The Birth of AV1
In early 2015, Google was working on VP10, Xiph (Mozilla) was working on Daala, and Cisco open-sourced its royalty-free video codec called Thor.
Then MPEG LA announced the upper limit of HEVC (H.265) annual royalties, which is 8 times higher than H.264, but soon they changed the terms again:


P.S. If you want to learn more about the history of codecs, you need to understand the basics behind video compression patents .

Generic codec

We next describe the main mechanisms behind common video codecs , most of the concepts are practical and used by modern codecs such as VP9, ​​AV1 and HEVC. A word of caution: we're going to simplify a lot. Sometimes we use real examples (mostly H.264) to demonstrate techniques.

Step 1 - Image Partitioning

The first step is to divide the frame into several partitions , sub-partitions and even more.

But why? There are many reasons, for example, when we segment the image, we can handle the prediction more accurately, using smaller partitions for small moving parts and larger partitions for static backgrounds.

Typically, codecs organize these partitions into slices (or tiles), macros (or coding tree units), and many subpartitions. The maximum size of these partitions varies, with HEVC set to 64x64 and AVC using 16x16, but sub-partitions can be up to 4x4 in size.

Remember the classification of frames we learned about ? You can apply these concepts to blocks as well , so we can have I slices, B slices, I macroblocks, and so on.

Do It Yourself: View Partitions
We can also use  Intel® Video Pro Analyzer (for a fee, but there is also a free trial that only looks at the first 10 frames). Here is  an analysis of the VP9 partition .

Step Two - Prediction

Once we have partitions, we can make predictions on top of them. For inter prediction , we need to send motion vector and residual ; for intra prediction , we need to send prediction direction and residual .

Step Three - Conversion

After we get the residual block ( ), we can transform预测分区-真实分区 it in a way so we know which pixels we should discard and still maintain the overall quality . There are several variations on this exact behavior.

Although there are other transforms , we focus on the discrete cosine transform (DCT). The main functions of DCT  are:

  • Convert a block of pixels into a block of frequency coefficients of the same size .
  • Compressing energy, it is easier to eliminate spatial redundancy.
  • Reversible also means you can restore pixels back.
On February 2, 2017, FM Bayer and RJ Cintra published their paper: A DCT-like transform for image compression requires only 14 additions .

Don't worry if you don't understand the benefit of each point, we'll try to run some experiments to see the real value.

Let's look at the following pixel block (8x8):

Here is its rendered block image (8x8):

When we apply DCT to this block of pixels  , we get the following block of coefficients (8x8):

 

Then if we render this coefficient block, we get this image:

As you can see it looks nothing like the original image, we may notice that the first coefficient is very different from the others. The first coefficient is called the DC component and represents all samples in the input array , somewhat like an average .

This block of coefficients has an interesting property: the high-frequency part is separated from the low-frequency part.

In an image, most of the energy will be concentrated in the low frequency part , so if we convert the image into frequency coefficients and discard the high frequency coefficients , we can reduce the amount of data required to describe the image without sacrificing too much Image Quality.

Frequency refers to the speed at which a signal changes.

Let's learn this by experiment, we will use DCT to convert the original image into frequencies (blocks of coefficients), and then drop the least important coefficients.

First, we convert it to its frequency domain .

 Then we discard part (67%) of the coefficients, mainly its lower right corner.

 

We then reconstruct the image from the discarded block of coefficients (remember, this needs to be invertible) and compare with the original image.

As we can see it closely resembles the original image, but it introduces a lot of differences from the original, we discard 67.1875% , but we still get something at least similar to the original. We can drop coefficients more intelligently to get better image quality, but that's another topic.

Using all pixels to form each coefficient
It is important to note that each coefficient does not directly map to a single pixel, but it is a weighted sum of all pixels. This amazing graphic shows how the first and second coefficients are calculated, using each unique index as a weight.


Source: https://web.archive.org/web/20150129171151/https://www.iem.thm.de/telekom-labor/zinke/mk/mpeg2beg/whatisit.htm
You can also try by looking at DCT Basics A simple picture formed above to visualize the DCT . For example, here is the character A formed using each coefficient weight .

Do it yourself: Discard different coefficients
and you can play around with  the DCT transform


Step 4 - Quantization
When we discard some coefficients, in the last step (transform), we do some form of quantization. In this step, we selectively remove information ( lossy parts ) or simply, we quantize the coefficients to achieve compression .
How do we quantize a block of coefficients? A simple approach is uniform quantization, where we take a block and divide it by a single value (10), and round the value.


How do we reverse (requantize) this block of coefficients? We can do this by multiplying by the same value (10) that we divided by earlier.


This is not the best approach because it doesn't take into account the importance of each coefficient, instead of a single value we can use a quantization matrix that can take advantage of the properties of the DCT to quantize more the lower right and less (quantize) the upper left , JPEG uses a similar approach , you can see this matrix by viewing the source code .

Do It Yourself: Quantify
You Can Play with Quantify


Step 5 - Entropy Encoding
After we quantize the data (blocks/slices/frames), we can still compress it in a lossless way. There are many methods (algorithms) that can be used to compress data. We will briefly experience a few of them, and you can read this great book to understand them in depth: Understanding Compression: Data Compression for Modern Developers .
VLC Encoding:
Let us assume we have a stream of symbols: aer  and  t with their probabilities (from 0 to 1) given by the table below.

We can assign different binary codes, (preferably) smaller codes for the most likely (occurring characters), and larger codes for the least likely (occurring characters).

Let's compress the eat stream, assuming we spend 8 bits per character, without doing any compression we would spend 24 bits. But in this case, we replace each character with its respective code, and we save space.


The first step is to encode the character e  10, the second character is a, after  appending (not mathematical addition[10][0] ) , and finally the third character t, which finally forms the compressed bit stream  [10][0][1110] or  1001110, which only needs 7 bits (than the original 3.4 times less space).


Please note that each code must have a unique prefix code, Huffman can help you find these numbers . Although it has some problems, video codecs still provide this method , and it is also the compression algorithm for many applications.
Both the encoder and the decoder must know this (encoded) character table, so you need to send this table too.


Arithmetic Coding
Let us assume we have a stream of symbols: a e r s  and  t with probabilities given by the following table.
aerst probability 0.30.30.150.050.2 Given this table, we can construct an interval that contains all possible characters sorted by their probability of occurrence.


Let's encode  the eat stream, we choose  the subrange where  the first character  e  lies between  0.3 and 0.6 (but not including 0.6), we choose this subrange, and divide it again in the same proportion as before.


Let us continue encoding our stream  eat , now with the second  a  character  in the interval 0.3 to 0.39  , and then encode the last character  t in the same way again , to get the final subrange  0.354 to 0.372 .


We just need to choose a number from the last subrange 0.354 to 0.372, let's choose 0.36, but we can choose any number in this subrange. With this number alone, we will be able to recover the original stream  eat . It's like we draw a line in the interval of intervals to encode our stream.

The reverse process (aka decoding) is just as easy, with the number  0.36  and our original interval, we can do the same thing, but now use this number to restore the encoded stream.
In the first interval, we found that the number falls into a subinterval, therefore, this subinterval is our first character, now we split this subinterval again and do the same process as before. We will notice that  0.36  falls into  the interval of a  , and we repeat this process until we get the last character  t (forming our original encoded stream eat).
Both the encoder and decoder must know the character probability table, therefore, you need to transmit this table too.
Pretty ingenious, isn't it? It's pretty clever of people to come up with such a solution, some video codecs use this technique (or at least offer this option).
Regarding the method of losslessly compressing quantized bitstreams, this article undoubtedly lacks a lot of details, reasons, trade-offs, etc. As a developer you should learn more . People who are new to video coding can try different entropy coding algorithms like ANS .

Do it yourself: CABAC vs CAVLC
You can generate two streams, one using CABAC and the other using CAVLC , and compare the time to generate each and the final size.


Step 6 - Bitstream Formatting
After all these steps, we need to pack the compressed frames and content into it . It is necessary to clearly inform the decoder of encoding definitions , such as color depth, color space, resolution, prediction information (motion vector, intra prediction direction), configuration , layer , frame rate, frame type, frame number, and more.

* Annotation: The original text is profile and level, there is no general translation

We will simply learn about H.264 bitstream. The first step is to generate a small H.264* bitstream , which can be done using this repo and  ffmpeg  .

./s/ffmpeg -i /files/i/minimal.png -pix_fmt yuv420p /files/v/minimal_yuv420.h264
* ffmpeg adds all parameters as  SEI NAL by default , soon we will define what is NAL.

This command will generate a raw h264 bitstream with a single frame , 64x64 and colorspace yuv420 , using the image below as a frame .

H.264 Bitstream

The AVC (H.264) standard specifies that information will be transmitted in macroframes (in network terms), called  NAL (Network Abstraction Layer). The main goal of NAL is to provide a "network-friendly" representation of video, which must be suitable for TV (stream-based), Internet (packet-based), etc.

Synchronization markers are used to define the boundaries of NAL units. The value of each synchronization marker is fixed at  0x00 0x00 0x01 , except for the first marker, which has a value of  0x00 0x00 0x00 0x01 . If we run hexdump on the resulting h264 bitstream  , we can identify at least three NALs at the beginning of the file.

As we said before, the decoder needs to know not only the image data, but also the detailed information of the video, such as: frame, color, parameters used, etc. The first bit of each NAL defines its classification and type .

Usually, the first NAL of the bitstream is  SPS , and this type of NAL is responsible for conveying general encoding parameters such as configuration, layer, resolution, etc.

If we skip the first sync marker, we can know the type of the first  NAL by decoding the first byte .

For example, the first byte after the synchronization mark is  01100111, the first bit ( 0) is  the forbidden_zero_bit  field, the next two bits ( 11) tell us  the nal_ref_idc  field, which indicates whether the NAL is a reference field, and the remaining 5 bits ( 00111) tell us is  the nal_unit_type  field, in this example the NAL unit  SPS  (7).

Bit 2 ( binary=01100100, hex=0x64, dec=100) of the SPS NAL is  the profile_idc  field, which shows the profile used by the encoder. In this example, we use the restricted high profile , a high profile without B (bidirectional prediction) slice support.

When we read the H.264 bitstream specification for SPS NAL, we find many values  ​​for parameter names , categories and descriptionspic_width_in_mbs_minus_1 , for example, look at the fields  and  pic_height_in_map_units_minus_1.

Parameter Name Category Description

ue(v) : Unsigned integer  Exp-Golomb-coded

If we do some calculations on the values ​​of these fields, we will end up with the resolution . We can use   the representation  119( (119 + 1) * macroblock_size = 120 * 16 = 1920)of  the value , again to reduce space, we use   instead of encoding  .pic_width_in_mbs_minus_11920 x 10801191920

If we again inspect the video we created (ex:  xxd -b -c 11 v/minimal_yuv420.h264) using the binary view, we can jump to a NAL on the frame itself.

We can see the first 6 bytes: 01100101 10001000 10000100 00000000 00100001 11111111. We already know that the first byte tells us the type of the NAL, in this case ( 00101) is  the IDR slice (5) , we can check it further:

According to the specification, we can decode the slice type ( slice_type ), frame number ( frame_num ) and other important fields.

In order to get the value of some field ( ue(v), me(v), se(v) 或 te(v)), we need   a specific decoder called Exponential-Golomb to decode it. This method is particularly efficient for encoding variable values ​​when there are many default values.

The values ​​of slice_type  and  frame_num in this video   are 7 (I slice) and 0 (first frame).

We can regard bitstream as a protocol , if you want to learn more about bitstream, please refer to  the ITU H.264 specification . This macro image shows where the image data (compressed YUV) resides.

We can explore other bitstreams like  VP9 bitstream , H.265 (HEVC) or our new friend  AV1 bitstream , are they similar? No , but once you learn one, it's much easier to learn the others.

Do it yourself: Check the H.264 bitstream
We can generate a single frame of video and   check its H.264 bitstream using mediainfo . In fact, you can even look at the source code for parsing h264(AVC) video streams .


We can also use  Intel® Video Pro Analyzer , which requires payment, but there is also a free trial version that can only view the first 10 frames, which is enough for learning purposes.


Looking back
we can see that we learned many modern codecs that use the same model . In fact, let's look at the Thor video codec block diagram, which contains all the steps we have learned. You should now have a better understanding of innovations and papers in the field of digital video.


Earlier we calculated that we would need 139GB to save a one hour video file at 720p resolution and 30fps , if we use the techniques we have learned here such as inter and intra prediction, transformation, quantization, entropy coding and others we can To achieve - assuming we spend 0.031 bits per pixel - video of the same perceived quality, compared to 139GB of storage, only needs 367.82MB .

We chose to use 0.031 bits per .

How H.265 achieves better compression than H.264

Now that we know more about how codecs work, it's easy to understand how new codecs can deliver higher resolution video using less data volume.

We'll compare AVC and HEVC, keeping in mind that there's almost always a tradeoff between compression ratio and more CPU cycles (complexity).

HEVC has larger and more partition (and sub-partition ) options than AVC, more intra prediction directions , improved entropy coding , etc., all these improvements make H.265 compress 50% better than H.264.

Common Architecture for Online Streaming

Progressive download and adaptive streaming

content protection

We can secure the video with a simple token authentication system. Users need to have a valid token to play the video, and the CDN will reject requests from users without tokens. It is very similar to the authentication system of most websites.

Using only a token authentication system, users can still download and redistribute videos. DRM systems can be used to avoid this.

In practice, people usually use both technologies to provide authorization and authentication.

DRM

main system

DRM stands for Digital Rights Management, a method of providing copyright protection for digital media , such as digital video and audio. Although used in many contexts, it is not universally accepted .

Creators of content (mostly studios/studios) want to protect their intellectual property and keep their digital media from unauthorized distribution.

We will describe DRM in a simple, abstract way

There is an existing piece of content C1 (such as HLS or DASH video stream), a player P1 (such as shaka-clappr, exo-player or iOS), installed on a device D1 (such as a smartphone, TV or desktop/notebook), using  DRM system DRM1 (eg FairPlay Streaming, PlayReady, Widevine)

 Content C1 is encrypted by DRM1 with a symmetric key K1 to generate encrypted content C'1

Player P1 on device D1 has an asymmetric key pair, the key pair contains a private key PRK1 (this key is protected 1, only  D1  knows the key content), and a public key PUK1

1 Protected : This protection can be done by hardware, for example, by storing this key in a special chip (read-only) that works like a [black box] for decryption. or protected by software (lower safety factor). A DRM system provides a way to identify the type of protection a device is using.

When  the player P1 wants to play the **** encrypted content C'1  , it needs to   negotiate  with DRM1 and send the public key PUK1  to DRM1, and DRM1 will return a  K1 encrypted by the public key PUK1  . By deduction, the result is that only D1 can decrypt .

K1P1D1 = enc(K1, PUK1)

P1  uses its native DRM system (this can use  SoC  , a dedicated hardware and software, this system can be used to decrypt the content using its private key PRK1, which can decrypt the encrypted symmetric key K1 of K1P1D1 . Ideally In this case, the key will not be exported outside the memory.

K1 = dec(K1P1D1, PRK1)

 P1.play(dec(C'1, K1))

forward from:

leandromoreira/digital_video_introduction​github.com/leandromoreira/digital_video_introduction/blob/master/README-cn.md is uploading... reupload cancel​

Edited at 2019-11-24 21:27

Guess you like

Origin blog.csdn.net/hi_zhengjian/article/details/121933419