Application Practice of Video Remote Control Based on 5G Network——Low Latency Video Technology and Application

This sharing will be divided into three parts: the first part introduces the key technologies involved in low-latency video, including low-latency video codec, video transmission, video processing low-latency framework, video capture and display; the second part focuses on Requirements for low-latency video against weak networks in the 5G environment, including: detection of weak network status, congestion control, etc.; the last part will combine actual test results to introduce application examples in scenarios such as port remote control and remote driving.

Text/Shen Can

Organize/LiveVideoStack

The development of network technology has brought about the discussion of low latency. In the past, the delay of the network was relatively high, and the processing of the chip took time, so the delay has not been perfect. With the development of technology, the improvement of chip processing capabilities and the development of networks, low-latency video can be used in some special scenarios, so the main content I want to talk about today includes: First, put the "low-latency" issue aside Come out, and then introduce how to solve these problems and related technologies, and finally introduce the application scenarios of low-latency video.

-01-

Problems facing low-latency video

ccf042b3993c0a18e8da0a5855628ea2.png

Latency is a key technical indicator encountered in video production. In normal face-to-face real-time communication, generally a delay of 200ms can achieve a satisfactory state, because the response speed of communication between people is not so fast. However, for human-machine communication or machine-machine docking, the requirements for latency will be relatively high, because the machine's response speed is much faster than that of humans. For example, for a controlled medium-speed engineering vehicle, usually a delay speed of 100ms can meet the control requirements. A typical example is the remote control of the mechanical arm on the vehicle.

In addition, games also have relatively high latency requirements, preferably below 100ms, preferably around 50ms or 60ms. Because although the game is controlled by humans, it controls more intense game scenes, such as driving and shooting, and the game may be lost with a little delay. Therefore, delay is very critical for user experience. The remote driving speed is usually 60 km/h, and if the delay is around 60ms, there will be an error of about 1 meter when controlling. Another example is to control a drone on a highway. The "smart road" requires that the operator be notified whether there is an accident or obstacle ahead while controlling it. Therefore, the delay needs to reach about 30ms to meet the user's requirements.

309d449e42835d7a52c834907b29c023.png

So how do you define latency?

The delay generated from video generation, collection to display (in cloud game scenarios, that is, from the time when the video is generated), that is, the video is sent, network transmitted, received, and then to the display. This whole link belongs to the end-to-end terminal delay. The delay in the middle section can also be calculated separately, for example, from the sending end to the receiving end, without calculating the delay of the head and tail. Many links in the middle will cause delays, so the delay of each link needs to be calculated, and it is necessary to consider how much the indicator can achieve, so that the complete delay can be defined. At the same time, the source of the content needs to be considered, for example, if the content is generated from a certain cloud, the delay also needs to be calculated. Therefore, technical indicators need to take into account all aspects. The above is its definition.

d1818bea38ddde4c878e79bd251f0fd7.png

The list can be used to estimate how much delay each link will cause, and which links are prone to delay. The first link is sampling. If it is 30 frames per second, there will be an interval of about 33ms. This (process) itself is also delayed, because the acquisition has a fixed interval, and its fixed interval is the delay of sampling. Next is the pre-processing of the media, mainly the delay brought by the ISP chip, such as noise reduction, distortion correction, etc., in which calculations and data transmission will cause delays. This part is related to its processing capabilities. We can also It estimates.

Then there is sending buffer, network transmission, and receiving buffer. These also take time, because the bandwidth of the network is limited. If the bandwidth is very wide, the transmission time can be ignored. What problems this will cause will be described in detail later. This part is the main source of delay.

After that is post-processing. Post-processing is the same as pre-processing, both of which belong to media processing. The delay of this part is relatively fixed, and its value will not fluctuate too much. The (delay) fluctuation of the middle network part is relatively large, which changes with network conditions and environmental changes, and is also related to packet loss or not. The final display, after decoding, enters the display buffer and display, and there is also a waiting delay for the display. Based on the above calculations, according to the current system capabilities, if the sampling delay is not calculated, the (delay) is about 20 to several hundred.

ed6e08c06032e819126bd1f6f153e2b7.png

Let's analyze the reasons for the delay one by one. One is the calculation overhead. For example, the calculation is very complicated during processing, involving some numerical calculations, such as white balance, noise reduction, noise removal, encoding, and decoding. These are computationally intensive and require time, time. Probably ranging from a few milliseconds to tens of milliseconds. This part belongs to the overhead of calculation.

The second (latency-inducing) source is transmission delay. The network itself has a delay. No matter what the network is, the transmission of light takes time, and the processing of routers also takes time. No matter what method (transmission) is used, it takes time. The amount of video data is relatively large, and the large video frame ranges from hundreds of kbps to several M. The transmission is carried out on a limited bandwidth. Assuming a network of about 50M or 100M, it takes a few milliseconds to transmit a 2M video. Through the waveform diagram of the video transmission data in the following, it can be seen that the code stream of the video is unstable, that is, the size difference of each frame is relatively large, the time required for each frame in transmission is uncertain, and the time required for each frame to reach the destination The time is different, is it fixed, so the arrival time is unpredictable. The video data before compression reaches the level of about 1G, that is, the data has such a large amount of data before encoding and after decoding, and it takes time to display and process on the terminal device. Therefore, the delay of transmission is an aspect that needs to be considered.

There is also anti-packet loss. When the network quality is relatively poor, a "price" must be used to deal with network jitter or network packet loss. Usually, to ensure the user experience, the "price" to be used is the exchange of delay. So it is necessary to increase the delay to resist (packet loss), such as retransmission. This part of the delay is variable and keeps changing as the network status changes. Then there is the delay between task scheduling. To string together the above steps, different modules and different tasks need to be arranged, like an assembly line. If the flow is not well arranged, there will be "waiting": the data has not come yet, I can't process it, and I need to wait for the data to come before I can start. Among them is the waiting delay.

e19d4cd6beb32f4f62827d9f482d6af2.png

Then look at the relationship between latency, bandwidth, and computing power. They are triangular and mutually contradictory. If you want to limit computing power and save computing power, the delay may increase. If more computing power is given, the delay or bandwidth will occupy less, and there is a contradictory relationship between them. If the bandwidth is given very high and sufficient, the delay will definitely decrease, because the time spent on transmission is less, and if the bandwidth is given high, the anti-packet loss is also easy to do, and the required anti-weak network delay is relatively few. So there is a triangle relationship between them. Delay can ultimately be expressed as a function related to bandwidth, computing power, and video quality. High video quality requires more delay and computing power consumption, which will affect video quality. High computing power means high cost, and high bandwidth occupation also means high cost, so a balance needs to be found. Under what kind of quality and what kind of delay requirements, it is reasonable to use how much cost constraints. It is not necessary to choose the best chip. It must be better if it has stronger processing power, higher configuration, and higher bandwidth configuration. It is a balanced relationship, and it must be balanced and optimized. Of course, video quality and delay are also mutually contradictory. If the video quality requirements are high, the bit rate will be high.

f4d1ff922abd4a38b15f487d84521a43.png

Look at the picture of the video stream. Different lines represent different conditions, some are games, some are speeches and chats. The curve changes under different conditions, and the code stream is not very regular. Although there is an average line for a long time, it is up and down in a short time. This is because the complexity of the video is not the same. , the complexity of the content is related to the code stream. The more complicated the content, or the camera shakes violently, or the distance changes, the bit rate will naturally increase, because the bandwidth (such as 4M bandwidth) cannot be suppressed and will increase. Because the encoding must ensure that the quality is within a certain range and that it is not too bad, then the (code stream) will inevitably increase because the amount of information is very large. If the bit rate is lowered, the final output quality of the encoder will be poor, and you will see many unclear and blurred pictures. The complexity of the content also includes the content within the frame. If the content in the frame is very rich, it will also affect the output of the code stream. Therefore, the video stream fluctuates more violently.

Another factor that affects the bit rate is the I frame and P frame of the video. The size of the key frame is generally 5 times or 10 times that of the P frame, depending on the complexity of the content and the complexity of the frame. If the intra-frame complexity is high, 100k and 200k are normal, and the P frame may only be one-tenth of it. Because it fluctuates violently, the transmission time of each frame on the road is different. Even when the network is good, the time of arrival and reception is also different, and the arrival delay changes. Some frames may be very large, and it needs to be split into hundreds of packets, and the transmission is not completed until all the packets are received, and the time spent on this route will vary greatly. At the same time, changes in the code stream will have a certain impact on the network. Assuming that several cameras transmit on the same channel, several wave peaks are all together, and they will be crowded with each other, the waiting time will be greatly increased, and more traffic will appear. Unpredictable delays. These issues all need to be considered.

c1510ff962ee8ed0d6865968ad073787.png

The display is also delayed. After the video is decoded, it is rendered by the GPU, and then displayed by the operating system. As shown in the figure, it is the display process on Android, and it is the same on other systems. There is a display service, and the service will be refreshed every once in a while. If the data "1" is required to be displayed with low delay, it must be displayed at the first interval point. Before processing and rendering it well, for example, if you need to superimpose a text, you need to process it before this, and then refresh it in the next time period. In addition to the delay in the figure, there is also the refresh time of the display. The refresh time of the display is relatively short, and some displays are only 1 millisecond. The operating system is controlled by software, and the delay here is related to the refresh rate. For example, if the refresh rate is 100 Hz, there will be an interval of 10 milliseconds. This is a fixed delay that is caused by the operating system and the refresh interval. Looking at the image of "3" in the figure again, "3" spans two time periods, and its display will wait for a longer time, and it will not refresh the display before the processing is completed. Therefore, the display link is also delayed.

The above is the problem of low-latency video, let's see how to solve these problems.

d2ecf4aac71d3cc23e1deb06b2325594.png

I boil it down to transmission and source (two parts). Transmission mainly refers to low-latency transmission, that is, how to achieve (transmission) fast, stable, and resistant to packet loss. The source layer includes the processing part in front of the camera to the final display. If all these problems can be solved, the goal of solving the delay problem will be achieved.

33f1d9d0e7aae96f172527f2dc411617.png

Parallel video processing and encoding. The video itself is sampled and can be divided into strips. The previous diagram can be divided horizontally or quartered. Sending blocks can be calculated in parallel, and the parallelization of images is the main means to solve the delay. Of course, parallelization will also bring some problems. Parallelization is done through division. When calculating, each block can be calculated and processed separately, including the subsequent encoding. The same method can be used, so that through parallel Reduce the delay to a fraction of the original. Of course, (in order to achieve parallelization), each frame is processed separately without association, which will cause some problems, such as white balance. Some processing algorithms need to calculate the entire frame first, or do compensation calculation later. Because there is a correlation between each frame and each frame, including encoding, once it is divided during calculation, it will be processed in full screen, so as to realize parallelization.

07ce11826d0b6a013d72ad310deac6e7.png

What's wrong with parallelization? One is that after partitioning, the compression ratio of Slice is also affected. Not only the frame header overhead, but also the mutual reference between Slices will be reduced, which will lead to a decrease in compression ratio. In addition, during video decoding, filtering is required to cross the block boundary, so the parallelization of decoding is not as good as the parallelization of encoding. Therefore, parallelization is generally done during encoding, and parallelization of decoding is relatively weak. Then when doing image processing, there is a problem of consistency between different blocks, such as light intensity, and some areas are darker. It is necessary to do a unified calculation to know how to deal with other blocks.

Therefore, if the parallelization is done well, the coding time and pre-processing time can be greatly reduced. Give full play to the computing power and reduce the delay through parallelization.

e1926a2eb516c89d71bf48239fceb067.png

Another is to dynamically adjust the interval of key frames to reduce stream fluctuations. Looking at the picture above, if the key frames appear at the same time, the delay of these two frames will be affected. At this time, there is a relatively large fluctuation in an instant. When it reaches the receiving end, it will appear that this frame takes a particularly long time on the road, and other frames are relatively small. However, in order to ensure fluency and reduce the problem of fast and slow broadcasting during display, buffering and waiting are necessary. In order to be able to display the playback smoothly, so as not to continuously fast forward and slow down, it will cause all frames to bring additional delay, and it can be compensated slowly through dynamic buffering. This situation is to be avoided to avoid simultaneous keyframes.

In places where delay requirements are sensitive, it is better to have as few key frames as possible. Because the fewer key frames, the easier it is to smooth the code stream, and the time of each frame on the road is less likely to fluctuate. Of course, some scenes will require a fixed interval, such as a key frame in a few seconds. If there is no terminal to join the viewing temporarily, the interval can be made longer, which is beneficial, and the compression ratio is also higher. But if some scene intervals are required to be short, try to avoid them appearing at the same time. When transmitting multiple channels of video in the middle of the same channel, try to stagger the key frames to avoid key frames appearing at the same time and reduce the collision of I frames.

e825842a6a76476dff26605ae1ff2544.png

In the middle of the transmission process, the encoded data is transmitted immediately to minimize waiting. How much data is transmitted immediately, without waiting for the entire frame to be compiled, because according to parallel encoding, data can come out one after another, so that waiting can be reduced. It's a matter of transmission.

5f16c98339b29ed8b5d86d366fe3a15a.png

In some scenarios, the delay is mainly fixed, while in others it changes, because it is related to the network, and many network roads are uncertain. Where is the traffic jam, where is the problem, will bring great uncertainty . Generally, anti-packet loss uses forward error correction or retransmission, or a combination of the two. If it is a delay-sensitive scenario, you can first use FEC to reduce the packet loss rate, but it cannot completely eliminate the packet loss rate, depending on your packet loss model. If the model is very uniform, FEC can eliminate it to almost nothing; but usually forward error correction itself cannot completely eliminate it, but it can reduce your packet loss rate, and retransmission occurs after the decline The probability will be relatively low. The more retransmission times, the longer the RTT time on the road, there will be freezes, it will affect the reception and display of subsequent videos, and there will be a lot of data waiting in the buffer.

Moreover, the redundancy rate of forward error correction itself is very high, which will bring additional waste and consumption of bandwidth. But it is a balance. If you are willing to use the bandwidth, the delay can be reduced; if you think the bandwidth is very valuable and you don’t want to consume too much, the delay may increase, and the forward error correction may be added less or not. Plus, because sometimes there may not be benefits. When there is a lot of packet loss or the packet loss model does not meet certain conditions, it is cost-effective not to use FEC. This is a balance. If it can be detected through the network status, and can be dynamically adjusted according to the current packet loss rate and RTT time, if a balance is reached, then it will be possible to know which method is the most cost-effective in which scenario.

The communication engine itself can meet various conditions. For example, delay-sensitive conditions, or conditions for balancing bandwidth and delay requirements, tell the engine that there are these preference requirements, and you can make some policy adjustments to achieve the final balance between delay and bandwidth consumption.

77c4479e424123e0031b16256fbd09ba.png

There is also intelligent speed regulation technology. The receiver feeds back the receiving time of each packet, and the sending end can estimate the network status based on the difference between the receiving time and sending time, and finally obtain information such as network bandwidth, current packet loss rate, and delay. If the network perceives this information, the code rate can be dynamically adjusted. In the mobile network, the bit rate and bandwidth fluctuate. When it is interfered, blocked or has weather changes, its wireless signal will also experience a relatively large proportion of packet loss or bandwidth drop within a period of time. When the noise of the wireless signal increases, the bandwidth of the entire network will decrease, and the packet loss rate will increase. At this time, it is necessary to dynamically monitor it, to estimate how the network is and what state it is in, to obtain parameters such as packet loss rate, and to perceive video content at the same time.

The so-called content refers to the complexity of the content, which can be represented by time complexity and space complexity. Time complexity is the change between each frame, such as whether the motion is severe. The spatial complexity is the picture complexity within a frame. The decision to dynamically estimate the code rate based on these can make the code rate more reasonable, reach the optimal value, and reduce stuttering and delay.

Intelligent speed adjustment can dynamically adjust the code rate in real time according to various conditions. The bit rate cannot be adjusted too frequently, and it cannot be changed immediately within a second. That is also problematic, and it must also have a certain adjustment time limit. It can be seen from this figure that if the adjustment time is reasonable, as the network bandwidth recovers, the bit rate can also recover. Of course, the recovery is a slow recovery, and the decline is a rapid decline. In order to minimize stuttering, the bandwidth must drop quickly and rise slowly.

a164ba7f497390f14f7247fbe1dcb0e9.png

Another problem is the shortest transmission path. If end-to-end, solving this problem on both ends is not completely solved, because the time spent on the road transmission is related to where the network nodes go. If the road conditions can be detected between network nodes and the delay and bandwidth of different paths can be obtained, the minimum delay path can be found to solve the problem of delay, so that network transmissions can autonomously select routes. In addition to the end-to-end media anti-weak network above, anti-packet loss is implemented between end-to-end and between two media nodes, and a complete solution can be achieved after integration. Each node has been optimized to solve this problem.

-03-

Low Latency Video Application Cases

Finally, let’s talk about the application scenario.

210127004661fab423a05eb3570be841.png

5G applications have brought some changes to the industry. In the past, many controls were wired, such as network cables and optical fibers. But with the mobile network, some things that require movement and are not easy to route (things, scenes), the mobile network can be applied to these new application scenarios, such as production lines, industrial applications, automatic driving, remote control, or some More dangerous scenes, such as underground mine cars, etc. Industrial parks, ports, coal mines, hospitals, and smart communities are all potential applications in the future. This is also a benefit brought by the mobile network, that is, communication devices can solve problems through the mobile network.

e6389f4b24c766bcceae2fd7df4f30d1.png

This is a gateway made by ZTE, with a 5G antenna on it, which can be connected to several cameras, and can also be connected to industrial and vehicle control ports, with rich interfaces. It can be connected to SDI camera, and the delay of this kind of camera is relatively low. There is also access to the IP camera, which has already encoded and packaged the video. Finally, for the control of industrial equipment, the front equipment can be controlled through the remote console. The latency of an IP camera depends on the latency of the camera encoding and capture. Because there is a 5G antenna on it, it can be controlled remotely.

6326fa5345d79c185353730e966dc9ba.png

Another option is to have a gateway through which the video captured by the front camera is transcoded. After transcoding, it comes from the base station through the antenna, and then transmits the video back. This is our other device, which can transcode or synthesize the gateway, and the picture can also be synthesized in it. The synthesis gateway device can complement the previous equipment.

4268c3b51489cf687d1289e5c39e6647.png

Case 1 is a coal mine. On the left is the coal mine machinery. In the shaft, if you put the gateway on the car, you can put a few cameras on the machinery, and the remote control center can see the video in the mine, so you don’t need people to drive below. Control Center controls it. Because the delay requirement of this machine is estimated to be enough at 100 milliseconds, and its movement speed is not fast, it can solve the problem of the dangerous working environment of underground workers. This is a more valuable application of reducing labor costs through remote control.

1c1d4cdb8f128071e5b20486a480c7e5.png

No one collects cards. This is similar to the front car, but it can take several SDI or IP cameras, and then complete the control through the car. In the middle, 5G base stations and core network equipment will be built. There are 2 5G modules inside for backup, providing an anti-weak network experience, supporting onvif cameras, and supporting IP camera access. After arriving locally, it can be wired.

56363653a15377507c1ded43bdd8e491.png

This scene is on the harbor. There are gantry cranes (such as equipment) at the port, and 20 or 30 cameras can be placed on each crane, all controlled by people in the control room. This kind of scene requires people to be able to switch cameras, and some zooming in is required after transmission, so that the driver can control better during operation. In the past, this kind of work was done with wired, optical fiber or network cable. Because the environment is relatively harsh, there is a certain maintenance cost. If it is replaced with a wireless one, the maintenance cost will be reduced relatively. It can provide a delay of 50-80 milliseconds, and this section on the wireless network is about 10 milliseconds, which can solve the remote control of the scene of the shore crane. Its advantage is that the driver can freely switch the camera when controlling the gantry crane, and the production efficiency will still be improved after being intelligent.

36fd6a802a8831e744be300ffea8c747.png

There is also the Internet of Vehicles. One scenario of the Internet of Vehicles is used in the car, and the other is used in the road test unit. The drive test unit needs to send back the video collected on the road, perform calculations in the edge computer room during calculation, and report to some related vehicles on the road after calculation, so as to form a synergy between the road and the vehicle.

That's all for my sharing, thank you.

641f3cb41a4451c9ba5a0f193cac112e.png

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/131355934