Baidu programmer development guide to avoid pits (3)

picture

In the first two issues, we shared the issues related to front-end and mobile development in our daily work. Interested students can recommend reading and jumping to view at the end of the article. In this issue, we share three topics: golang object pool to reduce gc pressure, concurrency control in FFmpeg, paddle's static graph and dynamic graph, hoping to help you improve your technology.

01 golang object pool reduces gc pressure

sync.Pool is a built-in object pool technology in Golang, which can be used to cache temporary objects to avoid the consumption caused by frequent creation of temporary objects and the pressure on GC. Objects cached by sync.Pool may be unannounced at any time, so you cannot use sync.Pool to store persistent objects. sync.Pool is not only concurrency safe, but also realizes lock free by introducing the CAS operation in the atomic package. Through atomic operations closer to the CPU and operating system levels, it meets the needs of concurrent scenarios to replace locks.

1.1 Use

When sync.Pool is initialized, the user needs to provide an object constructor New. Users use Get to get objects from the object pool, and Put to return objects to the object pool. The entire usage is relatively simple.

1.2 Principle

In the GMP scheduling model, from the thread dimension, the logic on P is executed by a single thread, which provides conditions for solving the concurrency of coroutines on P. sync.Pool is to take full advantage of this feature of GMP. For the same sync.Pool, each P has its own local object pool poolLocal. Each P will correspond to its own local object pool, poolLocal, which is the memory pool that stores the local objects of P. Each poolLocal corresponds to a private and a poolChain. . poolChain is a linked list that points to several ringBuffers. RingBuffer is used because the ring structure is convenient for memory reuse, and ringBuffer is a continuous memory, which is conducive to CPU Cache.

poolChain存放的是每个ringBuffer的head和tail,head 和 tail 并不是独立的两个变量,只有一个 uint64 的 headTail 变量。这是因为 headTail 变量将 head 和 tail 打包在了一起:其中高 32 位是 head 变量,低 32 位是 tail 变量,这个其实是个非常常见的 lock free 优化手段。对于一个 poolDequeue 来说,可能会被多个 P 同时访问,比如Get 函数中的对象窃取逻辑,这个时候就会带来并发问题。例如:当 ring buffer 空间仅剩一个的时候,即 head - tail = 1。如果多个 P 同时访问 ring buffer,在没有任何并发措施的情况下,两个 P 都可能会拿到对象,这肯定是不符合预期的。在不引入 Mutex 锁的前提下,sync.Pool 利用了 atomic 包中的 CAS 操作。两个 P 都可能会拿到对象,但在最终设置 headTail 的时候,只会有一个 P 调用 CAS 成功,另外一个 CAS 失败。

02FFmpeg中的并发控制

2.1 问题描述

最近业务需要,在一个探索性项目里,要进行视频的拼接与合成,由于原始的视频片段的格式、尺寸、码率等都各不相同,为了得到比较丝滑的拼接效果,需要首先进行视频尺寸的打齐、编码格式以及码率的统一。在对FFmepg命令进行了一番调研,并进行了一系列的转换实验(比如视频裁剪、视频填充、视频缩放等),均得到了符合预期的效果,但当把命令集成到实际业务场景里时,先后遭遇了内存被打爆进程被杀、CPU满负荷导致任务执行时间过长甚至失败,且进一步升级CPU配置问题并没有得到太大改善,于是开启了对FFmpeg命令的线程控制的调研。

2.2 FFmpeg线程控制

FFmpeg作为强大的多媒体处理工具,包含多个功能强大的lib库。FFmpeg处理多媒体文件流程如下:

其中关键的计算步骤为编码、解码以及其中数据修改,同时FFmpeg的线程控制也提供了三个参数进行线程控制,在FFmpeg文档里,对于线程控制相关参数说明如下:

-filter_threads nb_threads (global) 

Defines how many threads are used to process a filter pipeline. Each pipeline will produce a thread pool with this many threads available for parallel processing. The default is the number of available CPUs.

filter_threads实现对简单滤镜的线程控制,默认线程数为可用CPU核数

-filter_complex_threads nb_threads (global) 

Defines how many threads are used to process a filter_complex graph. Similar to filter_threads but used for -filter_complex graphs only. The default is the number of available CPUs.

filter_complex_threads实现对复杂滤镜的线程控制,默认线程数同样为可用CPU核数

threads integer (decoding/encoding,video) 

Set the number of threads to be used, in case the selected codec implementation supports multi-threading. 

Possible values: 

‘auto, 0’automatically select the number of threads to set 

Default value is ‘auto’.

threads实现对编解码器的线程控制,前提是使用的编解码器支持多线程并行,其默认线程数在文档里就一个automatically说明,翻遍全网也没找到对这个参数的具体说明,于是基于业务场景进行了相关数据实验。

使用time命令查看FFmpeg命令的耗时和CPU利用率相关参数,4核机器上试验情况如下:

-i  -filter_complex  -threads 1 -y   4.54s user 0.17s system 110% cpu 4.278 total
-i  -filter_complex  -threads 2 -y   4.61s user 0.29s system 189% cpu 2.581 total
-i  -filter_complex  -threads 4 -y   4.92s user 0.22s system 257% cpu 1.993 total
-i -filter_complex -threads 6 -y 4.73s user 0.21s system 302% cpu 1.634 total
-i -filter_complex -threads 8 -y 4.72s user 0.19s system 315% cpu 1.552 total
-i  -filter_complex  -y   4.72s user 0.22s system 306% cpu 1.614 total
-i  -filter_complex  -y -filter_complex_threads 1 -y   4.63s user 0.13s system 316% cpu 1.504 total
-i  -filter_complex  -y -filter_complex_threads 2 -y   4.62s user 0.20s system 304% cpu 1.583 total
-i  -filter_complex  -y -filter_complex_threads 4 -y   4.58s user 0.27s system 303% cpu 1.599 total

复制代码

通过试验发现在不加线程控制情况下,对于我的裁剪+缩放+gblur尺寸打平操作来说,几乎没有并行空间,filter_complex_threads增加线程数徒增系统态耗时,对于整体耗时和CPU利用来说几乎没有增益。而对于编解码部分随着线程数的增多,CPU利用率增大且耗时降低,但是整体数据呈现出来的并非线性关系。而对于单条命令而言,线程数设置为2时基本是CPU消耗和耗时相对性价比比较高的配置。

2.3 总结

1. FFmepg作为计算密集型处理工具,对CPU有比较大的需求,且FFmpeg提供了三个并行控制参数分别进行不同类型命令的并发控制,但是具体命令是否可并发与本身的实现原理有关,需要具体问题具体分析;

2. 编解码作为FFmpeg视频处理里关键环节,相对也是比较耗时、耗CPU的环节,用好threads参数能够比较好地加速处理、控制CPU使用率

03paddle的静态图和动态图

静态图和动态图的概念

静态图:类比c++,先编译后运行。因此可以分为compile time和runtime两个阶段。在compiletime,需要预先定义完整的模型,paddle会生成一个programDesc,然后使用transplier对programDesc进行优化。在runtime,executor使用programDesc进行运行。

picture

动态图:类比python,没有编译阶段,所以不用预先定义模型。每写一行网络代码,即可同时获得对应计算结果。

优缺点对比:

Static image : paddle only supports static image at the beginning, so there are many related supports and documents. In terms of performance, it is also better than dynamic graphics. But debugging will be more troublesome.
Dynamic diagram: It is convenient for debugging, and the model structure can be dynamically adjusted. But the execution efficiency is low.

Question 1: How to judge whether it is currently in static image mode or dynamic image mode

  • Static graph mode: There is a static module in the program, or you need to build an executor and use executor.run(program) to execute the defined model.

  • Dynamic graph mode: The dygraph module exists in the program. Starting from paddle 2.0, dynamic graph mode is enabled by default.

  • Note: Some APIs only support static graphs/dynamic graphs. For example, APIs involving variable values ​​generally only support dynamic graphs. When there is an error with imperative/dygraph, etc., you need to confirm whether the dynamic graph api is called in the static graph mode.

import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.fluid.dygraph.base import to_variable

print (paddle.__version__) # 2.1.1

#静态图模式
main_program = fluid.Program()
startup_program = fluid.Program()
paddle.enable_static()
with fluid.program_guard(main_program=main_program, startup_program=startup_program):
    data_x = np.ones([2, 2], np.float32)
    data_y = np.ones([2, 2], np.float32)
    # 静态图模式下,构建占位符
    x = fluid.layers.data(name='x', shape=[2], dtype='float32')
    y = fluid.layers.data(name='y', shape=[2], dtype='float32')
    x = fluid.layers.elementwise_add(x, y)
    print ('In static mode, after calling layers.data, x = ', x)
    # 这个时候无法打印出运行数值,输出In static mode, after calling layers.data, x =  var elementwise_add_0.tmp_0 : LOD_TENSOR.shape(-1, 2).dtype(float32).stop_gradient(False)
    place = fluid.CPUPlace()
    exe = fluid.Executor(place=place)
    exe.run(fluid.default_startup_program())
    data_after_run = exe.run(fetch_list=[x], feed={'x': data_x, 'y': data_y})
    print ('In static mode, data after run:', data_after_run)
    #In static mode, data after run: [array([[2., 2.],[2., 2.]], dtype=float32)]


# 动态图模式
with fluid.dygraph.guard():
    x = np.ones([2, 2], np.float32)
    y = np.ones([2, 2], np.float32)
    # 动态图模式下,将numpy的ndarray类型的数据转换为Variable类型
    x = fluid.dygraph.to_variable(x)
    y = fluid.dygraph.to_variable(y)
    print ('In DyGraph mode, after calling dygraph.to_variable, x = ', x)
    # In DyGraph mode, after calling dygraph.to_variable, x =  Tensor(shape=[2, 2], dtype=float32, place=CUDAPlace(0), stop_gradient=True,[[1., 1.],[1., 1.]])
    x = fluid.layers.elementwise_add(x,y)
    print ('In DyGraph mode, data after run:', x.numpy())
    #In DyGraph mode, data after run: [[2. 2.] [2. 2.]]
复制代码

Question 2: How to debug in static graph mode

  • Generally use fluid.layers.Print() to create a print operator to print the content of the tensor being accessed.

Question 3: How to convert a dynamic image to a static image

  • Based on the advantages and disadvantages of dynamic graphs, the dynamic graph mode can be used in the model development stage, and the static graph mode can be used in the training and inference stages.

  • Use @paddle.jit.to_static to decorate the function that needs to be statically converted. Or use the paddle.jit.to_static() function to transform the entire network.

import numpy as np
import paddle
import paddle.fluid as fluid
from paddle.jit import to_static

class MyNet(paddle.nn.Layer):
    def __init__(self):
        super(MyNet, self).__init__()
        self.fc = fluid.dygraph.Linear(input_dim=4, output_dim=2, act="relu")

    @to_static
    def forward(self, x, y):
        x = fluid.dygraph.to_variable(x)
        x = self.fc(x)
        y = fluid.dygraph.to_variable(y)
        loss = fluid.layers.cross_entropy(input=x, label=y)
        return loss


net = MyNet()
x = np.ones([16, 4], np.float32)
y = np.ones([16, 1], np.int64)
net.eval()
out = net(x, y)
复制代码

Recommended reading:

Baidu programmers develop pit avoidance guide (mobile terminal)

Baidu programmer development guide to avoid pits (front-end)

Baidu engineers teach you tips to quickly improve R&D efficiency

Baidu's first-line engineers talk about the ever-changing cloud native

[Technical gas station] Revealing the large-scale implementation of Baidu's intelligent test

[Technical Gas Station] Talking about the three stages of Baidu's intelligent test

Guess you like

Origin juejin.im/post/7085620918998794276