Vivado HLS implemented using a median filter and sort video processing network

Original Address: http:? //Xilinx.eetrend.com/article/6799 page = 6

Author: Daniele Bagni DSP experts Xilinx  [email protected]

High-level synthesis function Vivado will help you design a better sort of embedded network video applications.

Applications then handheld devices, now with embedded video functions from the car into the security system more and more. Each new generation needs more features and better image quality. However, for some of the design team, the image quality is not easy to achieve.

As a Xilinx DSP design field applications engineer, I am often asked about the IP video filtering and efficient implementation issues in this regard. I found that using the latest design kit Vivado® high-level synthesis (HLS) function, it is easy to achieve efficient median filtering method based on any sort of network Xilinx 7 series All Programmable devices.

Before detailed discussion of the method, we first look at some of the challenges designers face image integrity and resolve these issues commonly used filtering technology.

Most digital image acquisition system noise occurring during the transmission of the image. For example, a scanner or a digital camera and a sensor circuit may produce several types of irregular noise. Analog-random bit error or errors in the communication channel can cause particularly troublesome "impulse noise." This noise is often referred to as salt and pepper (salt-and-pepper) noise, as it appears in the form of random white spots or black spots on the image surface of the display, the image quality seriously degrades (FIG. 1).

FIG 1 - image input pulses affected by noise.  Just 2% of the pixels is damaged, but enough to seriously degrade the image quality.

To reduce the image noise, the video engineer would usually applied in the design of the spatial filter. These poor pixel points around the pixel filters using the noise characteristics of high numerical rendering of the image or be replaced or strengthened. The spatial filter is divided into linear or nonlinear. The most commonly used linear filter is referred to as a mean filter. It replaces each pixel with the mean value of neighboring pixels. Thus, a pixel can be rendered poor further improved from an average value of pixels in the image. Mean filter can quickly remove image noise in a low pass mode. However, this method often accompanied by side effects - the whole edge of the image becomes blurred.

In most cases, nonlinear filtering method is better than the linear average filter. Especially good nonlinear filtering to eliminate impulse noise. The most common non-linear filter is a statistical filter order. The most popular non-linear order statistical filter is a median filter.

Median filter is widely used in video and image processing, such as noise filters having excellent ability and blurring degree is much lower than the linear smoothing filter of the same size. And the like mean filter, median filter also sequentially analyzes each pixel in the image, and observe the adjacent pixels to determine whether the pixel can represent its surrounding pixels. However, the median filter does not simply be replaced with the average value of the pixel values ​​of surrounding pixels, but is replaced with the pixel values ​​of the surrounding. Because the value must be close to the actual value of a pixel, so the median filter does not create a new virtual pixel value when crossing the edge (to avoid the blurring of boundaries affect average filter). Thus, a median filter in terms of retained sharp edge better than any other filter doing. This filter when calculating the median, first of all the surrounding pixel values ​​in the window are sorted in order of value, and then replace the pixel to be filtered by the intermediate pixel values ​​(to be calculated if the area contains an even number of pixels, two intermediate the average value of pixels).

For example, assume a 3x3 pixel window to the center pixel value 229, the window value below
39 225 83
. 5 229 204
164 is 61 is 57 is

We sort the pixel can obtain a list of the order 539,576,183,164,204 225 229

The median value is the middle of the pixel, i.e., 83. The value in the output image 229 with an alternative initial value. Figure 2 shows the effect after application 3x3 median filter noise in the input image 1 in FIG. The larger the window around the pixel to be filtered, the filtering effect is more significant.

Figure 2 - the same images by the 3x3 median filter filtering; impulse noise has been completely eliminated.

Median filter with excellent noise reduction capability, it is also widely used in the interpolation stage video scan rate conversion system, for example, to achieve a field interlaced video signal is converted from 50Hz to 100Hz motion compensated interpolation procedure, or interlaced-to-progressive conversion in edge-oriented interpolation procedure. For the median filter For a more detailed description, the interested reader can refer to [1] and [2].

在运用中值滤波器时最为关键的是确定使用哪种排序方法,以获得用来生成每个输出像素的像素排序列表。排序过程需要大量计算时钟周期。

目前,赛灵思在Vivado设计套件中可提供高层次综合。我通常会告诉人们,可以根据排序网络概念在C语言中运用一种简单而有效的方法来设计中值滤波器。我们可使用Vivado HLS [3]来获得Zynq®-7000 All Pro-grammableSoC的FPGA架构的实时性能 [4]。

下面的内容里,我们假设图像格式是每像素8位,每行1,920像素,每帧1,080行,帧速率为60Hz,因此最小像素速率至少为124MHz。不过,为了设置一些设计难度,我将要求Vivado HLS工具提供200MHz的目标时钟频率,如果得到比124MHz更大的频率值效果会更好(由于实际视频信号中还包含空白数据,因此时钟速率比活动像素所要求的速率高)。

什么是排序网络?
排序是指将阵列中的元素按照升序或降序的方式重新进行排列的过程。排序是很多嵌入式计算系统中最重要的操作之一。

由于排序在众多应用中起到关键作用,因此很多科学文献中的大量文章都对众所周知的排序方法的复杂性和速度进行了分析,例如冒泡排序、希尔排序、归并排序和快速排序。对于大数据集来说快速排序是速度最快的排序算法 [5],而冒泡排序是最简单的。通常,所有这些技术都应该以软件任务的形式在RISC CPU上运行,而且每次只执行一个对比。它们的工作负载不是恒定的,而是取决于有多少输入数据已部分排序。例如,需要对一套N个样本进行排序,假设快速排序的计算复杂性在最差、一般和最好的情况下分别是N2、NlogN和NlogN。同时,冒泡排序的复杂性分别是N2、N2和N。不得不承认我还尚未发现关于此类复杂性数字的统一观点。但在我读过的有关此问题的所有文章中似乎都赞同一个观点,那就是计算某种排序算法的复杂性并不简单。这本身似乎成为了寻找备选方案的主要原因。

在进行图像处理时,我们需要在排序方法上获得确定的行为,以便以恒定的吞吐量产生输出图片。因此,上述算法都无法成为采用Vivado HLS的FPGA设计的理想备选方案。

排序网络可通过使用并列执行实现更快的运行速度。排序网络的基础构成模块是比较器。比较器是一种简单组件,能对a和b两个数据进行排序,然后将最大值和最小值分别输出到顶部和底部输出结果中,必要时还可进行交换。排序网络对于经典排序算法的优势在于比较器的数量在给定输入数量下是固定的。因此,排序网络在FPGA硬件中易于实现。图3举例说明了一个针对五个样本的排序网络(采用赛灵思System Generator生成[6])。需要注意到的是处理延迟正好是五个时钟周期,且与输入样本数值无关。此外还应注意到右侧的五个并行输出信号包含排序后的数据,其中最大值在顶部,最小值在底部。

Figure 3 - a block diagram of five sorting network input samples.  Comparator block is larger (has a clock cycle delay), delay elements are small squares

在C语言中通过排序网络实现中值滤波器是很简单的,如图4中的代码所示。Vivado HLS指令被嵌入到C语言代码自身内(#pragma HLS)。Vivado HLS只需要两个优化指令即可生成最佳RTL代码。首先是利用1的初始间隔 (II)将整个函数流水线化,使输出像素速率等于FPGA时钟速率。第二步优化是将像素窗口重新划分为单独的寄存器,以便同步并行访问所有数据,从而提高带宽。

Figure 4 - to achieve the network to sort through a median filter in the C language

顶层函数
图5中的代码段是中值滤波器的初级实现,我们将其作为参考。最里面的回路已进行流水线化处理,以便在任何时钟周期内都能生成一个输出像素。为了生成延迟估计报告,我们需要利用TRIPCOUNT指令通知Vivado HLS编译器有关回路L1和L2中可能出现的迭代次数,因为它们是“不受控”的。也就是说,假设该设计可在运行期间处理低于最大允许分辨率为1,920 x 1,080像素的图像分辨率,这些环路的极限值就是图片的高度和宽度,而这两个值在编译期间都是未知的。

Figure 5 - The video line buffer is not taken into account the behavior of the code initial Vivado HLS

在C语言代码中,待滤波的像素窗口可访问图像中不同的行。因此,利用存储器位置来降低存储带宽需求的优势比较有限。尽管Vivado HLS可对代码进行综合,但吞吐量并未达到最优值,如图6所示。回路L1_L2的初始化间隔(最里面回路L2完全展开的结果,由HLS编译器自动执行)为五个时钟周期,而非一个,因此得到的输出数据速率无法支持实时性能。从整个函数的最大延迟中也能明确这一点。在一个5纳秒的目标时钟周期中,用来计算输出图像的周期数量为10,368,020,这意味着帧速率为19.2Hz而非60Hz。正如参考文献[7]中详细描述的,Vivado HLS设计人员必须明确地将视频线路缓冲器的行为代码写入用于生成RTL的C语言模型中,因为HLS工具无法自动将新存储器插到用户代码中。

Figure 6 - a primary reference median filter Vivado HLS effective performance when used as a top-level function estimate; throughput far from ideal.

Figure 7 - top layer Vivado HLS performance estimation value of the filter function; frame rate 86.4Hz, beyond our desired properties.

全新的顶层函数C语言代码如图8所示。由于当前的像素坐标(行,列)显示为in_pix[r][c],因此需在坐标(r-1, c-1)中的待滤波输出像素周围创建一个滑动窗口。对于3x3大小的窗口,其结果是out_pix[r-1][c-1]。需要注意到的是当窗口尺寸为5x5或7x7的时候,输出像素坐标分别为(r-2, c-2)和(r-3, c-3)。静态阵列线路_缓冲器可存储KMED视频线路数量等同于中值滤波器中垂直样本的数量(当前情况下的数量为3个);而且由于静态C语言关键字的原因,Vivado HLS编译器可自动将内容映射到FPGA双端口Block RAM (BRAM)元件中。

Figure 8 - The video line buffer behavior into account the new top-level C code

这样仅需很少的HLS指令就可实现实时性能。需对最里面的回路L2进行流水线化处理,以便在任何时钟周期内都能生成一个输出像素。输入与输出图像阵列in_pix和out_pix被映射为RTL中的FIFO流接口。将该线路_缓冲器阵列划分成多个KMED独立阵列,以便Vivado HLS编译器将每个阵列映射到独立的双端口BRAM中。由于这样会有更多的可用端口,从而增加了载入/存储操作次数(每个双端口BRAM在每个周期内能完成两次载入或存储操作)。图7是Vivado HLS性能估算报告。目前,最大延迟为2,073,618个时钟周期。在5.58ns的估计时钟周期下,我们可以获得86.4Hz的帧速率。这已超越了我们的需求值!回路L1_L2正如我们所希望的那样得到II=1。应注意到的是需要两个BRAM以存放KMED线路缓冲存储器。

利用高层次综合进行架构探索
在我看来,Vivado HLS的最佳特性之一是能够通过改变工具的优化指令或C语言代码本身这样的方式来探索不同设计架构并对性能进行权衡,从而实现富有创造性的设计自由度。两种操作方式都非常简单而且并不耗时。

如果需要更大的中值滤波器窗口该怎么做?例如需要5x5而不是3x3的窗口尺寸。我们只需将KMED在C语言代码中的定义从“3”变为“5”,并再次运行Vivado HLS即可。图9是单独在3x3、5x5和7x7三种窗口尺寸情况下对中值滤波器例程进行综合所得到的HLS对比报告。在所有三种情况下,例程已完全流水线化 (II=1),并且满足目标时钟周期;延迟分别为9、25和49个时钟周期,与人们对于排序网络的预期表现相符。显然,由于待排序的数据总量从9增至25甚至达到49,因此所使用的资源(触发器和查找表)也相应增加。

Figure 9 - a median filter function separately estimate HLS contrast performance Vivado under three kinds of window size 3x3,5x5 and 7x7

Since the independent function is fully pipelined, thus delaying the top level function remains constant while the clock frequency is reduced slightly when the window size is increased. So far we have only discussed the Zynq-7000 All Programmable SoC as the case of the target device, but we can easily try different target device in the same project when using Vivado HLS. For example, if we use the same Kintex®-7 325T and 3x3 median filter integrated design, layout resources used include a BRAM two, a DSP48E, 1,323 flip-flops 705 and lookup table (the LUT), the clock and data rate of 403MHz; ZynqSoC while using the device, a BRAM need to use two, a DSP48E, 751 flip-flops 653 and look-up tables, and data rate clock is 205MHz.

Finally, if the resource usage when we want to see the 3x3 median filter processing of each sample is 11 bits (instead of 8-bit) gray image, we may change the data type defined by the pix_t ap_int C ++ type applications, can be predetermined so arbitrary bit width of fixed points. We just need to start by C language preprocessor symbol GRAY11 can recompile the project. In this case, the use of the resource estimate on the amount of ZynqSoC four BRAM, a DSP48E, 1,156 and 1,407 triggers a lookup table. Figure 10 shows the overall estimate of the final report of two cases.

FIG 10 -7Z02 processing device 11 or the comparison report Vivado HLS 7K325 device processing 8-bit functions both cases the top layer 3x3

Within a few days
In addition, we can see that for median filters with different window size or even a different number of bits / pixel, generation timing and area estimates how simple in the end. Especially in the case of 3x3 (or 5x5) median filter, automatically generated by the Vivado HLS RTL take up very little area (-1 speed stage) on ZynqSoC device, after completion of layout, FPGA clock frequency is 206 (5x5 version 188) MHz, the effective data rate 206 (or 188) MSPS.

These results give the desired total design time is only five days. Most of the time used to build MATLAB® and C models, rather than run Vivado HLS tool itself; two working days less than the time required for the latter.


Guess you like

Origin blog.csdn.net/zhipao6108/article/details/90759124