The Helix MP3 decoding library is free from the constraints of assembly instructions and runs on any processor

1. Main motivation

Helix is ​​a widely used MP3 audio decoding library, but because the original code only optimizes the bottom layer for ARM processor and X86 processor, and only has the bottom code of processors that adapt to these two architectures. The commonly used processor architectures are far more than these two, such as Intel FPGA soft-core NIOS II processors, Texas Instruments TMS320F28 series digital signal processors, Infineon TC2 series MCU TC1 processors, etc. Especially in recent years, domestic MCUs have sprung up like mushrooms, most of which use self-developed RISC-V architecture processors, such as K210, Lianshengde W801 and W806 based on the Pingtouge XT804 processor, Jiangsu Hengqin CH57 series single-chip microcomputer, EPS8266 and EPS32 series single-chip microcomputer, Zhongke Lanxun AB32VG1 single-chip microcomputer, etc. When using these single-chip microcomputers based on RISC-V processors, if the previous functions on the single-chip microcomputers based on ARM processors pursue performance optimization and use assembly to realize some functions, they will not be able to be transplanted. When I was developing the AB32VG1 single-chip microcomputer of Zhongke Lanxun AB32VG1, I found that when I reported an error after configuring the Helix library, the bottom layer called the code written by ARM assembly, which made the domestic single-chip microcomputer with excellent performance unable to be an MP3 player.

The following pictures are the pictures of adding the Helix decoding library to the software package and reporting errors after compiling, and viewing the error location. From the second picture, it is obvious that it is an ARM instruction, not to mention Cortex-M3 written at the beginning.

 

 

2. Solutions

Since all microcontrollers based on RISC-V processors have their corresponding GCC tool chains to realize the function of compiling from C or C++ language, as long as the previously unportable functions using assembly are re-implemented in C or C++ language, they can run on all processors (for the time being, the improvement in performance requirements after changing C or C++ is not considered. After all, the change only consumes more performance to realize the function. write higher than yourself)

3. Modify the detailed process

In fact, it is very simple, no matter how complex the processor is, its function is only one - to process data, and its method is only two - logical operation and arithmetic operation. As long as you understand the meaning of each instruction in the assembly file, you can rewrite it in C language.

3.1 First find all assembly files in the Helix library

It can be seen that there are these two assembly files, we only need to re-implement the functions of these two assembly files in C language.

3.2 Then I want to be lazy 

Although these two assembly files are not very long, I still want to be lazy. Is there a way to get it done in ten minutes? !

have!

Reading the underlying files of Helix, we can find that the library is not only compatible with ARM processors but also x86 processors, and the underlying files of the x86 processor version are originally implemented in C. This file is polyphase.c (under the Helix directory, that is, the upper directory of the original assembly file).

After using this file, an error will be reported when compiling directly. The error indication is in the assembly.h file. It is found that several of the high-frequency calculations are optimized by using inline assembly. However, because it is aimed at the x86 processor, an error will still be reported when compiling at this time.

Although it is still going the old way and turning the assembly into C language, there are only 4 assembly functions in the assembly.h file, and the function functions are extremely simple. After being converted to C language, there are no more than three lines at most.

The specific changes are as follows:

64-bit multiplication and take the high 32-bit result (the commented out part is the original x86 instruction)

static __inline int MULSHIFT32(int x, int y)	
{
//    __asm {
//		mov		eax, x
//	    imul	y
//	    mov		eax, edx
//	}
	long long temp;
	temp =  (long long)x * (long long)y;
	return temp >> 32;
}

64-bit multiply and add operations (the commented out part is an x86 instruction)

static __inline Word64 MADD64(Word64 sum, int x, int y)
{
//	unsigned int sumLo = ((unsigned int *)&sum)[0];
//	int sumHi = ((int *)&sum)[1];

//	__asm {
//		mov		eax, x
//		imul	y
//		add		eax, sumLo
//		adc		edx, sumHi
//	}

	return sum + (Word64)x * (Word64)y;

	/* equivalent to return (sum + ((__int64)x * y)); */
}

64-bit logical left shift (commented out is the original assembly code)

static __inline Word64 SHL64(Word64 x, int n)
{
//	unsigned int xLo = ((unsigned int *)&x)[0];
//	int xHi = ((int *)&x)[1];
//	unsigned char nb = (unsigned char)n;

//	if (n < 32) {
//		__asm {
//			mov		edx, xHi
//			mov		eax, xLo
//			mov		cl, nb
//			shld    edx, eax, cl
//			shl     eax, cl
//		}
//	} else if (n < 64) {
//		/* shl masks cl to 0x1f */
//		__asm {
//			mov		edx, xLo
//			mov		cl, nb
//			xor     eax, eax
//			shl     edx, cl
//		}
//	} else {
//		__asm {
//			xor		edx, edx
//			xor		eax, eax
//		}
//	}
	return x << n;
}

64-bit arithmetic right shift (commented out is the original assembly code)

static __inline Word64 SAR64(Word64 x, int n)
{
//	unsigned int xLo = ((unsigned int *)&x)[0];
//	int xHi = ((int *)&x)[1];
//	unsigned char nb = (unsigned char)n;

//	if (n < 32) {
//		__asm {
//			mov		edx, xHi
//			mov		eax, xLo
//			mov		cl, nb
//			shrd	eax, edx, cl
//			sar		edx, cl
//		}
//	} else if (n < 64) {
//		/* sar masks cl to 0x1f */
//		__asm {
//			mov		edx, xHi
//			mov		eax, xHi
//			mov		cl, nb
//			sar		edx, 31
//			sar		eax, cl
//		}
//	} else {
//		__asm {
//			sar		xHi, 31
//			mov		eax, xHi
//			mov		edx, xHi
//		}
//	}
	return x >> n;
}

It's over! It's that simple.

Finally, add a definition before the conditional compilation of this file to coax the compiler to treat our processor as an x86 processor, so that the 4 functions just changed can take effect.

 4. Test effect

After this modification, it can be played normally without any impact on the sound quality. Of course, the author encountered some small episodes during the test, just as the above code modified to C has added mandatory type conversion to convert the int format to long long format. If this conversion is not performed, the sound will not be heard!

5. Performance loss comparison and valuation

5.1 Comparison method:

Put the MP3 decoding function into a process of FreeRTOS, then check the CPU usage of this process, and compare the underlying CPU usage using assembly and C.

5.2 Valuation method:

The performance of the known test platform processor at the rated operating frequency is multiplied by the CPU usage to obtain the minimum performance required to run this decoding library.

5.3 Start the test:

The test track is: Ah Si - Love You at 105°C.mp3

Its bit rate is 320kpbs, which is the highest specification for common MP3 files.

The test platform is: STM32H743VBT6, its processor is Cortex-M7, the working frequency is 480MHZ, and the performance is 1027DMIPS

The bottom layer is implemented using C as shown in the figure below. It can be seen that the CPU usage rate is 15%.

 The bottom layer is implemented using the original ARM assembly instructions as shown in the figure below. It can be seen that the CPU usage rate is 5%.

The function is indeed realized, but the performance loss is 3 times that of the previous one.

It can be calculated from this:

When using assembly to implement the bottom layer, at least the performance of the processor is required to be 1027DMIPS * 5% = 51.35DMIPS

When using C to implement the bottom layer, at least the performance of the processor needs to be 1027DMIPS * 15% = 154.05DMIPS

This also proves from the side why MP3 decoding can run on STM32F103 (STM32F103 processor is Cortex-M3 with a working frequency of 72MHZ and a performance of 90DMIPS)

 6. Achievement practice

Go back to the original RT-Thread Studio and the AB32VG1 project of Zhongke Lanxun, and modify the code in the same way

Note: RT-Thread Stduio is completely different from KEIL5 in the management of project source files. RT-Thread Stduio will display all source files in the workspace. If you don’t want to join the project compilation or delete them, you need to filter them out through the file filter. The operation flow of the file filter is as follows. We need to use a file filter to filter out the two assembly source files, and find the configuration item of the file polyphase.c in the filter, and click Remove to see the underlying file polyphase.c implemented in C language in the project

 

 By default, the compilation will report an error. If you look carefully, it is the memory overflow of the data segment. Just modify the link.lds file. Since the data segment is not enough, then make the other few smaller and give more to the data segment. The author's settings are as follows, for reference only.

After the modification, the compilation passes

 Downloaded, booted and successfully mounted the SD card

 MP3 player function

 Successfully played, so far the small wish of transplanting Helix on the domestic single-chip microcomputer has come true. Although the actual playback effect is very poor and very stuck, after all, it has been calculated before that to run this C language to implement the underlying decoding library, at least 154.05DMIPS is required, and the main frequency of AB32VG1 is only 120MHZ. If you need to run this decoding library, the efficiency of the processor must be greater than 1.28DMIPS/MHZ. (1.25DMIPS/MHZ for Cortex-M3 and M4, 2.14DMIPS/MHZ for Cortex-M7)

 

 

 

Guess you like

Origin blog.csdn.net/Fairchild_1947/article/details/123150107