ARM bare metal development tutorial based on stm32mp157 linux development board 4: Cortex-A7 kernel storage system and pipeline (in serial)

Foreword:

At present, the ARM Cortex-A7 bare-metal development documents and videos have been upgraded twice and continuously updated to make the content richer and the explanation more detailed. The development platform used in the full text is the Huaqing Yuanjian FS-MP1A development board (STM32MP157 development board )

For the FS-MP1A development board, in addition to Cortex-A7 bare metal development, it also includes other series of tutorials, including Cortex-M4 development, FreeRTOS, Linux basic and application development, Linux system transplantation, Linux driver development , hardware design, artificial intelligence machine vision, Qt application programming, Qt comprehensive project actual combat, etc. In addition, it is planned to upgrade the documents and videos for the Linux system porting chapter and the Linux driver development chapter, so stay tuned!

More information about the development board can be obtained by leaving a message below the comment area~~ 

Cortex-A7  core storage system and pipeline

Storage System Overview

ARM's memory system is composed of multiple levels, which can be divided into core level, chip level, board level, and peripheral level, as shown in the figure below

 

Each level has a specific storage medium. The following compares the storage performance of specific storage media in systems of all levels.

⚫ Kernel-level registers. The processor register file can be thought of as the top level of the memory hierarchy. These registers are integrated in the processor core to provide the fastest memory access in the system. A typical ARM processor has multiple 32-bit registers whose access time is on the order of ns.

⚫ Chip-level tightly coupled memory (TCM, available in some processors) is a memory added to compensate for the uncertainty of Cache access. TCM is a fast SDRAM, which is close to the core and guarantees the number of clock cycles for fetching and data operations, which is very important for some real-time algorithms that require deterministic behavior. The TCM is located in the memory address map and can be accessed as fast memory.

⚫ The capacity of the chip-level on-chip Cache memory is between 8KB and 32KB, and the access time is about 10ns. high performance

In the ARM structure, there may be a second-level off-chip Cache with a capacity of several hundred KB and an access time of tens of ns.

⚫ Board-level DRAM. The main memory may be a few MB to tens of MB of dynamic memory, and the access time is about 100ns.

⚫ Peripheral-level backup memory, usually a hard disk, may range from hundreds of MB to dozens of GB, and the access time is about tens of milliseconds.

The storage management unit inside the processor core mainly includes: Cache, MMU, Write Buffer (write buffer) and other parts, as well as the co-processing CP15 that can control related storage units. The figure below is a simple structure diagram.

 

Cortex-A7  core memory

The previous chapter has learned about the registers of the Cortex-A7 core. This chapter will give an overall understanding of the storage structure of the Cortex-A7 core of the STM32MP1, as shown in the figure below.

 

It can be seen that the Cortex-A7 core has a two-level Cache, and it is a Harvard-structured Cache (the early ARM7 is a von Leumann structure). Instructions and data can interact with Icache and Dache at the same time (this is also a requirement for pipelines above 5 levels. Pipeline chapter will explain). It can be described as the following structure:

 

Storage Management Unit MMU

When creating a multitasking embedded system, it is desirable to have an easy way to program, load, and run individual tasks. At present, most embedded systems no longer use their own customized control systems, but use operating systems to simplify this process. Higher operating systems use hardware-based memory management units (MMUs) to do this.

A key service provided by the MMU is to enable each task to run as a separate program in its own private memory space. Under the control of the operating system with MMU, the running tasks do not need to know the storage requirements of other irrelevant tasks, which simplifies the design of each task.

The MMU provides resources to allow the use of virtual memory (the re-addressing of the system's physical memory, which can be viewed as a storage space separate from the system's physical memory). As a converter, the MMU converts the virtual address of the program and data (the connection address at compile time) into the actual physical address, that is, the address in the physical main memory. This translation process allows multiple programs to run using the same virtual address, but each stored in a different location in physical memory.

The memory thus has two types of addresses: virtual addresses and physical addresses. The virtual address is assigned by the compiler and linker when locating the program; the physical address is used to access the actual main memory hardware module (the area where the program exists physically).

 

 

MMU start command: 

1 mrc p15, 0, r1, c1, c0, 0 //Read control register

2 orr r1, #0x1 //Set M bit

3 mcr p15, 0,r1,c1, c0,0 //Write control register and enable MMU

MMU shutdown command: 

1 mrc p15, 0, r1, c1, c0, 0 //Read control register

2 bic r1, r1, #0x1 //Clr M bit

3 mcr p15, 0,r1,c1, c0,0 //Write control register and enable MMU

Cache, write buffer

Cache is a memory with a small capacity but very fast access speed, which keeps a copy of the most recently used memory data. For programmers, Cache is transparent. It automatically decides which data to save and which to overwrite. Cache is now usually implemented on the same chip as the processor. Cache works because of program locality. The so-called locality means that at any given time, the processor tends to execute the same instruction (such as a loop) multiple times on the same area of ​​data (such as the stack).

Cache is often used together with write buffer. The write buffer is a very small first-in-first-out (FIFO) memory located between the processor core and main memory. The purpose of using the write cache is to free the processor core and Cache from slower main memory write operations. When the CPU writes to the main memory, it first writes the data into the write cache. Since the write cache is very fast, the write operation will also be very fast. The write buffer writes data to corresponding locations in main memory at a slower rate when the CPU is idle.

By introducing Cache and write buffer, the performance of the storage system has been greatly improved, but it also brings some problems. For example, because data will exist in different physical locations in the system, it may cause data inconsistency; due to the optimization function of the write cache area, some write operations may not be performed in the order expected by the user, resulting in operation errors. Therefore, in the follow-up learning driver development, pay attention to the MMU memory management of the peripheral hardware register space, set the permissions in the C and B bits of the page table, and set them carefully. Generally, you choose not to support Cache or write buffer. Note: Memory management is managed by segments or pages, and the permission settings of different pages are different.

 

 

Of course, the Cache can be turned on or off as a whole.

ICache can be set separately after the overall Cache is turned on, and it can also be used when the MMU is not turned on.

DCache is dependent on MMU, and only after MMU is enabled, Dcache is valid and controlled by MMU.

To open and close the instruction Cache, you can use the general register to interact with the C1 register in the CP15 co-processing to set the switch of the instruction Cache. The figure below shows the C1 register.

 

We still add this code in start.S of the imported c_led project, debug and observe the phenomenon (because r0 has been occupied by the program, you can use the r1 register). 

1 /******Cache Test*******/

2 mrc p15,0,r1,c1,c0,0

3 orr r1, r1, #(1 << 2) // Set C bit to enable Cache as a whole

4 orr r1, r1, #(1 << 12) //Set I bit to enable ICache

5 mcr p15,0,r1,c1,c0,0

6 /******End Test******/

 

Through the set breakpoint, you can see that the value of r1 is 0x5187f, and the corresponding C and I bits are 1, indicating that ICache is already enabled, that is to say, the added code does not affect the original result. Next, run the program and remember the blinking frequency of the LED on the FS-MP1A development board. Note that because the M bit is also 1, it means

Next, test the effect of closing ICache on the program. Still in the position just now, add the command to close ICache.

1 /******Cache Test*******/

2 mrc p15,0,r1,c1,c0,0

3 orr r1, r1, #(1 << 2) // Set C bit to enable Cache as a whole

4 orr r1, r1, #(1 << 12) //Set I bit to enable Cache

5 bic r1, r1, #(1 << 12)//Close ICache

6 //bic r1, r1, #(1 << 2)//Close Cache

7 mcr p15,0,r1,c1,c0,0

8 /******End Test******/

After compiling and executing, it can be observed that the code with ICache turned off, the lighting speed is much slower.

The concept and principle of the pipeline

The processor executes each instruction in a series of steps, typically as follows:

1. Read instructions from memory (fetch).

2. Decode to identify which instruction it belongs to (decode).

3. Extract the operands of the instruction from the instruction (these operands often exist in the register reg).

4. Combine the operands to get the result or memory address (ALU).

5. If necessary, access memory to store data (mem).

6. Write the result back to the register file (res)

Not all instructions require each of the above steps, however, most instructions require more than one of these steps. These steps often use different hardware features, eg ALU may only be used in step 4. Therefore, if an instruction does not start before the previous instruction finishes, only a small portion of the processor's hardware is used during each step.

There is a way to significantly improve the utilization of hardware resources and the throughput of the processor, which is to start executing the next instruction before the end of the current one, which is commonly referred to as the pipeline (Pipeline) technology. Pipelining is the mechanism by which RISC processors execute instructions. Using pipelining, other instructions can be decoded and executed while the next instruction is being fetched, thereby speeding up execution. The assembly line can be thought of as a car production line, where each stage only completes specialized processor tasks.

Using the above sequence of operations, the processor can be organized in such a way that when an instruction has just executed step (1) and turns to step (2), the next instruction begins to execute step (1). In principle, such a pipeline should be 6 times faster than instruction execution without overlap, but due to some limitations of the hardware structure itself, the actual situation will be worse than the ideal state.

 

Classification of pipeline

3-stage pipeline

ARM processors up to ARM7 use a simple 3-stage pipeline consisting of the following pipeline stages.

1. Instruction fetch (fetch): Load an instruction from a register.

2. Decoding (decode): Identify the executed instruction and prepare the control signal of the data path for the next cycle. At this level, instructions occupy the decode logic and not the data path.

3. Execute: Process the instruction and write the result back to the register.

As shown in the figure below: the execution process of the 3-stage pipeline instruction.

We still add the following code in start.S of the imported c_led project, debug and observe.

 

 

1 /****pipeline test begin****/

2 mov r1,pc

3 /****pipeline test end****/

The result of the operation is as follows:

 

It can be found that after the execution of the instruction is completed, the value in R1 is assigned in the execution stage, that is, R1=0xc20000a8 is the value of PC in the execution stage, and the address of this instruction itself is =0xc20000a0. Explain that the address of this instruction in the execution stage = PC-8. (Note: It will be described later, although the A7 processor is an 8-stage pipeline, it also conforms to this rule)

When the processor executes simple data processing instructions, the pipelining allows an average of 1 instruction to be completed per clock cycle. But 1 instruction takes 3 clock cycles to complete, therefore, there is a latency of 3 clock cycles, but the throughput is 1 instruction per cycle. The following case is the best case for a three-stage pipeline.

 

In this example, 6 instructions are executed in 6 clock cycles, all operations are in registers (single-cycle execution), instruction cycle count (CPI) = 1.

In the three-stage pipeline, if you encounter LDR and STR commands and need to access memory, as shown in the figure below:

 

In this example, 4 instructions are executed in 6 cycles, and the number of instruction cycles (CPI) = 1.5.

5-stage pipeline

All processors must meet the requirements for high performance. Until ARM7, the cost performance of the 3-stage pipeline used in the ARM core is very high. However, in order to obtain higher performance, it is necessary to reconsider the organizational structure of the processor. There are two ways to improve performance.

Increase the clock frequency. The increase of the clock frequency will inevitably shorten the instruction execution cycle, so it is required to simplify the logic of each stage of the pipeline, and the number of stages of the pipeline must be increased.

Reduce the average number of instruction cycles per instruction CPI. This requires reconsidering the implementation of more than 1 pipeline cycle in the 3-stage pipeline ARM so that it occupies fewer cycles, or reduces pipeline stalls caused by instruction dependencies, or a combination of the two.

The 3-stage pipeline ARM core accesses the memory every clock cycle, or fetches instructions, or transfers data. It is not obvious to improve the performance of the system just by grasping the unused cycles of the memory. To improve CPI, the memory system must present more than one data per clock cycle. This can be done by giving more than 32 bits of data per clock cycle from a single memory, or by having separate memories for instructions or data.

For the above reasons, higher performance ARM cores use a 5-stage pipeline and have separate instruction and data memories. Divide the execution of instructions into 5 parts instead of 3 parts, so that higher clock frequency can be used, and the separate instruction and data memory can significantly reduce the CPI of the core.

A typical 5-stage pipeline is used in the ARM9TDMI, which includes the following pipeline stages.

1. Instruction fetch (fetch): An instruction is fetched from memory and put into the instruction pipeline.

2. Decoding (decode): The instruction is decoded and the register operand is read from the register file. There are 3 operand read ports in the register file, so most ARM instructions can read their operands in 1 cycle.

3. Execute: Shift one of the operands and generate the result in the ALU. If the instruction is a Load or Store instruction, the address of the memory is calculated in the ALU.

4. Buffer/data: Access data memory if needed, otherwise ALU simply buffers for 1 clock cycle.

5. Write-back: Write the result of the instruction back to the register file, including any data read from the register.

The execution process of the 5-stage pipeline instruction is listed as shown in the figure below.

 

 

During program execution, the PC value is based on 3-stage pipeline characteristics. Reading instruction operands 1 stage ahead in a 5-stage pipeline results in a different value (PC+4 instead of PC+8). Code incompatibility here is not tolerated. But the 5-stage pipeline ARM fully emulates the behavior of the 3-stage pipeline. The PC value added at the fetch stage is sent directly to the register of the decode stage, passing through the pipeline register between the two stages. The PC+4 of the next instruction is equal to the PC+8 of the current instruction, so no extra hardware is used to get the correct R15.

-stage pipeline

There is an 8-level pipeline in Cortex-A7, but no relevant details can be found, so I can only briefly introduce it here. From the classic ARM series to the current Cortex series, the structure of the ARM processor is developing towards a complex stage, but what has not changed is the relationship between the instruction fetch instruction and the address of the CPU. No matter how many stages of the pipeline, it can follow the original 3-stage pipeline. operating characteristics to determine its current PC location. This is mainly for the consideration of software compatibility. It can be judged that the processing cores launched by ARM later want to meet this feature. Interested readers can refer to relevant information by themselves.

Factors Affecting Pipeline Performance

interlock

During typical program processing, it is often the case that the result of one instruction is used as the operand of the next instruction. For example, the following sequence of instructions:

LDR R4,[R7]

ORR R8, R3, R4 ; generate interlock on 5-stage pipeline

 

It can be seen from the example that the operation of the pipeline generates an interruption, because the result of the first instruction has not yet been produced when the second instruction fetches the number. The 2nd instruction must stall until a result is produced. In this example, 6 instructions are executed in 7 clock cycles, CPI = 1.2.

However, if the order of ORR R8, R3, R4 and AND R6, R3, R1 is exchanged, a better pipeline effect can be obtained without affecting the program result. When you study the Linux kernel in the future, you will encounter the concept of memory barriers, sometimes to prevent the compiler from optimizing the order of instructions.

 

In this example, 6 instructions are executed in 6 clock cycles, CPI = 1.

jump instruction

Jump instructions also disrupt the behavior of the pipeline, since the fetch step of subsequent instructions is affected by the calculation of the jump target and must be postponed. However, when the jump instruction is decoded, the subsequent fetch operation has occurred before it is confirmed to be a jump instruction. As a result, instructions that have been prefetched into the pipeline have to be discarded. If the calculation of the jump target is done at the ALU stage, two instructions have been read in the original instruction stream before the jump target is obtained.

 

Obviously, pipelining is most efficient when all instructions are executed in similar steps. If the processor's instructions are so complex that each instruction behaves differently from the next, then it's difficult to pipeline.

summary

For the Cache and MMU parts, only function introductions and simple test experiments are given at present. You need to understand their functions at present. After laying a good foundation, you can continue to understand the internal structure of Cache and control methods, such as: Cache's overwriting mechanism and locking mechanism. MMU's first-level page table and second-level page table writing, permission management, etc. These contents correspond to the future embedded Linux driver, kernel optimization, learning of memory management, and real-time optimization are very meaningful. Some difficult problems encountered when writing the driver may also be related to this part of the storage system.

Guess you like

Origin blog.csdn.net/u014170843/article/details/130083168