Summary of ARM synchronization primitives

http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf

 

1. Software synchronization

When accessing shared resources must be limited to only one Agent at a time, the software must be synchronized.

Shared resources: shared memory, peripherals

Agent: processor, process, thread

 

Management is achieved through atomic modification of state variables representing shared resources. This atomic modification either succeeds or fails while being visible to other agents.

 

On simple systems, this can be done by disabling interrupts on critical sections of code; on modern multitasking multicore systems, this is not an efficient and safe approach. Modern computer architectures provide hardware synchronization primitives to perform atomic updates to memory locations in a safe manner.

 

Software synchronization interface

OS and platform libraries hide low-level hardware primitives from application developers behind the scenes of hardware independence. These functions form part of the Application Programming Interface (API). There are two types of software synchronization primitives:

Mutex : A variable marked with two states: unlocked and unlocked; attempting to lock an already locked mutex will cause execution to be blocked until the Agent holding the mutex unlocks it. Mutexes are sometimes called locks and binary semaphores.

Semaphore : A counter that can be incremented and decremented atomically; attempting to decrement a semaphore that is less than 1 will block execution until another Agent increments it.

In addition to blocking operations, APIs can define non-blocking interfaces that return an error condition immediately instead of blocking when execution of the requested operation fails.

 

Synchronization in multitasking systems

In a multitasking operating system, any synchronization operation must be guaranteed to behave correctly, even if interrupted by a context switch. Software can do this by disabling interrupts when updating synchronization variables when synchronous operations with other processors are not necessary. In the OS kernel, this may be an effective method to achieve synchronization, but for application layer software, the performance overhead caused by system calls makes it an impractical solution.

 

Synchronization in multiprocessor systems

Multi-core and multi-processor systems introduce a new problem, which requires atomic locking of mutexes and modification of semaphores throughout the system. This requires the system to maintain global state that tracks active synchronization operations .

 

Historical synchronization primitives for ARM architecture

The SWP and SWPB instructions atomically swap a 32-bit word or byte between a register and memory. Starting from the ARMv6 architecture, ARM has abandoned SWP and SWPB. This means that future architectures are not guaranteed to support these instructions. ARM strongly recommends using the new synchronization primitives.

 

ARMv6 architecture new

The ARMv6 architecture introduces the concept of exclusive access to memory locations, providing more flexible atomic operations for updating memory. Also introduced are memory types, memory access ordering rules, and barrier instructions for explicit ordered access to memory.

 

 

2. Exclusive access

The ARMv6 architecture introduces Load Link and Store Conditional instructions, LDREX and STREX, in the form of Load-Exclusive and Store-Exclusive synchronization primitives. Starting with ARMv6T2, these instructions are available in both ARM and Thumb instruction sets. Load-Exclusive and Store-Exclusive provide flexible, scalable synchronization operations and replace the deprecated SWP and SWPB instructions.

 

LDREX and STREX

The LDREX and STREX instructions divide the operation of atomically updating memory into two separate steps. Together with exclusive monitors that track exclusive memory accesses , they provide atomic update operations. Load-Exclusive and Store-Exclusive can only access memory areas marked Normal.

 

LDREX loads a word from memory and initializes an exclusive monitor used to track synchronization operations . For example, LDREX R1, [R0] implements Load-Exclusive, takes the value from address R0 and puts it into R1, and updates the exclusive monitor (there may be more than one).

 

The STREX conditional stores a word into memory. If the store is allowed by the exclusive monitor , the memory is updated and a 0 is placed in the target register to indicate the success of the operation. If the exclusive monitor does not allow it, do not update the memory and place a 1 in the destination register indicating that the operation failed. This enables a conditional execution path based on the success or failure of the memory operation. For example, STREX R2,R1, [R0] performs a Store-Exclusive operation: conditionally store the value of R1 to the R0 address, and R2 indicates success or failure.

 

Optional exclusive access size

The ARMv6K architecture introduces byte, halfword and doubleword variant instructions of LDREX and STREX.

LDREXB, STREXB

LDREXH, STREXH

LDREXD, STREXD

The ARMv7 architecture adds these to the Thumb instruction set of A,R Profile. ARMv7-M supports word and half-word instructions, but does not support double-word variant instructions. ARMv6-M does not support exclusive access.

 

The architecture requires that each Load-Exclusive instruction must only be used in conjunction with the corresponding Store-Exclusive instruction. For example, LDREXB must only match STREXB.

 

Exclusive MonitorsExclusive Monitors

The exclusive monitor is a simple state machine with two states: open and exclusive. To support synchronization between processors, the system must implement two sets of monitors , local and global . A Load-Exclusive operation updates related monitors to the exclusive state, and a Store-Exclusive operation accesses these monitors to determine whether they can be executed successfully. Store-Exclusive only succeeds if all accessible exclusive monitors are in the exclusive state.

local monitor and global monitor

 

Local monitors

There is one local monitor for each processor that supports exclusive access. Exclusive access to memory locations marked as unshared is checked against the local monitor. Exclusive access to memory locations marked as shared is checked against both the local monitor and the global monitor.

For example, software executing on a Cortex-A8 processor must implement synchronization between locally executing applications, which can be achieved by using a mutex on non-shared memory. Load-Exclusive and Store-Exclusive instructions only access the local monitor during execution.

 

Local monitors can be implemented by marking an address for exclusive use , or contain a state machine that keeps track of Load-Exclusive and Store-Exclusive instructions . This means that a Store-Exclusive operation on a shared memory location may succeed even if the previous Load-Exclusive came from a completely different address . Therefore, portable code should not make assumptions about validating addresses for exclusive access.

 

If the memory location is cacheable, synchronization may be done without crossing an external bus and will not be visible to external observers, such as other processors in the system.

 

Global monitorGlobal monitor

A global monitor tracks exclusive accesses to areas marked as shared memory. Any Store-Exclusive operation targeting shared memory must check both its local monitor and global monitor before deciding whether to update the memory.

 

For example, if software executing on one processor must synchronize its operations with software executing on another processor, it can do so through a mutex located in shared memory. Load-Exclusive and Store-Exclusive directives will access local and global monitors.

 

A global monitor, or part of a global monitor, may be implemented in conjunction with a local monitor, for example on a system that implements buffer coherency management.

 

The global monitor can mark an address for each processor on the system that supports exclusive access. When a processor completes a Load-Exclusive operation from a shared location, the global monitor marks the processor with an address for exclusive use. The following event will reset the processor N entry in the global monitor to the open state.

  1. Processor N performs a load-exclusive operation at a different location.
  2. Other processors successfully performed a store, Store-Exclusive operation on the address marked as exclusive use by processor N.

 

Other events may clear the global exclusive monitor, but they are implementation dependent, and portable code should not rely on these features.

 

If a region configured as shared is not associated with a global monitor, Store-Exclusive operations on the region always fail and return 0 in the destination register.

 

Exclusive Reservation GranuleExclusive Reservation Granule

When an exclusive monitor marks an address, the smallest area marked for exclusive access, called the Exclusive Reservation Granularity (ERG), is implementation-defined, in the range of 8-2048 bytes, in multiples of 2. Portable code cannot assume its size.

 

reset monitor

When the OS performs a context switch, it must reset the local monitor to the open state to prevent errors from occurring. ARMv6K introduces the Clear-Exclusive instruction, CLREX, to reset the local monitor .

 

In ARMv6 base architecture and ARMv6T2, the local monitor must be reset by performing a virtual Store-Exclusive on a private address.

 

The monitor state after a data abort exception is architecture-undefined, therefore, ARM recommends that the exception-handling code execute a CLREX or virtual Store-Exclusive instruction.

 

If the context switch schedules the process after executing Load-Exclusive but before executing Store-Exclusive, when the process resumes, Store-Exclusive returns a false negative result and does not update the memory. This does not affect program functionality, as the process can immediately try this operation.

 

For the above reasons, ARM recommends:

Load-Exclusive and Store-Exclusive should not exceed 128 bytes.

Do not perform explicit buffer maintenance operations or data access between Load-Exclusive and Store-Exclusive .

 

3. Memory barrier

To ensure a consistent view of memory, the architecture defines that software must implement the Data Memory Barrier (DMB) operation:

Between acquiring a resource (such as locking a mutex or reducing a semaphore) and making access to the resource;

Before resources are available, like unlocking a mutex or incrementing a semaphore.

 

The data memory barrier existed as a cp15 operation before ARMv7, and ARMv7 existed in the form of a dedicated instruction.

 

Use on multi-core systems

Synchronous operations modeled with Load-Exclusive and Store-Exclusive are the same for single-core and multi-core systems, but for multi-core systems you have to be aware of some system-level implications.

 

System with consistency management

The ARM MPCore multi-core processor contains a Snoop Control Unit (SCU), which is used to maintain the consistency of the level 1 data cache shared by the processor across memory areas. In this component, each core's local monitor operates in conjunction with the SCU to provide combined local and global monitors for synchronized operations in memory regions marked for consistency.

 

This may occasionally cause errors when multiple processors try to access synchronization variables in the ERG block at the same time, or delays caused by transferring data between caches. For performance reasons, it is useful to explicitly separate frequently accessed synchronization variables (at least ERG size) in memory.

 

System without consistency management

Memory regions used for simultaneous operations between processors or between multiple cores must be marked as shared. When coherency management is not available or disabled, it means that these regions cannot be cached, and a global monitor must be implemented to allow synchronization.

 

 

 

 

Guess you like

Origin blog.csdn.net/konga/article/details/103447512