PCIe Basics

PCIe bus overview

     With the development of modern processor technology, it is the general trend to use high-speed differential bus instead of parallel bus in the field of interconnection. Compared to single-ended parallel signals, high-speed differential signaling can use higher clock frequencies, thereby using fewer signal lines, and achieving bus bandwidths that previously required many single-ended parallel data signals.

     The PCI bus uses a parallel bus structure, and all external devices on the same bus share the bus bandwidth, while the PCIe bus uses a high-speed differential bus and adopts an end-to-end connection method, so only two can be connected in each PCIe link. equipment. This makes the topology used by PCIe and the PCI bus different. In addition to being different from the PCI bus in connection mode, PCIe bus also uses some technologies used in network communication, such as supporting multiple data routing methods, multi-channel-based data transfer methods, and message-based data transfer methods , and fully consider the QoS (Quality of Service) problem in data transmission.

 

Basics of the PCIe bus

       Different from the PCI bus, the PCIe bus uses an end-to-end connection method. Only one device can be connected to each end of a PCIe link. These two devices are the data sender and the data receiver. In addition to the bus link, the PCIe bus also has multiple layers that the sender will pass through when sending data, and the receiver will also use these layers when receiving data. The hierarchy used by the PCIe bus is similar to the network protocol stack.

1.1 End-to-end data transfer

      The PCIe link uses the "end-to-end data transmission method". Both the transmitter and the receiver contain TX (transmit logic) and RX (receive logic). The structure is shown in Figure 41.


      As shown in the figure above, in a data path (Lane) of the physical link of the PCIe bus, there are two sets of differential signals, a total of 4 signal lines. The TX part of the transmitting end and the RX part of the receiving end are connected by a set of differential signals, which is also called the transmitting link of the transmitting end and the receiving link of the receiving end; and the RX part of the transmitting end and the TX part of the receiving end use another The group differential signal connection, this link is also called the receiving link of the transmitting end, is also the transmitting link of the receiving end. A PCIe link can consist of multiple lanes.

      The electrical specification for high-speed differential signaling requires a capacitor in series with its transmit end for AC coupling. This capacitor is also known as an AC coupling capacitor. The PCIe link uses differential signals for data transmission. A differential signal consists of two signals, D+ and D-. The signal receiving end compares the difference between these two signals to determine whether the sending end sends a logic "1" or a logic "0" ".

       Compared with single-ended signals, differential signals have stronger anti-interference ability, because differential signals require "equal length", "equal width", "closeness" during wiring, and they are on the same layer. Therefore, the external interference noise will be "same value" and "simultaneously" loaded on the D+ and D- two signals, the difference is ideally 0, and the impact on the logic value of the signal is small. Therefore differential signaling can use higher bus frequencies.

       In addition, the use of differential signals can effectively suppress EMI (Electro Magnetic Interference). Since the differential signals D+ and D- are very close, and the signal amplitudes are equal and the polarities are opposite. The amplitudes of the coupled electromagnetic fields between these two wires and the ground wire are equal and will cancel each other out, so the electromagnetic interference of the differential signal to the outside world is small. Of course, the disadvantages of differential signals are also obvious. One is that differential signals use two signals to transmit one bit of data; the other is that the wiring of differential signals is relatively strict.

       A PCIe link can be composed of multiple lanes. Currently, a PCIe link can support 1, 2, 4, 8, 12, 16 and 32 lanes, namely ×1, ×2, ×4, ×8, ×12, ×16 and ×32 wide PCIe link. The bus frequency used on each lane is related to the version of the PCIe bus used.

      The first PCIe bus specification is V1.0, followed by V1.0a, V1.1, V2.0 and V2.1. At present, the latest specification of PCIe bus is V2.1, and V3.0 is under development and is expected to be released in 2010. The bus frequencies and link coding methods defined by different PCIe bus specifications are not the same, as shown in Table 41. 

Table 41 Relationship between PCIe bus specification and bus frequency and encoding

PCIe bus specification

bus frequency[1]

Peak bandwidth of a single lane

Encoding

1.x

1.25GHz

2.5GT / s

8/10b encoding

2.x

2.5GHz

5GT / s

8/10b encoding

3.0

4GHz

8GT/s

128/130b encoding

 

       As shown in the table above, different PCIe bus specifications use different bus frequencies and different data encoding methods. The PCIe bus V1.x and V2.0 specifications use 8/10b encoding in the physical layer, that is, 10 bits on the PCIe link contain 8 bits of valid data; while the V3.0 specification uses 128/130b encoding, that is There are 128 bits of valid data in 130 bits on the PCIe link.

       As shown in the above table, although the bus frequency used by V3.0 specification is only 4GHz, its effective bandwidth is twice that of V2.x. The following uses the V2.x specification as an example to illustrate the peak bandwidths provided by PCIe links of different widths, as shown in Table 42.

 Table 42 Peak bandwidth of PCIe bus

Data bit width of PCIe bus

×1

×2

×4

×8

×12

×16

×32

Peak bandwidth (GT/s)

5

10

20

40

60

80

160

        As shown in the above table, a ×32 PCIe link can provide a link bandwidth of 160GT/s, which is much higher than the peak bandwidth provided by the PCI/PCI-X bus. The upcoming PCIe V3.0 specification uses a 4GHz bus frequency, which will further increase the peak bandwidth of the PCIe link.

       In the PCIe bus, use GT (Gigatransfer) to calculate the peak bandwidth of the PCIe link. GT is the peak bandwidth passed on the PCIe link, and its calculation formula is bus frequency × data bit width × 2.

       In the PCIe bus, there are many factors that affect the effective bandwidth, so its effective bandwidth is difficult to calculate. Still, the effective bandwidth offered by the PCIe bus is much higher than that of the PCI bus. The PCIe bus also has its weaknesses, the most prominent of which is transfer latency.

       The PCIe link uses serial mode for data transmission. However, inside the chip, the data bus is still parallel. Therefore, the serial-to-parallel conversion of the PCIe link interface is required, and this serial-to-parallel conversion will generate a large delay. In addition, the data packets of the PCIe bus need to pass through the transaction layer, the data link layer and the physical layer. When these data packets pass through these layers, it will also bring delays. 

      Among PCIe bus-based devices, ×1 PCIe links are the most common, while ×12 PCIe links are rare, and ×4 and ×8 PCIe devices are also rare. Intel usually integrates multiple ×1 PCIe links in the ICH to connect low-speed peripherals, and integrates a ×16 PCIe link in the MCH to connect to the graphics controller. And PowerPC processors are usually able to support ×8, ×4, ×2 and ×1 PCIe links.

       The data transmission between the physical links of the PCIe bus uses a clock-based synchronous transmission mechanism, but there is no clock line on the physical link. The receiving end of the PCIe bus contains a clock recovery module CDR (Clock Data Recovery), which will receive packets from the CDR. The receiving clock is extracted from the synchronous data transmission.

      It is worth noting that in a PCIe device, in addition to extracting the clock from the message, the REFCLK+ and REFCLK- signal pair is also used as the local reference clock. The description of this signal pair is described below.


1.2 Signals used by the PCIe bus

        PCIe devices are powered by two power signals, Vcc and Vaux, which are rated at 3.3V. Among them, Vcc is the main power supply, the main logic modules used by PCIe devices are powered by Vcc, and some logic related to power management is powered by Vaux. In PCIe devices, some special registers are usually powered by Vaux, such as Sticky Register. At this time, even if the Vcc of the PCIe device is removed, the logic states related to power management and the contents of these special registers will not change.

       In the PCIe bus, the main reason to use Vaux is to reduce power consumption and reduce system recovery time. Because Vaux will not be removed in most cases, when the Vcc of the PCIe device is restored, the device does not need to resume the logic of using Vaux to supply power, so the device can quickly return to the normal working state.

       The maximum width of a PCIe link is ×32, but in practical applications, the link width of ×32 is rarely used. In a processor system, a ×16 PCIe slot is generally provided, and a total of 64 signal lines of PETp0~15, PETn0~15, PERp0~15, and PERn0~15 are used to form 32 pairs of differential signals, of which 16 pairs of PETxx signals are used for For the transmit chain, another 16 pairs of PERxx signals are used for the receive chain. In addition to this, the PCIe bus also uses the following auxiliary signals.

1 PERST# signal

        This signal is a global reset signal and is provided by the processor system. The processor system needs to provide the reset signal for PCIe slots and PCIe devices. PCIe devices use this signal to reset internal logic. When this signal is valid, the PCIe device will perform a reset operation. The PCIe bus defines a variety of reset methods. The implementation of the two reset methods, Cold Reset and Warm Reset, is related to this signal. See Section 1.5 for details.

2 REFCLK+ and REFCLK- signals

        In a processor system, there may be many PCIe devices, which can be connected to PCIe slots as Add-In cards, or as built-in modules, directly connected to the PCIe link provided by the processor system without going through PCIe slot. Both PCIe devices and PCIe slots have REFCLK+ and REFCLK- signals, where the PCIe slot uses this set of signals to synchronize with the processor system.

        In a processor system, dedicated logic is typically used to provide the REFCLK+ and REFCLK- signals to the PCIe slot, as shown in Figure 42. Among them, the 100Mhz clock source is provided by the crystal oscillator, and through a "one-push-multiple" differential clock driver, multiple clock sources of the same phase are generated, which are connected to the PCIe slots in one-to-one correspondence.

       The PCIe slot requires a reference clock with a frequency range of 100MHz ± 300ppm. The processor system needs to provide a reference clock for each PCIe slot, MCH, ICH and Switch. Moreover, it is required that in a processor system, the distance between the reference clock signal generated by the clock driver and each PCIe slot (MCH, ICH and Swith) is within 15 inches. Usually the propagation speed of the signal is close to the speed of light, which is about 6 inches/ns. It can be seen that the transmission delay difference between the REFCLK+ and REFCLK- signals between different PCIe slots is about 2.5ns.

       When a PCIe device is connected to a PCIe slot as an Add-In card, the REFCLK+ and REFCLK- signals provided by the PCIe slot can be used directly, or an independent reference clock can be used, as long as the reference clock is within the range of 100MHz±300ppm. The built-in PCIe device is similar to how the Add-In card handles the REFCLK+ and REFCLK- signals, but the PCIe device can use a separate reference clock instead of the REFCLK+ and REFCLK- signals.

       In the Link Control Register of the PCIe device configuration space, there is a "Common Clock Configuration" bit. When this bit is 1, it means that the device and the peer device of the PCIe link use the "same-phase" reference clock; if it is 0, it means that the reference clock used by the device and the peer device of the PCIe link is asynchronous.       

       In a PCIe device, the default value of the "Common Clock Configuration" bit is 0. At this time, the reference clock used by the PCIe device has no connection with the peer device, and the reference clocks used by the devices at both ends of the PCIe link can be set asynchronously. This asynchronous clock setting method is especially important when using PCIe links for remote connections.

       In a processor system, if a PCIe link is used for chassis-to-chassis interconnection, because the reference clock can be set asynchronously, only differential signal lines are required for data transmission between chassis and chassis, and no reference clock is required. , which greatly reduces the connection difficulty.

3 WAKE # signal

        When the PCIe device enters a sleep state and the main power supply has stopped supplying power, the PCIe device uses the signal to submit a wake-up request to the processor system, so that the processor system provides the PCIe device with the main power supply Vcc again. In the PCIe bus, the WAKE# signal is optional, so the mechanism to wake up a PCIe device using the WAKE# signal is also optional. It is worth noting that the hardware logic that generates this signal must be powered by the auxiliary power supply Vaux.

        WAKE# is an Open Drain signal. All PCIe devices of a processor can wire the WAKE# signal and send it to the power controller of the processor system. When a PCIe device needs to be woken up, the device first sets the WAKE# signal to be valid, and then after a delay, the processor system starts to provide the main power Vcc for the device, and uses the PERST# signal to reset the device. . At this time, the WAKE# signal needs to be kept low all the time. When the main power supply Vcc is powered on, the PERST# signal will also be deactivated and the reset will be completed. The WAKE# signal will also be deactivated, ending the entire wake-up process.

        In addition to using the WAKE# signal to implement the wake-up function, PCIe devices can also use the Beacon signal to implement the wake-up function. Different from WAKE# signal to realize wake-up function, Beacon uses In-band signal, namely differential signal D+ and D- to realize wake-up function. The Beacon signal is DC balanced and consists of a set of pulsed signals generated from D+ and D- signals. These pulse signal widths have a minimum value of 2ns and a maximum value of 16us. When the PCIe device is ready to exit the L2 state (this state is a low-power state used by the PCIe device), the Beacon signal can be used to submit a wake-up request.

4 SMCLK and SMDAT signals

       The SMCLK and SMDAT signals are related to the SMBus (System Mangement Bus) of the x86 processor. SMBus was proposed by Intel in 1995. SMBus consists of SMCLK and SMDAT signals. SMBus is derived from the I2C bus, but there are some differences from the I2C bus.

      The maximum bus frequency of SMBus is 100KHz, while the I2C bus can support bus frequencies of 400KHz and 2MHz. In addition, the slave device on the SMBus has a timeout function. When the slave device finds that the clock signal sent by the master device remains low for more than 35ms, it will trigger a timeout reset of the slave device. Under normal circumstances, the minimum bus frequency used by the master device of SMBus is 10KHz to avoid the timeout of the slave device during normal use.

      In SMbus, this timeout mechanism can be used if the master needs to reset the slave. The I2C bus can only use hardware signals to achieve this reset operation. In the I2C bus, if the slave device has an error, it is impossible to reset the slave device simply through the master device. 

       SMBus also supports the Alert Response mechanism. When a slave device generates an interrupt, the interrupt is not immediately cleared until the master device issues a command to address 0b0001100.

       The difference between SMBus and I2C bus mentioned above is still limited to the physical layer and link layer. In fact, SMBus also contains the network layer. SMBus also defines 11 bus protocols on the network layer to realize message transmission.

        SMBus has been widely popularized in the x86 processor system, and its main function is to manage the peripheral devices of the processor system and collect the operation information of the peripheral devices, especially some information related to intelligent power management. PCI and PCIe slots also reserve interfaces for SMBus so that PCI/PCIe devices can interact with the processor system.

        In Linux system, SMBus has been widely used, and ACPI also defines a series of commands for SMBus, which are used for communication between intelligent batteries, battery chargers and processor systems. In the Windows operating system, the description information about the external device is also obtained through SMBus.

5 JTAG signal

       JTAG (Joint Test Action Group) is an international standard test protocol, compatible with IEEE 1149.1, mainly used for internal chip testing. The vast majority of devices currently support the JTAG test standard. The JTAG signal consists of TRST#, TCK, TDI, TDO and TMS signals. Among them, TRST# is the reset signal; TCK is the clock signal; TDI and TDO correspond to the data input and data output respectively; and the TMS signal is the mode selection.

      JTAG allows multiple devices to be chained together through the JTAG interface and form a JTAG chain. At present, FPGA and EPLD can use JTAG interface to realize online programming ISP (In-System Programming) function. The processor can also use the JTAG interface for system-level debugging, such as setting breakpoints, reading internal registers and memory, and a series of operations. In addition, the JTAG interface can also be used as "reverse engineering" to analyze the implementation details of a product. Therefore, the JTAG interface is generally not reserved in formal products.

6 PRSNT1# and PRSNT2# signals

       The PRSNT1# and PRSNT2# signals are related to hot-plugging of PCIe devices. In the Add-in card based on the PCIe bus, the PRSNT1# and PRSNT2# signals are directly connected, while in the processor motherboard, the PRSNT1# signal is grounded, and the PRSNT2# signal is connected high through a pull-up resistor. Figure 43 shows the hot-plug structure of a PCIe device.

        As shown in the figure above, when the Add-In card is not inserted, the PRSNT2# signal of the processor motherboard is connected to a high by the pull-up resistor, and when the Add-In card is inserted, the PRSNT2# signal of the motherboard will be connected with the PRSNT1# signal through the Add- The In card is connected, and the PRSNT2# signal is low at this time. The hot-plug control logic of the processor motherboard will capture this "low level" and know that the Add-In card has been inserted, thereby triggering the system software to process accordingly.

        The working mechanism of Add-In card extraction is similar to insertion. When the Add-In card is connected to the processor mainboard, the PRSNT2# signal of the processor mainboard is low, and when the Add-In card is pulled out, the PRSNT2# signal of the processor mainboard is high. The hot-plug control logic of the processor motherboard will capture this "high level" and know that the Add-In card has been pulled out, thereby triggering the system software to handle it accordingly.

       Different processor systems handle hot-plugging of PCIe devices differently. In an actual processor system, the implementation of hot-plugging devices is far more complicated than the example in Figure 43. It is worth noting that, when implementing the hot-swap function, the Add-in Card needs to use the "long and short needle" structure.

       As shown in Figure 43, the length of the golden finger used by the PRSNT1# and PRSNT2# signals is half that of the other signals. Therefore, when the PCIe device is inserted into the slot, the PRSNT1# and PRSNT2# signals are in complete contact with the PCIe slot at other gold fingers, and after a delay, can they fully contact the slot; when the PCIe device is pulled out of the PCIe slot , these two signals are first disconnected from the PCIe slot, and after a delay, other signals can be disconnected from the slot. The system software can use this delay for some hot-plug processing.


1.3 Hierarchical structure of PCLe bus

        The PCIe bus adopts a serial connection method and uses data packets (Packet) for data transmission. This structure effectively removes some sideband signals existing in the PCI bus, such as INTx and PME#. In the PCIe bus, data packets need to pass through multiple layers in the process of receiving and sending, including the transaction layer, the data link layer and the physical layer. The hierarchy of the PCIe bus is shown in Figure 44.

        The hierarchical composition of the PCIe bus is similar to the hierarchical structure in the network, but each level of the PCIe bus is implemented using hardware logic. In the PCIe architecture, data packets are first generated in the device core layer (Device Core), and then pass through the device's transaction layer (Transaction Layer), data link layer (Data Link Layer) and physical layer (Physical Layer) ), and finally sent out. The data at the receiving end also needs to pass through the physical layer, data link and transaction layer, and finally reach the Device Core.

1 transaction layer

        The transaction layer defines the PCIe bus to use bus transactions, most of which are compatible with the PCI bus. These bus transactions can be transmitted to other PCIe devices or RCs through devices such as Switch. RCs can also access PCIe devices using these bus transactions. 

       The transaction layer receives the data from the core layer of the PCIe device, encapsulates it as TLP (Transaction Layer Packet), and sends it to the data link layer. In addition, the transaction layer can also receive data packets from the data link layer and forward them to the core layer of the PCIe device.

        An important job of the transaction layer is to handle the "ordering" of the PCIe bus. In the PCIe bus, the concept of "order" is very important and difficult to understand. In the PCIe bus, the transaction layer can transmit packets out of order, which creates a lot of trouble for the design of PCIe devices. The transaction layer also uses a flow control mechanism to ensure the efficient use of the PCIe link. See Chapter 6 for a detailed description of the transaction layer.

2 data link layer

        The data link layer ensures that packets from the transaction layer of the sender can be reliably and completely sent to the data link layer of the receiver. When packets from the transaction layer pass through the data link layer, a Sequence Number prefix and a CRC suffix will be added. The data link layer uses the ACK/NAK protocol to ensure reliable delivery of packets.

        The data link layer of the PCIe bus also defines a variety of DLLPs (Data Link Layer Packet). DLLPs are generated at the data link layer and terminate at the data link layer. It is worth noting that TLP is not the same as DLLP. DLLP is not composed of TLP plus Sequence Number prefix and CRC suffix.

3 Physical layer

        The physical layer is the bottom layer of the PCIe bus and connects PCIe devices together. The physical electrical characteristics of the PCIe bus determine that the PCIe link can only use an end-to-end connection. The physical layer of the PCIe bus provides a transmission medium for data communication between PCIe devices and a reliable physical environment for data transmission.

        The physical layer is the most important and most difficult component of the PCIe architecture. The physical layer of the PCIe bus defines the LTSSM (Link Training and Status State Machine) state machine, which is used by the PCIe link to manage the link state and perform link training, link recovery, and power management.

         The physical layer of the PCIe bus also defines some special "sequences". Some books refer to these "sequences" of the physical layer as PLP (Phsical Layer Packer). These sequences are used to synchronize the PCIe link and perform link management. It is worth noting that the process of sending a PLP by a PCIe device is different from that of sending a TLP. The physical layer is almost invisible to system software, but it is still necessary for system programmers to have a deeper understanding of how the physical layer works.


1.4 Extension of the data link

        PCIe links use an end-to-end data transfer method. In a PCIe link, these two ports are completely equivalent, and are connected to sending and receiving devices respectively, and one end of a PCIe link can only be connected to one sending device or receiving device. Therefore, the PCIe link must be extended with a Switch before multiple devices can be connected. An example of link extension using Switch is shown in Figure 45.

        In PCIe bus, Switch[2] is a special device, which consists of 1 upstream port and 2~n downstream ports. The PCIe bus stipulates that the port that can be directly or indirectly connected to the RC in a Switch [3] is the upstream port. In the PCIe bus, the position of the RC is generally at the top, which is also the origin of the name upstream port. In the Switch, all ports except upstream ports are called downstream ports. The downstream port is generally connected to the EP, or connected to the next-level Switch to continue extending the PCIe link. The PCIe link connected to the upstream port is called the upstream link, and the PCIe link connected to the downstream port is called the downstream link.

       Upstream link and downstream link are relative concepts. As shown in the figure above, the PCIe link between the Switch and EP2 is the upstream link for EP2 and the downstream link for the Switch.

       The Switch shown in the figure above contains 3 ports, one of which is the Upstream Port, and the other two are Downstream Ports. The upstream port is connected to the downstream port of the RC or other Switches, and the downstream port is connected to the upstream port of the EP or other Switches.

        In Switch, there are two concepts related to ports, namely Egress port and Ingress port. These two ports are related to the data flow through the Switch. The Egress port refers to the sending port, that is, the port used for data to leave the Switch; the Ingress port refers to the receiving port, that is, the port that data enters the Switch.

        Egress ports and ingress ports do not correspond to upstream and downstream ports. On the Switch, the upstream and downstream ports can be used as Egress ports or Ingress ports. As shown in Figure 45, when the RC writes the internal registers of the EP3, the upstream port of the Switch is the Ingress port, and the downstream port is the Egress port; when the EP3 performs the DMA write operation to the main memory, the upstream port of the Switch is the Egress port port, and the downstream port is the Ingress port.

        The PCIe bus also specifies a special Switch connection mode, the Crosslink connection mode. For a Switch that supports this mode, its upstream port can be connected to the upstream port of other Switches, and its downstream port can be connected to the downstream ports of other Switches.

        The main purpose of the PCIe bus to provide the CrossLink connection mode is to solve the interconnection between different processor systems, as shown in Figure 46. When using the CrossLink connection mode, although the upstream/downstream port of one Switch is directly connected to the upstream/downstream port of another Switch from the perspective of the physical structure, after the PCIe link is trained, it is still one port as the upstream port, and The other serves as the downstream port.

       Data exchange between processor system 1 and processor system 2 can be performed through Crosslink. When the address space or Requester ID of the PCI bus domain accessed by the processor system 1(2) is not in the processor system 1(2), these data will be received by the Crosslink port and transmitted to the opposite processor system. The P2P bridge of the Crosslink peer interface will receive the data request from another processor domain and convert it to the data request of this processor domain.

       There are still shortcomings when using Crosslink to connect two processor systems with the same topology. Assume that the ID numbers used by the RCs of processor systems 1 and 2 in Figure 46 are both 0, and the main memory is addressed from 0x0000-0000. When processor 1 reads a certain segment of PCI bus space of EP2, EP2 will use the ID routing method to transmit the completion message to the PCI device with ID number 0. At this time, it is the RC of processor 2 instead of processor 1. RC receives data from EP2. Because the ID numbers used by the RCs of processors 1 and 2 are both 0, EP2 cannot distinguish between the two RCs.

       From the above, the use of Crosslink cannot completely solve the interconnection problem of two processor systems, so some Switches support non-transparent bridge structure. This structure is similar to the implementation mechanism of the PCI bus non-transparent bridge, and this chapter will not describe it further.

       Using the non-transparent bridge only solves the problem of data path between two processors, but it is not convenient for the unified management of external devices by the NUMA structure. The ultimate solution to this problem for the PCIe bus is to use MR-IOV technology, which requires the Switch to have multiple upstream ports interconnected to different RCs. At present, PLX can provide Switch with multiple upstream ports, but some virtualization-related technologies involved in MR-IOV technology have not yet been implemented.

        Even if the MR-IOV technology can reasonably solve the data access between multiple processors and the configuration management of PCIe devices, using the PCIe bus for data transfer between two or more processor systems is still a big problem. Because the transmission delay of PCIe bus is still an important factor restricting its application in large-scale processor system interconnection.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324541062&siteId=291194637