InfiniBand, what the hell is it?

For InfiniBand, many students engaged in digital communications are certainly familiar with it.

Since the beginning of the 21st century, with the continuous popularization of cloud computing and big data, data centers have achieved rapid development. And InfiniBand is a key technology in the data center, and its status is extremely important.

Especially since the beginning of this year, the strong rise of AI large models represented by ChatGPT has greatly increased the attention of InfiniBand. Because the network used by GPTs is built by Nvidia based on InfiniBand.

So, what exactly is InfiniBand technology? Why is it so popular? What about the often-discussed "InfiniBand vs. Ethernet" debate?

f2b7ba8c61de4ebbdd2e10c2c3ae23db.png

Let Xiaozao Jun answer this article one by one.

History of InfiniBand

InfiniBand (abbreviated as IB) is a communication technology protocol with strong capabilities. Its English literal translation is "unlimited bandwidth".

The birth story of Infiniband starts with the architecture of the computer.

Everyone knows that digital computers in the modern sense have always adopted the von Neumann architecture since their birth . In this architecture, there are CPU (calculator, controller), storage (memory, hard disk), and I/O (input/output) devices.

In the early 1990s, in order to support more and more external devices, Intel Corporation took the lead in introducing the PCI (Peripheral Component Interconnect) bus design into the standard PC architecture .

d37db533006521d901b2415f0a8165ab.png

The PCI bus is actually a channel.

Soon after, the Internet entered a stage of rapid development. The continuous increase of online business and user scale has brought great challenges to the carrying capacity of IT systems.

At that time, under the blessing of Moore's Law, CPU, memory, hard disk and other components were rapidly upgraded. However, the PCI bus is slow to upgrade, which greatly limits the I/O performance and becomes the bottleneck of the entire system.

In order to solve this problem, Intel, Microsoft, and SUN led the development of the "Next Generation I/O (NGIO)" technical standard. IBM, Compaq, and Hewlett-Packard are leading the development of "Future I/O (FIO)". The three IBM companies also worked together to create the PCI-X standard (1998).

In 1999, the FIO Developers Forum and NGIO Forum merged to create the InfiniBand Trade Association (IBTA) .

44e9ade22b6aeb50a972e990f6c94389.png

Soon, in 2000, version 1.0 of the InfiniBand Architecture Specification was officially released.

Simply put, InfiniBand was born to replace the PCI bus. It introduces the RDMA protocol, which has lower latency, larger bandwidth, higher reliability, and can achieve more powerful I/O performance. (Technical details will be introduced in detail later.)

When it comes to InfiniBand, there is a company we must mention, that is the famous Mellanox .

017c833e95190213d08e00ca348bfebd.png

Mellanox

In May 1999, several employees who left Intel Corporation and Galileo Technologies founded a chip company in Israel, which they named Mellanox.

Mellanox joined NGIO after its establishment. Later, NGIO and FIO merged, and Mellanox joined the InfiniBand camp. In 2001, they launched their first InfiniBand products.

In 2002, the InfiniBand camp underwent a drastic change.

This year, Intel "fleeed the battle" and decided to turn to the development of PCI Express (that is, PCIe, launched in 2004). Another giant, Microsoft, also withdrew from the development of InfiniBand.

Although companies such as SUN and Hitachi still choose to insist, the development of InfiniBand has already cast a shadow.

Beginning in 2003, InfiniBand turned to a new application field, that is, computer cluster interconnection .


This year, the Virginia Institute of Technology in the United States created a cluster based on InfiniBand technology, which ranked third in the TOP500 (Top 500 Supercomputers in the World) test at that time.

In 2004, another important InfiniBand non-profit organization was born - OFA (Open Fabrics Alliance, Open Fabrics Alliance) .

716a2a704ef8bc952dd44f1051007b88.png

OFA and IBTA are cooperative relations. IBTA is mainly responsible for developing, maintaining and enhancing Infiniband protocol standards; OFA is responsible for developing and maintaining Infiniband protocol and upper layer application API.

In 2005, InfiniBand found a new scene - the connection of storage devices .

The older generation of network workers must remember that InfiniBand and FC (Fibre Channel, Fiber Channel) were very fashionable SAN (Storage Area Network, storage area network) technologies back then. It was at this time that Mr. Xiaozao first came into contact with InfiniBand.

Later, InfiniBand technology gradually became popular, and there were more and more users, and the market share continued to increase.

By 2009, 181 of the TOP500 list had adopted InfiniBand technology. (Of course, Gigabit Ethernet was still dominant at the time, accounting for 259.)

During the gradual rise of InfiniBand, Mellanox is also growing and has gradually become the leader in the InfiniBand market.

a795ff45dc6de96083b66335d6786698.jpeg

In 2010, Mellanox and Voltaire merged, leaving only Mellanox and QLogic as major InfiniBand suppliers. Soon after, in 2012, Intel Corporation invested in the acquisition of QLogic's InfiniBand technology and returned to the InfiniBand competition track.

After 2012, with the increasing demand for high-performance computing (HPC), InfiniBand technology continued to advance rapidly, and its market share continued to increase.

In 2015, the proportion of InfiniBand technology in the TOP500 list exceeded 50% for the first time, reaching 51.4% (257 sets).

This marks the first time InfiniBand technology has achieved a counterattack to Ethernet (Ethernet) technology. InfiniBand became the preferred interconnect technology for supercomputers.

In 2013, Mellanox successively acquired Kotura, a silicon photonics technology company, and IPtronics, a parallel optical interconnection chip manufacturer, to further improve its industrial layout. In 2015, Mellanox's share in the global InfiniBand market reached 80%. Their business scope has gradually extended from chips to network cards, switches/gateways, telecommunication systems, cables and modules, becoming a world-class network provider.

In the face of InfiniBand's catching up, Ethernet is not sitting still.

In April 2010, IBTA released RoCE (RDMA over Converged Ethernet, remote direct memory access based on Converged Ethernet), which "ported" the RDMA technology in InfiniBand to Ethernet. In 2014, they proposed the more mature RoCE v2.

With RoCE v2, Ethernet has greatly narrowed the technical performance gap with InfiniBand, combined with its own inherent cost and compatibility advantages, it has begun to fight back.

Through the picture below, you can see the proportion of technology in the TOP500 list from 2007 to 2021.

93e27fb87c493f9c16c3e280c3277f59.jpeg

As shown in the figure, since 2015, 25G and higher-speed Ethernet (the dark green line in the figure) has risen and quickly become the new favorite of the industry, suppressing InfiniBand for a time.

In 2019, Nvidia spent $6.9 billion, defeating rivals Intel and Microsoft ($6 billion and $5.5 billion respectively), and successfully acquired Mellanox.

e757e1f14302c03090592c6181ed42d0.jpeg

For the reasons for the acquisition, Nvidia CEO Huang Renxun explained:

"This is the combination of two of the world's leading high-performance computing companies, we focus on accelerated computing (accelerated computing), and Mellanox focuses on interconnect and storage."

It now appears that Lao Huang's decision is very far-sighted.


As you can see, with the rise of the AIGC model, the demand for high-performance computing and intelligent computing in the whole society is blowing out.

652a963cfe80fdaaf03111556c75c896.jpeg

To support such a huge demand for computing power, it must rely on high-performance computing clusters. InfiniBand is the best choice for high-performance computing clusters in terms of performance.

Combining its own GPU computing power advantages with Mellanox's network advantages is equivalent to creating a powerful "computing power engine". In terms of computing power infrastructure, Nvidia has undoubtedly taken the lead.

Today, in the competition of high-performance networks, it is the struggle between InfiniBand and high-speed Ethernet. The two sides are evenly matched. Manufacturers who are not short of money will choose InfiniBand more. Those who pursue cost performance will be more inclined to high-speed Ethernet.

There are still some technologies left, such as IBM's BlueGene, Cray, and Intel's OmniPath, which basically belong to the second camp.

█  Technical principles of InfiniBand

After introducing the development history of InfiniBand, let's take a look at its working principle. Why it will be stronger than traditional Ethernet. How is its low latency and high performance achieved.

  • Starting Skills - RDMA

As mentioned earlier, one of the most prominent advantages of InfiniBand is that it is the first to introduce the RDMA (Remote Direct Memory Access, remote direct data access) protocol.

In traditional TCP/IP, the data from the network card is first copied to the core memory, and then copied to the application storage space, or the data is copied from the application space to the core memory, and then sent to the Internet via the network card.

This I/O operation method needs to go through the conversion of the core memory. It increases the length of the data stream transmission path, increases the burden on the CPU, and increases the transmission delay.

f3b60b7c278f00627c88c527bf68d4f1.png

Legacy Mode VS RDMA Mode

RDMA is equivalent to a technology that "eliminates middlemen".

The kernel bypass mechanism of RDMA allows direct data reading and writing between applications and network cards, reducing the data transmission delay in the server to close to 1us.

At the same time, RDMA's memory zero-copy mechanism allows the receiving end to directly read data from the sending end's memory, bypassing the participation of core memory, greatly reducing the burden on the CPU and improving CPU efficiency.

As mentioned above, RDMA has contributed a lot to the rapid rise of InfiniBand.

  • InfiniBand network architecture

The network topology of InfiniBand is shown in the figure below:

e2086d8d82119fb89893b0fdb1b488bb.png

InfiniBand is a channel-based structure, and its constituent units are mainly divided into four categories:

HCA (Host Channel Adapter, host channel adapter)

· TCA (Target Channel Adapter, target channel adapter)

· InfiniBand link (connection channel, which can be cable or optical fiber, or on-board link)

· InfiniBand switches and routers (for networking)

Channel adapters are used to build InfiniBand channels. All transfers start or end with a channel adapter for security or to work at a given QoS (Quality of Service) level.

A system using InfiniBand can be composed of multiple subnets (Subnet ) , and each subnet can be composed of more than 60,000 nodes at most. Inside the subnet, InfiniBand switches perform Layer 2 processing. Between subnets, routers or bridges are used to connect them.

8f899c87d7da3c7bdb0135c91afc86d3.png

InfiniBand networking example

InfiniBand's Layer 2 process is very simple. Each InfiniBand subnet will have a subnet manager to generate a 16-bit LID (Local Identifier). An InfiniBand switch contains multiple InfiniBand ports and forwards packets from one of the ports to another based on the LID contained in the Layer 2 Local Routing header. The switch does not consume or generate packets other than management packets.

With simple processing and its own Cut-Through technology, InfiniBand greatly reduces the forwarding delay to less than 100ns, significantly faster than traditional Ethernet switches.

In an InfiniBand network, data is also transmitted in packets (up to 4KB), serially.

  • InfiniBand protocol stack

The InfiniBand protocol also adopts a layered structure. Each layer is independent of each other, and the lower layer provides services for the upper layer. As shown below:

8f36892597927ae064c1168765b01e5f.png

Among them, the physical layer defines how to form bit signals into symbols on the line, and then form frames, data symbols, and data filling between packets, etc., and specifies the signaling protocol for constructing effective packets, etc.

The link layer defines the format of the data packet and the protocol of the data packet operation, such as flow control, routing, encoding, decoding, etc.

The network layer selects a route by adding a 40-byte global routing header (Global Route Header, GRH) to the data packet and forwards the data.

During the forwarding process, the router only performs a variable CRC check, thus ensuring the integrity of end-to-end data transmission.

fbb6da45dd4f801ce7af7517f4f2ea0f.png

Infiniband Packet Encapsulation Format

The transport layer then transfers the data packet to a specified queue pair (Queue Pair, QP) and instructs the QP how to process the data packet.

It can be seen that InfiniBand has its own defined 1-4 layer format and is a complete network protocol. End-to-end flow control is the basis for sending and receiving InfiniBand network data packets, which can realize lossless network.

Speaking of QP (queue pair), we need to say a few more words. It is the basic unit of communication in RDMA technology.

The queue pair is a pair of queues, SQ (Send Queue, sending work queue) and RQ (Receive Queue, receiving work queue). When the user calls the API to send and receive data, he actually puts the data into the QP, and then processes the requests in the QP one by one in a polling manner.

dccaabbaa00f7e062d3d6d853f478d0a.png

  • InfiniBand link rate

InfiniBand links can use copper cables or optical cables. For different connection scenarios, dedicated InfiniBand cables are required.

InfiniBand defines multiple link speeds at the physical layer, such as 1X, 4X, and 12X. Each individual link is a four-wire serial differential connection (two wires in each direction).

bcf933de463151ab5c638a64a4a25c9c.png

Taking the early SDR (Single Data Rate) specification as an example, the original signal bandwidth of the 1X link is 2.5Gbps, the 4X link is 10Gbps, and the 12X link is 30Gbps.


The actual data bandwidth of the 1X link is 2.0Gbps (because of 8b/10b encoding). Since the link is bidirectional, the total bandwidth relative to the bus is 4Gbps.

Over time, InfiniBand's network bandwidth has been continuously upgraded, from the early SDR, DDR, QDR, FDR, EDR, HDR, all the way to NDR, XDR, and GDR. As shown below:

f74ca19d39e5e52ea4eb7e1c0aa2f603.jpeg

Nvidia's latest Quantum-2 platform seems to use NDR 400G

29581519951b81f80e4ece5a90c2b5fa.png

Specific rate and encoding method

  • InfiniBand commercial products

Finally, let's take a look at the InfiniBand commercial products on the market.

After NVIDIA acquired Mellanox, it launched its own seventh-generation NVIDIA InfiniBand architecture, NVIDIA Quantum-2, in 2021.

The NVIDIA Quantum-2 platform includes: NVIDIA Quantum-2 series switches, NVIDIA ConnectX-7 InfiniBand adapter, BlueField-3 InfiniBand DPU, and related software.

a89b2b535b6b097ab32213975a1a8d69.jpeg

NVIDIA Quantum-2 series switches feature a compact 1U design and include air-cooled and liquid-cooled versions. The chip process technology of the switch is 7nm, and a single chip has 57 billion transistors (more than the A100 GPU). Using a flexible configuration of 64 400Gbps ports or 128 200Gbps ports, it provides a total of 51.2Tbps bidirectional throughput.

NVIDIA ConnectX-7 InfiniBand adapters, supporting PCIe Gen4 and Gen5, are available in multiple form factors and provide 400Gbps single or dual network ports.

Epilogue

According to the forecast of industry organizations, by 2029, the market size of InfiniBand will reach 98.37 billion US dollars, compared with 6.66 billion US dollars in 2021, an increase of 14.7 times. The compound annual growth rate during the forecast period (2021-2029) is 40%.

Under the strong impetus of high-performance computing and artificial intelligence computing, the development prospect of InfiniBand is exciting.


Who will have the last laugh between it and Ethernet, let time tell us the answer!

115d3e43eb9c9ac2f25d67fefb4d356a.jpeg

InfiniBand uses dual queue program extraction technology

Guess you like

Origin blog.csdn.net/qq_38987057/article/details/132033319