SparkRDMA: Use RDMA technology to improve Spark Shuffle performance

SparkRDMA: Use RDMA technology to improve Spark Shuffle performance

Past large data memory historical memory of large data
such as the image below see, see https://www.iteblog.com/archives/1964.html, or click below to read the original .
Spark Shuffle is based
on the MapReduce framework. Shuffle is a bridge between Map and Reduce. Reduce must go through the Shuffle link to read the output of Map; and Reduce and Map processes are usually not on the same node, which means the Shuffle stage Usually need to read and write operations across the network and some disks, so the performance of Shuffle directly affects the performance and throughput of the entire program.

Like the MapReduce computing framework, Spark jobs also have Shuffle stages, which are usually divided into stages by Shuffle; and data interaction between stages requires Shuffle to complete. The whole process diagram is as follows:

SparkRDMA: Use RDMA technology to improve Spark Shuffle performance
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
From the above brief introduction, the following conclusions can be drawn:

  • Regardless of whether it is a MapReduce or Spark job, the Shuffle operation consumes resources. The resources here include: CPU, RAM, disk and network;
  • We need to avoid Shuffle operations as much as possible.
    Currently, the latest Spark built-in supports only one Shuffle implementation: org.apache.spark.shuffle.sort.SortShuffleManager, which is configured through the parameter spark.shuffle.manager. This is the standard Spark Shuffle implementation, and its internal implementation relies on the Netty framework. This article does not intend to introduce in detail how Shuffle is implemented in Spark. Here I want to introduce the community's improvements to Shuffle.

RDMA technology

Before proceeding to the following introduction, let's first understand some basic knowledge.

Traditional TCP Socket data transmission requires many steps: the data is first copied from the source application to the Sockets buffer of the current host, and then copied to the Transport Protocol Driver, then to the NIC Driver, and finally the NIC sends the data to the target host through the network. NIC, the target host transfers data to the application through the above steps. The whole process is as follows:
SparkRDMA: Use RDMA technology to improve Spark Shuffle performance
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
As can be seen from the above picture, network data transmission A large part of the time is spent on data copying! If the data to be transferred is large, the time used in this stage is likely to account for a large part of the running time of the entire job! So is there a way to directly save the data copy of different layers, so that the target host can directly obtain data from the memory of the source host? Really, this is RDMA technology!

The full name of RDMA (Remote Direct Memory Access) technology is remote direct memory access. It is a direct memory access technology that transfers data directly from the memory of one computer to another computer without the intervention of the operating systems of both parties. This allows high-throughput, low-latency network communication, and is especially suitable for use in massively parallel computer clusters (this paragraph is excerpted from Wikipedia-Remote Direct Memory Access). RDMA has the following characteristics:

  • Zero-copy
  • Direct hardware interface, bypassing the kernel and TCP/IP IO
  • Sub-microsecond delay
  • Flow control and reliability is offloaded in hardware,
    so data transmission using RDMA technology looks like the following:

SparkRDMA: Use RDMA technology to improve Spark Shuffle performance
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
It can be seen from the above that after using RDMA technology, although the source host and the target host are across the network, the data exchange between them It is obtained directly from the other party's memory, which obviously speeds up the entire calculation process.

SparkRDMA

Ok, now that we have acquired the basic knowledge, we formally enter the topic of this article. SparkRDMA ShuffleManager (GitHub address: https://github.com/Mellanox/SparkRDMA) , developed and open sourced by Mellanox Technologies, uses RDMA technology, so that Spark jobs use RDMA instead of standard TCP when shuffle data. The following is introduced in the official Wiki of SparkRDMA:


SparkRDMA is a high-performance, scalable and efficient ShuffleManager plugin for Apache Spark. It utilizes RDMA (Remote Direct Memory Access) technology to reduce CPU cycles needed for Shuffle data transfers. It reduces memory usage by reusing memory for transfers instead of copying data multiple times down the traditional TCP-stack.

It can be seen that SparkRDMA extends Spark's ShuffleManager interface and uses RDMA technology. The test results show that Shuffle data using RDMA is 2.18 times faster than the standard method!

SparkRDMA: Use RDMA technology to improve Spark Shuffle performance
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop
SparkRDMA developers have submitted an issue to the Spark community: [SPARK-22229] SPIP: RDMA Accelerated Shuffle Engine, detailed design documents: here . However, judging from the community's response, at least it will not be integrated into the Spark code at this time.

Install and use

If you want to use SparkRDMA, we need Apache Spark 2.0.0/2.1.0/2.2.0, Java 8 and a network that supports RDMA technology (such as RoCE and Infiniband).

SparkRDMA officially pre-compiles the corresponding jar packages for different versions of Spark, and we can download them here. After decompression, you will get the following four files:

  • spark-rdma-1.0-for-spark-2.0.0-jar-with-dependencies.jar
  • spark-rdma-1.0-for-spark-2.1.0-jar-with-dependencies.jar
  • spark-rdma-1.0-for-spark-2.2.0-jar-with-dependencies.jar
  • libdisni.so
    addition libdisni.so files on all nodes must be installed to Spark cluster, the other jar package only needs to be selected according to our Spark version. After the relevant files are deployed, we need to add the SparkRDMA module to the Spark operating environment, and set the following settings:

spark.driver.extraClassPath   /path/to/SparkRDMA/spark-rdma-1.0-for-spark-2.0.0-jar-with-dependencies.jar
spark.executor.extraClassPath /path/to/SparkRDMA/spark-rdma-1.0-for-spark-2.0.0-jar-with-dependencies.jar

In order to enable the SparkRDMA Shuffle Manager plug-in, we also need to modify the value of spark.shuffle.manager, just add the following cooperation in $SPARK_HOME/conf/spark-defaults.conf:

spark.shuffle.manager org.apache.spark.shuffle.rdma.RdmaShuffleManager
Others are the same as using Spark normally.

Guess you like

Origin blog.51cto.com/15127589/2679550