Design and implementation of broadcast variables in spark

How spark broadcasts

Spark has historically used two broadcast methods:

One is to transmit data through Http protocol;

One is to transmit data through the Torrent protocol.

But in the latest spark version, the http method has been abandoned (pr here https://github.com/apache/spark/pull/10531),  spark introduced TorrentBroadcast in spark 1.1 version, and HttpBroadcast has not been updated since then According to related documents, HttpBroadcast can be completely deleted in spark2.0, and then TorrentBroadcast is unified as the only way to implement broadcast variables. However, the code is not hard-coded, and the extensibility is still retained (BroadcastFactory is a trait, TorrentBroadcastFactory is just an implementation method, which conforms to the principle of dependency inversion, depends on abstraction, and does not depend on concrete implementation). It can be easily added, but I guess it should be gone in a while. Based on the principle of being outdated and not talking, we only talk about TorrentBroadcast here. You can go here and see the picture below.

Gami Valley Big Data.png

You can see that different data blocks come from different nodes, and multiple nodes form a network together. When you download, you are also uploading, so while enjoying the download provided by others, you are also contributing, and ultimately All benefit together.

Let's take a look at the BitTorrent protocol, wiki definition

BitTorrent protocol (BT for short, commonly known as bit torrent, BT download) is a network protocol program used for file sharing in peer-to-peer networks. Unlike the point-to-point protocol program, it is a user group-to-user group (peer-to-peer), and the more users, the more people who download the same file, the faster the download speed of the file. And after downloading, continue to maintain the uploading state, you can "share" and become the torrent file (.torrent) downloaded by its client node, upload and download at the same time.

key points

1. To download the file content, the downloader needs to obtain the corresponding torrent file first, and then use the BT client software to download.

2. Provide the downloaded file into virtual blocks of equal size, and write the index information and Hash verification code of each block into the seed file.

3. There is a Tracker responsible for maintaining the meta information, and all clients can find every other downloader closest to them through the Tracker.

4. When downloading, the BT client first parses the torrent file to get the Tracker address, and then connects to the Tracker server. The Tracker server responds to the downloader's request and provides the downloader with the IPs of other downloaders (including the publisher). The downloader then connects to other downloaders. According to the torrent file, the two inform each other of the blocks they already have, and then exchange data that the other party does not have. At this time, no other servers are required to participate, and the data traffic on a single line is dispersed, thus reducing the burden on the server.

5. Each time the downloader gets a block, he needs to calculate the comparison between the Hash verification code of the downloaded block and the torrent file. If they are the same, the block is correct. If they are different, the block needs to be downloaded again. This provision is to address the issue of the accuracy of downloaded content.

For the above points, how does spark do

The bottom layer of TorrentBroadcast uses BlockManager. To download each data block, first go to the master to get the location of the Block.

When writing a large variable to a broadcast variable, the input data is divided into multiple small pieces through ChunkedByteBufferOutputStream. In zipWithIndex, a unique identifier is added to each small piece, such as broadcast_broadcastId_pieceId. As BlockId, stored in BlockManager. And add a check code to each small data block.

As a tracker, BlockManagerMaster maintains the meta information of all Block blocks, and knows the executor and storage level where each data block is located. The Broadcast variable maintains the BlockId of all its own small blocks. When the Boradcast variable is read through the value method, the BlockId of all the small blocks is taken out. For each BlockId, the set of locations of the BlockId is obtained through the BlockManagerMaster, randomization, location set If it is disrupted, it will first find the address of the same host (so that it can go back to the loop), and then take the addresses from the random address set in order and try to obtain data one by one. Because the address is randomized, the executor will not only obtain data from the Driver. . Disperses the pressure on the driver.

After getting the block piece, use the check code to check to see if the data block is damaged, if not, then put it together in order.

Let's compare and see if the process is similar, which basically runs through the ideological principles of BitTorrent.

The idea of ​​BitTorrent.png

Take a look at the above picture. At the beginning, everyone gets data through the driver, but once other executors have data blocks, all executors have the opportunity to get data blocks through other executors, which is scattered driver pressure. To use a sentence, the more executors are downloaded, the faster the download will be.

How to use spark broadcast variables

The use of broadcast variables.png

The above small demo is to broadcast an array through broadcast, and then you can use the array variable in the task. This array variable resides on the executor, and does not have to be transmitted every time the task is scheduled to run. an array at a time.

We can see that the use of broadcast is nothing more than that sc.broadcast defines a broadcast variable and broadcasted.value uses the value method of the broadcast variable to find the real array.

When the spark context is initialized, a broadcastManager is initialized in sparkEnv. In the initialization method, the TorrentBroadcastFactory is now used by default. When the sc.broadcast method is called, a TorrentBroadcast will be created using the factory mode. At this time, the write operation will be called to divide the data into small pieces. Blocks are written to BlockManager, broadcasted is just an instance of TorrentBroadcast type, and there is no array data. This instance only maintains the meta information of the data, that is, a set of BlockId information. This instance is serialized and passed to the executor, on the executor Calling the value method of this instance will trigger to read the real data on the BlockManager.

Gami Valley is a training institution focusing on big data . As a member unit of Chengdu Big Data Alliance, Changhong Group Software Center Training Base, Zhongke Merchants, Zhongke Chuangda Entrepreneurship Incubation Base, Zhongke Entrepreneurs, Zhongke Chuangda Entrepreneurship Incubation Base, Hanlin Science and Technology Training Base, etc. Co-founded by a number of technical experts from well-known domestic companies such as Ali, Huawei, JD.com, and Star Ring, it has a rich technical background, is diligent and innovative, and is proficient in mainstream cutting-edge big data and artificial intelligence related technologies.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325573049&siteId=291194637