Spark Kernel Analysis-Communication Architecture 3 (6)

3. Spark communication architecture

Spark is a distributed computing framework, and the design and mutual communication mode of multiple nodes are important components of it.
Spark initially used Akka as its internal communication component. In Spark 1.3, in order to solve the problem of transmitting large blocks of data (such as Shuffle), Spark introduced the Netty communication framework. By Spark 1.6, Spark can be configured to use Akka or Netty, which means Netty can completely replace Akka. By Spark 2, Spark has completely abandoned Akka and all uses Netty.

why? The official explanation is:
1) Many Spark users also use Akka, but because different versions of Akka cannot communicate with each other, this requires users to use the exact same Akka version as Spark, resulting in users being unable to upgrade Akka.
2) Spark's Akka configuration is tuned for Spark itself and may conflict with the Akka configuration in the user's own code.
3) Spark uses very few Akka features, and these features are easy to implement yourself. At the same time, the amount of code in this part is much smaller than that of Akka, and debugging is easier. If you encounter any bugs, you can fix them yourself immediately without waiting for Akka upstream to release a new version. Moreover, Spark's upgrade of Akka itself will force users to upgrade the Akka they use because of the first point, which is unrealistic for some users.

3.1 Overview of communication components

For source code analysis, the understanding of the design ideas is as follows:
Insert image description here
1) RpcEndpoint: RPC endpoint. Spark calls each node (Client/Master/Worker) an Rpc endpoint, and all implement the RpcEndpoint interface. Internally, according to the needs of different endpoints, Design different messages and different business processes. If you need to send (query), call Dispatcher
2) RpcEnv: RPC context environment. The context environment that each Rpc endpoint depends on when running is called RpcEnv
3) Dispatcher: Message distributor, for Messages that need to be sent to the RPC endpoint or received from remote RPC are distributed to the corresponding command inbox/outbox. If the instruction recipient saves it to the inbox itself, if the instruction recipient is not its own endpoint, it will be put into the outbox.
4) Inbox: Inbox for the instruction message. One local endpoint corresponds to one inbox. Dispatcher will When a message is stored in Inbox, the corresponding EndpointData will be added to the internal waiting Receiver Queue. In addition, when the Dispatcher is created, a separate thread will be started to poll the Receiver Queue for inbox message consumption
. 5) OutBox: Instruction message sending box, one The remote endpoint corresponds to an outbox. When the message is put into the Outbox, the message is then sent out through the TransportClient. The process of putting the message into the outbox and sending it is carried out in the same thread. The main reason for this is that remote messages are divided into two kinds of messages: RpcOutboxMessage and OneWayOutboxMessage, and messages that need to be responded to are sent directly and the results need to be processed
6 )TransportClient: Netty communication client, based on the receiver information of the OutBox message, requests the corresponding remote TransportServer
7)TransportServer: Netty communication server, one RPC endpoint and one TransportServer. After receiving the remote message, it calls Dispatcher to distribute the message to the corresponding outbox.
Note:
TransportClient Communication with TransportServer The dotted line represents the communication between two RpcEnvs. There is no separate expression in the figure.
One Outbox and one TransportClient. There is no separate expression in the figure
. There are two RpcEndpoints in an RpcEnv, one represents the RPC endpoint started by itself, and the other is RpcEndpointVerifier.

3.2Endpoint startup process

 启动的流程如下:

Insert image description here
After the Endpoint is started, the OnStart message will be added to the Inbox by default. When different endpoints (Master/Worker/Client) consume the OnStart command, additional processing is performed on the startup of the relevant endpoints.

When Endpoint starts, TransportServer will be started by default, and a synchronization test rpc availability (askSync-BoundPortsRequest) will be performed after the startup is completed.

As a distributor, Dispatcher internally stores Inbox, Outbox and other related handles and stores related processing status data. The structure is roughly as follows

Insert image description here

3.3Endpoint Send&Ask process

 Endpoint的消息发送与请求流程,如下:

Insert image description here
Endpoint stores a message combination of two dimensions according to business needs: send/ask a message, the receiver is itself and non-self
1) OneWayMessage: send + itself, stored directly in the inbox
2) OneWayOutboxMessage: send + non-self, stored into the outbox and sent directly
3) RpcMessage: ask + self, stored directly in the inbox, and also needs to be stored in LocalNettyRpcCallContext, which needs to be called back before returning
4) RpcOutboxMessage: ask + not self, stored in the outbox and directly Send, need to call back before returning

3.4Endpoint receive process

The process of receiving Endpoint messages is as follows:

Insert image description here
In the picture above, ServerBootstrap is Netty startup service, and SocketChanel is Netty data channel.
The above includes two processes: TransportSever startup and message acceptance.

3.5Endpoint Inbox processing flow

The core design of Spark in the design of Endpoint is Inbox and Outbox. The core points of Inbox are:
1) The internal processing flow is split into multiple message instructions (InboxMessage) and stored in Inbox
2) When the Dispatcher is started, a named Scan the Inbox for the pending InboxMessage for the thread of [dispatcher-event-loop], and call the Endpoint to perform corresponding processing according to the InboxMessage type.
3) When the Dispatcher is started, the InboxMessage of the OnStart type will be stored in the Inbox by default, and the Endpoint will perform related processing according to the OnStart instruction. Additional startup work. All work after the three terminals are started is derived from the processing of the OnStart command. Therefore, it can be said that the OnStart command is the source of mutual communication.

Insert image description here
The message command types are roughly as follows:
1) OnStart/OnStop
2) RpcMessage/OneWayMessage
3) RemoteProcessDisconnected/RemoteProcessConnected/RemoteProcessConnectionError

3.6EndpointImage

Insert image description here

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135390417