The principle behind PySpark - Driver in the end, the method by Py4j implementation calls Java in Python .pyspark.executor end run simultaneously on a number of Executor Task, there will be a number corresponding pyspark.worker process.

The principle behind PySpark

Spark mainly by the Scala language development, in order to facilitate integration with other systems without introducing scala-related depend, in part implemented using Java language development, such as External Shuffle Service and so on. Overall, Spark is implemented by the JVM languages, it will run in the JVM. However, in addition to providing Spark Scala / Java development interface, but also provides a development interface Python, R and other languages, in order to ensure the independence of the core implementation of Spark, Spark only do the packaging in the periphery, to achieve development support for different languages, this paper introduction Python Spark implementation principle, analyze pyspark how the application is up and running.

Spark runtime architecture

First, let's review the basic operation at the time of Spark architecture, as shown below, where the orange part is represented as JVM, Spark application runtime divided into Driver and Executor, Driver load scheduling and overall UI display, Executor in charge of Task run, Spark It can be deployed in a variety of resource management systems, such as Yarn, Mesos, etc., while Spark itself implements a simple Standalone (independent deployment) resource management system can be run without the help of other resource management systems. For more details, please refer to Spark Scheduler internal principles of analysis .

spark-structure

Spark user applications running on a Driver (to some degree, the user program is a program Spark Driver), after Spark Task schedule into one package, and then the information to the Task execution Executor, and Task information includes code logic data, Executor not directly run user code.

PySpark runtime architecture

In order not to destroy the existing runtime architecture Spark, Spark at the periphery of the package layer Python API, the aid Py4j achieve Python and Java interaction, thus achieving Spark written by Python applications, its runtime architecture as shown in Fig.

pyspark-structure

Which is the white part of the new Python process, the Driver end, Py4j call Java implementation of the Python through , about to user-written PySpark program "maps" to the JVM, for example, the user PySpark instantiate a Python SparkContext objects, eventually SparkContext object instantiation in the JVM Scala; Executor at the end, is not required by Py4j, since the end of the running Task Executor logic is sent, Driver, it is serialized byte code, although there may contain user-defined function or Python Lambda expressions, Py4j and method calls can not be achieved in Java, Python, in order to be able to function in Python Lambda expressions or user-defined Executor end run, you need a separate start for each Task Python process , through the socket communication function or the Python Python Lambda expressions distributed process execution. Level interactive language overall process as shown below, a solid line represents a method call, a broken line indicates the result returned.

pyspark-call

Here are a detailed analysis of how PySpark Driver is up and running and how to run Task Executor of.

Driver-side operating principle

When we submitted by spark-submmit pyspark program will first upload python script and dependent, and apply Driver, and when we apply to Driver resources, will (including the main method) pulled JVM by PythonRunner, as shown below.

pyspark-driver-runtime

PythonRunner entrance main function mainly do two things:

  • Open Py4j GatewayServer
  • 通过Java Process方式运行用户上传的Python脚本

用户Python脚本起来后,首先会实例化Python版的SparkContext对象,在实例化过程中会做两件事:

  • 实例化Py4j GatewayClient,连接JVM中的Py4j GatewayServer,后续在Python中调用Java的方法都是借助这个Py4j Gateway
  • 通过Py4j Gateway在JVM中实例化SparkContext对象

经过上面两步后,SparkContext对象初始化完毕,Driver已经起来了,开始申请Executor资源,同时开始调度任务。用户Python脚本中定义的一系列处理逻辑最终遇到action方法后会触发Job的提交,提交Job时是直接通过Py4j调用Java的PythonRDD.runJob方法完成,映射到JVM中,会转给sparkContext.runJob方法,Job运行完成后,JVM中会开启一个本地Socket等待Python进程拉取,对应地,Python进程在调用PythonRDD.runJob后就会通过Socket去拉取结果。

把前面运行时架构图中Driver部分单独拉出来,如下图所示,通过PythonRunner入口main函数拉起JVM和Python进程,JVM进程对应下图橙色部分,Python进程对应下图白色部分。Python进程通过Py4j调用Java方法提交Job,Job运行结果通过本地Socket被拉取到Python进程。还有一点是,对于大数据量,例如广播变量等,Python进程和JVM进程是通过本地文件系统来交互,以减少进程间的数据传输。

pyspark-driver

Executor端运行原理

为了方便阐述,以Spark On Yarn为例,当Driver申请到Executor资源时,会通过CoarseGrainedExecutorBackend(其中有main方法)拉起JVM,启动一些必要的服务后等待Driver的Task下发,在还没有Task下发过来时,Executor端是没有Python进程的。当收到Driver下发过来的Task后,Executor的内部运行过程如下图所示。

pyspark-executor-runtime

Executor端收到Task后,会通过launchTask运行Task,最后会调用到PythonRDD的compute方法,来处理一个分区的数据,PythonRDD的compute方法的计算流程大致分三步走:

  • 如果不存在pyspark.deamon后台Python进程,那么通过Java Process的方式启动pyspark.deamon后台进程,注意每个Executor上只会有一个pyspark.deamon后台进程,否则,直接通过Socket连接pyspark.deamon,请求开启一个pyspark.worker进程运行用户定义的Python函数或Lambda表达式。pyspark.deamon是一个典型的多进程服务器,来一个Socket请求,fork一个pyspark.worker进程处理,一个Executor上同时运行多少个Task,就会有多少个对应的pyspark.worker进程。
  • 紧接着会单独开一个线程,给pyspark.worker进程喂数据,pyspark.worker则会调用用户定义的Python函数或Lambda表达式处理计算。
  • 在一边喂数据的过程中,另一边则通过Socket去拉取pyspark.worker的计算结果。

The runtime architecture diagram Executor portion of the front pull out individually, as shown below, the orange part of the JVM process, the white part of the Python process, a common process on each pyspark.deamon Executor, Task is responsible for receiving requests, and fork pyspark.worker process, process each Task, the actual data processing, pyspark.worker JVM Task will process and local data communication Socket more frequently.

pyspark-executor.png

to sum up

On the whole, PySpark is achieved by means of Py4j Python calls Java, to drive the Spark application, mainly JVM runtime Essentially, Java to Python result returned is done by local Socket. While this architecture ensures the independence of the Spark core code, but in the big data scene, Python process between JVM and frequent data communication lead to more loss of performance, poor are also likely to be directly stuck, it is recommended for large-scale machine learning or Streaming scenarios or caution PySp ARK, to make use of native Scala / Java to write applications for offline tasks under the simple size small amount of data that can be used PySpark rapid deployment submit.

Please indicate the source, Permalink article: http://sharkdtu.com/posts/pyspark-internal.html

Guess you like

Origin www.cnblogs.com/bonelee/p/11585530.html