spark data localization

Reprinted from: https://www.cnblogs.com/jxhd1/p/6702224.html?utm_source=itdadao&utm_medium=referral

Spark data localization --> how to achieve the purpose of performance tuning

1. Spark data localization: mobile computing, not mobile data

2. Data localization level in Spark:

The Locality Levels of TaskSetManager are divided into the following five levels:
PROCESS_LOCAL
 NODE_LOCAL
NO_PREF
   RACK_LOCAL
ANY
    
PROCESS_LOCAL process localization: the data to be calculated by the task is in the same Executor    
 
 
    NODE_LOCAL     node localization: slightly slower than PROCESS_LOCAL because data needs to be passed between different processes or read from files
                                        Case 1: The data to be calculated by the task is in different Executor processes of the same Worker
                                        Case 2: The data to be calculated by the task is on the disk , or on HDFS, and there are blocks that happen to be on the same node.
                                  Spark computing data comes from HDFS, so the best data localization level is NODE_LOCAL
 
 
     NODE_PREF     doesn't have a sweet spot . Data is accessed as fast wherever it comes from, no need for location priority. For example, SparkSQL reads data in MySql
 
    RACK_LOCAL is  rack localized, with data on different nodes in the same rack. Need to transmit data and file IO over the network, slower than NODE_LOCAL
                                         Case 1: The data calculated by the task is in the Executor of Worker2
                                         Case 2: The data calculated by the task is on the disk of Worker2
 
    ANY    across racks, the data is on a network that is not the same rack, and the speed is the slowest
            

3. Who is responsible for data localization in Spark?

         DAGScheduler,TaskScheduler
 
            val rdd1 = rdd1.cache
            rdd1.map.filter.count()
             Before the Driver (TaskScheduler) sends the task, it should first get the nodes where the RDD1 is cached (node1, node2) --> This step is to call getPreferredLocations() by the DAGScheduler through the cacheManager object to get which nodes the RDD1 is cached on. TaskScheduler sends tasks according to these nodes.
 
            val rdd1 = sc.textFile("hdfs://...") //rdd1 encapsulates the location of the block corresponding to this file, getPreferredLocation()-->TaskScheduler calls to get the location of the data corresponding to the partition
            rdd1.map.filter.count()
             Driver(TaskScheduler)在发送task之前,首先应该拿到rdd1数据所在的位置(node1,node2)-->RDD1封装了这个文件所对应的block的位置, TaskScheduler通过调用getPreferredLocations()拿到partition所对应的数据的位置,TaskScheduler根据这些位置来发送相应的task
 
    总的来说:
       Spark中的数据本地化由DAGScheduler和TaskScheduler共同负责。
       DAGScheduler切割Job,划分Stage, 通过调用submitStage来提交一个Stage对应的tasks,submitStage会调用submitMissingTasks,submitMissingTasks 确定每个需要计算的 task 的preferredLocations,通过调用getPreferrdeLocations()得到partition 的优先位置,就是这个 partition 对应的 task 的优先位置,对于要提交到TaskScheduler的TaskSet中的每一个task,该task优先位置与其对应的partition对应的优先位置一致。
      TaskScheduler接收到了TaskSet后TaskSchedulerImpl 会为每个 TaskSet 创建一个 TaskSetManager 对象,该对象包含taskSet 所有 tasks,并管理这些 tasks 的执行,其中就包括计算 TaskSetManager 中的 tasks 都有哪些locality levels,以便在调度和延迟调度 tasks 时发挥作用。

4.Spark中的数据本地化流程图

某个 task 计算节点与其输入数据的位置关系,下面将要挖掘Spark 的调度系统如何产生这个结果,这一过程涉及 RDD、DAGScheduler、TaskScheduler,搞懂了这一过程也就基本搞懂了 Spark 的 PreferredLocations(位置优先策略)
 
    第一步:PROCESS_LOCAL-->TaskScheduler首先根据数据所在的节点发送task,
如果task在Worker1的Executor1中等待了3s(这个3s是spark的默认等待时间,通过spark.locality.wait来设置,可以在SparkConf()中修改),重试了5次,还是无法执行
 
    TaskScheduler会降低数据本地化的级别,从PROCESS_LOCAL降到NODE_LOCAL
 
    第二步:NODE_LOCAL-->TaskScheduler重新发送task到Worker1中的Executor2中执行,
如果task在Worker1的Executor2中等待了3s,重试了5次,还是无法执行
 
    TaskScheduler会降低数据本地化的级别,从NODE_LOCAL降到RACK_LOCAL 
 
    第三步:RACK_LOCAL -->TaskScheduler重新发送task到Worker2中的Executor1中执行。
 
   第四步:当task分配完成之后,task会通过所在Worker的Executor中的BlockManager来获取数据,如果BlockManager发现自己没有数据,那么它会调用getRemote()方法,通过ConnectionManager与原task所在节点的BlockManager中的ConnectionManager先建立连接,然后通过TransferService(网络传输组件)获取数据,通过网络传输回task所在节点(这时候性能大幅下降,大量的网络IO占用资源),计算后的结果返回给Driver。
 
总结:
   TaskScheduler在发送task的时候,会根据数据所在的节点发送task,这时候的数据本地化的级别是最高的,如果这个task在这个Executor中等待了三秒,重试发射了5次还是依然无法执行,那么TaskScheduler就会认为这个Executor的计算资源满了,TaskScheduler会降低一级数据本地化的级别,重新发送task到其他的Executor中执行,如果还是依然无法执行,那么继续降低数据本地化的级别...
 
     现在想让每一个task都能拿到最好的数据本地化级别,那么调优点就是等待时间加长。注意!如果过度调大等待时间,虽然为每一个task都拿到了最好的数据本地化级别,但是我们job执行的时间也会随之延长
  1. spark.locality.wait 3s//相当于是全局的,下面默认以3s为准,手动设置了,以手动的为准
  2. spark.locality.wait.process
  3. spark.locality.wait.node
  4. spark.locality.wait.rack
  5. newSparkConf.set("spark.locality.wait","100")

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324643804&siteId=291194637