spark data localization

Reprinted from: https://www.cnblogs.com/jxhd1/p/6702224.html?utm_source=itdadao&utm_medium=referral

Spark data localization --> how to achieve the purpose of performance tuning

1. Spark data localization: mobile computing, not mobile data

2. Data localization level in Spark:

The Locality Levels of TaskSetManager are divided into the following five levels:
PROCESS_LOCAL
 NODE_LOCAL
NO_PREF
   RACK_LOCAL
ANY
    
PROCESS_LOCAL process localization: the data to be calculated by the task is in the same Executor    
 
 
    NODE_LOCAL     node localization: slightly slower than PROCESS_LOCAL because data needs to be passed between different processes or read from files
                                        Case 1: The data to be calculated by the task is in different Executor processes of the same Worker
                                        Case 2: The data to be calculated by the task is on the disk , or on HDFS, and there are blocks that happen to be on the same node.
                                  Spark computing data comes from HDFS, so the best data localization level is NODE_LOCAL
 
 
     NODE_PREF     doesn't have a sweet spot . Data is accessed as fast wherever it comes from, no need for location priority. For example, SparkSQL reads data in MySql
 
    RACK_LOCAL is  rack localized, with data on different nodes in the same rack. Need to transmit data and file IO over the network, slower than NODE_LOCAL
                                         Case 1: The data calculated by the task is in the Executor of Worker2
                                         Case 2: The data calculated by the task is on the disk of Worker2
 
    ANY    across racks, the data is on a network that is not the same rack, and the speed is the slowest
            

3. Who is responsible for data localization in Spark?

         DAGScheduler,TaskScheduler
 
            val rdd1 = rdd1.cache
            rdd1.map.filter.count()
             Before the Driver (TaskScheduler) sends the task, it should first get the nodes where the RDD1 is cached (node1, node2) --> This step is to call getPreferredLocations() by the DAGScheduler through the cacheManager object to get which nodes the RDD1 is cached on. TaskScheduler sends tasks according to these nodes.
 
            val rdd1 = sc.textFile("hdfs://...") //rdd1 encapsulates the location of the block corresponding to this file, getPreferredLocation()-->TaskScheduler calls to get the location of the data corresponding to the partition
            rdd1.map.filter.count()
             Before the Driver (TaskScheduler) sends the task, it should first get the location of the rdd1 data (node1, node2) --> RDD1 encapsulates the location of the block corresponding to this file, and the TaskScheduler gets the partition corresponding to the partition by calling getPreferredLocations(). The location of the data, the TaskScheduler sends the corresponding task according to these locations
 
    In general:
       Data localization in Spark is jointly responsible by DAGScheduler and TaskScheduler.
       DAGScheduler cuts Jobs, divides Stages, submits tasks corresponding to a Stage by calling submitStage, submitStage will call submitMissingTasks , submitMissingTasks determines the preferredLocations of each task that needs to be calculated, and obtains the priority location of the partition by calling getPreferrdeLocations(), which is the corresponding partition The priority position of the task. For each task to be submitted to the TaskSet of the TaskScheduler, the priority position of the task is the same as the priority position corresponding to its corresponding partition.
      After the TaskScheduler receives the TaskSet , the TaskSchedulerImpl will create a TaskSetManager object for each TaskSet, which contains all the tasks of the taskSet, and manages the execution of these tasks, including calculating the locality levels of the tasks in the TaskSetManager for scheduling and delaying. Comes into play when scheduling tasks.

4. Data localization flowchart in Spark

That is , the location relationship between a task computing node and its input data . Next, we will explore how Spark's scheduling system produces this result. This process involves RDD, DAGScheduler, and TaskScheduler. If you understand this process, you will basically understand Spark's PreferredLocations (Location-first strategy)
 
    The first step: PROCESS_LOCAL--> TaskScheduler first sends tasks according to the node where the data is located,
If the task waits for 3s in Executor1 of Worker1 (this 3s is the default waiting time of spark, which is set by spark.locality.wait and can be modified in SparkConf()), retried 5 times, but still cannot be executed
 
    TaskScheduler will reduce the level of data localization from PROCESS_LOCAL to NODE_LOCAL
 
    Step 2: NODE_LOCAL--> TaskScheduler resends the task to Executor2 in Worker1 for execution,
If the task waits for 3s in Executor2 of Worker1 and retries 5 times, it still cannot be executed
 
    TaskScheduler will reduce the level of data localization, from NODE_LOCAL to RACK_LOCAL 
 
    Step 3: RACK_LOCAL --> TaskScheduler resends the task to Executor1 in Worker2 for execution.
 
   Step 4: When the task assignment is completed, the task will obtain data through the BlockManager in the Executor of the Worker. If the BlockManager finds that it has no data, it will call the getRemote() method to pass the ConnectionManager and the BlockManager of the node where the original task is located. The ConnectionManager first establishes a connection, then obtains data through TransferService (network transmission component), and transmits it back to the node where the task is located through the network (at this time, the performance is greatly reduced, and a large amount of network IO occupies resources), and the calculated result is returned to the Driver.
 
Summarize:
   When the TaskScheduler sends a task, it will send the task according to the node where the data is located. At this time, the level of data localization is the highest. If the task waits for three seconds in the Executor, it still cannot be executed after 5 retry attempts. , then the TaskScheduler will think that the computing resources of this Executor are full, the TaskScheduler will reduce the level of data localization, and resend the task to other Executors for execution. If it still cannot be executed, continue to reduce the level of data localization. ..
 
     Now that you want each task to get the best data localization level, the advantage of tuning is that the waiting time is longer. Notice! If the waiting time is increased excessively, although the best data localization level is obtained for each task, the execution time of our job will also be prolonged.
  1. spark.locality.wait 3s//相当于是全局的,下面默认以3s为准,手动设置了,以手动的为准
  2. spark.locality.wait.process
  3. spark.locality.wait.node
  4. spark.locality.wait.rack
  5. newSparkConf.set("spark.locality.wait","100")

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324643770&siteId=291194637