[Interview] [Spark] Advanced Big Data (2)

0. Question outline

Second, the core principle of Spark

2. Spark operating principle

1、Spark 总体架构?
 - 追问1:SparkContext作用

2、Spark内存模型
……

2、Spark程序的运行流程(*3- 追问1:关于分配资源,即作业调度,能不能再细说下?
 - 追问2:DAGscheduler干了什么活?(*2- 追问3:Job、Task、Stage分别说下?
 - 追问4:特别大的数据,怎么发送到excutor中?
 - 追问5:Hadoop中job和Tasks的区别是什么?

3、宽依赖和窄依赖(*4)概念
 - 追问1:哪些算子会导致宽依赖,哪些会导致窄依赖?(*3

4、Spark中如何划分Stage?(*2- 追问1:划分Stage依据是什么?(*2- 追问2:Spark判断Shuffle的依据?

5、Spark shuffle过程是什么,shuffle的概念?

6、Spark 运行模式?(x3)其中哪几种是集群模式?

7、Spark on Yarn作业执行流程?(在集群中怎么执行程序?)
 - 追问1:Yarn-Client和Yarn-Cluster有什么区别?
 - 追求2:如果有聚合操作呢?

8、Spark中广播变量是什么,作用?
 - 追问1:共享变量和累加器是什么?

Second, the core principle of Spark

2. Spark core principles

1. The overall structure of Spark?

answer:
1

Module Introduction effect
Driver Each application task control node Run the main() function of Spark Application to create a SparkContext.
SparkContext Application entrance, interact with the entire cluster Communicate with Cluster Manager for resource application, task allocation and monitoring , such as creating RDD. (*2)
Cluster Manager Cluster resource manager Apply for and manage the resources needed to run on the Worker Node.
Worker Node Working node An Application can have multiple work nodes that run job tasks.
Executor The execution process of Worker Node Run Task, responsible for saving data to memory/disk.
Task task In the work unit of the Executor process, multiple tasks form a stage.
2. Running process of Spark program (*3)

2

  1. Build the operating environment . The SparkContext is created by the Driver for resource application, task allocation and monitoring;
  2. Allocate resources . The resource manager (Standalone/Mesos/YARN) allocates resources and starts the process for Executor;
  3. Decompose Stage and apply for Task . SparkContext constructs a DAG graph, which is decomposed into stages, and sends TaskSet to TaskScheduler. Executor applies for Task from SparkContext, and Task Schedule issues the Task to Executor to run;
  4. Run & log off . Task runs on Executor, the execution result is fed back to TaskScheduler and DAGScheduler, and all resources are released after running.
Follow-up 1: Can you elaborate on the allocation of resources, that is, job scheduling?
  • 1) DAGScheduler analyzes the conversion dependency between each RDD to obtain the DAG;
  • 2) Divide Job into multiple stages through DAG;
  • 3) Each stage will generate a TaskSet and submit it to TaskScheduler, and the scheduling power is transferred to TaskScheduler, which is responsible for distributing tasks to workers for execution.
Follow-up 2: What did DAGscheduler do? (*2)

Build a Stage-based DAG (dependency between RDDs) based on the Job, and submit the Stage to the TaskScheduler.

Follow-up 3: What about Job, Task, and Stage separately? (*2)
name Features
Job Parallel computing stage composed of multiple tasks, triggered by Action operator
Stage Each Job is divided into multiple Tasks as a TaskSet (small computing task group), named Stage
Task Computing unit on Executor, multiple tasks form a stage
1 App   = n Job
1 Job   = n Stage = n TaskSet
1 Stage = n Task

1 App = n1 job = n1*n2 Stage = n1*n2*n3 Task

3

Follow-up 4: How to send very large data to excutor?

……

Follow-up 5: What is the difference between job and tasks in Hadoop?

Answer: Job is our abstract encapsulation of a complete MapReduce program. Task is the specific instance of each processing stage when the job is running. For example, map task, reduce task, maptask and reduce task will have multiple instances running concurrently.

3. The concept of wide dependence and narrow dependence (*6)

Answer: Transformation operations will form RDD dependencies. One parent RDD node can only be used by at most one partition of the child RDD, called narrow dependency, and used by multiple child RDDs, called wide dependency.

说明:  n > 1
窄依赖:n 父RDD == 1 子RDD
宽依赖:1 父RDD == n 子RDD
  • There are two types of narrow dependencies

  • 1) 1 to 1; 2) Many to one, such as union

  • Wide dependency description: each partition of the parent RDD may be used by multiple child RDD partitions, and the child RDD partition usually corresponds to all the partitions of the parent RDD

Breadth and narrow dependence

4

Follow-up 1: Which operators will lead to wide dependence and which ones will lead to narrow dependence? (*3)
- 窄依赖:map、filter、union
- 宽依赖:groupbykey、sortByKey  [会产生Shuffle过程]
- 看情况:If(分区器 == HashPartitioner), then (join、reduceByKey、join)为窄依赖。


说明:
  - 默认两种划分器:HashPartitioner和RangePartitioner,当分区器HashPartitioner时就是窄依赖。
答:同一个key进去,通过HashPartitioner得到的分区id是一样的,计算前后同一个key得到的分区都一样,
    父RDD的分区就只被子RDD的一个分区依赖,就不需要移动数据。所以是窄依赖。
4. How to divide stages in Spark? (*2)

Push forward from the last RDD executed, and cut the stage when it encounters wide dependencies.

4

Follow-up 1: What is the basis for dividing Stage? (*2)

Answer: Find the scheduling method with the least overhead through the dependency between RDDs.

Follow-up 2: How does Spark judge Shuffle?

Answer: The data in one partition of the parent RDD may be distributed to multiple partitions of the child RDD. There will be Shuffle Write and Shuffle Read operations.

4

5. What is the Spark shuffle process and the concept of shuffle?

MapReduce Shuffle — Hash Shuffle V1 : In
MR, Shuffle connects Map and Reduce. At this stage, disk read and write and network transmission are designed , which affects the performance and throughput of the entire program.
5
Serious problems:
1. A large number of files are generated, memory is consumed, and IO operations are inefficient.
2. The Reduce Task merge operation will merge the data in the HashMap, which may cause OOM.

Spark Shuffle-Hash Shuffle V2 :
In response to the first problem, the File Consolidation mechanism is introduced : There is only one partition file generated by all Map Tasks on an Executor, that is, all partition files of the same Map Task are merged. In this way, each Executor can generate up to N partition files.
6
Problem: This reduces the number of files, but the number of downstream Stage partitions N is large, and N files will still be generated on each Executor. Similarly, if there are K Cores on one Executor, there will still be K*N Writers. Handler, it is easy to cause OOM here.

Spark Sort Shuffle V1:

In order to solve the above problems, Spark refers to the Shuffle processing method in MR and introduces a shuffle write operation mechanism based on sorting.

Each task does not create a separate file for each subsequent task, but writes all the results into the same file. The file is first sorted according to Partition Id, and each Partition is sorted according to Key, and it will be written sequentially during the operation of Map Task. For each Partition data, an index file is also generated to record the size and offset of each Partition.

In the Reduce phase, the Reduce Task no longer uses HashMap when pulling data for Combine, but uses ExternalAppendOnlyMap. When this data structure is used for Combine , if the memory is insufficient, the disk will be flushed, which largely guarantees robustness and avoids most situations. OOM.

Look: Sort Shuffle solves all the drawbacks of Hash Shuffle, but the Shuffle process needs to sort records, so performance is lost.

Tungsten-Sort Based Shuffle / Unsafe Shuffle

Spark 1.5.0 started the tungsten plan to optimize memory and CPU usage. Due to the use of off-heap memory, based on the JDK Sun Unsafe API, it is also called Unsafe Shuffle.

The approach is to store the data records in a binary format , and directly sort the serialized binary data instead of Java objects. On the one hand, it can reduce memory usage and GC overhead, and on the other hand, it can avoid frequent serialization and deserialization in the Shuffle process. A cache-efficient sorter is provided during the sorting process , which uses 8 bytes pointers to convert the sorting into a pointer data sorting, which greatly optimizes the sorting performance .

Problem: The use of Tungsten-Sort Based Shuffle cannot have aggregate operations, and the number of partitions cannot exceed a certain size, so aggregate operators such as reduceByKey cannot use Tungsten-Sort Based Shuffle and will degenerate to use Sort Shuffle.

For the details of SortShuffleWriter implementation,
we can first consider a problem. If I have 10 billion pieces of data, but our memory is only 1M, but our disk is very large, we need to sort these 10 billion pieces of data now, it is impossible to sort all the data The one-time load performs memory sorting, which involves an external sorting problem. Our 1M memory can only hold 100 million pieces of data, and only these 100 million pieces of data can be sorted each time. Then output to the disk, output a total of 100 files, and finally how to merge these 100 files into a large, globally ordered file. We can take part of the header data as a buffer for each file (ordered), and put these 100 buffers in a heap for heap sorting. The comparison method is to perform the head element of all heap elements (buffer) Compare the sizes, and then continuously pop out the head element of the buffer at the top of each heap and output it to the final file, and then continue the heap sorting and continue to output. If any buffer is empty, go to the corresponding file to continue to add part of the data. In the end, you get a large, globally ordered file.

How to aggregate:

Shuffle Write: Each map task of the previous stage must ensure that the same key of the data in the current partition that it is processing is written to a partition file, and it may be written to multiple different partition files.

Shuffle Read: The reduce task will search for its own partition files from the machine where all tasks of the previous stage are located, so as to ensure that the value corresponding to each key will be gathered on the same node for processing and aggregation.

There are two types of Shuffle management in Spark, HashShuffleManager and SortShuffleManager. Before Spark1.2, it was HashShuffleManager. Spark1.2 introduced SortShuffleManager. HashShuffleManager has been discarded in Spark2.0+.

Sort Shuffle V2

Spark-1.6.0 unifies Sort Shuffle and Tungsten-Sort Based Shuffle into Sort Shuffle. If it is detected that the Tungsten-Sort Based Shuffle conditions are met, Tungsten-Sort Based Shuffle will be used automatically, otherwise Sort Shuffle will be used.
Spark-2.0.0 removes Hash Shuffle. Currently, there is only one Shuffle in Spark-2.0, namely Sort Shuffle.

6. Spark operating mode? (X3) Which of these are cluster modes?

  • 1) Local: stand-alone operation, generally used for development and testing;
  • 2) Standalone: ​​Build a Spark cluster composed of Master+Slave, and Spark runs in the cluster;
  • 3) Spark on Yarn: Spark customers directly connect to Yarn, without the need to build a Spark cluster;
  • 4) Spark on Mesos: The Spark client directly connects to Mesos, without the need to build an additional Spark cluster.

Among them, the Standalone, Spark on Yarn, and Spark on Mesos modes are cluster modes

7. What is the execution process of Spark on Yarn job? (How to execute the program in the cluster?)

7.1 Introduction

Yarn is responsible for unified resource management and can run multiple computing frameworks, such as MapReduce and Storm. Spark has developed the Spark on Yarn operating mode based on historical or performance considerations. Yarn's flexible resource management mechanism makes Application deployment more convenient, and Application resources are completely isolated.

Spark on Yarn is divided into Yarn-Client mode and Yarn-Cluster (also known as Yarn-Standalone mode) according to the location of the Driver in the cluster.

7.2 Yarn-Client

In this mode, the Driver runs locally on the client , and the Spark Application interacts with the client. Because the Driver is on the client side, the driver status can be accessed through the WebUI. The default is http://xxx:4040, and Yarn is accessed through http://xxx:8088.

Work flow :

  1. Spark Yarn Client applies to ResourceManager to start Application Master. At the same time, DAGScheduler and TASKScheduler will be created in SparkContext initialization. (In Yarn-Client mode, the program will select YarnClientClusterScheduler and YarnClientSchedulerBackend)

  2. After the ResourceManager receives the request, it selects a NodeManager, assigns the first Container to the application, and then starts the ApplicationMaster. (The difference between VS YARN-Cluster is that the ApplicationMaster does not run SparkContext and only communicates with SparkContext for resource allocation)

  3. After the SparkContext in the Client is initialized, establish communication with the ApplicationMaster, register with the ResourceManager, and apply for a resource (Container) based on the task information;

  4. Once the ApplicationMaster has applied for the resource (Container), it communicates with the corresponding NodeManager and requires the CoarseGrainedExecutorBackend to be started in the obtained Container. After the CoarseGrainedExecutorBackend is started, it will register with the SparkContext in the Client and apply for a Task;

  5. The SparkContext in the Client allocates the Task to CoarseGrainedExecutorBackend for execution, and CoarseGrainedExecutorBackend runs the Task and reports the running status to the Driver, so that the Client can grasp the running information and restart the task when the task fails;

  6. After the application has finished running, the SparkContext of the Client applies to the ResourceManager to log out and close itself.
    1

7.3 Yarn-Cluster

In this mode, YARN will run the application in two stages:

  • 1. Start Spark's Driver as an ApplicationMaster in the YARN cluster first;
  • 2. ApplicationMaster creates an application, then applies for resources from ResourceManager for it, starts Executor to run Task, and monitors its entire running process until the run is completed.

Work flow :

  1. Spark Yarn Client submits applications to YARN, including ApplicationMaster program and its startup commands, programs running in Executor, etc.;

  2. After the ResourceManager receives the request, it selects a NodeManager in the cluster and assigns the first Container, and asks it to start the ApplicationMaster of the application in this Container, where the ApplicationMaster performs the initialization of SparkContext and so on;

  3. ApplicationMaster registers with ResourceManager so that users can directly view the running status of the application through ResourceManage, and then it will apply for resources for each task through the RPC protocol in a polling manner, and monitor the running status until the end;

  4. Once the ApplicationMaster has applied for the resource (Container), it will communicate with the corresponding NodeManager, requiring the CoarseGrainedExecutorBackend to be started in the Container, and will register with the SparkContext in the ApplicationMaster and apply for Task after startup.
    (Same as Standalone mode, except that when SparkContext is initialized in Spark Application, CoarseGrainedSchedulerBackend is used in conjunction with YarnClusterScheduler for task scheduling. YarnClusterScheduler is just a simple package of TaskSchedulerImpl, adding waiting logic for Executor, etc.)

  5. The SparkContext in ApplicationMaster allocates Task to CoarseGrainedExecutorBackend for execution. CoarseGrainedExecutorBackend runs the Task and reports the running status and progress to the ApplicationMaster, so that the ApplicationMaster can grasp the running status of each task at any time, so that it can restart the task when the task fails;

  6. After the application runs, the ApplicationMaster applies to the ResourceManager to log out and shuts itself down.

2

Pursuit 1: What is the difference between Yarn-Client and Yarn-Cluster? What if there is an aggregation operation?

Introduction to ApplicationMaster : In YARN, each Application instance has an ApplicationMaster process. The ApplicationMaster process is the first container started by the Application. It is responsible for interacting with the ResourceManager and requesting resources. After obtaining the resources, it tells the NodeManager to start the Container for it.

YARN-Cluster VS YARN-Client mode difference—>ApplicationMaster process difference :

  • 1) In YARN-Client mode, the Application Master only requests Executor from YARN, and the Client will communicate with the requested Container to schedule their work, which means that the Client cannot leave;
  • 2) In YARN-Cluster mode, Driver runs in AM (Application Master), it is responsible for applying for resources from YARN and supervising the running status of the job. After the user submits the job, the Client can be turned off, and the job will continue to run on YARN. Therefore, the YARN-Cluster mode is not suitable for running interactive jobs.
    6
Pursuit 2: What if there is aggregation operation?

……

8. What is the broadcast variable in Spark and its function?

Broadcast variables can cache some shared data or large variables in the Spark cluster, instead of copying a copy of each task. Subsequent calculations can be reused, reducing network transmission and improving performance.

补充:
1、广播变量只读,保证数据一致性。
2、相比Hadoop分布式缓存,广播内存可以跨作业共享。
Follow-up 1: What is an accumulator?

The accumulator can perform global summarization and has the function of distributed counting.

注意:只有Driver能取到累加器值,Task端是累加操作。

Three, reference

1. 2020 big data interview questions real question summary (with answers)
2. Spark overall architecture and running process
3. Spark running process
4. Spark notes (7)-Spark running process
5. Understanding job, stage, and task in spark
6. Stage division in Spark job scheduling
7, Spark Stage division basis: Stage scheduling algorithm
in Spark 8, Spark source code reading: DAGScheduler Stage division and task optimal position calculation
9, Spark Shuffle detailed explanation
10, Thorough understanding of Spark shuffle process ( shuffle write)
11. Spark Shuffle basics
12, Spark's four operating modes
13, Spark (5) Spark task submission method and execution process
14, understanding spark memory model from a SQL task
15, detailed explanation of broadcast variables in Spark and how to dynamically update Broadcast variable
16, Spark learning road (4) Spark's broadcast variable and accumulator
16,Accumulator of Spark shared variables

Guess you like

Origin blog.csdn.net/HeavenDan/article/details/112483627