The latest compilation of interviews frequently asked Spark knowledge points

Question guide:

1. What are the characteristics of RDD?
2. What is the difference between Map and MapPartitions?
3. 
Why does the Spark Application start to execute the job without obtaining enough resources? What problems may cause?


Five features

of

RDD : 1. A list of partitions RDD is a list of partitions (in a node) When loading data as an RDD, it generally follows the locality of the data (generally, a block in HDFS is loaded as a partition).

2. A function for computing each split

. Each partition of the RDD will have a function on each partition, which is a function application. Its role is to realize the conversion of partitions between RDDs.

3. A list of dependencies on other RDDs

RDD will record its dependencies. Dependencies are also divided into wide dependencies and narrow dependencies, but not all RDDs have dependencies. In order to tolerate errors (recalculation, cache, checkpoint), that is to say, the RDD in the memory will be recalculated if it fails or is lost.

4. Optionally, a Partitioner for Key-value RDDs

is optional. If the data stored in the RDD is in the form of key-value, you can pass a custom Partitioner for repartitioning. For example, the custom Partitioner here is partitioning based on the key. Then put the same key data in different RDDs into the same partition

5. Optionally, a list of preferred locations to compute each split on the

optimal location to calculate, which is the locality of the data.

The difference between Map and MapPartitions

map is to operate on each element in
rdd ; mapPartitions is to operate on the iterator of each partition in rdd.

Advantages of MapPartitions:

if it is an ordinary map, for example, there are 10,000 items in a partition data. ok, then your function needs to be executed and calculated 10,000 times.
After using the MapPartitions operation, a task will execute the function only once, and the function will receive all
partition data at once . It only needs to be executed once, and the performance is relatively high. If additional objects need to be created frequently during the map process (for example, the data in rdd is written to the database through jdbc, map needs to create a link for each element and mapPartition creates a link for each partition), then mapPartitions is more efficient than map Many.
SparkSql or DataFrame optimizes the program for mapPartition by default.

Disadvantages of MapPartitions:

If it is a normal map operation, one piece of data will be processed in one function execution; then if the memory is not enough, for example, if 1,000 pieces of data are processed, then the memory is not enough at this time, then the existing The 1,000 pieces of data processed are garbage collected from the memory, or use other methods to make room.
Therefore, ordinary map operations usually do not cause memory OOM exceptions.
However, for MapPartitions operations, for large amounts of data, such as even a partition with 1 million data,
after passing in one function at a time, there may be insufficient memory at a time, but there is no way to free up memory space, maybe OOM, memory overflow .

Wide dependency, narrow dependency

Narrow dependency: Each partition of the parent RDD can only be used by one child RDD partition (1 to 1 or many to 1)
Wide dependency: Each partition of the parent RDD may be used by multiple child RDD partitions (1 To many).
Some common wide and narrow dependencies and
narrow dependencies: map, filter, union, mapPartitions, join (when the partitioner is HashPartitioner)
Wide dependencies: sortByKey, join (when the partitioner is not a HashPartitioner)
Is reduceByKey a wide dependency or a narrow dependency? /todo
https://www.cnblogs.com/upupfeng/p/12344963.html
https://github.com/rohgar/scala- ... Narrow-Dependencies
https://blog.csdn.net/qq_34993631/article /details/88890669

What determines the number of partitions

https://blog.csdn.net/thriving_fcl/article/details/78072479

spark parallelism

The default value of spark.sql.shuffle.partitions is 200, which is the default number of partitions in the returned RDD s, which is explicitly set by the user through join, reduceByKey, and parallelize conversions. Note that spark.default.parallelism seems to only apply to raw RDDs and is ignored when processing data frames.
spark.default.parallelism is not used in sparksql. Configure the number of partitions used when shuffling connections or aggregating data.
https://www.jianshu.com/p/e721f002136c Spark

shared variables

In application development, a function is passed to Spark operations (such as map and reduce), and it runs on a remote cluster. It actually operates this function Independent copies of all variables used. These variables will be copied to every machine. In general, it seems that reading and writing shared variables between tasks is obviously not efficient enough. However, Spark still provides two limited shared variables for two common usage patterns: broadcast variables and accumulators.

(1). Broadcast Variables
-Broadcast variables are cached in the memory of each node, instead of each Task
-After the broadcast variable is created, any function call that can run in the cluster
-Broadcast variables are read-only, Cannot be modified after being broadcast
-For the broadcast of large data sets, Spark tries to use efficient broadcast algorithms to reduce communication costs.
val broadcastVar = sc.broadcast(Array(1, 2, 3)) method parameter is the variable to be broadcast

( 2). Accumulator
The accumulator only supports addition operations, which can be efficiently paralleled and used to implement counter and variable summation. Spark natively supports numeric types and standard variable set counters, but users can add new types. Only the driver can get the value of the accumulator
https://www.jianshu.com/p/aeec7d8bc8c4

spark data skew

https://www.jianshu.com/p/e721f002136c

spark shuffle

https://www.jianshu.com /p/a3bb3001abae
https://www.jianshu.com/p/98a1d67bc226
hashshuffle (obsolete)
sortshuffle (the current default shuffle method, including bypassMergeSortShuffle)
unsafe shuffle or tungsten sort



shuffle tuning:

https://www.cnblogs. com/haozheng... b094f36b72c7d3.html

shuffle configuration details:

https://blog.csdn.net/lds_include/article/details/89197291

The difference between spark and hadoop shuffle (emphasis)

https://www.jianshu.com/p /58c7b7f3efbe

spark performance optimization

basic articles:https://tech.meituan.com/2016/04/29/spark-tuning-basic.html
Advanced: https://tech.meituan.com/2016/05/12/spark-tuning-pro.html

Why When the Spark Application does not obtain enough resources, the job starts to execute. What problems may cause ? It

will cause insufficient cluster resources when the job is executed, and cause the execution of the job to end and not allocate enough resources. Part of the Executor is allocated. The job starts to execute the task. It should be that the task scheduling thread and the Executor resource application are asynchronous; if you want to wait for all the resources to be applied for before executing the job: you need to set spark.scheduler.maxRegisteredResourcesWaitingTime to a large value; spark.scheduler.minRegisteredResourcesRatio Set to 1, but it should be combined with actual considerations, otherwise it is easy to fail to allocate resources for a long time and the job cannot run all the time.
The minimum ratio of registered resources (registered resources / total expected resources) (resources are executors in yarn mode, CPU cores in standalone mode and Mesos coarsed-grained mode ['spark.cores.max' value is total expected resources for Mesos coarse-grained mode] ) to wait for before scheduling begins. Specified as a double between 0.0 and 1.0. Regardless of whether the minimum ratio of resources has been reached, the maximum amount of time it will wait before scheduling begins is controlled by config spark.scheduler.maxRegisteredResourcesWaitingTime.
default : 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode

spark Lineage

https://blog.csdn.net/m0_37914799/article/details/85009466

spark 粗粒度、细粒度

https://www.jianshu.com/p/d6df42c10a5c

spark checkpoint

https://www.jianshu.com/p/259081b0083a

spark component
 

  • Driver Pragram: Application's driver program, used to run Application's main function and create SparkContext process to prepare Application's operating environment;
  • SparkContext: SparkContext is used in Spark to communicate with ClusterManager, apply for resources, allocate and monitor tasks;
  • ClusterManager: The node responsible for the management and allocation of cluster resources. It monitors the workerNode by means of heartbeat. In Standalone mode, it is the Master, and in Yarn mode, it is the Resource Manager;
  • WorkNode: The slave node in the cluster, the node responsible for calculation and control, can start the Exector or Driver;
  • Executor: Application's Executor program, ClusterManager will allocate a process for each Executor of Application, and execute TaskSet by means of thread pool allocation;
  • Task: The basic unit of executing tasks on Executor; multiple Tasks can form a Stage;
  • TaskSet (task set): A set of related tasks but no Shuffle dependency between them.








Spark job process 1. The Application program starts the Driver program and creates a SparkContext;
2. SparkContext applies to the resource manager for running Executor node resources, and the Executor node starts SatandaloneExectuorBackend, and Executor will periodically report resource usage to the resource manager;
3. The Executor node sends the SparkContext to the SparkContext Apply for Task, then SparkContext will send the Executor program in Application to Executor;
4. SparkContext will construct RDD objects into DAG (directed acyclic graph), and then send to DAG Scheduler;
5. DAG Scheduler will split DAT into multiple Each Stage is composed of multiple tasks; then the TaskSet is sent to the TaskScheduler;
6. The TaskScheduler will submit the Tasks in the TaskSet to the Executor to run, and the Executor will use the thread pool to run these Tasks; the Executor will be in the Tasks All resources will be released after running;

https://www.jianshu.com/p/3e1abd37aadd

sparkContext

https://www.cnblogs.com/xia520pi/p/8609602.html

memory system

http://arganzheng.life/spark -executor-memory-management.html

spark block

https://blog.csdn.net/imgxr/article/details/80129296

data locality

https://www.cnblogs.com/cc11001100/p/10301716.html

spark sql three kinds of Join

https://blog.csdn.net /wlk_328909605/article/details/82933552
https://blog.csdn.net/aa5305123/article/details/83037838

sparksql execution plan

https://www.cnblogs.com/johnny666888/p/12343338.html

RDD, Dataframe, Dataset The difference between

DataFrame only knows the field, but does not know the type of the field, so when performing these operations, there is no way to check whether the type fails at compile time. For example, you can subtract a String and report an error during execution. , And DataSet not only knows the field, but also knows the field type, so there is more strict error checking. Just like the analogy between JSON objects and class objects.
https://www.pianshen.com/article/273498711/

spark codegen

https://zhuanlan.zhihu.com/p/92725597

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/108658409