Spark-Use Summary-Introduction to Big Data Basics

1. Number of partitions

The input of spark may be stored on HDFS in the form of multiple files, and each file contains many blocks, called Blocks.

When Spark reads these files as input, it will parse according to the InputFormat corresponding to the specific data format. Generally, several blocks are merged into one input slice, which is called InputSplit. Note that InputSplit cannot span files.
Concrete tasks will then be generated for these input shards. There is a one-to-one correspondence between InputSplit and Task.
Then each of these specific tasks will be assigned to an Executor of a node on the cluster to execute.
Each node can start one or more Executors.
Each Executor consists of several cores, and each core of each Executor can only execute one Task at a time.
The result of each Task execution is to generate a partition of the target RDD.
Note: The core here is the virtual core instead of the physical CPU core of the machine, which can be understood as a worker thread of the Executor.

The concurrency of the Task being executed = the number of Executors * the number of cores per Executor.

As for the number of partitions:

For the data read-in phase, such as sc.textFile, the number of initial tasks required by the input file is divided into as many InputSplits.
The number of partitions remains unchanged during the Map phase.
In the Reduce phase, the aggregation of RDDs will trigger a shuffle operation. The number of partitions of the aggregated RDDs depends on the specific operation. For example, the repartition operation will aggregate into a specified number of partitions, and some operators are configurable.
2. Comparison of spark deployment modes

This blog compares the three deployment modes, please refer to the deployment mode comparison: The summary is as follows:

Mesos seems to be a better choice for Spark, and it is officially recommended,
but if you run hadoop and Spark at the same time, Yarn seems to be a better choice in terms of compatibility, after all, it is native. Spark on Yarn also works well.
If you run not only hadoop, spark. Also running docker on resource management, Mesos seems to be more general.
Standalone small-scale computing cluster seems to be more suitable!
For the comparison of client and cluster in yarn mode, please refer to the comparison of client and cluster:
Before understanding the deep difference between YARN-Client and YARN-Cluster, let's clear a concept: Application Master. In YARN, each Application instance has an ApplicationMaster process, which is the first container started by the Application. It is responsible for dealing with the ResourceManager and requesting resources. After obtaining the resources, it tells the NodeManager to start the Container for it. From a deeper meaning, the difference between YARN-Cluster and YARN-Client mode is actually the difference between the ApplicationMaster process. In
YARN-Cluster mode, Driver runs in AM (Application Master), which is responsible for applying for resources to YARN and supervising the operation of jobs situation. After the user submits the job, the Client can be turned off, and the job will continue to run on YARN, so the YARN-Cluster mode is not suitable for running interactive jobs. In the
YARN-Client mode, the Application Master only requests an Executor from YARN, and the Client will communicate The requested Container communicates to schedule their work, which means that the Client cannot leave
(1) The Driver of YarnCluster is on an NM in the cluster, but Yarn-Client is on the machine of RM;
(2) The Driver will communicate with the Executors, so Yarn_cluster can close the Client after submitting the App, and Yarn -Client cannot;

(3) Yarn-Cluster is suitable for production environments, and Yarn-Client is suitable for interaction and debugging.

3. Spark operating principle

The spark application performs various transformation calculations, and finally triggers the job through actions. After submitting, build SparkContext, build a DAG graph based on RDD dependencies through sparkContext, and submit the DAG graph to DAGScheduler for parsing. During parsing, shuffle is used as the boundary, reverse parsing, build stages, and dependencies between stages. This process is The DAG graph is parsed and divided into stages, and the dependencies between the stages are calculated. The stage is submitted to the TaskScheduler in the form of stageSet, and then the TaskSet is submitted to the underlying scheduler. In spark, it is submitted to the taskScheduler for processing, the TaskSet manager is generated, and finally submitted to the executor for calculation. The executor multi-threaded calculation, after completing the task task, the The completion information is submitted to the schedulerBackend, which submits the task completion information to the TaskScheduler. TaskScheduler feeds back information to TaskSetManager, deletes the task task, and executes the next task. At the same time, the TaskScheduler inserts the completed result into the success queue, and returns the information of successful joining after joining. The TaskScheduler passes the task processing success information to the TaskSet Manager. After all tasks are completed, TaskSet Manager will feed back the results to DAGScheduler. If it belongs to resultTask, give it to JobListener. If it does not belong to resultTask, save the result. Write data after all runs.

Many people know that I have big data training materials, and they all naively think that I have a full set of video learning materials such as big data development, hadoop, and spark. I want to say you are right, I do have a full set of videos on big data development, hadoop, spark.
If you are interested in big data development, you can join the group to receive free learning materials: 763835121

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325891959&siteId=291194637