Spark parallel computing

 

aims:

  1. Explain how RDDs are distributed in SPARK clusters.
  2. Analyze how SPARK partitions file-based RDDs.
  3. Explain how SPARK performs RDD operations in parallel
  4. Explain how to achieve parallel control through partitioning
  5. Analyze how to view and monitor tasks and stages.

First, let's take a look at how spark works in the cluster mode.

spark cluster

The running process of a spark program in a cloud mode is shown below.

Users can submit spark jobs through Spark-submit. Once the spark job is submitted, the Sparkcontext driver will be opened, and the program will be passed to the cloud management node (cluster Manager: can be yarn, standalone, mesos, local, k8s, etc.) In my understanding, only the driver triggered by the task is submitted here. As for the real spark program (the part of how to process the data is usually uploaded in a packaged form to a specified location, which can be dfs or nfs, etc.) here And did not submit) to the master node of the cluster.

The cluster master node will create a container in the working node according to this program (driver program). Next, the Executors in the worker node will be created according to the container, and then interact through the sparkContext in the driver program.

RDDs in SPARK CLUSTER

In RDD (Resilient Distributed DataSets), spark partitions data across worker nodes. You can also control the number of partitions created.

Let's try to understand the partition concept of a single file

File partition: single file

The partition will be based on the file size. You can also specify the minimum number of partitions (file, minPartitions) for the desired text file. By default, the file running in spark cluster is divided into 2 parts. The more partitions, the higher the degree of parallelism.

File partition: multiple files

Use the command: sc.textFile ('mydir / *'), each file is at least one partition. You can perform file-based operations on each partition, such as parsing XML.

The next command: sc.wholeTextFiles ("mydir"). This command is used to partition many small files, and can also be used to create a key-value pair RRD (key represents the file name, value represents the file content)

Most RDD operations will act on each element in the RDD, and a small number will act on each partition. Some commands that act on partitions are:

foreachPartition ---- used to call a function for each partition

mapPartitions --- used to create a new RDD by executing a function on each partition of the current RDD

mapPartitionsWithIndex --- This command is similar to mapPartitions, except that this command contains the partition index

Note: The function of the partition operation uses an iterator. For a better understanding we look at an example of RDD.

First understand the foreachPartition through examples

foreachPartition

In the following example, we created a function printFirstLine to calculate the first line of each partition.

Suppose we have created an RDD named myrdd. We pass the created function printFirstLine to the foreachPartition that needs to calculate the first line of a partition.

Now that you understand the commands for partitioning, the next chapter will try to understand the concepts of HDFS and local data through an example

HDFS and local data (Data locality)

In the figure, you can see multiple data nodes.

Now you can push mydata files to hdfs via hdfs dfs -put mydata. Assume that the file is stored in the hdfs disk in the form of three blocks.

After the data is saved to the HDFS disk, you can program in spark. After starting your program, the Spark context will be obtained and the executor will start on the datanodes. (It should be assumed that the cluster manager running by the spark program is yarn, and yarn copies the scheduler to the corresponding datanodes to execute)

Using the sc.textfile command, you can push the mydata file into the executor. Since this is just a conversion step, the RDD will still be empty.

Executed using action triggers, tasks on the executor load data from blocks into partitions. Then, the data will be distributed among the executing programs until the operation returns the value to the driver.

Parallel operations on partitions (Parallel Operations on Partitions)

RDD operations can be performed in parallel on each partition. The task will be executed on the data storage worker node.

Some operations will preserve the partition, such as map, flatMap, filter, distinct, etc. Some operations have been repartitioned, such as reduceByKey, sortByKey, join, groupByKey, etc.

Then understand the operation of stages

Operations in stages

Operations on stages can run on the same partition (Operations that can run on the same partition are executed in stages). Tasks on the same stage are bound together by the pipeline. Developers should be aware of the stages operation to improve performance.

Here are some Spark terms:

Job: is a set of execution tasks for a behavior (personal understanding is a collection of all tasks for a certain purpose, which can be understood as a dag)

stage: is a set of tasks that can be executed in parallel in the job (personal understanding is a set of tasks executed from a single partition, which can be understood as an optimized DAG)

Task: A single execution unit of work sent to an executor (personally understood as a function).

Application: is a collection of several jobs managed by a single driver (can be understood as a single program script)

Next, see how spark calculates stages.

How spark calculates stages

Spark constructs a directed acyclic graph or DAG of RDD dependencies. There are two types of these dependencies

Narrow dependence

Narrow dependency means that each partition in the child RDD only depends on one partition of the parent RDD. There is no need for shuffle in different executors. The node that creates the RDD can be divided into a stage. For example: map, filter

Wide dependence or shuffle dependence

Wide dependency or shuffle dependency, many child RDD partitions depend on each partition of parent RDD. The wide-k dependency defines a new stage. For example: reduceByKey, join, groupByKey. Next look at the process of parallelism control.

Widely dependent operations such as ReduceByKey partition RDD results. The more partitions, the more parallel tasks. If there are too few partitions, the spark cluster will not be fully utilized.

The number of partitions can be controlled by the numPartitionsparameter option during the function call. You can view the spark application UI through localhost: 4040, and you can view all spark jobs in the UI.

to sum up:

RDDS is stored in the memory of SPARK executor

Virtual machine and JVMS data are divided into zones

Each partition in the separate Executor RDD performs operations in parallel

Operations based on the same partition are constrained together in the pipeline in the stage

Operations that rely on multiple partitions are performed in a separate manner in the stage

 

Published 42 original articles · praised 4 · 10,000+ views

Guess you like

Origin blog.csdn.net/wangyhwyh753/article/details/104040730