SparkCore of Spark: RDD-Data Core/API [execution process, programming model: creation, conversion, output, running process]

1. The execution process

1. Read into an external data source (or a collection in memory) to create an RDD;
2. After a series of "conversion" operations, a different RDD will be generated each time for the next conversion;
3. The last RDD After the "action" operation is processed, and the specified data type and value are output.

Advantages: lazy call, pipeline, no need to save intermediate results.

  • RDD uses lazy invocation, that is, during the execution of RDD, all conversion operations will not perform real operations, but only record dependencies. Only when an action operation is encountered, will the real calculation be triggered, and based on the previous Dependencies get the final result.

Insert picture description here
An example of the RDD execution process is as follows:
\quad \quadAs shown in the figure below, two RDDs A and C are logically generated in the input. After a series of "conversion operations", the RDD "F" is logically generated. The reason why it is said to be logical is that the calculation did not happen at this time. Spark only records the dependencies between RDDs. When F wants to output, it will perform "action operations". Spark will generate DAGs based on RDD dependencies, and start the real calculation from the starting point.
Insert picture description here

  • In the conversion operation, no real calculations will take place, but the trajectory of the conversion will be recorded
  • Action operation will trigger the real calculation from beginning to end, and get the result

2. Programming model

\quad \quad In Spark, RDD is represented as an object, and the RDD is converted through method calls on the object. After a series of transformations define RDD, actions can be called to trigger the calculation of RDD. Actions can return results to the application (count, collect, etc.), or save data to the storage system (saveAsTextFile, etc.). In Spark, RDD calculations (that is, delayed calculations) are executed only when an action is encountered, so that multiple transformations can be transmitted through a pipeline at runtime.

\quad \quad To use Spark, developers need to write a Driver program, which is submitted to the cluster to schedule Workers, as shown in the following figure. One or more RDDs are defined in the Driver, and actions on the RDD are called, and the Worker performs RDD partition calculation tasks.
Insert picture description here
Insert picture description here

2.1 RDD creation

\quad \quad There are three ways to create an RDD in Spark: create an RDD from a collection; create an RDD from external storage; create an RDD from other RDDs.
1. Create from the collection

\quad \quad To create an RDD from a collection, Spark mainly provides two functions: parallelizeandmakeRDD

1) Use parallelize() to create from the collection

scala> val rdd=sc.parallelize(Array(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[199] at parallelize at <console>:26

2) Use makeRDD() to create from the collection

scala> val rdd1 = sc.makeRDD(Array(1,2,3,4,5,6,7,8))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[200] at makeRDD at <console>:26
scala> val rdd2 = sc.makeRDD(List(1,2,3,4,5,6,7,8))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[201] at makeRDD at <console>:26

  • One more overloading method for makeRDD: overloading allocates a series of local Scala collections to form an RDD, which can create a partition for each collection object, and specify the priority position to optimize scheduling during operation.
  • The problem with using local collections to create RDDs is: because this method needs to use all the data collected in a machine, this method is rarely used outside of testing and prototype construction, and is generally used during testing.

2. Created by the data set of the external storage system

\quad \quad Created by data sets of external storage systems, including local file systems, as well as all data sets supported by Hadoop, such as HDFS, Cassandra, HBase, etc.

val rdd1=sc.textFile("文件路径")

3. Create from other RDD

\quad \quad Create RDDs by converting existing RDDs. For common conversion operators, see this blog post on Operators

such as:

val rdd2=rdd1.flatMap(_.split(" "))

2.2 RDD conversion

2.3 RDD output

  • The RDD output is either saved as a file or output as a result.
  • See blog post for common action operators

3. RDD operation process

\quad \quad Through the previous introduction of RDD concepts, dependencies and phase division, combined with the basic Spark operation process introduced earlier, summarize the running process of RDD in the Spark architecture:

(1) Create RDD objects
(2) SparkContext is responsible for calculating the dependencies between
RDDs and building DAG (3) DAGScheduler is responsible for decomposing the DAG graph into multiple stages, each stage contains multiple tasks, each task will be The task scheduler is distributed to the Executor on each worker node to execute
Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112555766