Spark [Architecture, API]

1. Spark architecture

Spark architectureIt is mainly composed of the following components:

  • Application : User program built on Spark, including Driver code and code running in Executor of each node in the cluster
  • Driver program: Driver, main function in Application and create SparkContext
  • Cluster Manager: an external service to obtain resources on the cluster (Standalone, Mesos, YARN)
  • Worker Node: Any node in the cluster that can run Application code
  • Executor: A process running on the worker node of an Application
  • Task: a unit of work sent to an Executor
  • Job: Contains parallel calculations composed of multiple tasks, which are often generated by the Spark Action operator. Multiple Jobs are often generated in one Application
  • Stage: Each job will be split into multiple groups of Tasks, as a TaskSet, its name is Stage

Operating architecture
Insert picture description here

  • In the driver program,
    SparkContext can be used to connect different types of Cluster Managers (Standalone, YARN, Mesos) through SparkContext leading the execution of the application. After connection, the Executor on the cluster node is obtained.
  • A Worker node defaults to one Executor, which can be adjusted by SPARK_WORKER_INSTANCES
  • Each application gets its own Executor
  • Each Task handles one RDD partition

Run process

Insert picture description here
Insert picture description here

2、 Spark API

\quad \quad API (Application Programming Interface, application programming interface) is some pre-defined functions, the purpose is to provide applications and developers based on certain software or hardware to access a set of routines, without having to access the source code, or understand the internal work The details of the mechanism. In layman's terms, code written by others or compiled program provided to you for use is called an API. If you use a certain function, class, or object in other people's code (or program), it is called using a certain API.

\quad \quad During the development process, Spark commonly used APIs mainly include: SparkContext, SparkSession, RDD, DataSet and DataFrame.

2.1 SparkContext

  • Is the main entrance of Spark
  • Connect Driver and Spark Cluster (Workers)
  • Only one SparkContext in active state can exist in each JVM, and stop() must be called to close the previous SparkContext before creating a new SparkContext

Create SparkContext in IDEA, the code is as follows:

//导包
import org.apache.spark.{SparkConf, SparkContext}
//创建一个SparkContext对象
val conf=new SparkConf().setMaster("local[2]").setAppName("HelloSpark")
val sc=SparkContext.getOrCreate(conf)
  • SparkConf contains various parameters for Spark cluster configuration. For simple applications, you only need to pass the following two parameters:

    • Cluster URL: Tell Spark how to connect to the cluster. For example, "local" means running locally, "local[4]" means running locally with 4 cores, and "spark://master:7077" means running on a spark independent cluster
    • Application name: Set the name of the application that will be displayed in the spark Web UI, which can help us find the application in the user interface of the cluster manager.

2.2 SparkSession

  • SparkSession is also an entry point of Spark. It is a new API introduced in 2.0 and aims to provide a unified programming entry for Spark programming. SparkSession integrates SparkConf, SparkContext, SQLContext, HiveContext and StreamingContext.

  • After the SparkSession object is created, you can get the sparkContext and sqlContext objects indirectly, so it is recommended to use SparkSession as the programming entry after version 2.0.

  • In Spark versions before 2.0, the spark shell will automatically create a SparkContext object (sc)

  • In version 2.0+, the Spark shell will create an additional SparkSession object (spark), as shown below:

Insert picture description here
Create SparkSession in IDEA , the code is as follows:

//导包
import org.apache.spark.sql.SparkSession
//创建一个SparkSession对象
val spark = SparkSession.builder
.master("local[2]")
.appName("appName")
.getOrCreate()
val sc: SparkContext = spark.sparkContext

2.3 RDD

  • Spark core, main data abstraction

2.4 DataSet

  • A new abstraction introduced from Spark 1.6, a strongly typed collection of objects in a specific domain, which can use functions or related operations to perform conversions and other operations in parallel

2.5 DateFrame

  • DataFrame is a special Dataset

Reference materials:

1、https://blog.csdn.net/zxl55/article/details/74924221

2、https://blog.csdn.net/Helltaker/article/details/10976406

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112493903