Talk about those things about Spark

Interview-oriented blogs are presented in Q / A style.


Question1: Briefly introduce the Spark architecture?

Answer1:

Spark provides a comprehensive and unified framework for managing the needs of big data processing of various data sets and data sources (batch data or real-time streaming data) with different properties (text data, chart data, etc.).

The core architecture of Spark is shown in the following figure:
Insert picture description here
Explanation of the important concepts in the above figure:

Spark Core (that is, the second layer of Apache Spark in the figure above)
Spark Core contains the basic functions of Spark, especially the API, operations and actions on both of which are defined RDD. Above.

Note: RDD, Resilient Distributed DataSet, namely elastic distributed data set.

Spark SQL
provides an API for interacting with Spark by HiveQL (a SQL variant of Hive query language for Apache Hive). Each database table is treated as an RDD, and Spark SQL queries are converted into Spark operations.

Spark Streaming
processes and controls real-time data streams. Spark Streaming allows programs to process real-time data like ordinary RDDs.

Mllib (Machine Learning) is
a commonly used machine learning algorithm library. The algorithm is implemented as a Spark operation on RDD. This library contains scalable learning algorithms, such as classification, regression, etc., that require iterative operations on large data sets.

GraphX
is a collection of algorithms and tools for controlling graphs, parallel graph operations and calculations. GraphX ​​extends the RDD API to include control graphs, creating subgraphs, and accessing all vertices on the path.


Question2: Briefly introduce the core components of Spark?

Answer2:

Spark component interaction, as shown below:

Insert picture description here
For the explanation of the important concepts in the above figure:

Cluster Manager
is the master node in standalone mode, controlling the entire cluster and monitoring workers.
In YARN mode, it is a cluster manager.


As a slave node, Worker Node is responsible for controlling the computing node and starting Executor or Driver.

Driver
runs the main () function of Application.

Executor
executor is a process running on worker node for an Application.


Question3: Briefly introduce the Spark programming model?

Answer3:

Spark programming model, as shown below:
Insert picture description here

The whole process of Spark application from writing to submission, execution, and output is shown in the figure.

  1. Users use the API provided by SparkContext (commonly used are textFile, sequenceFile, runJob, stop, etc.) to write Driver application programs. In addition, SQLContext, HiveContext, and StreamingContext encapsulate SparkContext, and provide APIs related to SQL, Hive, and streaming computing.
  2. User applications submitted using SparkContext will first use BlockManager and BroadcastManager to broadcast the task's Hadoop configuration. The DAGScheduler then converts the tasks to RDDs and organizes them into DAGs, which will also be divided into different stages. Finally, TaskScheduler uses ActorSystem to submit tasks to the cluster manager.
  3. The cluster manager (ClusterManager) allocates resources to tasks, that is, assigns specific tasks to Workers. Workers create Executors to handle the running of tasks, and the running results are saved in Store storage. Standalone, YARN, Mesos, EC2, etc. can all be used as Spark cluster managers; HDFS, Amzon, S3, Tachyon, etc. can all be used as Store storage.

Question4: What is the Spark calculation model?

Answer4:

The operation diagram of the Spark computing model is as follows:
Insert picture description here

In the above figure, the full name of RDD in English is Resiliennt Distributed Datasets, which is translated as a flexible distributed data set. It can be regarded as a unified abstraction of various data calculation models. The calculation process of Spark is mainly the iterative calculation process of RDD. The iterative calculation process of RDD is very similar to the pipeline, that is, each partition (from partition 1 to partition N) in the above figure can be regarded as a pipeline. The number of partitions depends on the setting of the number of partitions, and the data of each partition will only be calculated in one Task. All partitions can be executed in parallel on the Executor of multiple machine nodes.


Question5: Talk about the running process of Spark?

Answer5:

The flowchart of Spark operation is shown in the following figure:
Insert picture description here
Explanation of the above figure:

  1. Build the running environment of Spark Application and start SparkContext.
  2. SparkContext applies to the resource manager (which can be Standalone, Mesos, Yarn) to run the Executor resource and start StandaloneExecutorbackend.
  3. Executor applies for Task from SparkContext.
  4. SparkContext distributes the application to Executor.
  5. SparkContext is built into a DAG graph, the DAG graph is decomposed into Stage, Taskset is sent to Task Scheduler, and finally Task Scheduler sends Task to Executor to run.
  6. The Task runs on the Executor, and releases all resources after running.

Question6: What is the Spark RDD process?

Answer6:

The flowchart of Spark RDD is as follows:
Insert picture description here
Explanation of the above figure:

  1. Create an RDD object.
  2. The DAGScheduler module intervenes in the calculation to calculate the dependencies between RDDs, and the dependencies between RDDs form the DAG.
  3. Each job is divided into multiple stages. A main basis for dividing the Stage is whether the input of the current calculation factor is determined, and if so, it is divided into the same Stage to avoid the message transmission overhead between multiple Stages.

Question7: What is Spark RDD?

Answer7:

(1) How to create RDD

1) Created from Hadoop file system (or other persistent storage systems compatible with Hadoop, such as Hive, Cassandra, HBase) input (such as HDFS).
2) Convert the new RDD from the parent RDD.
3) Create stand-alone data as distributed RDD through parallelize or makeRDD.

(2) Two operation operators of RDD (Transformation and Action)

There are two kinds of operation operators for RDD: Transformation and Action.

1) Transformation: Transformation operation is delayed calculation, that is to say, the transformation operation from one RDD to another RDD is not executed immediately, and the operation will not be triggered until there is an Action operation.
Insert picture description here

2) Action: The Action operator will trigger Spark to submit a job and output the data to the Spark system.
Insert picture description here

Published 207 original articles · praised 80 · 120,000 views

Guess you like

Origin blog.csdn.net/qq_36963950/article/details/105336060