Spark of big data technology

Chapter 1 Spark Overview

1.1 What is Spark

Review: Hadoop mainly solves the storage of massive data and the analysis and calculation of massive data .

Spark is a fast, general-purpose, and scalable big data analysis computing engine based on memory .

Hadoop's Yarn framework was born later than the Spark framework, so Spark itself designed a resource scheduling framework.

difference :

1. MR is based on disk , and spark is based on memory

2. The task of MR is a process

3. The task of spark is a thread, which is executed in the executor process.

4. MR is executed in the container (with an interface for easy insertion), and spark is executed in the worker (for its own use, without an interface).

5. MR is suitable for one calculation, and Spark is suitable for iterative calculation

1. 2 usage scenarios of spark

        Offline, real-time, machine learning, graph computing

1.3 Comparison between Hadoop and Spark frameworks

1.4 Spark built-in modules

  • Spark Core : implements the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and interaction with storage systems. Spark Core also includes API definitions for Resilient Distributed DataSet (RDD).

Spark SQL : is a package used by Spark to manipulate structured data. With Spark SQL, we can use SQL or Apache Hive version of HQL to query data. Spark SQL supports multiple data sources, such as Hive tables, Parquet, and JSON.

Spark Streaming : It is a component provided by Spark that performs streaming calculations on real-time data. Provides an API for manipulating data streams, and highly corresponds to the RDD API in Spark Core.

Spark MLlib : A library that provides common machine learning functions. Including classification, regression, clustering, collaborative filtering, etc., it also provides additional support functions such as model evaluation and data import.

Spark GraphX : Components mainly used for graph parallel computing and graph mining systems.

Cluster Manager: Spark is designed to efficiently scale computations from one compute node to thousands of compute nodes. In order to achieve such requirements and obtain maximum flexibility, Spark supports running on various cluster managers (Cluster Manager), including Hadoop YARN, Apache Mesos, and a simple scheduler that comes with Spark, called an independent scheduler .

1.5 Features of Spark


Chapter 2 Spark Running Mode

        Deploying Spark clusters is generally divided into two modes: stand-alone mode and cluster mode

        Most distributed frameworks support stand-alone mode, which is convenient for developers to debug the operating environment of the framework. But in a production environment, stand-alone mode is not used. Therefore, the subsequent deployment of the Spark cluster directly follows the cluster mode.

The deployment modes currently supported by Spark are listed in detail below.

(1) Local mode: Deploy a single Spark service locally

(2) Standalone mode : Spark's own task scheduling mode. (commonly used in China)

(3) YARN mode : Spark uses Hadoop's YARN component for resource and task scheduling. (the most commonly used in China)

(4) Mesos mode : Spark uses the Mesos platform to schedule resources and tasks. (rarely used in China)

2.1 Local mode

1. Local mode: single-machine installation, ready to use after decompression
        1. Task submission: bin/spark-submit --master loca/local[*]/local[N] --class xx.xx.xx xx.jar Parameters...
            master=local: Use a single thread to simulate execution, and only one task can be executed at the same time.
            master=local[*]: Use several CPU threads to simulate execution, and only execute several CPU tasks at the same time.
            master=local[N ]: Use N threads to simulate execution, and only N tasks can be executed at the same time.

2.1.1 Installation and use

follow-up update

2.2 Cluster roles

1. Master and Worker cluster resource management
Master: Responsible for resource management and allocation
Worker: Resource nodes and task execution nodes
Master and Worker are started with the start of the cluster, and disappear with the stop of the cluster.
Master and Worker are only available in standalone mode.

Master and Worker are Spark's daemon processes and cluster resource managers, that is, background resident processes necessary for Spark to run normally in a specific mode (Standalone) .


2.
Responsibilities of Driver and executor Driver:
                    1. Responsible for converting the code into a job
                    2. Responsible for submitting the task to the executor for execution
                    3. Responsible for monitoring the execution of the task 4. Responsible for displaying the Executor
                    on the web ui interface during the running of the program : task execution
process


What spark executes is a task, and a task is a thread, which is a thread started in the executor.
Driver and executor start with the submission of the task, and disappear with the completion of the task.

2.3  Standalone mode

2.3.1 Installation and use

Task submission: bin/spark-submit --master spark://master hostname:7077,... --class full class name jar package location parameter value...

follow-up update

2.3.2 Operation process

Spark has two modes: standalone-client and standalone-cluster. The main difference is: the running node of the Driver program.

1. Client mode

[hadoop102 spark-standalone]$ bin/spark-submit --master spark://hadoop102:7077,hadoop103:7077 --deploy-mode client --executor-memory 2G--total-executor-cores 2  --class org.apache.spark.examples.SparkPi  ./examples/jars/spark-examples_2.12-3.1.3.jar

10

--deploy-mode client , indicating that the Driver program runs on the local client, the default mode.

2. Cluster mode mode

[hadoop102 spark-standalone]$ bin/spark-submit --master spark://hadoop102:7077,hadoop103:7077 --deploy-mode cluster --executor-memory 2G--total-executor-cores 2  --class org.apache.spark.examples.SparkPi  ./examples/jars/spark-examples_2.12-3.1.3.jar

10

--deploy-mode cluster , indicating that the Driver program runs on the cluster.

The difference between  the standalone client and the cluster deployment mode
        Client mode : Driver is in the SparkSubmit process, and the process cannot be closed at this time. After closing, the Driver disappears and task scheduling cannot be performed, and the program will terminate.
        Cluster mode : Driver is in any Worker, at this time SparkSubmit shutdown does not affect program execution

2.4 Yarn mode

The Spark client directly connects to Yarn without additionally building a Spark cluster. Use yarn as resource scheduler.

2.4.1 Installation and use

Task submission: bin/spark-submit --master yarn --class full class name jar package location parameter value...

pending upgrade

2.4.2 Operation process

Spark has two modes: yarn-client and yarn-cluster. The main difference is: the running node of the Driver program.

yarn-client: The Driver program runs on the client , which is suitable for interaction and debugging, and it is hoped to see the output of the app immediately.

yarn-cluster: Driver program runs on APPMaster started by ResourceManager , suitable for production environment.

1. Client mode (default)

[hadoop102 spark-yarn]$ bin/spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.12-3.1.3.jar

10

2. Cluster mode

[hadoop102 spark-yarn]$ bin/spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.12-3.1.3.jar

10

 

The difference between yarn client and cluster deployment mode
        Client mode : Driver is in the SparkSubmit process, and the process cannot be closed at this time. After closing, the Driver disappears and task scheduling cannot be performed, and the program will terminate.
        Cluster mode : Driver is in the ApplicationMaster process, and SparkSubmit shutdown does not affect program execution at this time.

2.5 Comparison of several modes

model

Number of Spark installation machines

process to start

Affiliation

Local

1

none

Spark

Standalone

3

Master and Worker

Spark

Yarn

1

Yarn and HDFS

Hadoop

2.7 Port Number Summary

1) Spark checks the current Spark-shell running task port number: 4040

2) Spark Master internal communication service port number: 7077 (similar to yarn's 8032 ( internal communication between RM and NM ) port)

3) Spark Standalone mode Master Web port number: 8080 (similar to Hadoop YARN task running status check port number: 8088) (yarn mode)  8989

4) Spark history server port number: 18080 (similar to Hadoop history server port number: 19888)

2.8 common parameters of spark-submit

--master specifies which resource scheduler the task is submitted to
--executor-memory specifies the memory size of each executor
--executor-cores specifies the number of cpu cores of each executor
--total-executor-cores specifies the total cpu of all executors Number of cores [Only used in standalone mode]
--num-executors Specifies the number of executors required by the task [Only used in yarn mode]
--queue Specifies which resource queue the task is submitted to [Only used in yarn mode]
--deploy- mode specifies the deployment mode of the task [client/cluster]
--driver-memory specifies the memory size of the driver
--class specifies the full class name of the object with the main method to be run

Guess you like

Origin blog.csdn.net/m0_57126939/article/details/130105954