Spark basic analysis

Spark overview

What is Spark

Spark is a fast, universal, and scalable big data analysis engine. It was born in AMPLab of the University of California, Berkeley in 2009. It was open sourced in 2010 and became an Apache incubation project in June 2013. It became an Apache top-level project in February 2014. The project is written in Scala.

Spark built-in modules

Insert picture description here

  • Spark Core: Implements the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and interaction with the storage system. Spark Core
    also contains API definitions for Resilient Distributed DataSet (RDD).
  • Spark SQL: is a package used by Spark to manipulate structured data. Through Spark SQL, we can use SQL or Apache Hive version of SQL dialect (HQL) to query data. Spark SQL supports multiple data sources, such as Hive tables, Parquet, and JSON.
  • Spark Streaming: It is a component provided by Spark for streaming real-time data. Provides APIs for manipulating data streams, and highly corresponds to the RDD API in SparkCore.
  • Spark MLlib: A library that provides common machine learning (ML) functions. Including classification, regression, clustering, collaborative filtering, etc., it also provides additional support functions such as model evaluation and data import.
  • Cluster Manager: Spark is designed to efficiently scale computing from one computing node to thousands of computing nodes. In order to achieve this requirement and obtain maximum flexibility, Spark supports running on various cluster managers (ClusterManager), including Hadoop YARN, Apache Mesos, and a simple scheduler that Spark comes with, called an independent scheduler.

Spark features

fast

Compared with Hadoop's MapReduce, Spark memory-based operations are more than 100 times faster, and hard disk-based operations are also more than 10 times faster. Spark implements an efficient DAG execution engine that can efficiently process data streams based on memory. The intermediate result of the calculation exists in memory.
Insert picture description here

Easy to use

Spark supports Java, Python and Scala APIs, and also supports more than 80 advanced algorithms, allowing users to quickly build different applications. In addition, Spark supports interactive Python and Scala shells. It is very convenient to use Spark clusters in these shells to verify solutions to problems.
Insert picture description here

Universal

Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time stream processing (Spark Streaming), machine learning (Spark MLlib), and graph computing (GraphX).

These different types of processing can all be used seamlessly in the same application. Spark's unified solution is very attractive. After all, any company wants to use a unified platform to deal with problems encountered, reducing the labor cost of development and maintenance and the material cost of deploying the platform.

compatibility

Spark can be easily integrated with other open source products. For example, Spark can use Hadoop's YARN and Apache Mesos as its resource management and scheduler, and can process all data supported by Hadoop, including HDFS, HBase, and Cassandra. This is particularly important for users who have deployed Hadoop clusters, because they can use the powerful processing
capabilities of Spark without any data migration . Spark can also not rely on third-party resource management and schedulers. It implements Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for using Spark and makes it very easy for everyone to deploy and use Spark . In addition, Spark also provides tools for deploying Standalone Spark clusters on EC2.

Spark operating mode

Spark installation address

1. Official website address
http://spark.apache.org/
2.Document viewing address
https://spark.apache.org/docs/2.1.1/
3. Download address
https://spark.apache.org/downloads.html

Local mode

Overview

  • Local mode is a mode that runs on a computer, and is usually used for training and testing on this machine. It can set up Master in the following centralized way.
  • Local: All calculations are run in one thread, there is no parallel calculation, usually we execute some test code on the machine, or practice hands, we use this mode.
  • Local[K] specifies the use of several threads to run calculations, for example, local[4] is to run 4 Worker threads. Usually our Cpu has several Cores, just specify a few threads to maximize the use of the Cpu’s computing power
  • Loca[*]This mode directly helps you set the number of threads according to the maximum number of cores of CPU.

Install and use

1) Upload and unzip the spark installation package

[root@bigdata111 sorfware]$ tar -zxvf
spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module/
[root@bigdata111 module]$ mv spark-2.1.1-bin-hadoop2.7 spark

2) Official request for PI case

[root@bigdata111 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

Basic grammar

bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

Parameter Description:

  • -Master specifies the address of the Master, the default is Local
  • --Class: the startup class of your application (org.apache.spark.examples.SparkPi)
    --deploy-mode: whether to publish your driver to the worker node (cluster) or as a local client (client) (default: client) *
  • –Conf: Any Spark configuration attribute, format key=value. If the value contains spaces, you can add quotation marks "key=value"
  • application-jar: Packaged application jar, including dependencies. This URL is globally visible in the cluster. For example, hdfs:// shared storage system,
    if it is file:// path, then the path of all nodes contains the same jar application-arguments: the parameters passed to the main() method
  • --Executor-memory 1G specifies that the available memory of each executor is 1G
  • --Total-executor-cores 2 Specify the number of cup cores used by each executor to be 2
    3) Display of
    results This result is the result of PI using Monte Carlo algorithm
    Insert picture description here4) Prepare files
[root@bigdata111 spark]$ mkdir input

Create 3 files 1.txt and 2.txt under input, and enter the following

hello root
hello spark

5) Start spark-shell to
Insert picture description hereopen another CRD window

Insert picture description here
Can log in ip name: 4040 View program operation
Insert picture description here6) Run WordCount program

scala>sc.textFile("input").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((hello,2), (root,1), (spark,1))
scala>

Can log in ip name: 4040 View program operation
Insert picture description here

Submission process

1) Submit task analysis:
Insert picture description hereimportant role:
Driver (driver)
Spark's driver is the process of executing the main method in the development program. It is responsible for
the execution of code written by developers to create SparkContext, create RDD, and perform RDD conversion operations and action operations. If you are using the spark shell, when you start the Spark shell, a Spark driver program is automatically launched in the background of the system, which is a SparkContext object called sc preloaded in the Spark shell. If the driver program terminates, then the Spark application also ends. Mainly responsible for:
1) Turn user programs into tasks
2) Track the running status of Executor
3) Schedule tasks for executor nodes
4) UI display application running status

Executor (executor)
Spark Executor is a worker process, responsible for running tasks in Spark jobs, and the tasks are independent of each other. When the Spark application starts, the Executor node is started at the same time, and it always exists with the entire Spark application life cycle. If an Executor node fails or crashes, the Spark application can continue to execute, and the task on the error node will be scheduled to other Executor nodes to continue running. Mainly responsible for:
1) Responsible for running the tasks that make up the Spark application and returning the results to the driver process;
2) Providing memory
storage for RDDs that require caching in user programs through its own block manager (Block Manager) . RDD is directly cached in the Executor process, so the task can make full use of the cached data to speed up operations at runtime.

Data flow

textFile("input"): read the input folder data of the local file;
flatMap( .split(" ")): flattening operation, map a line of data into words according to the space separator;
map((
,1)) : Operate on each element and map words to tuples;
reduceByKey( + ): aggregate and add values ​​according to key;
collect: collect data to the Driver side for display.

Standalone mode

Overview

Build a Spark cluster composed of Master+Slave, and Spark runs in the cluster.
Insert picture description here

Install and use

1) Enter the conf folder under the spark installation directory

[root@bigdata111 module]$ cd spark/conf/

2) Modify the configuration file name

[root@bigdata111 conf]$ mv slaves.template slaves
[root@bigdata111 conf]$ mv spark-env.sh.template spark-env.sh

3) Modify the slave file and add the work node:

[root@bigdata111 conf]$ vi slaves
bigdata111  //三台机器的名称
bigdata112 
bigdata113

4) Modify the spark-env.sh file and add the following configuration:

[root@bigdata111 conf]$ vim spark-env.sh
SPARK_MASTER_HOST=bigdata111
SPARK_MASTER_PORT=7077
 

5) Distribute spark package

[root@bigdata111 module]$ xsync spark/

After configuring here, you can use scp transmission for distribution, or you can write a script for use like the above
6) Start

[root@bigdata111 spark]$ sbin/start-all.sh

Web page view: bigdata111 (ip address): 8080

Note : If you encounter the "JAVA_HOME not set" exception, you can add the following configuration to the spark-config.sh file in the sbin directory:
export JAVA_HOME=XXXX

7) Official request for PI case

[root@bigdata111 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://bigdata111:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

8) Start the spark shell

/opt/module/spark/bin/spark-shell \
--master spark://bigdata111:7077 \
--executor-memory 1g \
--total-executor-cores 2

Parameters: –master spark://bigdata111:7077 specifies the master of the cluster to be connected

JobHistoryServer configuration

1) Modify the spark-default.conf.template name

[root@bigdata111 conf]$ mv spark-defaults.conf.template
spark-defaults.conf

2) Modify the spark-default.conf file and enable Log:

[root@bigdata111 conf]$ vi spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata111:9000/directory

Note : The directory on HDFS needs to exist in advance.
3) Modify the spark-env.sh file and add the following configuration:

[root@bigdata111 conf]$ vi spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://bigdata111:9000/directory"

Parameter description:
spark.eventLog.dir: All information during the application running process is recorded in the path specified by this property
spark.history.ui.port=18080 The port number accessed by WEBUI is 18080
spark.history.fs.logDirectory= After hdfs://bigdata111:9000/directory is configured with this property,
there is no need to explicitly specify the path when starting-history-server.sh. The Spark History Server page only displays information under the specified path
spark.history.retainedApplications =30 Specifies the number of application history records to be saved. If this value is exceeded, the old application information will be deleted. This is the number of applications in the memory, not the number of applications displayed on the page.
4) Distribution configuration file

[root@bigdata111 conf]$ xsync spark-defaults.conf
[root@bigdata111 conf]$ xsync spark-env.sh

5) Start history service

[root@bigdata111 spark]$ sbin/start-history-server.sh

6) Perform the task again

[root@bigdata111 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://bigdata111:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

7) View historical service
bigdata111:18080

HA configuration

Insert picture description here1) Zookeeper is installed and started normally.
2) Modify the spark-env.sh file and add the following configuration:

[root@bigdata111 conf]$ vi spark-env.sh
注释掉如下内容:
#SPARK_MASTER_HOST=bigdata111
#SPARK_MASTER_PORT=7077

Add the following content:

export SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=bigdata111,bigdata112 ,bigdata113
-Dspark.deploy.zookeeper.dir=/spark"

3) Distribution of configuration files

[root@bigdata111 conf]$ xsync spark-env.sh

4) Start all nodes on bigdata111

[root@bigdata111 spark]$ sbin/start-all.sh

5) Start the master node separately on bigdata112

[root@bigdata112 spark]$ sbin/start-master.sh

6) Spark HA cluster access

/opt/module/spark/bin/spark-shell \
--master spark://bigdata111:7077,bigdata112 :7077 \
--executor-memory 2g \
--total-executor-cores 2

yarn mode

Overview

The Spark client directly connects to Yarn, without the need to build a Spark cluster. There are
two modes of yarn-client and yarn-cluster . The main difference lies in the running node of the Driver program.
yarn-client: The Driver program runs on the client and is suitable for interaction and debugging. You want to see the output of the app immediately.
Yarn-cluster: The Driver program runs on AM (APPMaster) started by RM (ResourceManager) and is
suitable for production environments.
Insert picture description here

Install and use

1) Modify the hadoop configuration file yarn-site.xml and add the following content:

[root@bigdata111 hadoop]$ vi yarn-site.xml
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,
则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,
则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

2) Modify spark-env.sh and add the following configuration:

[root@bigdata111 conf]$ vi spark-env.sh
YARN_CONF_DIR=/opt/module/hadoop-2.8.4/etc/hadoop

3) Distribution of configuration files

[root@bigdata111 conf]$ xsync
/opt/module/hadoop-2.8.4/etc/hadoop/yarn-site.xml

4) Execute a program

[root@bigdata111 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

Note: HDFS and YARN cluster need to be started before submitting the task.

Log View

1) Modify the configuration file spark-defaults.conf and
add the following content:

spark.yarn.historyServer.address=bigdata111:18080
spark.history.ui.port=18080

2) Restart the spark history service

[root@bigdata111 spark]$ sbin/stop-history-server.sh
stopping org.apache.spark.deploy.history.HistoryServer
[root@bigdata111 spark]$ sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to
/opt/module/spark/logs/spark-itstar-org.apache.spark.deploy.histo
ry.HistoryServer-1-bigdata111.out

3) Submit the task to Yarn for execution

[root@bigdata111 spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

4) Web page view log
bigdata111:8088 view

Comparison of several modes

mode Number of Spark installed machines
Local 1
Standalone 3
Yarn 1
Process to start Affiliation
no Spark
Master 及 Worker Spark
Yarn and HDFS Hadoop

Guess you like

Origin blog.csdn.net/qq_45092505/article/details/107484166