Overview of Spark's architecture (Chapter 1)

Overview of Spark's architecture (Chapter 1)

Background introduction

Spark is a lightning-fast unified analysis engine (computing framework) for processing large-scale data sets. Spark is doing data batch computing, and its computing performance is about 10-100 times that of Hadoop MapReduce. Because Spark uses advanced DAG-based task scheduling (directed acyclic computing), it can split a task into several stages. Then these stages are handed over to the cluster computing node for processing in batches.
Insert picture description here

The mapreduce calculation is divided into two steps, the map phase and the reduce phase. If the results cannot be processed in the two steps, the mapreduce calculation needs to be performed again, and data is repeatedly read and written from the disk, thereby reducing efficiency. Spark is a memory-based calculation. Each calculation is divided into several stages. After reading the data from the disk once, the calculation is directly completed in the memory, and finally the result is stored in the disk.

MapReduce VS Spark

As the first-generation data processing framework, MapReduce was designed to meet the urgent needs of massive data computing based on massive data level at the initial stage of design. The spin-off of the self-Nutch (Java search engine) process in 2006 mainly solved the problems faced by early people's primary understanding of large data.

Over time, people began to explore the use of Map Reduce computing framework to complete some complex high-order algorithms, often these algorithms are usually not completed by one-time Map Reduce iterative calculations. Since the Map Reduce calculation model always stores the results in the disk, each iteration needs to load the data disk into the memory, which brings more delay to subsequent iterations.

Spark has developed so fast because Spark is significantly better than Hadoop's Map Reduce disk iteration calculation in the computing layer, because Spark can use memory to calculate data, and the intermediate results of the calculation can also be cached in memory. It saves time for subsequent iterative calculations and greatly improves the calculation efficiency for massive data.

Not only that, Spark also proposed the One stack ruled them all strategy in the design concept, and provided computing services based on Spark batch processing, such as: realization of Spark-based interactive query, near real-time stream processing, machine learning, Grahx graphics Relational storage, etc.

Insert picture description here

Calculation process (emphasis)

First we review the shortcomings of mapreduce

1) Although MapReduce is based on the idea of ​​volume programming, the calculation state is too simple. It simply divides the task into Map state and Reduce State, without considering the iterative calculation scenario.

2) The intermediate results calculated in the Map task are stored on the local disk, IO is used too much, and the efficiency of data reading and writing is poor.

3) MapReduce submits tasks first, and then applies for resources during the calculation process. And the calculation method is too cumbersome. Each parallelism is calculated by a JVM process.

Spark calculation process
Insert picture description here

Compared with MapReduce computing, Spark computing has the following advantages:

1) Intelligent DAG task splitting, which splits a complex calculation into several states to satisfy the iterative calculation scenario.
2) Spark provides calculation buffering and fault tolerance strategies, storing the calculation results in memory or disk, accelerating each State operation
improves operation efficiency
3) Spark has already applied for computing resources in the early stage of computing. Task parallelism is realized by starting threads in the Executor process
, which is much lighter and faster than MapReduce calculations.

Reminder Spark previously provided the implementation of Cluster Manager by Yarn, Standalone, Messso, kubernates, etc. Among them
, Yarn and Standalone-style management are commonly used by enterprises .

Spark installation

Spark On Yarn

Hadoop environment

Set the number of CentOS processes and files (optional)

[root@CentOS ~]# vi /etc/security/limits.conf
* soft nofile 204800
* hard nofile 204800
* soft nproc 204800
* hard nproc 204800

Optimize linux performance, modify this maximum value, restart CentOS to work

Configure the host name (restart effective)

[root@CentOS ~]# vi /etc/hostname
CentOS
[root@CentOS ~]# reboot

Set up IP mapping

[root@CentOS ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.52.134 CentOS

Security wall service

# 临时关闭服务
[root@CentOS ~]# systemctl stop firewalld
[root@CentOS ~]# firewall-cmd --state
not running
# 关闭开机⾃动启动
[root@CentOS ~]# systemctl disable firewalld

Install JDK1.8+

[root@CentOS ~]# rpm -ivh jdk-8u171-linux-x64.rpm
[root@CentOS ~]# ls -l /usr/java/
total 4
lrwxrwxrwx. 1 root root 16 Mar 26 00:56 default -> /usr/java/latest
drwxr-xr-x. 9 root root 4096 Mar 26 00:56 jdk1.8.0_171-amd64
lrwxrwxrwx. 1 root root 28 Mar 26 00:56 latest -> /usr/java/jdk1.8.0_171-amd64
[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
[root@CentOS ~]# source ~/.bashrc

SSH configuration without password

[root@CentOS ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
4b:29:93:1c:7f:06:93:67:fc:c5:ed:27:9b:83:26:c0 root@CentOS
The key's randomart image is:
+--[ RSA 2048]----+
| |
| o . . |
| . + + o .|
| . = * . . . |
| = E o . . o|
| + = . +.|
| . . o + |
| o . |
| |
+-----------------+
[root@CentOS ~]# ssh-copy-id CentOS
The authenticity of host 'centos (192.168.40.128)' can't be established.
RSA key fingerprint is 3f:86:41:46:f2:05:33:31:5d:b6:11:45:9c:64:12:8e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'centos,192.168.40.128' (RSA) to the list of known hosts.
root@centos's password:
Now try logging into the machine, with "ssh 'CentOS'", and check in:
 .ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
[root@CentOS ~]# ssh root@CentOS
Last login: Tue Mar 26 01:03:52 2019 from 192.168.40.1
[root@CentOS ~]# exit
logout
Connection to CentOS closed.

Configure HDFS|YARN

Unzip hadoop-2.9.2.tar.gz to the /usr directory of the system and configure the [core|hdfs|yarn|mapred]-site.xml configuration file.

[root@CentOS ~]# vi /usr/soft/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--nn访问⼊⼝-->
<property>
 <name>fs.defaultFS</name>
 <value>hdfs://CentOS:9000</value>
</property>
<!--hdfs⼯作基础⽬录-->
<property>
 <name>hadoop.tmp.dir</name>
 <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>
[root@CentOS ~]# vi /usr/soft/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<!--block副本因⼦-->
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<!--配置Sencondary namenode所在物理主机-->
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>CentOS:50090</value>
</property>
<!--设置datanode最⼤⽂件操作数-->
<property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
</property>
<!--设置datanode并⾏处理能⼒-->
<property>
    <name>dfs.datanode.handler.count</name>
    <value>6</value>
</property>
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/yarn-site.xml
<!--配置MapReduce计算框架的核⼼实现Shuffle-洗牌-->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<!--配置资源管理器所在的⽬标主机-->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>CentOS</value>
</property>
<!--关闭物理内存检查-->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>
<!--关闭虚拟内存检查-->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>
[root@CentOS ~]# vi /usr/hadoop-2.9.2/etc/hadoop/mapred-site.xml
<!--MapRedcue框架资源管理器的实现-->
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

Configure hadoop environment variables

[root@CentOS ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
HADOOP_HOME=/usr/hadoop-2.9.2
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
[root@CentOS ~]# source .bashrc

Start Hadoop service

[root@CentOS ~]# hdfs namenode -format # 创建初始化所需的fsimage⽂件
[root@CentOS ~]# start-dfs.sh
[root@CentOS ~]# start-yarn.sh
[root@CentOS ~]# jps
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
123036 Jps

Visit: http://CentOS:8088 and http://centos:50070/

Spark environment (using yarn)

Download spark-2.4.5-bin-without-hadoop.tgz and unzip it to the /usr directory, and change the name of the Spark directory to spark-2.4.5, and then modify the spark-env.sh and spark-default.conf files.

Unzip and install spark

[root@CentOS ~]# tar -zxf spark-2.4.5-bin-without-hadoop.tgz -C /usr/
[root@CentOS ~]# mv /usr/spark-2.4.5-bin-without-hadoop/ /usr/spark-2.4.5
[root@CentOS ~]# tree -L 1 /usr/spark-2.4.5/
/usr/spark-2.4.5/
!"" bin # Spark系统执⾏脚本
!"" conf # Spar配置⽬录
!"" data
!"" examples # Spark提供的官⽅案例
!"" jars
!"" kubernetes
!"" LICENSE
!"" licenses
!"" NOTICE
!"" python
!"" R
!"" README.md
!"" RELEASE
!"" sbin # Spark⽤户执⾏脚本
#"" yarn

Configure Spark service

[root@CentOS ~]# cd /usr/spark-2.4.5/
[root@CentOS spark-2.4.5]# mv conf/spark-env.sh.template conf/spark-env.sh
[root@CentOS spark-2.4.5]# vi conf/spark-env.sh
# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (e.g. 1000M, 2G) (Default: 1G)
HADOOP_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
YARN_CONF_DIR=/usr/hadoop-2.9.2/etc/hadoop
SPARK_EXECUTOR_CORES=2
SPARK_EXECUTOR_MEMORY=1G
SPARK_DRIVER_MEMORY=1G
LD_LIBRARY_PATH=/usr/hadoop-2.9.2/lib/native
export HADOOP_CONF_DIR
export YARN_CONF_DIR
export SPARK_EXECUTOR_CORES
export SPARK_DRIVER_MEMORY
export SPARK_EXECUTOR_MEMORY
export LD_LIBRARY_PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath):$SPARK_DIST_CLASSPATH
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"
[root@CentOS spark-2.4.5]# mv conf/spark-defaults.conf.template conf/sparkdefaults.conf
[root@CentOS spark-2.4.5]# vi conf/spark-defaults.conf
spark.eventLog.enabled=true
spark.eventLog.dir=hdfs:///spark-logs

You need to create a spark-logs directory on HDFS now, and use it as a place for the Sparkhistory server to store historical calculation data.

[root@CentOS ~]# hdfs dfs -mkdir /spark-logs

Start Spark history server

[root@CentOS spark-2.4.5]# ./sbin/start-history-server.sh
[root@CentOS spark-2.4.5]# jps
124528 HistoryServer
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
124574 Jps

Visit http://host ip:18080 to visit Spark History Server

test environment

./bin/spark-submit  --master yarn  --deploy-mode client --class org.apache.spark.examples.SparkPi  --num-executors 2  --executor-cores 3  /usr/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar

got the answer

19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID0) in 6609 ms on CentOS (executor 1) (1/2)19/04/21 03:30:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID1) in 6403 ms on CentOS (executor 1) (2/2)19/04/21 03:30:39 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks haveall completed, from pool
19/04/21 03:30:39 INFO scheduler.DAGScheduler: ResultStage 0 (reduce atSparkPi.scala:38) finished in 29.116 s
19/04/21 03:30:40 INFO scheduler.DAGScheduler: Job 0 finished: reduce at
SparkPi.scala:38, took 30.317103 s
`Pi is roughly 3.141915709578548`
19/04/21 03:30:40 INFO server.AbstractConnector: Stopped Spark@41035930{
    
    HTTP/1.1,
[http/1.1]}{
    
    0.0.0.0:4040}
19/04/21 03:30:40 INFO ui.SparkUI: Stopped Spark web UI at http://CentOS:4040
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
19/04/21 03:30:40 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
parameter Description
–master The name of the linked resource server yarn
–deploy-mode Deployment mode, the optional values ​​are client and cluster, which determine whether the Driver program is executed remotely
–class The name of the main class of the operation
–num-executors The number of processes required for the calculation process
–Executor-colors The maximum number of CPU cores used by each Sector

Spark Shell

./bin/spark-shell --master yarn --deploy-mode client --executor-cores 4 --num-executors 3

Spark Standalone

Use spark's own calculation model instead of yarn.

Basically the same as the above configuration, just change spark-env.sh

vim conf/spark-env.sh
# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors(e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service(e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers
SPARK_MASTER_HOST=centos
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=4
SPARK_WORKER_INSTANCES=2
SPARK_WORKER_MEMORY=2g
export SPARK_MASTER_HOST
export SPARK_MASTER_PORT
export SPARK_WORKER_CORES
export SPARK_WORKER_MEMORY
export SPARK_WORKER_INSTANCES
export LD_LIBRARY_PATH=/usr/soft/hadoop-2.9.2/lib/native
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs:///spark-logs"

Delete the previous log directory and recreate it

hdfs dfs -rm -r -f /spark-logs
hdfs dfs -mkdir /spark-logs

Start history-server

[root@CentOS spark-2.4.5]# ./sbin/start-history-server.sh
[root@CentOS spark-2.4.5]# jps
124528 HistoryServer
122690 NodeManager
122374 SecondaryNameNode
122201 DataNode
122539 ResourceManager
122058 NameNode
124574 Jps

Visit http://host ip:18080 to visit Spark History Server

Start Spark self-computing service

[root@CentOS spark-2.4.5]# ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/spark2.4.5/logs/spark-root-org.apache.spark.deploy.master.Master-1-CentOS.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark2.4.5/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-CentOS.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/spark2.4.5/logs/spark-root-org.apache.spark.deploy.worker.Worker-2-CentOS.out
[root@CentOS spark-2.4.5]# jps
7908 Worker
7525 HistoryServer
8165 Jps
122374 SecondaryNameNode
7751 Master
122201 DataNode
122058 NameNode
7854 Worker

Note: ./sbin/start-all.sh needs to configure user variables in vim.bashrc, otherwise the hadoop: command not found error will appear, and the JAVA_HOME not set error will appear at the same time.

Only configuring system variables cannot solve the above problems. You must configure the .bashrc file. Remember to source .bashrc after the configuration is complete.

Users can visit http://CentOS:8080

Test the standalone environment (write on the same line)

./bin/spark-submit  --master spark://centos:7077  --deploy-mode client --class org.apache.spark.examples.SparkPi  --total-executor-cores 6  /usr/soft/spark-2.4.5/examples/jars/spark-examples_2.11-2.4.5.jar

The value of pi in the result is success

`Pi is roughly 3.141915709578548`

Property description

parameter Description
–master The name of the linked resource server spark://CentOS:7077
–deploy-mode Deployment mode, the optional values ​​are client and cluster, which determine whether the Driver program is executed remotely
–class The name of the main class of the operation
–Total-executor-colors The number of computing resource threads required for the calculation process

Spark Shell

[root@CentOS spark-2.4.5]# ./bin/spark-shell --master spark://centos:7077 --total-executor-cores 6
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Spark context Web UI available at http://CentOS:4040
Spark context available as 'sc' (master = spark://CentOS:7077, app id = app20200207140419-0003).
Spark session available as 'spark'.
Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.4.5
 /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more informat.
scala>sc.textFile("hdfs:///demo/words").flatMap(_.split("")).map((_,1)).reduceByKey(_+_).sortBy(_._2,true).saveAsTextFile("hdfs:///demo/results")

Complete a WordCount

Customize a file, write any number and content of words in the file, and then upload the file to hdfs.

[root@centos hadoop-2.9.2]# vi t_word
[root@centos hadoop-2.9.2]# hdfs dfs -mkdir -p /demp/words
[root@centos hadoop-2.9.2]# hdfs dfs -put t_word /demp/words
scala> sc.textFile("hdfs:///demp/words").flatMap(_.split(" ")).map(t=>(t,1)).reduceByKey(_+_).sortBy(_._2,false).collect()

result

res8: Array[(String, Int)] = Array((day,2), (good,2), (up,1), (a,1), (on,1), (demo,1), (study,1), (this,1), (is,1), (come,1), (baby,1))
.saveAsTextFile("hdfs:///demo/results")     #将文件输出到hdfs文件系统中
```shell
[root@centos hadoop-2.9.2]# vi t_word
[root@centos hadoop-2.9.2]# hdfs dfs -mkdir -p /demp/words
[root@centos hadoop-2.9.2]# hdfs dfs -put t_word /demp/words
scala> sc.textFile("hdfs:///demp/words").flatMap(_.split(" ")).map(t=>(t,1)).reduceByKey(_+_).sortBy(_._2,false).collect()

result

res8: Array[(String, Int)] = Array((day,2), (good,2), (up,1), (a,1), (on,1), (demo,1), (study,1), (this,1), (is,1), (come,1), (baby,1))
.saveAsTextFile("hdfs:///demo/results")     #将文件输出到hdfs文件系统中

Guess you like

Origin blog.csdn.net/origin_cx/article/details/104365791