Article directory
Preface
Many students have used the classic big data distributed computing framework Hadoop. Its distributed file system HDFS is very friendly to data management, but its computing power is still insufficient compared to Spark. As the saying goes, if you want to do your job well, you must first sharpen your tools. Today we will introduce the deployment of Spark cluster in docker container.
Technology accumulation
Introduction to Spark
Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. Spark enables memory distributed data sets. In addition to providing interactive queries, it can also optimize iterative workloads. For distributed computing, Spark performs distributed computing based on memory, which greatly improves performance.
Spark core functions and advantages
Faster speed:
In memory computing, Spark is 100 times faster than Hadoop.
Ease of use
Spark provides more than 80 advanced operators.
Universal
Spark provides a large number of libraries, including Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX. Developers can seamlessly combine these libraries in the same application.
Supports multiple resource managers
Spark supports Hadoop YARN, Apache Mesos, and its own independent cluster manager
Spark running architecture
The core of the Spark framework is a computing engine. Overall, it adopts the standard master-slave structure
diagram, which shows the basic architecture of a Spark execution. The Driver in the diagram represents the master, which is responsible for managing the entire cluster. Job task scheduling. The Executor in the picture is the slave, responsible for actually executing tasks.
After the user program creates SparkContext, it will connect to the cluster resource manager. The cluster resource manager will allocate computing resources to the user program and start the Executor; the Driver will divide the computing program into
different execution stages and multiple Tasks, and then send the Task. To the Executor;
the Executor is responsible for executing the Task and reporting the execution status to the Driver. It will also report the current node resource usage to the cluster resource manager.
Spark independent cluster construction
Install docker and docker-compose
docker与docker-compose安装
#安装docker社区版
yum install docker-ce
#版本查看
docker version
#docker-compose插件安装
curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
#可执行权限
chmod +x /usr/local/bin/docker-compose
#版本查看
docker-compose version
docker-compose orchestration
docker-compose-spark.yaml
version: "3.3"
services:
master:
image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
container_name: master
user: root
command: " /opt/bitnami/java/bin/java -cp /opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080 "
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- ./python:/python
network_mode: host
extra_hosts:
- "master:10.10.22.91"
- "localhost.localdomain:127.0.0.1"
worker1:
image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
container_name: worker1
user: root
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
network_mode: host
extra_hosts:
- "master:10.10.22.91"
- "localhost.localdomain:127.0.0.1"
worker2:
image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
container_name: worker2
user: root
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://master:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
network_mode: host
extra_hosts:
- "master:10.10.22.91"
- "localhost.localdomain:127.0.0.1"
docker-compose orchestrates and runs containers
docker-compose -f docker-compose-spark.yaml up -d
Browser access
http://10.10.22.91:8080/
At this point, the Spark independent cluster is completed.
Of course, if you need to integrate HDFS, you can directly build a Hadoop cluster. I won’t go into details here, please refer to the previous blog post.
Spark cluster official case test
1. Select any node to perform pi calculation, here select master
#View spark master container information
docker ps | grep master
#Enter the container and you will enter /opt/bitnami/spark by default
docker exec -it master bash
#Execute the official case of calculating pi
./bin/spark-submit --master spark://master:7077 --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.12-3.2.1.jar 1000
Parameters:
–master Submit cluster
–class Run main class path
1000 Run 1000 times
2. Check the execution result.
Pi is roughly 3.141485671414857.
The more calculation times, the more accurate the pi accuracy.
write at the end
Spark uses distributed data sets RDD to manage data and uses memory for distributed computing. Its performance is significantly improved by Hadoop. It is relatively simple for us to use docker containers to build independent Spark clusters. Of course, we can also integrate it with springboot to develop functional installation requirements that meet business needs and submit tasks remotely.