Practical combat: Introduction to big data Spark and building an independent cluster with docker-compose

Preface

Many students have used the classic big data distributed computing framework Hadoop. Its distributed file system HDFS is very friendly to data management, but its computing power is still insufficient compared to Spark. As the saying goes, if you want to do your job well, you must first sharpen your tools. Today we will introduce the deployment of Spark cluster in docker container.

Technology accumulation

Introduction to Spark

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. Spark enables memory distributed data sets. In addition to providing interactive queries, it can also optimize iterative workloads. For distributed computing, Spark performs distributed computing based on memory, which greatly improves performance.
insert image description here

Spark core functions and advantages

Faster speed:
In memory computing, Spark is 100 times faster than Hadoop.
Ease of use
Spark provides more than 80 advanced operators.
Universal
Spark provides a large number of libraries, including Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX. Developers can seamlessly combine these libraries in the same application.
Supports multiple resource managers
Spark supports Hadoop YARN, Apache Mesos, and its own independent cluster manager

Spark running architecture

The core of the Spark framework is a computing engine. Overall, it adopts the standard master-slave structure
diagram, which shows the basic architecture of a Spark execution. The Driver in the diagram represents the master, which is responsible for managing the entire cluster. Job task scheduling. The Executor in the picture is the slave, responsible for actually executing tasks.
insert image description here

After the user program creates SparkContext, it will connect to the cluster resource manager. The cluster resource manager will allocate computing resources to the user program and start the Executor; the Driver will divide the computing program into
different execution stages and multiple Tasks, and then send the Task. To the Executor;
the Executor is responsible for executing the Task and reporting the execution status to the Driver. It will also report the current node resource usage to the cluster resource manager.

Spark independent cluster construction

Install docker and docker-compose

docker与docker-compose安装
#安装docker社区版
yum install docker-ce
#版本查看
docker version
#docker-compose插件安装
curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose
#可执行权限
chmod +x /usr/local/bin/docker-compose
#版本查看
docker-compose version

docker-compose orchestration

docker-compose-spark.yaml

version: "3.3"
services:
  master:
    image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
    container_name: master
    user: root
    command: " /opt/bitnami/java/bin/java -cp /opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080 "
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ./python:/python
    network_mode: host
    extra_hosts:
      - "master:10.10.22.91"
      - "localhost.localdomain:127.0.0.1"

  worker1:
    image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
    container_name: worker1
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    network_mode: host
    extra_hosts:
      - "master:10.10.22.91"
      - "localhost.localdomain:127.0.0.1"
  worker2:
    image: registry.cn-hangzhou.aliyuncs.com/senfel/spark:3.2.1
    container_name: worker2
    user: root
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    network_mode: host
    extra_hosts:
      - "master:10.10.22.91"
      - "localhost.localdomain:127.0.0.1"

docker-compose orchestrates and runs containers

docker-compose -f docker-compose-spark.yaml up -d
insert image description here

Browser access
http://10.10.22.91:8080/
insert image description here

At this point, the Spark independent cluster is completed.
Of course, if you need to integrate HDFS, you can directly build a Hadoop cluster. I won’t go into details here, please refer to the previous blog post.
insert image description here

Spark cluster official case test

1. Select any node to perform pi calculation, here select master
#View spark master container information
docker ps | grep master
#Enter the container and you will enter /opt/bitnami/spark by default
docker exec -it master bash
#Execute the official case of calculating pi
./bin/spark-submit --master spark://master:7077 --class org.apache.spark.examples.SparkPi ./examples/jars/spark-examples_2.12-3.2.1.jar 1000

Parameters:
–master Submit cluster
–class Run main class path
1000 Run 1000 times

2. Check the execution result.
Pi is roughly 3.141485671414857.
The more calculation times, the more accurate the pi accuracy.
insert image description here
insert image description here

write at the end

Spark uses distributed data sets RDD to manage data and uses memory for distributed computing. Its performance is significantly improved by Hadoop. It is relatively simple for us to use docker containers to build independent Spark clusters. Of course, we can also integrate it with springboot to develop functional installation requirements that meet business needs and submit tasks remotely.

Guess you like

Origin blog.csdn.net/weixin_39970883/article/details/132447149