Reference https://data-flair.training/blogs/install-apache-spark-multi-node-cluster/
Download spark address
http://spark.apache.org/downloads.html
Prepare three nodes
192.168.1.1 [hostname] master 192.168.1.2 [hostname] slave1 192.168.1.3 [hostname] slave2
The above configuration to the three nodes append machine / etc / hosts in. Since I am here three machines domain are different, so we set the [hostname], for example, master node
192.168.1.1 xxx.localdomain master
Check the host name for the method,
$ hostname
If the last start spark error unknown hostname, generally refers to the host name is not set, this time by
$ hostname -i
Found will be reported the same mistakes.
installation steps:
First, set the ssh login-free secret
If you do not install ssh, you need to install
sudo apt install openssh-server
On three machines were executed
ssh-keygen -t rsa
The way to enter, use the default setting (key file path and file name)
The above slave1 slave2 ~ / .ssh / id_rsa.pub copy files to the master node,
scp ~/.ssh/id_rsa.pub xxx@master:~/.ssh/id_rsa.pub.slave1
scp ~/.ssh/id_rsa.pub xxx@master:~/.ssh/id_rsa.pub.slave2
Note, xxx represents the user name, the best three machines use the same user name, if required, the user can create
adduser xxx # create a new user xxx
passwd xxx to xxx password #
Performed on the master
cat ~/.ssh/id_rsa.pub* >> ~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys xxx@slave1:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys xxx@slave2:~/.ssh/authorized_keys
Verify no login password on the master
ssh slave1
ssh slave2
On slave1 / slave2 can also password-free two other nodes.
Note: .ssh folder permissions must be 700, authorized_keys file permissions must be 600 (additional permissions values may not work), modify the permissions to use
chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys
Second, install jdk, scala, spark
Omitted, spark installed just above the downloaded file to decompress. Note configuration environment variable
export JAVA_HOME=...
export SCALA_HOME=...
export SPARK_HOME=...
export PATH=$JAVA_HOME/bin:$SCALA_HOME/bin:$SPARK_HOME/bin:$PATH
On the master node, enter the conf directory under SPARK_HOME,
cd conf cp spark-env.sh.template spark-env.sh cp slaves.template slaves
Edit slaves file
# localhost
slave1
slave2
Edit spark-env.sh file
export JAVA_HOME=... export SPARK_WORKER_CORES=8
On slave1 and slave2, perform the same operations.
Note: spark directory the best remains the same in the three nodes that same environment variables SPARK_HOME
Third, start the cluster
Executed on the master node
sbin/start-all.sh
Shut down the cluster is executed
sbin/stop-all.sh
After the start, can be performed on the master or slave1 / slave2 JPS to see the java process. View web interface, address
http://MASTER-IP:8080/
If the connection is not found worker nodes master, given as follows
Caused by: java.io.IOException: Connecting to :7077 timed out (120000 ms) ... org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run ...
We need three machines in the $ SPARK_HOME / conf / spark-env.sh add
export SPARK_MASTER_HOST=<master ip>
Then re-run
sbin/start-all.sh