Spark cluster installation and simple use

1. Spark cluster installation

1.1. Installation

1.1.1. Machine deployment
Prepare two or more Linux servers and install JDK1.7
1.1.2. Download the Spark installation package
write picture description here

Download address: http://www.apache.org/dyn/closer.lua/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz
Upload and decompress the installation package
Upload spark-1.5.2 -bin-hadoop2.6.tgz installation package to Linux
Unzip installation package to the specified location
tar -zxvf spark-1.5.2-bin-hadoop2.6.tgz -C /usr/local

1.1.3. Configure Spark
Go to the Spark installation directory
cd /usr/local/spark-1.5.2-bin-hadoop2.6
Go to the conf directory and rename and modify the spark-env.sh.template file
cd conf/
mv spark-env .sh.template spark-env.sh
vi spark-env.sh
Add the following configuration to the configuration file
export JAVA_HOME=/usr/java/jdk1.7.0_45
export SPARK_MASTER_IP=node1
export SPARK_MASTER_PORT=7077
Save and exit
Rename and modify slaves .template file
mv slaves.template slaves
vi slaves
Add the location of the child nodes in this file (Worker node)
node2
node3
node4
save and exit
Copy the
configured Spark to other nodes scp -r spark-1.5.2-bin- hadoop2.6/ node2:/usr/local/
scp -r spark-1.5.2-bin-hadoop2.6/ node3:/usr/local/
scp -r spark-1.5.2-bin-hadoop2.6/ node4:/usr/local/

The Spark cluster has been configured. There are currently 1 Master and 3 Workers. Start the Spark cluster on node1.itcast.cn
/usr/local/spark-1.5.2-bin-hadoop2.6/sbin/start-all.sh

Execute the jps command after startup. There is a Master process on the master node, and Work on other child nodes. Log in to the Spark management interface to view the cluster status (master node): http://node1:8080/
write picture description here

So far, the Spark cluster has been installed, but there is a big problem, that is, the Master node has a single point of failure. To solve this problem, we need to use zookeeper and start at least two Master nodes to achieve high reliability. Configuration method Simple:
Spark cluster planning: node1, node2 are masters; node3, node4, node5 are workers
Install and configure zk cluster, and start zk cluster
Stop all services of spark, modify the configuration file spark-env.sh, and delete it in the configuration file SPARK_MASTER_IP and add the following configuration
export SPARK_DAEMON_JAVA_OPTS=”-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk1,zk2,zk3 -Dspark.deploy.zookeeper.dir=/spark”
1. Modify on the node1 node The content of the slaves configuration file specifies the worker node
2. Execute the sbin/start-all.sh script on node1, and then execute sbin/start-master.sh on node2 to start the second Master

2. Start Spark Shell

spark-shell is an interactive shell program that comes with Spark, which is convenient for users to perform interactive programming. Users can write spark programs in scala under this command line.

2.1, start the spark shell

Here you need to start the corresponding Spark cluster
/root/apps/spark/bin/spark-shell –master spark://shizhan:7077 –executor-memory 2g –total-executor-cores 2

Parameter description:
–master spark://shizhan:7077 Specify the address of the Master
–executor-memory 2g Specify the available memory of each worker as 2G
–total-executor-cores 2 Specify the number of CPU cores used by the entire cluster as 2

Note:
If the master address is not specified when starting the spark shell, but you can also start the spark shell and execute the programs in the spark shell normally, it is actually starting the local mode of spark, which only starts a process on the local machine and is not established with the cluster. connect.

The SparkContext class has been initialized to the object sc by default in Spark Shell. If the user code needs to be used, just apply sc directly

2.2. Write WordCount program in spark shell

1.首先启动hdfs
2.向hdfs上传一个文件到hdfs://192.168.112.200:9000/words.txt
3.在spark shell中用scala语言编写spark程序
sc.textFile("hdfs://192.168.112.200:9000/wordcount/input/README.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://192.168.112.200:9000/out1")
4.使用hdfs命令查看结果
hdfs dfs -ls hdfs://node1.itcast.cn:9000/out/p*

Description:
sc is the SparkContext object, the entry
textFile (hdfs://node1.itcast.cn:9000/words.txt) when submitting the spark program is to read the data in hdfs
flatMap(_.split(" ")) first map is flattening
map((_,1)) to form a tuple of words and 1
reduceByKey( + ) to reduce according to the key, and accumulate the value
saveAsTextFile(“hdfs://node1.itcast.cn:9000/out”) write the result to hdfs

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325817810&siteId=291194637