spark Profile
I recommend that you read the blog Big Data Infrastructure
a general purpose computing engine spark, specially designed for large-scale data processing, and mapreduce similar, except that, the intermediate result is written mapreduce HDFS, and spark directly written to memory, which makes it possible to achieve real-time calculation.
spark, he was able to scala and scala perfect combination of language development, while achieving java, python, R interfaces.
Installation and set up a cluster
Step 1: Install Environment
1. Install java: very simple, please Baidu own
2. Install hadoop cluster: specific reference to my blog hadoop cluster structures
3. Installation scala: spark tar packet with scala dependent, so no special installation
4. python2.7 or later: If you are using pyspark only needs to be installed
Step two: Download and install
1. official website to download spark
Download spark
Note Select the corresponding version hadoop
2. Extract the tar package
Each node in the cluster to upload, extract, set the environment variable
export SPARK_HOME=/usr/lib/spark
export PATH=.:$HADOOP_HOME/bin:$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH
3. Configure spark
Spark into the extracted directory, you need to configure conf / slaves, conf / spark-env.sh two files
Note that both the file does not exist, you need to copy it cp
cp slaves.template slaves
cp spark-env.sh.template spark-env.sh
slaves
End remove localhost, add the following
hadoop10
hadoop11
hadoop12
hadoop13
spark-env.sh
Plus the following
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64 export SPARK_MASTER_IP=hadoop1 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_MEMORY=1G
And spark port provided in the master node;
Spark_worker_memory used for indicating the calculated memory, the better, spark is calculated based on memory
4. Send the configuration to the other remote node
scp -r conf/ root@hadoop11:/usr/lib/spark scp -r conf/ root@hadoop12:/usr/lib/spark scp -r conf/ root@hadoop13:/usr/lib/spark
5. Start spark
cd /usr/lib/spark/sbin、
./start-all.sh
Stop is the corresponding stop
6. Verify successful start
6.1 jsp view the process
The master node display master and worker two processes
Display worker process from the node
6.2 browser to access http://192.168.10.10:8080/
The third step: Operation spark cluster
Client command hadoop cluster are operating in the bin directory of spark
1. spark-shell mode [mode] scala
input the command
spark- shell # also sets the parameters spark-shell --master spark: // hadoop10 : 7077 --executor-memory 600m
2. pyspark mode [mode] python
Enter the command to pyspark
If NameError appear: name 'memoryview' is not defined, explained python wrong version 2.7 and above
If want to use python program starts pyspark directly, you need to configure / etc / profile
# python can call pyspark directly export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/pyspark:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
References:
https://www.cnblogs.com/swordfall/p/7903678.html installation
https://www.jianshu.com/p/5626612bf10c installation
https://blog.csdn.net/penyok/article/details/81483527 installation