spark tutorial (a) - Getting Started and Installation

spark Profile 

I recommend that you read the blog  Big Data Infrastructure

a general purpose computing engine spark, specially designed for large-scale data processing, and mapreduce similar, except that, the intermediate result is written mapreduce HDFS, and spark directly written to memory, which makes it possible to achieve real-time calculation.

spark, he was able to scala and scala perfect combination of language development, while achieving java, python, R interfaces.

 

Installation and set up a cluster

Step 1: Install Environment

1. Install java: very simple, please Baidu own

2. Install hadoop cluster: specific reference to my blog  hadoop cluster structures

3. Installation scala: spark tar packet with scala dependent, so no special installation

4. python2.7 or later: If you are using pyspark only needs to be installed

 

Step two: Download and install

1. official website to download spark

Download  spark

Note Select the corresponding version hadoop

 

2. Extract the tar package

Each node in the cluster to upload, extract, set the environment variable

export SPARK_HOME=/usr/lib/spark
export PATH=.:$HADOOP_HOME/bin:$JAVA_HOME/bin:$SPARK_HOME/bin:$PATH

 

3. Configure spark

Spark into the extracted directory, you need to configure conf / slaves, conf / spark-env.sh two files

Note that both the file does not exist, you need to copy it cp

cp slaves.template slaves
cp spark-env.sh.template spark-env.sh

 

slaves

End remove localhost, add the following

hadoop10
hadoop11
hadoop12
hadoop13

 

spark-env.sh

Plus the following

export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
export SPARK_MASTER_IP=hadoop1
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=1G

And spark port provided in the master node;

Spark_worker_memory used for indicating the calculated memory, the better, spark is calculated based on memory

 

4. Send the configuration to the other remote node

scp -r conf/ root@hadoop11:/usr/lib/spark
scp -r conf/ root@hadoop12:/usr/lib/spark
scp -r conf/ root@hadoop13:/usr/lib/spark

 

5. Start spark

cd /usr/lib/spark/sbin、
./start-all.sh

Stop is the corresponding stop

 

6. Verify successful start

6.1 jsp view the process

The master node display master and worker two processes

Display worker process from the node

 

6.2 browser to access http://192.168.10.10:8080/

 

The third step: Operation spark cluster

Client command hadoop cluster are operating in the bin directory of spark

1. spark-shell mode   [mode] scala

input the command

spark- shell 

# also sets the parameters 
spark-shell --master spark: // hadoop10 : 7077 --executor-memory 600m

 

 

2. pyspark mode   [mode] python

Enter the command to pyspark

If NameError appear: name 'memoryview' is not defined, explained python wrong version 2.7 and above

 

If want to use python program starts pyspark directly, you need to configure / etc / profile

# python can call pyspark directly
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/pyspark:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH

 

 

 

References:

https://www.cnblogs.com/swordfall/p/7903678.html installation

https://www.jianshu.com/p/5626612bf10c installation

https://blog.csdn.net/penyok/article/details/81483527 installation

Guess you like

Origin www.cnblogs.com/yanshw/p/11614988.html