Spark2.X environment preparation, compilation, deployment and operation under Linux-centos

1. Spark overview

    Spark is a platform for fast and general cluster computing.

    In terms of speed, Spark extends the widely used MapReduce computing model and efficiently supports more computing modes, including interactive query and stream processing. Speed ​​is very important when dealing with large datasets. Being fast means we can do interactive data manipulations that otherwise would have to wait minutes or even hours for each operation.

    One of the main features of Spark is the ability to perform computations in memory and thus be faster. But even for complex computations that must be done on disk, Spark is still more efficient than MapReduce.

2. Spark Learning Website

1) databricks  website
2) spark  official website
3) github  website

3. Spark2.x source code download and compiled version

1) Download the Spark2.2 source code to the /opt/softwares/ directory of the node5 node and unzip it
tar -zxf spark-2.2.0.tgz -C /opt/modules/
2) The environment required for spark2.2 compilation: Maven3.3.9 and Java8
3) The way to compile Spark source code: Maven compilation, SBT compilation (not yet available) and package compilation make-distribution.sh
    a) Download Jdk8 and install it
tar -zxf jdk8u11-linux-x64.tar.gz -C /opt/modules/
    b)JAVA_HOME配置/etc/profile
export JAVA_HOME=/opt/modules/jdk1.8.0_11
export PATH=$PATH:$JAVA_HOME/bin
    After the editor exits, make it effective
source /etc/profile
    c) If you encounter the problem that the current version cannot be loaded
rpm -qa|grep jdk
rpm -e --nodeps jdk version
    d) Download and unzip Maven
     Download Maven, unzip maven
tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/modules/
    Placement MAVEN_HOME (/ etc / profile)
export MAVEN_HOME=/opt/modules/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024M"
    After the editor exits, make it effective
source /etc/profile
    Check maven version
mvn -version
    e) Edit the content of make-distribution.sh to make compilation faster
VERSION=2.2.0
SCALA_VERSION = 2.11.8
SPARK_HADOOP_VERSION=2.6.4
# support spark on hive
SPARK_HIVE=1
4) Install some compression and decompression tools before compiling
yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop openssl openssl-devel
5) Maven compilation is only for compiling the source code, which can be imported into idea after compilation
mvn clean package -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive -Phive-thriftserver -Pyarn -DskipTests
6) Compile spark through the make-distribution.sh source code, and after packaging, it can be thrown into the production environment
./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Dhadoop.version = 2.6.4 -Phive -Phive-thriftserver -Pyarn -DskipTests
    Unzip after compiling
tar -zxf spark-2.2.0-bin-custom-spark.tgz -C /opt/modules/

4. scala installation and environment variable setting

1) Download
2) Unzip
tar -zxf scala-2.11.8.tgz -C /opt/modules/
3) Configure environment variables (/etc/profile)
export SCALA_HOME=/opt/modules/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
4) After editing and exiting, make it effective
source /etc/profile

5. Spark2.2.0 local mode running test

1) Start the spark-shell test
./bin/spark-shell
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count()
res0: Long = 126

scala> textFile.first()
res1: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
2) Word frequency statistics
    a) Create a local file stu.txt with the content of
hadoop storm spark
hbase spark flume
spark dajiangtai spark
hdfs mapreduce spark
hive hdfs solr
spark flink storm
hbase storm es	
solr dajiangtai scale
linux java scale
python spark mlib
kafka spark mysql
spark es scale
azkaban oozie mysql
storm storm storm
scale mysql es
spark spark spark
    b) spark-shell word frequency statistics
./bin/spark-shell
scala> val rdd = spark.read.textFile("/opt/datas/stu.txt")
# Word frequency statistics
scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).collect
# sort word frequency
scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).map(x =>(x._2,x._1)).sortBykey().map(x => (x._2,x._1)).collect

6. spark service web monitoring page

Check the spark service through the web page, http://node5:4040



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325529869&siteId=291194637