1.Spark Overview
Spark is used to achieve a fast and versatile cluster computing platform.
In terms of speed, the Spark extended MapReduce computation model widely used, but also support more efficient calculation mode, and stream including interactive query processing. When dealing with large data sets, the speed is very important. It means that we can speed interactive data manipulation, otherwise every operation we need to wait a few minutes or even hours.
A key feature is the ability to calculate Spark in memory, and therefore faster. But even complex calculations must be performed on the disk, Spark is still more efficient than MapReduce.
2.Spark ecosystem
3.Spark school site
1) databricks website
2) spark official website
3) github website
4.Spark2.x download the source code and compiled versions
1) Spark2.2 source code downloaded to the / opt / softwares / bigdata-pro02.kfk.com directory node.
Decompression
tar -zxf spark-2.2.0.tgz -C /opt/modules/
2) spark2.2 compiling the required environment: Maven3.3.9 and Java8
3) Spark source code to compile way: Maven compilation, SBT compiled (No) and packaged compilation make-distribution.sh
a) download Jdk8 and install
tar -zxf jdk8u11-linux-x64.tar.gz -C /opt/modules/
b) JAVA_HOME configuration / etc / profile
vi / etc / profile
export JAVA_HOME=/opt/modules/jdk1.8.0_11
export PATH=$PATH:$JAVA_HOME/bin
After editing quit, bring it into force
source /etc/profile
c) If you experience problems can not load the current version of
rpm -qa | grep jdk
rpm -e --nodeps jdk version
which java 删除/usr/bin/java
d) Download and unzip Maven
Download Maven
Decompression maven
tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/modules/
Configuration MAVEN_HOME
vi / etc / profile
export MAVEN_HOME=/opt/modules/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024M"
After editing quit, bring it into force
source /etc/profile
View maven version
mvn -version
e) make-distribution.sh edit content, allows faster compilation
VERSION=2.2.0
SCALA_VERSION=2.11.8
SPARK_HADOOP_VERSION=2.5.0
# Support spark on hive
SPARK_HIVE=1
4) By spark source compiler make-distribution.sh
./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.5 -Phive -Phive-thriftserver -Pyarn
# Compiler decompression after completion
tar -zxf spark-2.2.0-bin-custom-spark.tgz -C /opt/modules/
5.scala installation and environment variable settings
1) Download
2) unzip
tar -zxf scala-2.11.8.tgz -C /opt/modules/
3) configuration environment variable
vi / etc / profile
export SCALA_HOME=/opt/modules/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
4) After editing quit, bring it into force
source /etc/profile
6.spark2.0 local mode test
1) Start spark-shell test
./bin/spark-shell
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.count()
res0: Long = 126
scala> textFile.first()
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
2) word frequency statistics
a) create a local file stu.txt
we /opt/datas/stu.txt
hadoop storm spark
hbase spark flume
spark dajiangtai spark
hdfs mapreduce spark
hive hdfs solr
spark flink storm
hbase storm es
solr dajiangtai scala
linux java scala
python spark mlib
kafka spark mysql
spark it scala
azkaban oozie mysql
storm storm storm
scala mysql is
spark spark spark
b) spark-shell word frequency statistics
./bin/spark-shell
scala> val rdd = spark.read.textFile("/opt/datas/stu.txt")
# Word frequency statistics
scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).collect
# Sort of word frequency
scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).map(x =>(x._2,x._1)).sortBykey().map(x => (x._2,x._1)).collect
7.spark Monitoring Service web page
Check spark of service through web pages
bigdata-pro01.kfk.com:4040