1. Spark overview
Spark is a platform for fast and general cluster computing.
In terms of speed, Spark extends the widely used MapReduce computing model and efficiently supports more computing modes, including interactive query and stream processing. Speed is very important when dealing with large datasets. Being fast means we can do interactive data manipulations that otherwise would have to wait minutes or even hours for each operation.
One of the main features of Spark is the ability to perform computations in memory and thus be faster. But even for complex computations that must be done on disk, Spark is still more efficient than MapReduce.
2. Spark Learning Website
1) databricks
website
2) spark
official website
3) github
website
3. Spark2.x source code download and compiled version
1)
Download
the Spark2.2 source code to the /opt/softwares/ directory of the node5 node and unzip it
tar -zxf spark-2.2.0.tgz -C /opt/modules/2) The environment required for spark2.2 compilation: Maven3.3.9 and Java8
3) The way to compile Spark source code: Maven compilation, SBT compilation (not yet available) and package compilation make-distribution.sh
a) Download Jdk8 and install it
a) Download Jdk8 and install it
tar -zxf jdk8u11-linux-x64.tar.gz -C /opt/modules/b)JAVA_HOME配置/etc/profile
export JAVA_HOME=/opt/modules/jdk1.8.0_11 export PATH=$PATH:$JAVA_HOME/binAfter the editor exits, make it effective
source /etc/profilec) If you encounter the problem that the current version cannot be loaded
rpm -qa|grep jdk rpm -e --nodeps jdk versiond) Download and unzip Maven
Download
Maven, unzip maven
tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/modules/Placement MAVEN_HOME (/ etc / profile)
export MAVEN_HOME=/opt/modules/apache-maven-3.3.9 export PATH=$PATH:$MAVEN_HOME/bin export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024M"After the editor exits, make it effective
source /etc/profileCheck maven version
mvn -versione) Edit the content of make-distribution.sh to make compilation faster
VERSION=2.2.0 SCALA_VERSION = 2.11.8 SPARK_HADOOP_VERSION=2.6.4 # support spark on hive SPARK_HIVE=14) Install some compression and decompression tools before compiling
yum install -y snappy snappy-devel bzip2 bzip2-devel lzo lzo-devel lzop openssl openssl-devel5) Maven compilation is only for compiling the source code, which can be imported into idea after compilation
mvn clean package -Phadoop-2.6 -Dhadoop.version=2.6.4 -Phive -Phive-thriftserver -Pyarn -DskipTests6) Compile spark through the make-distribution.sh source code, and after packaging, it can be thrown into the production environment
./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Dhadoop.version = 2.6.4 -Phive -Phive-thriftserver -Pyarn -DskipTestsUnzip after compiling
tar -zxf spark-2.2.0-bin-custom-spark.tgz -C /opt/modules/
4. scala installation and environment variable setting
1) Download
2) Unzip
tar -zxf scala-2.11.8.tgz -C /opt/modules/3) Configure environment variables (/etc/profile)
export SCALA_HOME=/opt/modules/scala-2.11.8 export PATH=$PATH:$SCALA_HOME/bin4) After editing and exiting, make it effective
source /etc/profile
5. Spark2.2.0 local mode running test
1) Start the spark-shell test./bin/spark-shell scala> val textFile = spark.read.textFile("README.md") textFile: org.apache.spark.sql.Dataset[String] = [value: string] scala> textFile.count() res0: Long = 126 scala> textFile.first() res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string] scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 152) Word frequency statistics
a) Create a local file stu.txt with the content of
hadoop storm spark hbase spark flume spark dajiangtai spark hdfs mapreduce spark hive hdfs solr spark flink storm hbase storm es solr dajiangtai scale linux java scale python spark mlib kafka spark mysql spark es scale azkaban oozie mysql storm storm storm scale mysql es spark spark sparkb) spark-shell word frequency statistics
./bin/spark-shell scala> val rdd = spark.read.textFile("/opt/datas/stu.txt") # Word frequency statistics scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).collect # sort word frequency scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).map(x =>(x._2,x._1)).sortBykey().map(x => (x._2,x._1)).collect