News real-time analysis system Spark2.X environment preparation, compilation and deployment run

1.Spark Overview

Spark is used to achieve a fast and versatile cluster computing platform.

In terms of speed, the Spark extended MapReduce computation model widely used, but also support more efficient calculation mode, and stream including interactive query processing. When dealing with large data sets, the speed is very important. It means that we can speed interactive data manipulation, otherwise every operation we need to wait a few minutes or even hours.

A key feature is the ability to calculate Spark in memory, and therefore faster. But even complex calculations must be performed on the disk, Spark is still more efficient than MapReduce.

2.Spark ecosystem

 

3.Spark school site

1) databricks website

2) spark official website

3) github website

4.Spark2.x download the source code and compiled versions

1) Spark2.2 source code downloaded to the / opt / softwares / bigdata-pro02.kfk.com directory node.

Decompression

tar -zxf spark-2.2.0.tgz -C /opt/modules/

2) spark2.2 compiling the required environment: Maven3.3.9 and Java8

3) Spark source code to compile way: Maven compilation, SBT compiled (No) and packaged compilation make-distribution.sh

a) download Jdk8 and install

tar -zxf jdk8u11-linux-x64.tar.gz -C /opt/modules/

b) JAVA_HOME configuration / etc / profile

vi / etc / profile

export JAVA_HOME=/opt/modules/jdk1.8.0_11

export PATH=$PATH:$JAVA_HOME/bin

After editing quit, bring it into force

source /etc/profile

c) If you experience problems can not load the current version of

rpm -qa | grep jdk

rpm -e --nodeps jdk version

which java 删除/usr/bin/java

d) Download and unzip Maven

Download Maven

Decompression maven

tar -zxf apache-maven-3.3.9-bin.tar.gz -C /opt/modules/

Configuration MAVEN_HOME

vi / etc / profile

export MAVEN_HOME=/opt/modules/apache-maven-3.3.9

export PATH=$PATH:$MAVEN_HOME/bin

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024M"

After editing quit, bring it into force

source /etc/profile

View maven version

mvn -version

e) make-distribution.sh edit content, allows faster compilation

VERSION=2.2.0

SCALA_VERSION=2.11.8

SPARK_HADOOP_VERSION=2.5.0

# Support spark on hive

SPARK_HIVE=1

4) By spark source compiler make-distribution.sh

./dev/make-distribution.sh --name custom-spark --tgz -Phadoop-2.5 -Phive -Phive-thriftserver -Pyarn

# Compiler decompression after completion

tar -zxf spark-2.2.0-bin-custom-spark.tgz -C /opt/modules/

5.scala installation and environment variable settings

1) Download

2) unzip

tar -zxf scala-2.11.8.tgz -C /opt/modules/

3) configuration environment variable

vi / etc / profile

export SCALA_HOME=/opt/modules/scala-2.11.8

export PATH=$PATH:$SCALA_HOME/bin

4) After editing quit, bring it into force

source /etc/profile

6.spark2.0 local mode test

1) Start spark-shell test

./bin/spark-shell

scala> val textFile = spark.read.textFile("README.md")

textFile: org.apache.spark.sql.Dataset[String] = [value: string]

 

scala> textFile.count()

res0: Long = 126

 

scala> textFile.first()

res1: String = # Apache Spark

 

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))

linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

 

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?

res3: Long = 15

2) word frequency statistics

a) create a local file stu.txt

we /opt/datas/stu.txt

hadoop  storm   spark

hbase   spark   flume

spark   dajiangtai     spark

hdfs    mapreduce      spark

hive    hdfs    solr

spark   flink   storm

hbase   storm   es     

solr dajiangtai scala

linux   java    scala

python  spark   mlib

kafka   spark   mysql

spark it scala

azkaban oozie   mysql

storm   storm   storm

scala mysql is

spark   spark   spark

b) spark-shell word frequency statistics

./bin/spark-shell

scala> val rdd = spark.read.textFile("/opt/datas/stu.txt")

# Word frequency statistics

scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).collect

# Sort of word frequency

scala> val lines = rdd.flatmap(x => x.split(" ")).map(x => (x,1)).rdd.reduceBykey((a,b) => (a+b)).map(x =>(x._2,x._1)).sortBykey().map(x => (x._2,x._1)).collect

7.spark Monitoring Service web page

Check spark of service through web pages

bigdata-pro01.kfk.com:4040

 

Guess you like

Origin www.cnblogs.com/misliu/p/11112378.html
Recommended