Spark basic analysis of big data technology

Spark basic analysis of big data technology

Chapter 1 Spark Overview
1.1 What is Spark

1.2Spark built-in modules

Spark Core: Implements the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and interaction with the storage system. Spark Core also contains API definitions for Resilient Distributed DataSet (RDD).
Spark SQL: is a package used by Spark to manipulate structured data. Through Spark SQL, we can use SQL or Apache Hive version of SQL dialect (HQL) to query data. Spark SQL supports multiple data sources, such as Hive tables, Parquet, and JSON.
Spark Streaming: It is a component provided by Spark that performs streaming calculations on real-time data. Provides an API for manipulating data streams, and is highly compatible with the RDD API in Spark Core.
Spark MLlib: A library that provides common machine learning (ML) functions. Including classification, regression, clustering, collaborative filtering, etc., it also provides additional support functions such as model evaluation and data import.
Cluster Manager: Spark is designed to efficiently scale computing from one computing node to thousands of computing nodes. In order to achieve this requirement and obtain maximum flexibility, Spark supports running on various cluster managers, including Hadoop YARN, Apache Mesos, and a simple scheduler that Spark comes with, called an independent scheduler.
Spark is supported by many big data companies, including Hortonworks, IBM, Intel, Cloudera, MapR, Pivotal, Baidu, Ali, Tencent, JD, Ctrip, Youku Tudou. At present, Baidu's Spark has been applied to big search, direct number, Baidu big data and other businesses; Alibaba uses GraphX ​​to build a large-scale graph computing and graph mining system, and implements recommendation algorithms for many production systems; Tencent Spark cluster reaches 8,000 units The scale is currently the largest known Spark cluster in the world.
1.3 Spark features

Chapter 2 Spark Operation Mode
2.1 Spark Installation Address
1. Official website address
http://spark.apache.org/
2.Document viewing address
https://spark.apache.org/docs/2.1.1/
3. Download link
https://spark.apache.org/downloads.html
2.2 Cluster roles
2.2.1 Master and Worker
1)
Leader of Master Spark's unique resource scheduling system. In charge of the resource information of the entire cluster, similar to the ResourceManager in the Yarn framework, the main functions:
(1) Monitor Worker to see if the Worker is working normally;
(2) Master's management of Worker, Application, etc. (receive worker registration and manage all The worker receives the application submitted by the client, (FIFO) schedules the waiting application and submits it to the worker).
2)
There are multiple slaves of Worker Spark's unique resource scheduling system. Each Slave is in charge of the resource information of the node where it is located, similar to the NodeManager in the Yarn framework. The main functions:
(1) Register to the Master through RegisterWorker;
(2) Send heartbeats to the Master regularly;
  (3) Configure the process according to the application sent by the master Environment, and start StandaloneExecutorBackend (temporary process required to execute Task)
2.2.2 Driver and Executor
1) Driver (driver)
Spark driver is the process of executing the main method in the development program. It is responsible for the execution of code written by developers to create SparkContext, create RDD, and perform RDD conversion operations and action operations. If you are using the spark shell, when you start the Spark shell, a Spark driver program is automatically launched in the background of the system, which is a SparkContext object called sc pre-loaded in the Spark shell. If the driver program terminates, then the Spark application also ends. Mainly responsible for:
(1) Turn user programs into tasks
(2) Track the running status of Executor
(3) Schedule tasks for executor nodes
(4) UI display application running status
2) Executor (executor)
Spark Executor is a work process , Responsible for running tasks in Spark jobs, and the tasks are independent of each other. When the Spark application starts, the Executor node is started at the same time, and it always exists with the life cycle of the entire Spark application. If an Executor node fails or crashes, the Spark application can continue to execute, and the task on the error node will be scheduled to other Executor nodes to continue running. Mainly responsible for:
(1) Responsible for running the tasks that make up the Spark application and returning status information to the driver process;
(2) Provide memory storage for RDDs that require cache in user programs through its own block manager (Block Manager). RDD is directly cached in the Executor process, so the task can make full use of the cached data to speed up operations at runtime.
Summary: Master and Worker are the daemons of Spark, that is, processes necessary for Spark to run normally in a specific mode. Driver and Executor are temporary processes, which are only started when specific tasks are submitted to the Spark cluster.
2.3 Local mode
2.3.1 Overview

2.3.2 Installation and use
1) Upload and decompress the spark installation package
[atguigu@hadoop102 sorfware]$ tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz -C /opt/module/
[atguigu@hadoop102 module]$ mv spark-2.1.1-bin-hadoop2.7 spark
2) Official request for PI case
[atguigu@hadoop102 spark]$ bin/spark-submit
–class org.apache.spark.examples.SparkPi
–executor-memory 1G
–total- executor-cores 2
./examples/jars/spark-examples_2.11-2.1.1.jar
100
(1) Basic syntax
bin/spark-submit
–class
–master
–deploy-mode
–conf =
… # other options

[application- arguments]
(2) Parameter description
–master specifies the address of the Master;
–class: the startup class of your application (such as org.apache.spark.examples.SparkPi);
–Deploy-mode: Whether to release your driver to the worker node (cluster) or as a local client (client) (default: client);
–conf: Any Spark configuration attribute, format key=value. If the value contains spaces, You can add quotation marks "key=value";
application-jar: packaged application jar, including dependencies. This URL is globally visible in the cluster. For example, hdfs:// shared storage system, if it is file:// path, then the path of all nodes contains the same jar
application-arguments: the parameters passed to the main() method;
–executor-memory 1G specifies each executor The available memory is 1G;
--total-executor-cores 2 Specify the number of cup cores used by each executor as 2.
3) Results display
The algorithm uses Monte Carlo algorithm to find PI

4)准备文件
[atguigu@hadoop102 spark]$ mkdir input
在input下创建3个文件1.txt和2.txt,并输入以下内容
hello atguigu
hello spark
5)启动spark-shell
[atguigu@hadoop102 spark]$ bin/spark-shell
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/09/29 08:50:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
18/09/29 08:50:58 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.9.102:4040
Spark context available as ‘sc’ (master = local[*], app id = local-1538182253312).
Spark session available as ‘spark’.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 2.1.1
/
/

Using Scala version 2.11.8 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
Open another CRD window
[atguigu@hadoop102 spark]$ jps
3627 SparkSubmit
4047 Jps
can log in hadoop102:4040 to view the program running

6)运行WordCount程序
scala>sc.textFile(“input”).flatMap(.split(" ")).map((,1)).reduceByKey(+).collect
res0: Array[(String, Int)] = Array((hadoop,6), (oozie,3), (spark,3), (hive,3), (atguigu,3), (hbase,6))

scala>
can log in hadoop102:4040 to view the program running

2.3.3 Submit process
1) Submit task analysis

2.3.4 Data flow
textFile("input"): read the input folder data of the local file;
flatMap( .split(" ")): flatten operation, map a row of data into words according to the space separator;
map( (
,1)): Operate on each element and map words to tuples;
reduceByKey( + ): aggregate and add values ​​according to key;
collect: collect data to the Driver side for display.

2.4 Standalone mode
2.4.1 Overview
Build a Spark cluster composed of Master+Slave, and Spark runs in the cluster.

2.4.2 Installation and use
1) Enter the conf folder under the spark installation directory
[atguigu@hadoop102 module]$ cd spark/conf/
2) Modify the configuration file name
[atguigu@hadoop102 conf]$ mv slaves.template slaves
[atguigu@hadoop102 conf]$ mv spark-env.sh.template spark-env.sh
3) Modify the slave file and add the work node:
[atguigu@hadoop102 conf]$ vim slaves

hadoop102
hadoop103
hadoop104
4) Modify the spark-env.sh file and add the following configuration:
[atguigu@hadoop102 conf]$ vim spark-env.sh

SPARK_MASTER_HOST=hadoop102
SPARK_MASTER_PORT=7077
5)分发spark包
[atguigu@hadoop102 module]$ xsync spark/
6)启动
[atguigu@hadoop102 spark]$ sbin/start-all.sh
[atguigu@hadoop102 spark]$ util.sh
atguigu@hadoop102
3330 Jps
3238 Worker
3163 Master
atguigu@hadoop103
2966 Jps
2908 Worker
atguigu@hadoop104
2978 Worker
3036 Jps
webpage view: hadoop102:8080
Note: If you encounter "JAVA_HOME not set" exception, you can add the following configuration to the spark-config.sh file in the sbin directory:
export JAVA_HOME=XXXX
7) Official PI case
[ atguigu@hadoop102 spark]$ bin/spark-submit
–class org.apache.spark.examples.SparkPi
–master spark://hadoop102:7077
–executor-memory 1G
–total-executor-cores 2
./examples/jars/spark -examples_2.11-2.1.1.jar
100

8) Start the spark shell
/opt/module/spark/bin/spark-shell
–master spark://hadoop102:7077
–executor-memory 1g
–total-executor-cores 2
parameters: –master spark://hadoop102:7077 specified The master of the cluster to be connected
executes the WordCount program
scala>sc.textFile("input").flatMap( .split(" ")).map(( ,1)).reduceByKey( + ).collect
res0: Array[(String , Int)] = Array((hadoop,6), (oozie,3), (spark,3), (hive,3), (atguigu,3), (hbase,6))

scala>
2.4.3 JobHistoryServer configuration
1) Modify spark-default.conf.template name
[atguigu@hadoop102 conf]$ mv spark-defaults.conf.template spark-defaults.conf
2) Modify spark-default.conf file and enable Log :
[Atguigu@hadoop102 conf]$ vi spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:9000/directory
Note: The directory on HDFS needs to exist in advance.
3) Modify the spark-env.sh file and add the following configuration:
[atguigu@hadoop102 conf]$ vi spark-env.sh

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://hadoop102:9000/directory"
Parameter description:
spark.eventLog.dir: All information of the application during the running process is recorded in the path specified by this property
spark.history.ui.port=18080 The port number accessed by WEBUI is 18080
spark.history.fs.logDirectory=hdfs://hadoop102:9000/directory After configuring this property, there is no need to explicitly specify the path when start-history-server.sh. The Spark History Server page only displays the information under the specified path
spark.history.retainedApplications=30 Specify the one to save Application history records If it exceeds this value, the old application information will be deleted. This is the number of applications in the memory, not the number of applications displayed on the page.
4) Distribution configuration file
[atguigu@hadoop102 conf]$ xsync spark-defaults.conf
[atguigu@hadoop102 conf]$ xsync spark-env.sh
5) Start history service
[atguigu@hadoop102 spark]$ sbin/start-history-server .sh
6) Execute the task again
[atguigu@hadoop102 spark]$ bin/spark-submit
–class org.apache.spark.examples.SparkPi
–master spark://hadoop102:7077
–executor-memory 1G
–total-executor-cores 2
. /examples/jars/spark-examples_2.11-2.1.1.jar
100
7) View historical service
hadoop102:18080

2.4.4 HA configuration

Figure 1 HA architecture Figure
1) Zookeeper is installed and started normally
2) Modify the spark-env.sh file and add the following configuration:
[atguigu@hadoop102 conf]$ vi spark-env.sh

Comment out the following content:
#SPARK_MASTER_HOST=hadoop102
#SPARK_MASTER_PORT=7077
add the following content:
export SPARK_DAEMON_JAVA_OPTS="
-Dspark.deploy.recoveryMode=
ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoopkeeper -103,hadoop -103,
hadoopkeeper .dir=/spark"
3) Distribution configuration file
[atguigu@hadoop102 conf]$ xsync spark-env.sh
4) Start all nodes on hadoop102
[atguigu@hadoop102 spark]$ sbin/start-all.sh
5) On hadoop103 Start the master node separately on
[atguigu@hadoop103 spark]$ sbin/start-master.sh
6) Spark HA cluster access
/opt/module/spark/bin/spark-shell
--master spark://hadoop102:7077,hadoop103:7077
--Executor-memory 2g
--total-executor-cores 2
2.5 Yarn mode
2.5.1 Overview The
Spark client directly connects to Yarn, without the need to build a Spark cluster. There are two modes of yarn-client and yarn-cluster. The main difference lies in the running node of the Driver program.
yarn-client: The Driver program runs on the client and is suitable for interaction and debugging. You want to see the output of the app immediately.
Yarn-cluster: The Driver program runs on AM (APPMaster) started by RM (ResourceManager) and is suitable for production environments.

2.5.2 Installation and use
1) Modify the hadoop configuration file yarn-site.xml and add the following content:
[atguigu@hadoop102 hadoop]$ vi yarn-site.xml


yarn.nodemanager.pmem-check-enabled
false



yarn.nodemanager.vmem- check-enabled
false

2) Modify spark-env.sh and add the following configuration:
[atguigu@hadoop102 conf]$ vi spark-env.sh

YARN_CONF_DIR=/opt/module/hadoop-2.7.2/etc/hadoop
3) Distribution configuration file
[atguigu@hadoop102 conf]$ xsync /opt/module/hadoop-2.7.2/etc/hadoop/yarn-site.xml
4 ) Execute a program
[atguigu@hadoop102 spark]$ bin/spark-submit
–class org.apache.spark.examples.SparkPi
–master yarn
–deploy-mode client
./examples/jars/spark-examples_2.11-2.1.1 .jar
100
Note: HDFS and YARN cluster need to be started before submitting the task.
2.5.3 Log View
1) Modify the configuration file spark-defaults.conf and
add the following content:
spark.yarn.historyServer.address=hadoop102:18080
spark.history.ui.port=18080
2) Restart the spark history service
[atguigu@hadoop102 spark ]$ sbin/stop-history-server.sh
stopping org.apache.spark.deploy.history.HistoryServer
[atguigu@hadoop102 spark]$ sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/module/spark/logs/spark-atguigu-org.apache.spark. deploy.history.HistoryServer-1-hadoop102.out
3) Submit the task to Yarn for execution
[atguigu@hadoop102 spark]$ bin/spark-submit
–class org.apache.spark.examples.SparkPi
–master yarn
–deploy-mode client
. /examples/jars/spark-examples_2.11-2.1.1.jar
100
4) View log on Web page

2.6 The
Spark client in Mesos mode directly connects to Mesos; there is no need to build an additional Spark cluster. There are fewer domestic applications, and more use yarn scheduling.
2.7 Comparison of Several Modes
Modes Number of Spark Installation Machines The Process Owner to Start
Local 1 No Spark
Standalone 3 Master and Worker Spark
Yarn 1 Yarn and HDFS Hadoop
Chapter 3 Case Practical Operation
Spark Shell is only used when testing and verifying our programs There are many. In a production environment, programs are usually compiled in the IDE, then typed into a jar package, and then submitted to the cluster. The most commonly used is to create a Maven project and use Maven to manage the dependencies of the jar package.
3.1 Writing the WordCount program
1) Create a Maven project WordCount and import the dependency


org.apache.spark
spark-core_2.11
2.1.1



WordCount


net.alchim31.maven
scala-maven-plugin
3.2.2



compile
testCompile





org.apache.maven.plugins
maven-assembly-plugin
3.0.0



WordCount



jar-with-dependencies




make-assembly
package

single






2) Write code
package com.atguigu

import org.apache.spark. {SparkConf, SparkContext}

object WordCount{

def main(args: Array[String]): Unit = {

//1. Create SparkConf and set the App name
val conf = new SparkConf().setAppName("WC")

//2. Create SparkContext, this object is the entrance to submit Spark App
val sc = new SparkContext(conf)

//3.使用sc创建RDD并执行相应的transformation和action
sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))

//4. Close the connection
sc.stop()
}
}
3) Package to the cluster test
bin/spark-submit
–class WordCount
–master spark://hadoop102:7077
WordCount.jar
/word.txt
/out
3.2 Local debug
local Spark Program debugging needs to use the local submission mode, that is, the local machine is used as the operating environment, and both the Master and Worker are local. Directly add breakpoints to debug while running. As follows:
When creating SparkConf, set additional attributes to indicate local execution:
val conf = new SparkConf().setAppName("WC").setMaster("local[*]")
If the local operating system is windows, if it is in the program If you use hadoop-related things, such as writing files to HDFS, you will encounter the following exceptions:

The reason for this problem is not the program error, but the use of hadoop-related services. The solution is to extract the hadoop-common-bin-2.7.3-x64.zip attached to any directory.

Configure Run Configuration in IDEA, add HADOOP_HOME variable

Guess you like

Origin blog.csdn.net/w13716207404/article/details/102756102