Learn Basics 1 spark - spark Overview and Getting Started

spark summarizes
a: spark Overview
1. Spark distributed computing frame memory
Apache Spark is a fast, versatile cluster computing system, with respect to Hadoop MapReduce intermediate results stored in the disk,
the Spark memory used to hold intermediate results, can when the data were not yet written to disk in memory operation.
1. the Spark is the Apache open-source framework
2. Spark's parent company is called Databricks
3. the Spark is unable to save intermediate results in memory and so on in the past in order to solve MapReduce computing system problems
4. Spark core is RDDs, RDDs calculated not only a frame, is a data structure

Two: Spark features
1. Fast
spark why faster than hadoop?
Calculations 1. mapreduce saved in HDFS, and spark the data is first stored in memory, enough memory before saving to disk (read from memory take data from the disk than the speed / HDFS faster)
2. MapReduce is the process of task-level, spark the task is a thread level.
2. the use
Spark supports Java, Scala, Python, R, SQL and other languages the API.
3. universal
Spark provide a complete stack, including SQL execution, a Dataset the API command type, machine learning library MLlib,
FIG GraphX computing framework, like flow calculation SparkStreaming
4 compatible
Spark run Hadoop Yarn, Apache Mesos, Kubernets , Spark Standalone such as cluster
Spark can access HBase, HDFS, Hive, Cassandra including a variety of database
summary:
1. support Java, Scala, Python and R, API
2. scalable than 8K nodes
3. capable of in memory cache data set, in order to achieve interactive data analysis
4 provides a command line window, the reaction time is reduced exploratory data analysis

Three: Spark assembly
1. Spark-Core and elasticity distributed data sets (RDDs)
core functions Spark is RDDs, RDDs spark-core present within the package, which is the core of the package Spark.
Spark-Core is Spark basis of the whole, there is provided a distributed task scheduling and basic I / O functions
based on the program abstract Spark elastic distributed data sets (RDDs), it is a parallel operation, fault-tolerant data set
2. SPRK the SQL
Spark SQL in spark-core basis to bring out the concept of a DataSet and DataFrame data abstraction called the
Spark SQL provides the ability to execute SQL on top of the Dataset and DataFrame
Spark SQL provides DSL, by Scala, Java, Python other languages and operating DataSet DataFrame
it also supports JDBC / ODBC SQL server operating language
3. streaming the Spark
the Spark streaming take advantage of rapid deployment ability spark-core flow analysis to run
it intercepted small amounts of data and can be run on the RDD Transformation ( according to the time slicing data)
4. MLlib
MLlib Spark frame is distributed machine learning. Spark distributed memory architecture than Hado The op-disk Apache Mahout 10 times faster, but also extremely excellent scalability
MLlib can use many common machine learning and statistical algorithms, simplifying large-scale machine learning
summary statistics, correlation, stratified sampling, hypothesis testing, data immediately generate
support vector machine, regression, linear regression, logistic regression, decision trees, simple shellfish Yates
collaborative filtering, the ALS
K-means
the SVD singular value decomposition, the PCA Principal component analysis
TF-the IDF, Word2Vec, StandardScaler
the SGD stochastic gradient descent, L-BFGS
5. the Graphx
GraphX FIG distributed computing framework that provides a set can be expressed FIG computed API, GraphX this abstraction also provides the optimal operation

Four: Spark cluster structure
1. spark cluster structure
1. Master
is responsible for cluster management, resource allocation
2. Workder
task execution node
executor: the process, the process used to perform tasks
task: the task to be performed
3. Driver
1. Run the main method
2 create SparkContext object. SparkContext entrance spark program, responsible for scheduling tasks and task segmentation
to run a Spark program has gone through the following steps
1. Start Drive, create SparkContext
2. Client program submitted to the Drive, Drive Manager application cluster resources Cluster
3. resource request is completed, start the Executor in Worker
4. Driver program into the Tasks, distributed execution Executor
spark can use the following several cluster management tools:
1. Standalone the spark
2. the Hadoop the Yarn
3. the Apache Mesos
4. Kubernets
commonly used : 2,3,4
2. Driver and when Worker is activated?
1. Standalone mode cluster,
Worker will start when the cluster starts
Driver to start is divided into two modes:
1. In the next Client mode, Driver runs the Client side, is activated when the start of Client
2. In Cluster mode, Driver runs in a Worker, a job submission time at the end of launch, tasks, Driver disappear
2. on the lower yarn mode spark
1, there is no equivalent Worker, executor start on nodemanager of Container
2, Driver will start when the task is submitted, the end of the mission, Driver disappear
spark differences on yarn in the client and the Cluster?
Driver in client mode when the client side, client terminal can not be closed. Driver disappear after closing, cutting and task scheduling can not be performed, resulting in failure of the task spark
position Driver cluster mode when located in AppMaster, client end may be closed

Five: Cluster Setup (individual structures, built using the company CDH)
Preparation: a good distribution of resources
Step 1 Download and unzip
download Spark installation package, download the corresponding version of Hadoop time to choose (to avoid version incompatibility)
Download https: // archive .apache.org / dist / spark / spark- 2.2.0 / spark-2.2.0-bin-hadoop2.7.tgz
upload / unzip to a specific directory
tar xzvf spark-2.2.0-bin- hadoop2.7.tgz - C / export / servers

Step 2 Modify profile spark-env.sh
CD / Export / Servers / Spark / the conf
Vim spark-env.sh
(following additions)
# Home specified the Java
Export the JAVA_HOME = / Export / Servers / jdk1.8.0
# Specify Spark Master Address
SPARK_MASTER_HOST = amdha01 Export
Export SPARK_MASTER_PORT = 7077

Step 3 Modify the profile slaves, to specify the date from the node, from using sbin / start-all.sh start when the cluster,
can be a key to start all Worker entire cluster
to enter the configuration directory and copy the new configuration file, for this basis be modified
CD / Export / Servers / Spark / the conf
CP slaves.template slaves
VI Slave
configuration address from all nodes
amdha02
node03
...

Step 4 HistoryServer
By default, the Spark program has finished running, you can not see the log of the Web UI, by HistoryServer
can provide a service by reading the log file so that we can run at the end of the program, still be able to view process run
1. copy spark-defaults.conf, for modification
2. copy the following spark-defaults.confat the end, by this configuration, you can specify the Spark log input to the HDFS
spark.eventLog.enabled to true
spark.eventLog.dir HDFS : // amdha01: 8020 spark_log /
spark.eventLog.compress to true
3. copy the following content to the spark-env.shend, start HistoryServer configuration parameters,
such HistoryServer read Spark HDFS log written at the time of starting the
designated operating parameter Spark History
export = SPARK_HISTORY_OPTS "- Dspark.history.ui.port -Dspark.history.fs.logDirectory. 3 = 4000 = = -Dspark.history.retainedApplications HDFS: // amdha01: 8020 / spark_log"
4. Create the HDFS log directory for Spark
hdfs dfs -mkdir -p / spark_log

Step 5 distribution and run
the install package to Spark other machines in the cluster
startup Spark Master and Slaves, and HistoryServer
CD / Export / Servers / Spark
sbin / start-all.sh
sbin / start-history-server.sh

spark build high availability cluster
on the basis of the above-described
step 1 spark-env.sh, add Spark startup parameters, and remove SPARK_MASTER_HOST address
# Home specified the Java
Export the JAVA_HOME = / Export / Servers / jdk1.8.0_141
# Master Spark specified address
# export = node01 SPARK_MASTER_HOST
Export SPARK_MASTER_PORT = 7077
# specify the operating parameters Spark History
export SPARK_HISTORY_OPTS = "- Dspark.history.ui.port = 4000 -Dspark.history.retainedApplications = 3 -Dspark.history.fs.logDirectory = hdfs: // node01: 8020 / spark_log "
parameter specifies the Spark # runtime
export SPARK_DAEMON_JAVA_OPTS =" - Dspark.deploy.recoveryMode = ZOOKEEPER -Dspark.deploy.zookeeper.url = node01: 2181, node02: 2181, node03: 2181 ... -Dspark.deploy.zookeeper. dir = / spark "

Step 2 Distribution Profile entire cluster to
step 3 to start the cluster
starts on the entire cluster node01
single Master restart on a node02
Step 4 node01 view of Master and Master node02 WebUI

Spark each service port
Master WebUI -> node01: 8080
Worker WebUI -> node01: 8081
History Server -> node01: 4000

VI: Application Example Run
Step 1 enters Spark installation directory (or configuration environment variable can be omitted)
CD / Export / Servers / Spark /
Step 2 Run Spark Task
bin / Spark-Submit
the -class org.apache.spark.examples. SparkPi
-master the Spark: // node01: 7077, node02: 7077 \ (according to master to write open)
-executor-1G Memory
the -total-Executor-Cores 2
/ Export / Servers / the Spark / examples / JARs / the Spark-examples_2. 11-2.2.3.jar
100
step 3 run results
Pi is roughly 3.141550671141551

Seven: Spark entry-case
writing Spark program are two common ways
the Spark-shell
the Spark-the Submit
written WordCount 1. Spark shell of the way
first step in creating maven project to increase support for Scala
second step write Maven configuration file pom.xml (configuration dependent, copy)

		<properties>
			<scala.version>2.11.8</scala.version>
			<spark.version>2.2.0</spark.version>
			<slf4j.version>1.7.16</slf4j.version>
			<log4j.version>1.2.17</log4j.version>
		</properties>
		<dependencies>
			<dependency>
				<groupId>org.scala-lang</groupId>
				<artifactId>scala-library</artifactId>
				<version>${scala.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.spark</groupId>
				<artifactId>spark-core_2.11</artifactId>
				<version>${spark.version}</version>
			</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-client</artifactId>
				<version>2.6.0</version>
			</dependency>
			<dependency>
				<groupId>org.slf4j</groupId>
				<artifactId>jcl-over-slf4j</artifactId>
				<version>${slf4j.version}</version>
			</dependency>
			<dependency>
				<groupId>org.slf4j</groupId>
				<artifactId>slf4j-api</artifactId>
				<version>${slf4j.version}</version>
			</dependency>
			<dependency>
				<groupId>org.slf4j</groupId>
				<artifactId>slf4j-log4j12</artifactId>
				<version>${slf4j.version}</version>
			</dependency>
			<dependency>
				<groupId>log4j</groupId>
				<artifactId>log4j</artifactId>
				<version>${log4j.version}</version>
			</dependency>
			<dependency>
				<groupId>junit</groupId>
				<artifactId>junit</artifactId>
				<version>4.10</version>
				<scope>provided</scope>
			</dependency>
		</dependencies>

		<build>
			<sourceDirectory>src/main/scala</sourceDirectory>
			<testSourceDirectory>src/test/scala</testSourceDirectory>
			<plugins>

				<plugin>
					<groupId>org.apache.maven.plugins</groupId>
					<artifactId>maven-compiler-plugin</artifactId>
					<version>3.0</version>
					<configuration>
						<source>1.8</source>
						<target>1.8</target>
						<encoding>UTF-8</encoding>
					</configuration>
				</plugin>

				<plugin>
					<groupId>net.alchim31.maven</groupId>
					<artifactId>scala-maven-plugin</artifactId>
					<version>3.2.0</version>
					<executions>
						<execution>
							<goals>
								<goal>compile</goal>
								<goal>testCompile</goal>
							</goals>
							<configuration>
								<args>
									<arg>-dependencyfile</arg>
									<arg>${project.build.directory}/.scala_dependencies</arg>
								</args>
							</configuration>
						</execution>
					</executions>
				</plugin>

				<plugin>
					<groupId>org.apache.maven.plugins</groupId>
					<artifactId>maven-shade-plugin</artifactId>
					<version>3.1.1</version>
					<executions>
						<execution>
							<phase>package</phase>
							<goals>
								<goal>shade</goal>
							</goals>
							<configuration>
								<filters>
									<filter>
										<artifact>*:*</artifact>
										<excludes>
											<exclude>META-INF/*.SF</exclude>
											<exclude>META-INF/*.DSA</exclude>
											<exclude>META-INF/*.RSA</exclude>
										</excludes>
									</filter>
								</filters>
								<transformers>
									<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
										<mainClass></mainClass>
									</transformer>
								</transformers>
							</configuration>
						</execution>
					</executions>
				</plugin>
			</plugins>
		</build>
		
	因为在 pom.xml 中指定了 Scala 的代码目录, 所以创建目录 src/main/scala 和目录 src/test/scala

The third step: write code

			object WordCounts {
			  def main(args: Array[String]): Unit = {
				// 1. 创建 Spark Context
				val conf = new SparkConf().setMaster("local[2]")
				val sc: SparkContext = new SparkContext(conf)
				// 2. 读取文件并计算词频
				val source: RDD[String] = sc.textFile("hdfs://node01:8020/dataset/wordcount.txt", 2)
				val words: RDD[String] = source.flatMap { line => line.split(" ") }
				val wordsTuple: RDD[(String, Int)] = words.map { word => (word, 1) }
				val wordsCount: RDD[(String, Int)] = wordsTuple.reduceByKey { (x, y) => x + y }
				// 3. 查看执行结果
				println(wordsCount.collect)
			  }
			}

Step Four: Run
spark stand-alone applications are basically two ways, one is to debug directly in the IDEA, the
other is run Spark submitted to the cluster, and the cluster Spark and supports a variety of different clusters have different Operation mode

Parameter spark-submit the cluster operation mode described
bin / spark-submit
the -class
-master
-deploy-MODE
-conf =
... # OTHER Options

[arguments-file application]
. 1) --class: running the main class specified [Method with a main Object]
2) --master: standalone position specified resource manager: Spark: // amdha01: 7077: 7077, amdha02 , Yarn: Yarn
. 3) the --deploy-mODE: Specifies the mode <Client / Cluster>
. 4) - -executor-core: Specifies the number of CPU cores of each executor <Yarn ON only for Spark>
. 5) --executor-memory: memory size for each of the executor
6) --num-executor: Specifies the number of executor < Yarn ON only for spark>
. 7) --driver-memory: memory size specified Driver
8) --queue: a specified queue
specifies the total number of nuclei executor CPU 9)-executor-cores --total < spark or a standalone on mesos mode>

the shell-Spark
1, start mode: the shell Spark---master <local [NUM] / Yarn / Spark: // amdha01: 7077>
2, the shell-Spark has been created automatically sparkcontext, sc assigned to
read the file HDFS
1 , read mode:
. 1, sc.textFile ( "HDFS: // amdha01: 8020 / ...")
2, sc.textFile ( "HDFS: /// User /") [HADOOP profile must be configured in the configuration file path]
. 3, sc.textFile ( "/ User / HDFS / ...") [HADOOP profile path must be configured in the configuration file]

Released four original articles · won praise 0 · Views 705

Guess you like

Origin blog.csdn.net/fan_bigdata/article/details/103108978