Note spark first day (installed component)

Introduction spark
Spark is a fast, versatile, scalable big data analytics engine, was born in 2009 at the University of California, Berkeley AMPLab, 2010 Nian open source, in June 2013 became Apache incubator project, in February 2014 to become the top-level Apache project. Currently, the Spark ecosystem has become a set comprising a plurality of sub-projects, which comprises SparkSQL, Spark Streaming, GraphX, MLlib other sub, the Spark is calculated based on the large data parallel computing frame memory. Spark calculated based on the memory, improved real-time data in a large data processing environment, while ensuring a high fault tolerance and scalability, allow a user to Spark disposed over a large number of cheap hardware, form a cluster.

Why learn spark
intermediate result output: the output to disk MapReduce-based calculation engine will usually intermediate result, stored and fault tolerance. For the pipeline to undertake the task of considering, when some of query translation into MapReduce tasks, tend to produce more Stage, Stage series which in turn depends on the underlying file system (such as HDFS) to store each Stage of output.


Spark is the alternative to MapReduce, and is compatible with HDFS, Hive, Hadoop can be integrated into the ecosystem, to make up for the lack of MapReduce.

spark Features

Fast: Compared with the Hadoop MapReduce, Spark memory-based operation to be more than 100 times faster, hard disk-based operations have more than 10 times faster. Spark DAG enables efficient execution engine, based on a memory can be efficiently process the data stream.
Easy to use: Spark supports Java, Python and Scala's API, also supports more than 80 kinds of sophisticated algorithms that allows users to quickly build different applications. And Spark supports interactive Python and Scala's shell, can be very easy to use Spark cluster shell to validate these methods to solve the problem.
General: Spark provides a unified solution. Spark can be used in batch, interactive queries (Spark SQL), real-time stream processing (Spark Streaming), machine learning (Spark MLlib) and calculated (GraphX). These different types of processing can be seamlessly used in the same application. Spark unified solution very attractive, after all, companies want to deal with any problems encountered with a unified platform to reduce development and maintenance labor costs and material costs and deployment platform.
Compatibility: Spark can very easily integrate with other open source products. For example, the Spark and the use of Hadoop YARN Apache Mesos as its resource management and scheduling, a device, and can handle all supported data Hadoop, including HDFS, HBase Cassandra, and the like. This is particularly important for users already deploy Hadoop clusters, because does not require any data migration you can use the powerful processing capabilities of Spark. Spark may not rely on third-party resource manager and scheduler, which implements the Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for the use of the Spark that everyone can easily deploy and use Spark . In addition, Spark provides tools Spark clusters Standalone deployment on EC2.
Installation spark

Upload spark- package to install on Linux. Extracting installation package to the specified location.

spark-1.5.1--zxvf the tar-hadoop2.4.tgz -C bin / Home / Tyler / Apps
. 1
configuration spark
into the installation directory Spark

/home/tyler/apps/spark-1.5.1-bin-hadoop2.4 cd
1
into the conf directory and rename the file and modify the spark-env.sh.template

the conf CD /
Music Videos Spark-env.sh.template spark-env.sh
VI spark-env.sh
. 1
2
. 3
added follows the profile

the JAVA_HOME = Export / Home / Tyler / Apps / jdk1.8.0_181
Export SPARK_MASTER_IP = Tyler01
Export SPARK_MASTER_PORT = 7077
. 1
2
. 3
save and exit
rename files and modify slaves.template

slaves.template slaves Music Videos
VI slaves
. 1
2
added position (the Worker node) of the child node is located in the file

Tyler01
Tyler02
Tyler03
. 1
2
3
save the exit
will be configured Spark copy to other nodes in
the spark hadoop1 mounted on the packets sent to hadoop2 and 3, respectively,

the Spark-bin-1.5.1 scp-hadoop2.4 Tyler02: $ PWD
scp the Spark-1.5.1-bin-hadoop2.4 Tyler03: $ PWD
1
2
the Spark cluster is configured, is currently a Master, 3 Ge Work, in start Spark cluster on Tyler01.

/home/tyler/apps/spark-1.5.1-bin-hadoop2.4/sbin/start-all.sh
1
after starting the implementation of jps command, there are Master process on the primary node, there Work carried out on other child nodes, log Spark management interface to view the status of the cluster (the primary node):

http://192.168.72.110:8080/
. 1
Start program spark
spark-shell is carrying interactive Spark Shell procedures, user-friendly interactive programming, the user can write a program using scala spark at the command line.

/home/tyler/apps/spark-2.1.0-bin-hadoop2.6/bin/spark-shell \
1
basic components of the Spark architecture

ClusterManager: In the Standalone mode is the Master (primary node), to control the entire cluster, the monitoring Worker. In YARN mode for the Explorer.
Worker: From the node, compute node responsible for controlling start Executor or Driver. In YARN mode is NodeManager, responsible for controlling the compute nodes.
Driver: Run Application of main () function and create SparkContext.
Executor: actuator assembly, perform tasks on worker node for starting the thread pool to run the task. Each Application has a separate set of Executors.
SparkContext: the context of the entire application, control the life cycle of the application.
RDD: Spark basic calculating unit can be formed with a set of execution RDD directed acyclic graph RDD Graph.
DAG Scheduler: DAG built on the Stage based on the job (Job), and submit it to the Stage TaskScheduler.
TaskScheduler: The task (Task) distributed Executor execution.
SparkEnv: quote thread-level context, an important component of the storage operation. Create and contains a reference to the following important components within SparkEnv.
MapOutPutTracker: Shuffle Yuan is responsible for storing information.
BroadcastManager: responsible for storing the meta information of the broadcast control variables.
BlockManager: responsible for storage management, creation and search blocks.
MetricsSystem: run-time performance monitoring information.
SparkConf: responsible for storing configuration information.
The overall process for the Spark
Client submitted application, Master find a Worker start Driver, Driver application resources Master or the resource manager, after application into RDD Graph, then the DAGScheduler the RDD Graph into Stage are submitted to TaskScheduler directed acyclic graph by TaskScheduler submitted to the Executor task execution. During the execution of the task, the other components work together to ensure the smooth implementation of the entire application.
----------------
Disclaimer: This article is CSDN blogger "Ming Meng has shown signs of" original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https: //blog.csdn.net/TylerPY/article/details/102688635

Guess you like

Origin www.cnblogs.com/java67/p/11728806.html