Spark a study - preliminary understanding Spark

1. What is Spark

Apache Spark ™ is a unified analysis of large-scale data processing engine.
It is calculated based on the large data parallel computing frame memory
spark is calculated to achieve a fast cluster common internet. It is a universal memory parallel computing framework developed by the University of California, Berkeley AMP Lab, used to build large-scale, low-latency data analysis applications. It extends MapReduce computational model widely used.
Support more efficient calculation mode, and stream including interactive query processing. A key feature is the ability to spark calculated in memory, in time-dependent disk complex calculations, Spark is still more efficient than MapReduce.

2, why should learn Spark

Intermediate result output : the output to disk-based MapReduce intermediate calculation engine will usually result, the storage and fault tolerance. For the pipeline to undertake the task of considering, when some of query translation into MapReduce tasks, tend to produce more Stage, Stage series which in turn depends on the underlying file system (such as HDFS) to store each Stage of output.

Spark is the alternative to MapReduce, and is compatible with HDFS, Hive, Hadoop can be integrated into the ecosystem, to make up for the lack of MapReduce.

Second, the four characteristics of Spark

1, efficiency

Run 100 times faster.

Apache Spark DAG using the most advanced scheduler, query optimizer and execution engine physics, high-performance batch and streaming data.

2, ease of use

Spark supports Java, Python and Scala's API, also supports more than 80 kinds of sophisticated algorithms that allows users to quickly build different applications. And Spark supports interactive Python and Scala's shell, can be very easy to use Spark cluster shell to validate these methods to solve the problem.

3, CURRENCY

Spark provides a unified solution. Spark can be used in batch, interactive queries (Spark SQL), real-time stream processing (Spark Streaming), machine learning (Spark MLlib) and calculated (GraphX). These different types of processing can be seamlessly used in the same application. Spark unified solution very attractive, after all, companies want to deal with any problems encountered with a unified platform to reduce development and maintenance labor costs and material costs and deployment platform.

4. Compatibility

Spark can very easily integrate with other open source products. For example, the Spark and the use of Hadoop YARN Apache Mesos as its resource management and scheduling, a device, and can handle all supported data Hadoop, including HDFS, HBase Cassandra, and the like. This is particularly important for users already deploy Hadoop clusters, because does not require any data migration you can use the powerful processing capabilities of Spark. Spark may not rely on third-party resource manager and scheduler, which implements the Standalone as its built-in resource management and scheduling framework, which further reduces the threshold for the use of the Spark that everyone can easily deploy and use Spark . In addition, Spark provides tools Spark clusters Standalone deployment on EC2.

Mesos: Spark Mesos can run inside (similar to the yarn of a Mesos resource scheduling framework)

standalone: ​​Spark they can allocate resources to their (master, worker)

YARN: Spark can run on top of yarn

Kubernetes: Spark receives Kubernetes of resource scheduling

Three, Spark component

Spark composition (BDAS): full name Berkeley analyzed data stack by between large-scale integration algorithms, machines, people show a platform for big data applications. It is large data, cloud computing, communication technology solutions.

Its main components are:

SparkCore: Flexible distributed data distributed abstract data set (RDD), to achieve the task scheduling application, the RPC, and a compression sequence, and provide an API to operate on the upper layer of the assembly.

SparkSQL: Spark Sql Spark is operated structured data package that allows me to use SQL statements to query the data way, Spark supports a variety of data sources including Hive table, parquest and JSON so on.

SparkStreaming: Spark is real time data stream provided by the calculation component.

MLlib: library provides commonly used machine learning algorithms to achieve.

GraphX: FIG providing a distributed computing framework, a calculation can be efficiently performed.

BlinkDB: for interactive SQL in the approximate mass data query engine.

Tachyon: with memory-centric high fault-tolerant distributed file system.

Fourth, application scenarios

Yahoo will use the Spark Application Audience Expansion in, clicking forecasting and ad hoc queries such as
Taobao technical team used the Spark machine learning algorithms to solve a number of iterations of the algorithm and other high computational complexity. Recommendation applied to the content, community discovery
Tencent large data accuracy recommended by means of Spark rapid iteration benefits realized in the "real-time data acquisition, real-time training algorithms, real-time forecasting system," the whole process of high-dimensional real-time parallel algorithms, and ultimately successfully applied to a wide pCTR point through the delivery system.
Youku potatoes will be applied to Spark video recommendation (Figure computing), advertising, the main achievement of machine learning, drawing calculation iteration.

Yarn StandAlone scheduling mode and the comparison Spark

ResouceManager Master management sub-node, resource scheduling, receiving a task request
NodeManger Worker management of the current node, and manage child process
YarnChild Executor runs the real calculation logic (Task)
filed Client SparkSubmit (Client + ApplicaitonMaster) app , manage the task Executor
ApplicaitonMaster and will be submitted to the Task (Executor)

Spark advantages with respect to Hadoop

Although the application has become the de facto standard Hadoop big data technology, but there are still many defects of its own, the main drawback is its high latency MapReduce computational model, incompetent real-time, fast computing needs, and therefore only applies to off-line batch Scenes.

Review Hadoop workflows can be found Hadoop has the following disadvantages:

  1. Limited skills. Calculation must be transformed into two Map and Reduce operation, but not suitable for all situations, it is difficult to describe complex data processing;

  2. Disk IO overhead large. You are required to read each time the data from the disk, and the intermediate results need to be written to disk after the calculation, the IO large overhead;

  3. Delay high. A calculation may need to be broken down into a series of MapReduce tasks in order of execution, the interface between the IO task as it relates to cost, will result in higher latency. Moreover, before the completion of the previous task execution, other tasks can not begin to serve on the complex, multi-stage computing tasks.

Spark mainly has the following advantages:

  1. Spark also belong to the MapReduce computing model, but not limited to Map and Reduce operation, also provides a variety of types of operation data sets, more flexible than the MapReduce programming model;

  2. Spark memory provided calculated intermediate results directly into memory, resulting in higher efficiency of iterative operation;

  3. Spark task scheduling mechanism based on the implementation of the DAG, better than iterative MapReduce implementation mechanisms.

  4. Spark biggest feature is to calculate the data, intermediate results are stored in memory, greatly reducing the cost of IO

  5. Spark provides a variety of high level, the API simple, usually, for the application to achieve the same function, the code amount Spark Hadoop 2-5 times less than that.

Spark but not a complete substitute Hadoop, MapReduce is mainly used in the calculation model Hadoop alternative. In fact, Spark is already well integrated into the Hadoop ecosystem, and an important one of them, it can be achieved by means of YARN resource scheduling management, achieved by means of HDFS distributed storage.

Spark Basic Concepts

Prior to explain in detail Spark architecture to run, you need to understand a few important concepts:

- RDD: Flexible distributed data is set (the Resilient Distributed
a Dataset) for short, is an abstract concept of distributed memory, shared memory model provided a highly restricted;

  • DAG: a Directed Acyclic Graph (directed acyclic graph) referred to reflect the dependencies between RDD;
    - Executor: running a process on a working node (Worker Node), is responsible for running tasks, and storage for applications transactions; real task execution unit, there may be a plurality of Work Node Executor
  • Application: Spark user-written applications;
  • Task: Run unit working on the Executor;
  • Job: A job consists of a plurality of acting RDD and RDD various operations on the respective;
  • Stage: is the basic unit of operation schedule, a job task will be divided into a plurality of groups, each task is called "phase" or also called "task sets"

• Driver Program:
- (driver) is Spark's core components
- build SparkContext (entrance Spark application, create a variable needed, also contains configuration information cluster
- converting job submitted by the user is the DAG (similar to the data processing flow FIG.)
- the strategy is divided into a plurality of the DAG stage, thereby generating a series of tasks in accordance with the partition
- in accordance with the RM resource application tasks requested
to submit the task and the task state is detected -

Published 44 original articles · won praise 0 · Views 840

Guess you like

Origin blog.csdn.net/heartless_killer/article/details/104523510