This article allows you to understand the next level of high performance Spark

Introduction: excerpt from "China New Telecommunications": In recent years, big data has become a hot industry and academia concerned, because with the rapid growth of storage capacity, CUP outside the grounds of a substantial lift away the ability, increasing network bandwidth. But also for the big data era to a total of a strong technical support. From web1.0 to web2.0, each user becomes a self-Media, an Internet content provider, change this way of data generated is to promote the arrival of the era of big data. "

What is big data it?

Large data is structured and unstructured data component, which is 10% of structured data stored in various databases in 90% unstructured data, unstructured data, such as pictures, videos, e-mail, web pages, etc. , now, big data applications to penetrate into all walks of life, data-driven decision-making, a substantial increase in the degree of intelligence information society. At present, the domestic technology focused on data mining correlation algorithm, the practical application of research and related theory, involving industry more widely, including retail, manufacturing, financial services, telecommunications, network-related professions, medicine and health sciences, unit concentrated in the part of the institutions of higher learning, research institutes and companies, especially in the emerging field of it. the huge Alibaba, Tencent, Baidu and other giants role in promoting the development of technology, and these Internet giants in the big data processing, and have adopted the hadoop, spark the processing framework.

Spark What is it?

Apache Spark is built around speed, ease of use and sophisticated analysis of large data processing framework, originally developed in 2009 by the University of California, Berkeley AMPLab, and became one of the Apache open source project in 2010, and so on Hadoop and Storm other big data and MapReduce technology compared to, Spark has the following advantages:

  • Spark provides a comprehensive and unified framework for managing a variety of needs with different properties (text data, graphics data, etc.) of data sets and data sources (batch or real-time data stream data) of large data processing.
  • Official information on Spark can apply Hadoop cluster in memory to run 100 times faster, or even be able to run the application on the disk speed of 10X.
    Here Insert Picture Description
Architecture and ecology:

What is the Spark ecological system? Spark ecosystem called BDAS (Bernoulli data analysis stack), this article aims to briefly Spark ecosystem in some common components, so that we have a simple understanding of the Spark ecosystem (BDAS), know what components can do thingHere Insert Picture Description

Components Introduction
  • Core Spark : Spark core component, its operation is a data object RDD (elastic distributed data sets), of four components in FIG Spark Core above are dependent on Spark Core, Core is simply considered Spark Spark ecosystem offline computing framework, eg: map Spark Core provided, reduce operator can complete computing tasks done mapreduce calculation engine.
  • Spark Streaming : flow computing framework Spark ecosystem, the data object is DSTREAM its operation, in fact, is an exploded Spark Streaming stream into a series of short batch computing jobs. Here batch engine is Spark Core, i.e. the (interval duration batch) according to the input data Spark Streaming batch size (e.g., one second) is divided into a section of the data series (DStream), each segment of data is converted into the Spark Core the RDD, and then converted to Spark Streaming DSTREAM calculation operation for calculating changes in Spark RDD conversion operation, following FIG official: Here Insert Picture Description
    on the internal realization, represented by a continuous DSTREAM RDD on a set of time series. Each RDD contains its own data flow within the specific time interval (data above figure 0-1 seconds into a received RDD, 1 to 2 second reception data becomes a RDD), using the drawings, DStream Spark Streaming the operation will be converted into a corresponding use of Spark Core operator (function) in the operation of Rdd.
  • Sql the Spark : simply think that allows users to use SQL to write the data computing, SQL SQL interpreter will be converted into Spark core mission, so that people do not understand understand SQL spark of data can be calculated by writing SQL way similar to hive role in the Hadoop ecosystem, providing SparkSql CLI (command line interface), you can then command-line interface to write SQL.
  • Graphx the Spark : the Spark FIG ecosystem and parallel calculation map calculation, a newer version currently supports six classical the PageRank, the number of the triangle, and the maximum connectivity graph like FIG shortest path algorithm.
  • MLIB Spark : Spark a scalable machine learning library, which encapsulates many common algorithms, including binary classification, regression, clustering, collaborative filtering. For machine learning and statistics scenes.
  • Tachyon : Tachyon is a distributed memory file system, can be understood as memory HDFS.
  • Local, Standalone, Yarn, Mesos : Spark four deployment models, in which the Local Local mode is generally used to develop the test, Standalone Spark is built-in resource management framework, Yarn and Mesos are two other resource management framework, with a Spark which deployment model, that is, the use of which resource management framework.

Usually when the amount of data to be processed is more than the stand-alone dimension (such as our computer has 4GB of memory, and we need to deal with more than 100GB of data) then we can choose to calculate spark clusters, sometimes we may need to handle the amount of data is not large, but the calculation is very complicated and requires a lot of time, then we can also choose to take advantage of the powerful spark cluster computing resources, parallelization calculates its architecture diagram is as follows:
Here Insert Picture Description

  • Spark Core : Spark containing a basic function; especially the RDD defined API, both the operation and operation on it. Spark of other libraries are built on top of RDD and Spark Core
  • SQL the Spark : providing SQL variant Hive Query Language (HiveQL) by Apache Hive is an API to interact with the Spark. Each database table is treated as a RDD, Spark SQL query is converted to Spark operations.
  • Streaming the Spark : processing and control of the realtime data streams. Spark Streaming allows the program to be able to handle real-time data the same as ordinary RDD
  • MLlib : a commonly used machine learning algorithm library, the algorithm is implemented as a Spark of the operation of the RDD. This library contains scalable learning algorithms, such as classification, regression and other large data sets need to operate iteration.
  • GraphX : control chart, illustrating the operation and a parallel set of algorithm and set of tools calculated. GraphX extends RDD API, comprising a control chart created submap, the operation of all the vertices on the access path
  • Figure Spark architecture composed as follows:
    Here Insert Picture Description
  • Manager the Cluster : In standalone mode, the master node is the Master controls the entire cluster, to monitor worker. In YARN mode for the Explorer
  • Worker nodes : the node, compute node responsible for controlling start Executor or Driver.
  • Driver : Run Application of main () function
  • Executor : actuator, an Application for a process running on a worker node

Spark与hadoop

  • Hadoop has two core modules, the distributed storage module and distributed computing module Mapreduce HDFS
  • spark itself does not provide a distributed file system, so the analysis spark mostly depends on the Hadoop distributed file system, HDFS
  • Hadoop with the spark of Mapreduce data can be calculated and compared to the Mapreduce, spark faster and provide more feature-rich
  • The relationship is as follows:
    Here Insert Picture Description
  • spark flowchart of the operation is as follows:
    Here Insert Picture Description
  • Spark Application to build the operating environment, start SparkContext
  • SparkContext to the resource manager (can be Standalone, Mesos, Yarn) application running Executor resources, and start StandaloneExecutorbackend,
  • Task Executor to apply SparkContext
  • SparkContext will distribute the application to Executor
  • FIG SparkContext constructed DAG, DAG the graph into the Stage, to send Taskset Task Scheduler, and finally by the Task Scheduler Task to a running Executor
  • Task running on the Executor, runs out of free all resources

Spark Run Features:

  • Each Application acquired exclusive executor process that has been residing during the Application, and run Task multithreaded manner. This Application isolation mechanism is advantageous, both from a scheduling standpoint (Driver scheduling each his own task), or from the perspective of operation (Task run from different Application in different JVM) is, of course, this means that Spark application data can not be shared across applications, unless the data is written to external storage systems
  • Spark has nothing to do with the resource manager, as long as they get executor process, and can communicate with each other to keep it
  • Client should submit SparkContext close Worker nodes (nodes running Executor), preferably in the same Rack, because there is a lot of information exchange between SparkContext Spark Application operation and Executor
  • Task optimization mechanism uses local data and speculative execution

Spark operating modes:

Spark mode of operation varied, flexible, when deployed on a single machine, both can run in local mode, it can also run with a pseudo-distributed mode, and when deployed in a manner distributed cluster, there are a number of operational modes choose, depending on the actual situation of the cluster, resource scheduling, i.e., the bottom may be dependent on external resources scheduling framework may be built using Spark Standalone mode. Support for external resources scheduling framework, the current implementation includes a relatively stable Mesos mode, and hadoop YARN mode.

  • Standalone : Independent cluster operation mode
    Standalone mode uses Spark own resource scheduling frame; using Master / Slaves typical architecture, chosen to achieve HA Master of ZooKeeper
  • The frame structure is as follows:
    Here Insert Picture Description
    the primary node of the node where the Client mode, Master Worker node and the node. Wherein Driver run on both Master node, may be run on the local Client end. When submitting Job Spark with spark-shell interactive tools, Driver runs on the Master node; when using spark-submit Job submission tool or use "new SparkConf.setManager (" spark on Eclips, IDEA Plat: // master: 7077 ")" when run Spark task, Driver is running on the local Client side. Operation process is as follows with reference to FIGS :(: http: //blog.csdn.net/gamer_gyt/article/details/51833681)
    Here Insert Picture Description
  1. SparkContext connected to the Master, the Master register and apply for resources (CPU Core and Memory)
  2. Master decided to allocate resources on which Worker, and access to resources on the Worker, and then start StandaloneExecutorBackend based on the information reported in the resource application claims SparkContext and Worker heartbeat period;
  3. StandaloneExecutorBackend registered with SparkContext;
  4. Transmitting the code to SparkContext Applicaiton StandaloneExecutorBackend; and SparkContext Applicaiton parsing code, build the DAG, DAG and submitted to the Scheduler decomposed into Stage (Action encountered when operating, will birth Job; Job each containing one or more Stage , Stage typically generated before the external data acquisition and shuffle), then submitted to Stage (alternatively referred taskset) to Task Scheduler, Task Scheduler Task is responsible for assigning to the corresponding Worker, and finally submitted to StandaloneExecutorBackend performed;
  5. StandaloneExecutorBackend will create a thread pool Executor Begin Task, and SparkContext report until the completion of Task
  6. After all Task completed, SparkContext to Master canceled, the release of resources
Summary: Another major application Spark the data processing can be described from the perspective of engineers. Here, the engineer refers to the use Spark to build a large number of software developers to produce data processing applications. These developers understand software engineering concepts and principles, such as encapsulation, interface design and object-oriented programming. They usually have a degree in computer science. They design and implement software systems build a business using the scene through their own software engineering skills. For engineers, the Spark provides a simple way between these clustered parallelized applications, hidden distributed systems, the complexity of network communication processing and fault tolerance. The system enables engineers to implement tasks at the same time, there is sufficient authority to monitor, check and adjust the application. Module Features API allows reuse existing work and local testing easier.

Finally, let us simply say something very common questions will be asked during several interviews.

Hadoop and shuffle the same spark and differences:

From high-level perspective, the two are not a big difference. Are output mapper (Spark there is ShuffleMapTask) partition is performed, to a different partition of a different reducer (in the Spark reducer may be the next stage in the ShuffleMapTask, it may be ResultTask). Reducer as a buffer to memory, shuffle-side edge aggregate data, aggregate data is performed until well after reduce () (Spark It may be a series of subsequent operations).

From the viewpoint of low-level, the difference is not small. Hadoop MapReduce is a sort-based, into the combine () and reduce () the records must sort. This has the advantage that the combine / reduce () can process large data, the input data because it can be obtained by efflux (Mapper for each piece of ordering data do first shuffle, the reducer of each piece of data sorted do merge). Current Spark default choice is hash-based, usually HashMap to shuffle to the aggregate of the data, not the data is sorted in advance. If you need the data sorted, then you need to call their own similar sortByKey () operation; if you are a Spark 1.1 users, spark.shuffle.manager can set the sort, it will sort the data. In the Spark 1.2, sort be implemented as the default Shuffle.

From the implementation point of view, there are many differences in the two. Hadoop MapReduce processing process into a clear phases: map (), spill, merge , shuffle, sort, reduce () and the like. Each stage of their duties, may be achieved by one for each phase of the function in accordance with the procedural programming ideas. In Spark, there is no such function clear phases, and only a series of different stage of transformation (), so the spill, merge, aggregate and other operations need inherent in transformation () in.

Mapreduce and Spark are parallel computation, then what are their similarities and differences:

  • Hadoop job is called a job, job which is divided into map task and reduce task, each task is run in its own process, when the end of the task, the process will end.
  • Task spark becomes application submitted by the user, an application corresponding to a sparkcontext, app there are multiple job, operating every triggering action will generate a job. The job can be executed serially or in parallel, each job has a plurality stage, a shuffle stage process DAGSchaduler by dividing the dependency between the RDD job from each stage there are a plurality of Task, there TaskSchaduler composition taskset distributed to each executor in the execution, executor and app life cycle is the same, even if there is no job is running, so the task can quickly start to read memory calculation.
  • hadoop the job only map and reduce operations, but also in the ability to express lacks mr process will be repeated read and write hdfs, cause a lot of io operations need to manage multiple job relationship.
  • Spark iterative calculation is performed in the memory, the API provides a number of operations such as RDD join, groupby the like, and can achieve good fault by the DAG.

spark optimization how to do?

Tuning spark more complex, but can generally be divided into three areas, one tune) platform level: jar prevent unnecessary packet distribution, improved local data selecting efficient storage formats such as parquet, 2) application tuning program level: optimization of the filter is reduced too little operator tasks, reducing the resource overhead of a single record, the inclination data processing, complex cache, execution of the job in parallel with RDD, etc., tune 3) JVM level: setting suitable amount of resources, set a reasonable the JVM, to enable efficient serialization method such Kyro, increased memory, and so off head.

Published 36 original articles · won praise 13 · views 1051

Guess you like

Origin blog.csdn.net/weixin_44598691/article/details/105021603