spark technical features

1. What is Hadoop? What is the difference between Hadoop and Spark

What is Hadoop?

  • Hadoop is an open source framework that can write and run distributed applications to process large-scale data. It is designed for offline and large-scale data analysis . It is not suitable for the online transaction processing mode of random read and write of several records. Hadoop=HDFS (file system, related to data storage technology) + Mapreduce (data processing) , the data source of Hadoop can be in any form, and it has better performance compared with relational database in processing semi-structured and unstructured data , with more flexible processing capabilities, regardless of any data form will eventually be converted into key/value, key/value is the basic data unit . Use functional programming Mapreduce instead of SQL . SQL is a query statement, while Mapreduce uses scripts and codes. For relational databases, Hadoop, which is used to SQL, can be replaced by the open source tool hive.

  • Spark is a distributed computing solution.

The difference between Hadoop and Spark

1. Different levels of problem solving

  • Hadoop is essentially a distributed data infrastructure, which distributes huge data sets to multiple nodes in a cluster composed of ordinary computers for storage; at the same time, Hadoop can also do offline computing

  • Spark is a tool specially designed to process big data that is stored in a distributed manner. It does not store distributed data.

2. Spark data processing speed beats MapReduce in seconds

  • Hadoop's mapreduce needs to operate a lot of disk IO. MapReduce processes data step by step: "Read data from the cluster, perform one processing, write the result to the cluster, read the updated data from the cluster, and perform the next processing, writing results to the cluster , etc..."

  • Spark, which does all data analysis in-memory in near "real-time" time: "Read data from cluster, do all necessary analytical processing, write results back to cluster, done,"

3. Disaster Recovery

  • Hadoop writes each processed data to disk, so it is inherently resilient to system errors.

  • Spark's data objects are stored in what is called a Resilient Distributed Dataset (RDD: Resilient Distributed Dataset) distributed in the data cluster . "These data objects can be placed in memory or on disk, so RDD can also provide complete disaster recovery functions,"

3. Processing data

  • Hadoop is suitable for processing static data, but has poor processing ability for iterative streaming data;

  • Spark improves the performance of processing streaming data and iterative data by caching processed data in memory;

4. Intermediate results

  • The intermediate results in Hadoop are stored in HDFS, and each MR needs to be written and called;

  • Spark intermediate results are stored in the memory first, and if the memory is not enough, they are stored on the disk instead of HDFS, avoiding a large number of IO and flashing and reading operations.

2. Spark technical characteristics and architectural ideas

1. Spark has four technical characteristics

Simple (simple), Fast (fast), Scalable (scalable?), Unified (unified, universal)

Here is a supplement to the reason for the generation of spark, based on the defects of MRv1 and MRv2, through five aspects of optimization, the spark computing engine is formed, which is the basis of these four characteristics.

(1) Defects of MRv1: As early as Hadoop1.x version, the MapReduce programming model of MRv1 version was adopted at that time. MRv1 includes three parts: runtime environment (JobTracker and TaskTracker), programming model (MapReduce), data processing engine (MapTask and ReduceTask). It has the following deficiencies: ① Poor scalability. JobTracker is responsible for both resource management and task scheduling. When the cluster is busy, JobTracker can easily become a bottleneck. ② Poor availability. When the cluster is busy, JobTracker can easily become a bottleneck. ③ Low resource utilization. TaskTracker Use "slot" to equally divide the amount of resources on this node ④Resource utilization is low

(2) Defects of MRv2: In MRv2, the programming model and data processing engine in MRv1 are reused. But the runtime environment was refactored. There are deficiencies: disk I/O becomes the bottleneck of system performance, so it is only suitable for offline data processing or batch processing, but cannot support iterative, interactive, and streaming data processing.

(3) Spark was born and optimized from five aspects: ①Reduce disk I/O; ②Increase parallelism; ③Avoid recalculation; ④Optional shuffle and sorting; ⑤Flexible memory management strategy

2. Key features

①Batch/streaming data (batch/streaming data)

Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

Unified processing of batch data and real-time streaming data, using your favorite language: Python, SQL, Scala, Java or R.

②SQL analytics (SQL analysis)

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

Execute fast, distributed ANSI SQL queries for dashboards and ad-hoc reporting. Runs faster than most data warehouses.

③Data science at scale (large-scale data science)

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

Perform exploratory data analysis (EDA) on petabytes of data without employing dimensionality reduction techniques

④Machine learning (machine learning)

Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Train machine learning algorithms on a laptop and scale to fault-tolerant clusters of thousands of machines using the same code.

3. The technical architecture of spark

Spark is a computing engine based on the Hadoop distributed file system HDFS. The core modules of the framework are:

Spark Core: Contains the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, and storage system interaction, as well as the API definition for elastic distributed dataset RDD.

Spark SQL: It is a package for operating structured data. Through Spark SQL, SQL or HQL can be used to query various data sources, such as Hive tables, Parquet, and JSON.

Spark Streaming: It is a component for streaming computing of real-time data, allowing programs to process real-time data like ordinary RDDs.

Spark MLlib: It is a library of machine learning functions, including operations such as classification, regression, clustering, and collaborative filtering. It also provides additional support functions such as model evaluation and data import.

GraphX: A collection of algorithms and tools for controlling graphs, parallel graph operations, and computation. GraphX ​​extends the RDD API to include operations for manipulating graphs, creating subgraphs, and accessing all vertices along a path.

4. The working mechanism of spark

The spark execution process is as follows:

1. When an Application is submitted to the Spark cluster, a Driver process will be created. Driver initializes the Application runtime environment, starts SparkContext, and builds DAGScheduler and TaskScheduler;

2. SparkContext registers and applies to the resource manager (Standalone, Mesos or Yarn) for the Executor resources run by Application, and Executor starts StandaloneExecutorbackend, registers with SparkContext and applies for Task tasks;

3. The Driver executes the Application, reads the data source, and generates an RDD from the data to be processed. Every time an Action is executed, a Job is created and submitted to the DAGScheduler;

4. DAGScheduler will divide multiple stages for each job. Each stage determines the number of tasks according to the Partition of the RDD. Then each stage creates a TaskSet and submits the TaskSet to the TaskScheduler. TaskScheduler will submit the Task in each TaskSet to Executor for running;

5. Every time the Executor accepts a Task, it will be encapsulated with a TaskRunner, and a thread will be obtained from the thread pool for execution. After the ResultTask of the last Stage is run, all resources will be released.

5. Different deployment modes of spark

When spark is deployed in a distributed cluster, it can rely on an external resource scheduling framework (Mesos, yarn, or EC2), or use the built-in resource scheduling framework. According to different resource schedulers, the three mainstream deployment modes are:

Standalone mode is a cluster management mode with its own resource scheduling framework, that is, independent mode. Standalone is the simplest and easiest mode to deploy without relying on any other resource management system. Its main nodes are Driver node, Master node and Worker node.

The Spark on YARN mode is a mode that runs on the Hadoop YARN framework. Using YARN to provide unified resource management and scheduling for upper-layer applications has become the standard for big data cluster resource management systems. Currently only coarse-grained mode is supported. Container resources on YARN cannot be dynamically scaled. Once the Container is started, the available resources will not change.

The Spark on Mesos mode is a mode that runs on the Apache Mesos framework and is the officially recommended mode. Apache Mesos is a more powerful distributed resource management framework that is responsible for the allocation of cluster resources. Spark running on Mesos is more flexible than running on YARN. It not only supports coarse-grained mode, but also provides fine-grained scheduling mode to realize resource utilization. Allocate as needed.

6. Enlightenment of spark technology

One of the more distinctive features of the spark project is that it uses a layered idea. It does not try to complete all of them. It still uses HDFS for storage. Instead, it extracts and improves the calculation layer separately, and then uses RDD To complete the process of storing and calling, an important point of this idea is to properly divide the process and perform hierarchical processing. Each project focuses on solving a problem.

6. Basic architecture of Spark

From the perspective of cluster deployment, a Spark cluster consists of the following parts:

  • Cluster Manager

  • Worker

  • Executor

  • Driver

  • Application

Cluster Manager

  • The cluster manager is mainly responsible for the allocation and management of the entire cluster resources;

  • The resources allocated by the Cluster Manger belong to the first-level allocation, which allocates the memory and CPU on each Worker to the Application, but is not responsible for the allocation of Executor resources.

  • ResourceManager in YARN deployment mode

Worker

  • Working nodes, replaced by NodeManager in YARN deployment mode;

  • Responsible for the following tasks:

    • Inform ClusterManger of its memory, CPU and other resources through the registration mechanism

    • Create Executors

    • Further assign resources and tasks to Executors

    • Synchronize resource information and Executor status information to Cluster Manager

Executor

  • First-line components that perform tasks

  • Mainly responsible for:

    • task execution

    • Synchronization with Worker and Driver information

Driver

  • Application driver, Application communicates with Cluster Manager and Executor through Driver;

  • The Driver can run in the Application, or it can be submitted by the Application to the Cluster Manager and the Cluster Manager arranges the Worker to execute;

3. What are the core mechanisms of Spark?

1、RDD

1.1 Overview of RDDs

  • RDD is the cornerstone of Spark and the core abstraction for Spark data processing . So why is RDD generated?

  • This method of MapReduce is not very efficient for two common operations in the data field. The first is an iterative algorithm. For example, ALS in machine learning, convex optimization gradient descent, etc. All of these require repeated queries and repeated operations based on datasets or derived data from datasets. This model of MapReduce is not suitable. Even if multiple MapReduce processes are processed serially, performance and time are still a problem. Data sharing relies on disks. The other is interactive data mining, which MapReduce is obviously not good at.

  • We need a very fast model that can support iterative computing and effective data sharing, and Spark came into being. RDD is a work set-based work model, more oriented to workflow.

  • But both MapReduce and RDD should have features like location awareness, fault tolerance, and load balancing .

1.2. What is RDD

  • RDD (Resilient Distributed Dataset) is called a distributed dataset, which is the most basic data abstraction in Spark . It represents an immutable, partitionable collection whose elements can be calculated in parallel .

  • In Spark, all operations on data are nothing more than creating RDDs, converting existing RDDs, and calling RDD operations for evaluation .

It can be understood from three aspects:

  • Read-only dataset DataSet: RDD is read-only. If you want to change the data in RDD, you can only create a new RDD based on the existing RDD.

  • Distributed Distributed/Partition: RDD data may be physically stored in the disk or memory of multiple nodes, which is the so-called multi-level storage.

  • Resilient: Although the data stored in RDD is read-only, we can modify the number of partitions.

The elasticity of Spark's RDD:

  • Storage elasticity: automatic switching between memory and disk

  • Fault-tolerant resilience: data loss can be automatically recovered

  • Computing Elasticity: Calculation Error Retry Mechanism

  • Sharding resilience: reshard as needed

Dependencies between RDDs: Based on the dependencies between RDDs, RDDs will form a directed acyclic graph DAG, which describes the entire flow computing process. In actual execution, RDDs are completed in one go through lineage Yes, even if the data partition is lost, the partition can be rebuilt through blood relationship,

Lineage: RDDs only support coarse-grained transformations, i.e. a single operation performed on a large number of records. Record a series of Lineages (that is, lineages) that create RDDs in order to recover lost partitions. The Lineage of RDD will record the metadata information and conversion behavior of RDD. When some partition data of the RDD is lost, it can recalculate and restore the lost data partition based on this information.

DAG generation: DAG (Directed Acyclic Graph) is called a directed acyclic graph. The original RDD forms a DAG through a series of transformations. The DAG is divided into different stages according to the different dependencies between RDDs.

1.3. RDD caching

One of the reasons Spark is so fast is the ability to persist or cache datasets in memory between operations. After a certain RDD is persisted, each node will save the calculated fragmentation results in memory and reuse them in other actions performed on this RDD or derived RDDs. This makes subsequent actions much faster. RDD-related persistence and caching is one of the most important features of Spark. It can be said that caching is the key for Spark to build iterative algorithms and fast interactive queries.

Checkpoint

  • In addition to the persistence operation, Spark also provides a checkpoint mechanism for data storage. Checkpoints (essentially write RDDs to Disk for checkpoints) are for fault-tolerant assistance through lineage. Lineage is too long It will cause the cost of fault tolerance to be too high, so it is better to do checkpoint fault tolerance in the middle stage. If there is a node problem later and the partition is lost, starting from the RDD of the checkpoint to redo the Lineage will reduce the overhead. Checkpoint implements the checkpoint function of RDD by writing data to the HDFS file system.

  • There is a significant difference between cache and checkpoint. The cache calculates the RDD and puts it in memory , but the dependency chain of the RDD (equivalent to the redo log in the database) cannot be lost. When a certain executor crashes at a certain point, the above The RDD of the cache will be lost, which needs to be calculated by replaying the dependency chain. The difference is that the checkpoint saves the RDD in HDFS , which is multi-copy reliable storage, so the dependency chain can be lost, and the dependency chain is cut off. It is high fault tolerance achieved through replication.

2、Shuffle

Shuffle, translated into Chinese is shuffling. The reason why Shuffle is needed is because a type of data with certain common characteristics needs to be aggregated to a computing node for calculation. These data are distributed on various storage nodes and processed by computing units of different nodes.

2.1.ShuffleManager

Development overview:

  • Before Spark 1.2, the default shuffle calculation engine was HashShuffleManager. The ShuffleManager and HashShuffleManager have a very serious disadvantage, that is, a large number of intermediate disk files will be generated, and the performance will be affected by a large number of disk IO operations.

  • Therefore, in versions after Spark 1.2, the default ShuffleManager was changed to SortShuffleManager. Compared with HashShuffleManager, SortShuffleManager has some improvements. The main reason is that although each Task will generate more temporary disk files during the shuffle operation, all temporary files will be merged into one disk file in the end, so each Task has only one disk file . When the shuffle read task of the next stage pulls its own data, it only needs to read part of the data in each disk file according to the index.

The operating principle of HashShuffleManager:

Unoptimized HashShuffleManager:

  • The shuffle write stage is mainly to "classify" the data processed by each task according to the key for the next stage to execute the shuffle operator (such as reduceByKey) after the calculation of one stage is completed. The so-called "classification" is to execute the hash algorithm on the same key, so that the same key is written into the same disk file , and each disk file belongs to only one task of the downstream stage . Before writing the data to the disk, the data will be written into the memory buffer first, and when the memory buffer is full, it will be overflowed to the disk file. The number of disk files generated by unoptimized shuffle write operations is extremely staggering.

  • The pulling process of shuffle read is to aggregate while pulling. Each shuffle read task will have its own buffer buffer, and each time it can only pull data of the same size as the buffer buffer, and then perform aggregation and other operations through a Map in memory.

Optimized HashShuffleManager:

  • After the consolidate mechanism is enabled, during the shuffle write process, the task does not create a disk file for each task of the downstream stage. At this time, the concept of shuffleFileGroup will appear. Each shuffleFileGroup will correspond to a batch of disk files. The number of disk files is the same as the number of tasks in the downstream stage. As many CPU cores as there are on an Executor, how many tasks can be executed in parallel. Each task executed in parallel in the first batch will create a shuffleFileGroup and write the data into the corresponding disk file.

  • When the Executor's CPU core executes a batch of tasks and then executes the next batch of tasks, the next batch of tasks will reuse the previously existing shuffleFileGroup, including the disk files in it. That is to say, at this time, the task will write the data to the existing disk file, but not to the new disk file. Therefore, the consolidate mechanism allows different tasks to reuse the same batch of disk files, so that the disk files of multiple tasks can be effectively merged to a certain extent, thereby greatly reducing the number of disk files and improving the performance of shufflewrite.

SortShuffleManager runs:

  • The mode of Sort Based Shuffle is: each Shuffle Map Task will not generate a separate file for each Reducer; instead, it will write all the results to a file and generate an Index file at the same time.

Four. Summary

Spark technology is a very prominent framework in the distributed field. In addition to the design patterns at the code level, you can also learn from the ideas at the architecture level. Whether it is splitting computing power or using RDD to abstract data concepts, it all reflects one point. When When the complexity of a field reaches a certain level, it is a good practice to solve highly complex problems by subdividing the workflow according to the importance and necessity and designing new logical concepts.

reference content

1. Short book link: spark technical architecture, working mechanism, and installation and use - short book

2、spark官网:Apache Spark™ - Unified Engine for large-scale data analytics

3. Detailed explanation of shuffle: Detailed explanation of Spark Shuffle

Guess you like

Origin blog.csdn.net/qq_22059611/article/details/128218216