Spark ecosystem Profile

Spark ecosystem is AMP Laboratory, University of California at Berkeley to create, is a bid to algorithm (Algorithms), the machine (Machines), between the person (People) to show through large-scale integrated big data platform applications.

AMP laboratory use big data, cloud computing resources, communications, and a variety of flexible technology solutions for mass screening and opaque data into useful information for a better understanding of the world. The ecosystem has been directed to machine learning, data mining, database fields, information retrieval, natural language processing and speech recognition.

1, Core Spark Spark ecosystem in the core, to read data from the HDFS, Amazon S3, and the like HBase persistence layer to Mesos, YARN Standalone and carry their own to complete the Cluster Manager Spark Job schedule calculation application, these applications may be from different components.

Such as the Spark  Shell / the Spark the Submit batch, Spark Streaming real-time processing applications, Spark SQL ad hoc queries, MLlib machine learning, GraphX of view of the process and mathematical calculations SparkR and so on.

Spark ecosystem
Figure 1 Spark ecosystem

1. Spark Core

This section has introduced the basic situation Spark Core, the following summarize the Spark core architecture.

1) is provided with a frame parallel to the distributed computing acyclic graph (DAG), and provides mechanisms to support multiple cache iterative calculation or data shared, read data greatly reduced computational overhead between the iterations, which is required for performance data mining and analysis of multiple iterations has improved so much.

2) introduced in the RDD Spark abstraction, which is distributed in a read-only object in the set of node sets that are elastic, if the missing part of the data set, can be reconstructed according to their kinship, guarantee high fault tolerance of data.

3) the mobile computing instead of moving the data, RDD partition can read data block HDFS nearest to each node in the memory is calculated.

4) multi-thread pool model to reduce Task startup cost.

5) using fault-tolerant, highly scalable as Akka communication framework.

2. Spark Streaming

Spark Streaming Streaming is a high-throughput system, the fault-tolerant real-time data stream may be similarly map to multiple data sources (e.g., Kafka, Flume, Twitter, Zero and TCP sockets), and the reduce the join complex operation, and save the results to an external file system, database or application to real-time dashboard.

Spark Streaming core idea is to break down into a series of short-flow calculation of batch jobs, where the batch engine is Spark Core. I.e. the input data according to a set of Spark Streaming time slice (e.g., one second) is divided into a section of data, each piece of data in the RDD Spark converted, then Spark Streaming conversion operation becomes DStream of Spark the RDD conversion operation, through the operation into the RDD intermediate results stored in memory.

The service requirements, the entire stream may be calculated as the superposition of the intermediate results, or to store intermediate results to an external device. This tutorial will do a detailed description of Spark Streaming behind.

3. Spark SQL

Spark SQL allows developers to deal directly with external data RDD, and query data stored in the Hive, HBase's. An important feature of Spark SQL is its ability to deal with a unified relational tables and RDD, making it easy for developers to use SQL commands for external inquiries, while more complex data analysis.

4. Spark MLlib

Spark MLlib achieve some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and the underlying optimization, and the algorithm can be expanded. Spark MLlib reducing the threshold for machine learning, as long as the developers have some theoretical knowledge will be able to carry out work on machine learning. This tutorial will face Spark MLlib further introduced in the post.

5. Spark GraphX

Spark Spark GraphX ​​is a diagram for parallel computing in API, where considered GraphLab rewritten and Pregel on the Spark and optimization. Compared with other distributed computing frameworks FIG, Spark GraphX ​​greatest contribution is to provide a one-stop solution to data on Spark, it can be conveniently and efficiently complete set of assembly-line calculation of FIG.

Spark GraphX ​​core abstraction Resilient Distributed Property Graph, that is, a node and edge are directed with multiple attributes FIG. It extends the Spark RDD abstract, there are two views Table and Graph, and only need a physical store. Both views have their own unique operator, so that the operational flexibility and improved efficiency.

It should be noted that neither the Spark Streaming, Spark SQL, Spark MLlib, or Spark GraphX, all API dealing Spark Core can be used, their methods are almost universal, data processing can also be shared, so that can be done for different applications seamless integration between data.

40 + annual salary of big data development [W] tutorial, all here!

43. A the Spark Development Example
44. the Spark Streaming Profile
45. the Spark Streaming architecture
46. the Spark Streaming programming model
47. the Spark DSTREAM related operation
48. the Spark development instance Streaming

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92173096
Recommended