Introduction to Spark and its ecosystem

1. Introduction to Spark

Spark is a platform used to implement fast and versatile cluster computing. The explanation on the official website is: Apache Spark™ is a unified analysis engine for large-scale data processing. Spark is suitable for a variety of scenarios that originally required a variety of distributed platforms, including batch processing, iterative algorithms, interactive queries, stream processing, etc. It provides a very rich API interface to the outside world. In addition to providing services based on Python, Java, In addition to easy-to-use APIs for Scala and SQL and rich built-in libraries, Spark can also work closely with other big data tools. For example, Spark can run on a Hadoop cluster and access any Hadoop data source, including Cassandra.

The biggest feature of Spark is that it is based on memory and the data processing speed is very fast. It is said to be 100 times faster than MapReduce in processing data. At the same time, Spark is also a unified software stack, whose composition is shown in the figure below:
Insert image description here

2. Introduction to Spark Core

Spark Core implements the basic functions of Spark, including modules such as task scheduling, memory management, error recovery, storage system, and Spark data analysis. Spark Core also includes API definitions for resilient distributed datasets (RDD). RDD represents a collection of elements distributed on multiple computing nodes that can be operated in parallel, and is the main programming abstraction of Spark. Spark Core provides multiple APIs for creating and manipulating these collections.
Spark basic architecture:
Insert image description here

3. Introduction to Spark SQL

Spark SQL is a package used by Spark to manipulate structured data. Through Spark SQL, we can use SQL or the Apache Hive version of the SQL dialect (HQL) to query data. Spark SQL supports multiple data sources, such as Hive tables, Parquet, and JSON. In addition to providing a SQL interface for Spark, Spark SQL also supports developers to combine SQL with traditional RDD programming data manipulation methods. Whether using Python, Java or Scala, developers can use SQL simultaneously in a single application. and complex data analysis.

4. Spark Streaming

Spark Streaming is a component provided by Spark for streaming computing of real-time data. For example, web server logs in a production environment, or message queues composed of status updates submitted by users in network services, are all data flows. SparkStreaming provides an API for operating data streams and is highly consistent with the RDD API in Spark Core. In this way, the learning threshold for programmers can be lowered when writing applications. Whether it is operating data in memory or hard disk, or operating real-time data streams, programmers can handle it more freely. From the underlying design, Spark Streaming supports the same level of fault tolerance, throughput, and scalability as Spark Core.

5. SparkMLlib

Spark also includes a library that provides common machine learning (ML) functions called MLlib. MLlib provides a variety of machine learning algorithms, including classification, regression, clustering, collaborative filtering, etc., and also provides additional support functions such as model evaluation and data import. MLlib also provides some lower-level machine learning primitives, including a general gradient descent optimization algorithm. All these approaches are designed as architectures that can scale easily on a cluster.

6. GraphX

GraphX ​​is a program library used to operate graphs (such as friend graphs in social networks) and can perform parallel graph calculations. Similar to Spark Streaming and Spark SQL, GraphX ​​also extends Spark's RDD API and can be used to create a directed graph whose vertices and edges contain arbitrary attributes. GraphX ​​also supports various operations on graphs (such as subgraph for graph segmentation and mapVertices for operating all vertices), as well as some common graph algorithms (such as PageRank and triangle counting).

7. Cluster Manager

Under the hood, Spark is designed to efficiently scale computing between one compute node and thousands of compute nodes. In order to achieve such requirements while obtaining maximum flexibility, Spark supports running on various cluster managers, including Hadoop YARN, Apache Mesos, and a simple scheduler that comes with Spark, called an independent scheduler. If you want to install Spark on a machine without any pre-installed cluster manager, Spark's own independent scheduler can make it easy for you to get started; and if you already have a cluster equipped with Hadoop YARN or Mesos, use Spark to manage these clusters With manager support, your applications can also run on these clusters. These different options and how to choose the right cluster manager will be discussed in detail.

8. Users and uses of Spark

Spark is a general computing framework for cluster computing and is therefore used in a wide variety of applications. Two target audience groups: data scientists and engineers. Carefully analyzing these two groups and how they use Spark, it is not difficult to find that the typical use cases for using Spark by these two groups are not consistent, but we can roughly divide these use cases into two categories - data science applications and data processing applications.

1. Data science tasks (data scientist)
2. Data processing applications (engineers)

Guess you like

Origin blog.csdn.net/agatha_aggie/article/details/132604195