【Spark】What is Spark? What can I do? What are the characteristics?

1. What is Spark

Official website: http://spark.apache.org

insert image description here

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Spark is a fast, general-purpose, and scalable big data analysis engine. It was born in AMPLab, University of California, Berkeley in 2009. It was open-sourced in 2010. It became an Apache incubation project in June 2013, and became a top-level Apache project in February 2014. At present, the Spark ecosystem has developed into a collection of multiple sub-projects, including SparkSQL, SparkStreaming, GraphX, MLlib and other sub-projects. Spark is a big data parallel computing framework based on memory computing. Based on memory computing, Spark improves the real-time performance of data processing in a big data environment, while ensuring high fault tolerance and high scalability, allowing users to deploy Spark on a large number of cheap hardware to form a cluster. Spark is supported by many big data companies, including Hortonworks, IBM, Intel, Cloudera, MapR, Pivotal, Baidu, Alibaba, Tencent, JD.com, Ctrip, and Youku Tudou. At present, Baidu's Spark has been applied to businesses such as Fengchao, Big Search, Zhidaohao, and Baidu Big Data; Ali has used GraphX ​​to build a large-scale graph computing and graph mining system, and implemented recommendation algorithms for many production systems; Tencent's Spark cluster has reached With a scale of 8,000 units, it is currently the largest Spark cluster in the world known.

Two, the role of Spark

  • Intermediate result output: MapReduce-based computing engines usually output intermediate results to disk for storage and fault tolerance. Considering that the task pipeline undertakes, when some queries are translated into MapReduce tasks, multiple stages are often generated, and these stages in series rely on the underlying file system (such as HDFS) to store the output results of each stage.

  • Spark is an alternative to MapReduce, and is compatible with HDFS and Hive, and can be integrated into the Hadoop ecosystem to make up for the shortcomings of MapReduce.

3. Spark Features

quick

Compared with Hadoop's MapReduce, Spark's memory-based operations are more than 100 times faster, and hard disk-based operations are also more than 10 times faster. Spark implements an efficient DAG (Directed Acyclic Graph) execution engine, which can efficiently process data streams based on memory.
insert image description here

easy to use

Spark supports APIs of Java, Python and Scala, and also supports more than 80 advanced algorithms, enabling users to quickly build different applications. Moreover, Spark supports interactive Python and Scala shells, and it is very convenient to use Spark clusters in these shells to verify solutions to problems.
insert image description here

universal

Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time stream processing (Spark Streaming), machine learning (Spark MLlib) and graph computing (GraphX). These different types of processing can all be used seamlessly within the same application. Spark's unified solution is very attractive. After all, any company wants to use a unified platform to deal with problems encountered, reducing the labor cost of development and maintenance and the material cost of deploying the platform.

compatibility

Spark can be easily integrated with other open source products. For example, Spark can use Hadoop's YARN and Apache Mesos as its resource management and scheduler, and can process all data supported by Hadoop, including HDFS, HBase, and Cassandra. This is especially important for users who have already deployed Hadoop clusters, because they can use Spark's powerful processing capabilities without any data migration. Spark also does not depend on third-party resource management and schedulers. It implements Standalone as its built-in resource management and scheduling framework, which further lowers the threshold for using Spark and makes it very easy for everyone to deploy and use Spark. . In addition, Spark also provides tools for deploying Standalone Spark clusters on EC2.
insert image description here

4. Spark and Hadoop

Some people say that the emergence of Spark represents the death of Hadoop, which I disagree with. Hadoop is a distributed system ecosystem, and it cannot be replaced by the Spark engine.

But I have to admit that the emergence of Spark has indeed made up for some shortcomings to a great extent for Hadoop, and has had some impact on Hadoop. And Hadoop's ecology, including resource scheduling and file storage, is also very helpful for Spark, a pure engine.

Specifically, Spark helps Hadoop to be user-friendly. A person who uses Spark and Hadoop together will feel very different from a person who only uses tools in the Hadoop ecosystem.

First, when using Spark, it is no longer necessary to consider how to force various daily operations between the two operations of map and reduce. Because Spark provides a more abstract interface.

Second, when using Spark, there is no need to wait until the oil is exhausted for a query. Spark, built on RDD and memory storage intermediate data, has high support for real-time performance.

Briefly talk about RDD here. RDD is an abstract concept, a logical data structure, the Chinese full name is elastic distributed data set, the most direct understanding is a large dataframe - this dataframe may be the sum of the original data on all machines, or it may be the middle A dataframe formed by an intermediate result obtained after calculation to a certain step.

五、Spark Streaming

Why should Spark Streaming be taken out separately? This is a personal obsession. What prompted me to understand the concept of Spark is Spark Streaming, an application built on the Spark engine. In my past thinking, Spark is a data processing tool, how can it still manage the transmission and management of data streams? It wasn't until I learned about Spark in detail that I realized that data processing tools refer to MLLib or Shark, applications that are also built on the Spark engine.

Let’s talk about Spark Streaming separately. According to Apache's official statement, Spark Streaming is to receive real-time data from somewhere, and then split the real-time data into batches, then do some necessary processing (such as data cleaning or data aggregation), and finally transfer the data to a The target is sent over.

In this process, it is worth noting that batches are fragmented, which means that the data stream transmitted by Spark is batch by batch, and the transmission speed and the size of each batch of data can be adjusted according to the needs of the downstream receiver.

Guess you like

Origin blog.csdn.net/u011397981/article/details/130412612