Summary of eight-part essay on big data development——Spark

Spark overview

What is Spark

Spark is a fast, versatile, and scalable memory-based big data analysis computing engine.

Spark and Hadoop

In previous studies, Hadoop’s MapReduce is a well-known computing framework, so why do we still
need to learn the new computing framework Spark? I have to mention it here To the relationship between Spark and Hadoop.

Hadoop

  • Hadoop is an open source framework written in Java language that stores massive amounts of data and runs distributed analysis applications on distributed server clusters.
  • As a Hadoop distributed file system, HDFS is at the bottom of the Hadoop ecosystem, storing all data and supporting all Hadoop services. Its theoretical basis originates from Google's TheGoogleFileSystem paper, which is an open source implementation of GFS.
  • MapReduce is a programming model that Hadoop implements based on Google's MapReduce paper. As Hadoop's distributed computing model, it is the core of Hadoop. Based on this framework, writing distributed parallel programs becomes extremely simple. Combining the distributed storage of HDFS and the distributed computing of MapReduce, Hadoop's performance becomes very easy to scale horizontally when processing massive data.
  • HBase is an open source implementation of Google's Bigtable, but it is different from Bigtable in many ways. HBase is a distributed database based on HDFS that is good at random reading/writing of very large data sets in real time. It is also a very important component of Hadoop.

Spark

  • Spark is a fast, versatile and scalable big data analysis engine developed in Scala language
  • Spark Core provides the most basic and core functions of Spark
  • Spark SQL is the component of Spark used to manipulate structured data. With Spark SQL, users can query data using SQL or the Apache Hive version of the SQL dialect (HQL).
  • Spark Streaming is a component on the Spark platform that performs streaming computing on real-time data and provides a rich API for processing data streams.

From the above information, we can know that Spark appeared relatively late, and its main function is mainly used for data calculation, so Spark has always been considered an upgraded version of the Hadoop framework.

Spark or Hadoop

Hadoop's MR framework and Spark framework are both data processing frameworks, so how do we choose when using them?

  • Since Hadoop MapReduce was not originally designed to satisfy cyclic iterative data stream processing, there are many computational efficiencies in multiple parallel-running data reusable scenarios (such as machine learning, graph mining algorithms, interactive data mining algorithms), etc. question. So Spark came into being. Spark is based on the traditional MapReduce computing framework and uses the optimization of its computing process to greatly speed up the running and reading and writing speed of data analysis and mining, and shrink the computing unit to be more suitable for parallel computing. and reused RDD calculation models.
  • ALS, convex optimization gradient descent, etc. in machine learning. These require repeated queries and operations based on the data set or the derived data of the data set. MR is not suitable for this mode. Even if multiple MRs are processed serially, performance and time are still a problem. Data sharing relies on disk. The other is interactive data mining, which MR is obviously not good at. The scala language that Spark is based on is good at processing functions.
  • Spark is a distributed data rapid analysis project. Its core technology is Resilient Distributed Datasets, which provides a richer model than MapReduce and can quickly iterate the data set multiple times in memory to support complex data mining algorithms and graph computing algorithms.
  • The fundamental difference between Spark and Hadoop is the issue of data communication between multiple jobs: Data communication between multiple jobs in Spark is based on memory, while Hadoop is based on disk.
  • Spark Task has fast startup time. Spark uses fork threads, while Hadoop uses the method of creating new processes.
  • Spark only writes data to disk during shuffle, while data interaction between multiple MR jobs in Hadoop depends on disk interaction.
  • Spark's caching mechanism is more efficient than HDFS's caching mechanism.

After the above comparison, we can see that Spark does have advantages over MapReduce in most data computing scenarios. HoweverSpark is based on memory, so in the actual production environment, due to memory limitations, it may Insufficient memory resources lead to job execution failure. At this time, MapReduce is actually a better choice,, so Spark cannot completely replace MR< a i=6>.

Spark core module

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_44123362/article/details/130257261
Recommended