Spark learning road one (spark overview)


Preface

Recently, when I feel that I have nothing to do, I am a little annoyed, and irritated, I do something~, and then I'm going to make trouble.


1. What is spark?

1. Definition

Spark is a fast, universal, and scalable big data analysis engine based on memory.

2. History

Born in AMPLab of the University of California, Berkeley in 2009, the project is written in Scala.
Open source in 2010;
became the Apache incubation project in June 2013 and became the top Apache project
in February 2014.

2. Spark's built-in modules

1. Specific display

Insert picture description here

2. Specific introduction

Spark Core: Implements the basic functions of Spark, including
modules such as task scheduling, memory management, error recovery, and interaction with the storage system. Spark Core also contains
API definitions for Resilient Distributed DataSet (RDD).
Spark SQL: is a package used by Spark to manipulate structured data. Through Spark SQL, we can use
SQL or Apache Hive version of SQL dialect (HQL) to query data. Spark SQL supports a variety of data sources,
such as Hive tables, Parquet, and JSON.
Spark Streaming: It is a component provided by Spark for streaming real-time data. It provides an
API for manipulating data streams, and is highly compatible with the RDD API in Spark Core.
Spark MLlib: A library that provides common machine learning (ML) functions. Including classification, regression, clustering, collaborative
filtering, etc., it also provides additional support functions such as model evaluation and data import.
Cluster Manager: Spark is designed to efficiently scale calculations from one computing node to thousands of computing nodes
. In order to achieve this requirement and obtain maximum flexibility, Spark supports
running on various cluster managers , including Hadoop YARN, Apache Mesos, and a simple scheduler
that Spark comes with, called an independent scheduler.

Spark is supported by many big data companies, including Hortonworks, IBM, Intel,
Cloudera, MapR, Pivotal, Baidu, Ali, Tencent, JD, Ctrip, Youku Tudou. At present, Baidu's
Spark has been applied to big search, direct number, Baidu big data and other businesses; Alibaba uses GraphX ​​to build a large-scale graph
computing and graph mining system, and implements recommendation algorithms for many production systems; Tencent Spark cluster reaches 8,000 units In terms of scale
, it is currently the largest known Spark cluster in the world.

3. Spark features

1) Fast: Compared with Hadoop's MapReduce, Spark memory-based operations are more than 100 times faster, and hard disk-based operations are more than 10 times faster
. Spark implements an efficient DAG execution engine that can efficiently process data streams based on memory. The intermediate result of the calculation exists in the memory
.
2) Ease of use: Spark supports Java, Python, and Scala APIs, as well as more than 80 advanced algorithms, allowing users to quickly build different applications
. In addition, Spark supports interactive Python and Scala shells, and it is very convenient to use Spark clusters in these shells to verify the solutions to
problems.
3) General: Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time stream processing
(Spark Streaming), machine learning (Spark MLlib) and graph computing (GraphX). These different types of processing can all be
used seamlessly in the same application. Reduce the labor cost of development and maintenance and the material cost of the deployment platform.
4) Compatibility: Spark can be easily integrated with other open source products. For example, Spark can use Hadoop's YARN and
Apache Mesos as its resource management and scheduler, and can process all data supported by Hadoop, including HDFS, HBase, etc. This is
especially important for users who have deployed Hadoop clusters, because they can use the powerful processing capabilities of Spark without any data migration.

Guess you like

Origin blog.csdn.net/weixin_44695793/article/details/113852818