Spark basic knowledge carding

Table of contents

1. Basic introduction

2. Four major characteristics of spark

1. Fast

2. Easy to use

3. Strong versatility

4. Operation mode

Three, spark framework module

4. Operation mode

5. The architectural role of spark

6. Summary


1. Basic introduction


Spark is a general-purpose big data computing framework that uses in-memory computing technology. Today, Gami Valley Big Data will briefly introduce the history of Spark.
A Brief History of Spark
1. In 2009, Spark was born in AMPLab of Berkeley University, which is a research project of Berkeley University; 2. In 2010
, it was officially released as open source through the BSD license agreement;
3. In 2012, Spark's first paper was released. The first official version (Spark 0.6.0) was released;
4. In 2013, it became an Apache fund project; released Spark Streaming, Spark Mllib (machine learning), Shark (Spark on Hadoop); 5. In 2014,
Spark became Apache Spark1.0.0 was released at the end of May; Spark Graphx (graph computing) and Spark SQL were released to replace Shark;
6. In 2015, DataFrame (big data analysis) was launched; since 2015, Spark has become more and more popular in the domestic IT industry Popular, a large number of companies began to focus on deploying or using Spark to replace traditional big data computing frameworks such as MapReduce, Hive, and Storm; 7. In 2016,
dataset (a stronger data analysis method) was launched;
8. In 2017, structured streaming was released ;
9. In 2018, Spark2.4.0 was released, becoming the world's largest open source project.

Basic component
Spark Core; Spark core API, providing DAG distributed memory computing framework
Spark SQL: providing interactive query API
Spark Streaming: real-time stream processing
SparkML: machine learning API
Spark Graphx: graph computing

2. Four major characteristics of spark

1. Fast


Since Apache Spark supports memory computing
and supports acyclic data flow through the DAG (Directed Acyclic Graph) execution engine, the
official claims that its computing speed in memory is 100 times faster than Hadoop's MapReduce, and 10 times faster in hard disk. times.

Compared with the data processed by MapReduce, the data processed by Spark has the following two differences: When processing data, Spark can
store the intermediate processing result data in the memory.
It is completed in a spark program, and it is also calculated in a threaded manner, not a process of mapreduce

2. Easy to use


Spark supports multiple languages ​​including Java, Scala, Python, R and SQL.
In order to be compatible with Spark2.x enterprise application scenarios, Spark continues to update the Spark2 version


3. Strong versatility


spark Core API (core module): supports R, SQL, Python, Scala, Java and other languages
​​On the basis of Spark (core module), Spark also provides Spark SQL+DataFrames, Spark Streaming, MLib (machine learning) and GraphX (graph computing) including multiple tool libraries


4. Operation mode


Spark supports a variety of operating modes, including on Hadoop and Mesos, and also supports the standalone operating mode of Standalone, and can also run on cloud Kubernetes (Spark 2.3 supports it). For data sources, Spark supports HDFS, HBase
, Cassandra and Kafka and other ways to obtain and data
i, file system: localFS, HDFS, Hive, text, parquet, orc, jsion, csv
ii, database RDBMs: mysql, Oracle, mssql
iii, NOsql database: Hbase, ES, Redis
iv. Message object: kafka

Three, spark framework module



The entire module includes: sparkcore, spark SQL, spark streaming, sparkgrphx, spark MLib and the last four Spark Cores built on top of the core engine : the core of spark, and the core functions of Spark are provided by the Spark Core module, which is the basis of Spark operation. Spark Core uses RDD as data abstraction, provides APIs in python, Java, Scala, and R languages, and can be programmed for massive offline data batch processing calculations.
SparkSQL: Based on SparkCore, it provides a processing module for structured data. SparkSQL supports data processing in sql language, and sparkSQL itself is aimed at offline computing scenarios. At the same time, based on SparkSQL, Spark provides the StructuredStreaming module, which can perform data flow calculation based on SparkSQL.
SparkStreaming: Based on SparkCore, it provides stream computing functions for data.
MLib: Based on SparkCore, it performs machine learning calculations. It has a large number of built-in machine learning libraries and API algorithms, etc., which is convenient for users to continue Ning machine learning calculations in a distributed computing mode.
Graphx: Based on SparkCore, it performs graph computing and provides a large number of graph computing APIs, which are convenient for graph computing in a distributed computing mode.

4. Operation mode


spark provides a variety of operating modes, including:
local mode (stand-alone) LOCAL development and testing:
local mode is to use an independent process to simulate the entire Spark runtime environment Standalone mode (cluster) through its internal multiple threads
.
Roles exist independently and form the Spark cluster environment
HadoopYARN mode (cluster)
Each role in Spark runs inside the YARN container and forms the Spark cluster environment.
KUbernetes mode (container cluster)
Each role in Spark runs inside the KUbernetes container and forms a Spark cluster environment
Cloud service mode (runs on the cloud platform)

5. The architectural role of spark


YRAN mainly has four types of roles, viewed from two levels:
resource management level
cluster resource manager (MASTER): ResourceManager
stand-alone resource manager (Worker): NodeManager

Task computing level
Single task manager (Master): ApplicationMaster
Single task executor (Worker): Task (work role of the computing framework in the container)

SPARK role:
resource level:
MASTER role: cluster resource manager
Workerde role: stand-alone resource manager

Task operation level:
Driver: management of a single task
Executor role: calculation of a single task (worker work)

Note: Under normal circumstances, Executor is the role of work, but in special scenarios (local mode) Driver can manage and work at the same time

6. Summary


Problems solved by spark:
calculation of massive data, offline batch processing and real-time streaming calculation

spark module:
sparkcore, SQL, stream computing (SparkStreaming), graph computing (Graphx), machine learning (MLib)

Spark features:
fast speed, easy to use, strong versatility, multi-mode operation

spark running mode
local mode
cluster mode
cloud mode

The running role of spark
MASTER: cluster resource manager (similar to ResourceManager)
worker: stand-alone resource manager (similar to NodeManager)
Driver: single-task manager (similar to ApplicationMaster)
Executor: single-task executor (similar to YARN container task)

Guess you like

Origin blog.csdn.net/Sheenky/article/details/126321198