2021-02-28

Introduction to Spark

Spark was originally developed by the AMP laboratory of the University of California, Berkeley in 2009. A large data parallel computing framework based on memory computing can be used to build large-scale, low-latency data analysis applications. The
main features of Spark:
1. Fast running speed
2. Easy to use
3. Versatility
4. Various operating modes
Spark architecture diagram:
Insert picture description here

Spark ecosystem:
Insert picture description here
Spark ecosystem:
Spark's design follows the concept of "a software stack meets different application scenarios", and has gradually formed a complete ecosystem that
can not only provide a memory computing framework, but also support SQL ad hoc queries and real-time streaming computing
Spark , machine learning, and graph computing can be deployed on the resource manager YARN to provide a one-stop big data solution.
Therefore, the ecosystem provided by Spark is sufficient to deal with the above three scenarios, that is, it supports batch processing and interactive query at the same time. And streaming data processing
Spark operating architecture:
RDD: is an abstract concept of distributed memory, provides a highly restricted shared memory model
DAG: reflects the dependency between RDD
Executor: a process that runs on a worker node, Responsible for running the Task
application (Application): the Spark application written by the user.
Task: the unit of work that runs on the Executor.
Job: A job contains multiple RDDs and various
stages of operation (Stage ) acting on the corresponding RDD. ): It is the basic scheduling unit of a job. A job is divided into multiple groups of tasks. Each group of tasks is called a stage, or also called a task set, which represents a group of related tasks that have no Shuffle dependencies between them. Task set

Guess you like

Origin blog.csdn.net/weixin_46519384/article/details/114227568