02_spark basic working principle and RDD

1. The basic working principle of spark

Distributed
mainly based on memory
iterative computing

2.RDD and its characteristics

RDD is abstractly a collection of elements that contains data. It is partitioned, divided into multiple partitions, each partition is distributed on different nodes in the cluster, so that the data in the RDD can be operated in parallel. (Distributed Data Set) The
most important feature of RDD is that it provides fault tolerance and can automatically recover from node failures. That is, if the RDD partition on a node is lost due to node failure, then RDD will automatically recalculate the partition through its own data source. All this is transparent to users.
RDD data is stored in memory by default, but when memory resources are insufficient, Spark will automatically write RDD data to disk. (elasticity)

3.spark development

a. Core development: offline batch processing / delayed interactive data processing
b. SQL query: the bottom layer is RDD and calculation operations
c. real-time calculation: the bottom layer is RDD and calculation operations

Guess you like

Origin www.cnblogs.com/ytq1016/p/12682423.html