高性能spark读书笔记-第二章

1、容错

Spark is fault-tolerant, meaning Spark will not fail, lose data, or return inaccurate
results in the event of a host machine or network failure. Spark’s unique method of
fault tolerance is achieved because each partition of the data contains the dependency

information needed to recalculate the partition. Most distributed computing para‐
digms that allow users to work with mutable objects provide fault tolerance by log‐
ging updates or duplicating data across machines

In contrast, Spark does not need to maintain a log of updates to each RDD or log the
actual intermediary steps, since the RDD itself contains all the dependency informa‐
tion needed to replicate each of its partitions. Thus, if a partition is lost, the RDD has
enough information about its lineage to recompute it, and that computation can be
parallelized to make recovery faster.

2、大规模计算的的可能唯一选项

On disk
RDDs, whose partitions are too large to be stored in RAM on each of the
executors, can be written to disk. This strategy is obviously slower for
repeated computations, but can be more fault-tolerant for long sequences of
transformations, and may be the only feasible option for enormous compu‐
tations.

3.持久化rdd使用lru cache算法

When persist‐
ing RDDs, the default implementation of RDDs evicts the least recently used parti‐
tion (called LRU caching) if the space it takes is required to compute or to cache a
new partition.

4、RDD的主要属性

Internally, Spark uses five main properties to represent an RDD。

partitions()
iterator(p, parentIters)

dependencies()

partitioner()

preferredLocations(p)

5、Types of RDDs

PairRDDFunctions,

OrderedRDDFunctions,

GroupedRDDFunctions

6、关于action

Each Spark program must contain an action, since actions either bring information
back to the driver or write the data to stable storage.

Some of these actions do not scale well, since they can cause mem‐
ory errors in the driver. In general, it is best to use actions like
take, count, and reduce, which bring back a fixed amount of data
to the driver, rather than collect or sample.

7、资源分配

Spark offers two ways of allocating resources across applications: static allocation and
dynamic allocation

8、关于partitioners对rdd的影响

the
same operations on RDDs with known partitioners and RDDs
without a known partitioner can result in different stage bound‐
aries, because there is no need to shuffle an RDD with a known
partition

ruiyiin

发布了30 篇原创文章 · 获赞 74 · 访问量 23万+

私信关注

高性能spark读书笔记-第二章

猜你喜欢