1, Spark Introduction
Spark is a general large scale data processing based on the calculated frame memory . Spark has been integrated into the Hadoop ecosystem can support the type of job and more extensive than MapReduce application scenarios, and have a MapReduce all high fault tolerance and high scalability features. Spark supports off-line batch processing , stream computing and real-time analysis.
2, Spark why fast
MapReduce slow reasons:
- When a plurality of serially executed MapReduce, depend on the intermediate result output from the HDFS
- In dealing with complex MapReduce the DAG (directed acyclic graph) when the sequence will generate a lot of data, copy data, and disk I / O overhead
Spark quick reasons:
- Spark memory-based, as much as possible to reduce intermediate results are written to disk and unnecessary sort, shuffle
- Spark For data were used repeatedly cache
- Spark for DAG been highly optimized , particularly that Spark divide and use a different stage delay calculation technique
3, the distribution of the elastic data sets RDD
Spark will save the data in a distributed memory, abstract understanding of distributed-memory, memory model provides a highly restricted, he is focused on the logic of abortion is on the machine and then stored on a physical cluster.
4, RDD attributes and characteristics
Read-only | By HDFS or other to create RDD persistence system by transformation parent RD |