Based on the characteristics of spark memory computing, problem solving and OOM

1, Spark Introduction

Spark is a general large scale data processing based on the calculated frame memory . Spark has been integrated into the Hadoop ecosystem can support the type of job and more extensive than MapReduce application scenarios, and have a MapReduce all high fault tolerance and high scalability features. Spark supports off-line batch processing , stream computing and real-time analysis.

2, Spark why fast

    MapReduce slow reasons:

  • When a plurality of serially executed MapReduce, depend on the intermediate result output from the HDFS
  • In dealing with complex MapReduce the DAG (directed acyclic graph) when the sequence will generate a lot of data, copy data, and disk I / O overhead    

     Spark quick reasons:

  • Spark memory-based, as much as possible to reduce intermediate results are written to disk and unnecessary sort, shuffle
  • Spark For data were used repeatedly cache
  • Spark for DAG been highly optimized , particularly that Spark divide and use a different stage delay calculation technique

3, the distribution of the elastic data sets RDD

Spark will save the data in a distributed memory, abstract understanding of distributed-memory, memory model provides a highly restricted, he is focused on the logic of abortion is on the machine and then stored on a physical cluster.

4, RDD attributes and characteristics   

Read-only By HDFS or other to create RDD persistence system
by transformation parent RD

Guess you like

Origin blog.csdn.net/as4589sd/article/details/104033138