spark - RDD combing

1. What is RDD?

Five features in RDD source code: elastic, distributed, immutable, support for parallelized operations, and partitionable datasets

Five main properties:

  • 1 rdd can have multiple partitions
  • If 1 rdd acts on a function, it is actually a function for each split in it, and 1 split is a partition
  • There are a series of dependencies between RDDs, for example:

  • (Optional) For (groupbykey) there is a Hashpartition corresponding to rdd of type key-value, and sortbykey corresponding to a range-partitioned
  • (Optional) Each split has a list of preferred locations (note the plural here, why?)

 

2. How to create RDD:

  • Create from an existing collection, sc.parallize(collection object, number of partitions)
  • Based on file creation (local, hdfs, s3 files), if it is a local file, when executing in a distributed environment, make sure that each machine has this file

 3. Spark read and write file api:

sc.textFile

sc.sequenceFile

sc.wholeTextFiles

sc.newAPIHadoop

sc.newAPIHadoopRDD

sc.hadoopRDD

rdd.saveAsObjectFile

 

4. Basic operations of RDD

transitions, actions, etc.

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326084968&siteId=291194637