1. What is RDD?
Five features in RDD source code: elastic, distributed, immutable, support for parallelized operations, and partitionable datasets
Five main properties:
- 1 rdd can have multiple partitions
- If 1 rdd acts on a function, it is actually a function for each split in it, and 1 split is a partition
- There are a series of dependencies between RDDs, for example:
- (Optional) For (groupbykey) there is a Hashpartition corresponding to rdd of type key-value, and sortbykey corresponding to a range-partitioned
- (Optional) Each split has a list of preferred locations (note the plural here, why?)
2. How to create RDD:
- Create from an existing collection, sc.parallize(collection object, number of partitions)
- Based on file creation (local, hdfs, s3 files), if it is a local file, when executing in a distributed environment, make sure that each machine has this file
3. Spark read and write file api:
sc.textFile sc.sequenceFile sc.wholeTextFiles sc.newAPIHadoop sc.newAPIHadoopRDD sc.hadoopRDD rdd.saveAsObjectFile
4. Basic operations of RDD
transitions, actions, etc.