Spark learning (3) RDD of Spark

1. Overview of RDDs

1.1. What is RDD

RDD (Resilient Distributed Dataset) is called a Resilient Distributed Dataset. It is the most basic data abstraction in Spark . It represents an immutable, partitionable collection whose elements can be computed in parallel. RDDs have the characteristics of a dataflow model: automatic fault tolerance, location-aware scheduling, and scalability. RDDs allow users to explicitly cache the working set in memory when executing multiple queries, and subsequent queries can reuse the working set, which greatly improves query speed.

1.2. Properties of RDD

(1) A group of slices (Partition), that is, the basic unit of the data set. For RDDs, each shard is processed by a computing task and determines the granularity of parallel computing. The user can specify the number of shards of the RDD when creating the RDD. If not specified, the default value will be used. The default value is the number of CPU Cores allocated by the program.

(2) A function that computes each partition. The calculation of RDD in Spark is based on sharding, and each RDD will implement the compute function to achieve this purpose. The compute function will compose the iterator without saving the result of each calculation.

(3) Dependencies between RDDs. Each transformation of RDD will generate a new RDD, so RDD will form a front and back dependency similar to pipeline. When some partition data is lost, Spark can recompute the lost partition data through this dependency, instead of recomputing all partitions of the RDD.

(4) A Partitioner, the sharding function of RDD. There are currently two types of sharding functions implemented in Spark, one is a hash-based HashPartitioner, and the other is a range-based RangePartitioner. Only for key-value RDDs, there will be Partitioner, and the value of Partitioner for non-key-value RDDs is None. The Partitioner function not only determines the number of shards in the RDD itself, but also determines the number of shards when the parent RDD Shuffle is output.

(5) A list that stores the preferred location for accessing each Partition. For an HDFS file, this list holds the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", when Spark performs task scheduling, it will allocate computing tasks to the storage locations of the data blocks to be processed as much as possible.

1.3, wordcount rough graphic RDD


where hello.txt


2. How to create an RDD

2.1. Generated by reading files

Created from datasets in external storage systems, including local file systems, and all Hadoop-supported datasets, such as HDFS, Cassandra, HBase, etc.

scala> val file = sc.textFile("workcount.txt")
file: org.apache.spark.rdd.RDD[String] = workcount.txt MapPartitionsRDD[2] at textFile at <console>:24

2.2. Create RDD by parallelization

Created from an existing Scala collection.

scala> val array = Array(1,2,3,4,5)
array: Array[Int] = Array(1, 2, 3, 4, 5)                  

scala> val rdd = sc.parallelize(array)
rdd: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [3] at parallelize at <console>: 26

staircase>

2.3. Other ways

Read the database and other operations. It is also possible to generate RDDs.

RDDs can be converted from other RDDs.

3. RDD programming API

Spark supports two types (operators) operations: Transformation and Action

3.1、Transformation

The main thing is to talk about an existing RDD to generate another RDD. Transformation has lazy feature (lazy loading). The code of the Transformation operator will not actually be executed. Only when our program encounters an action operator, the code will be actually executed. This design makes Spark run more efficiently.

Common Transformations:

   
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try
Yes Yes Yes s try

 



















Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324735169&siteId=291194637