SparkCore basic analysis (2)

1. RDD overview

1.1 What is RDD

RDD (Resilient Distributed Dataset) is called a distributed data set and is the most basic data abstraction in Spark. The code is an abstract class that represents an immutable, partitionable collection whose elements can be calculated in parallel.

1.2 Properties of RDD

Insert image description here

1) A set of partitions (Partition), which is the basic unit of the data set;
2) A function that calculates each partition;
3) The dependency relationship between RDDs;
4) A Partitioner, which is the sharding function of RDD;
5) A list that stores the preferred location for accessing each Partition.

1.3 RDD features

RDD represents a read-only partitioned data set. Changes to RDD can only be made through RDD conversion operations. A new RDD is obtained from one RDD. The new RDD contains the information necessary to derive from other RDDs. There are dependencies between RDDs, and the execution of RDDs is calculated with delay based on blood relationships. If the blood relationship is long, the blood relationship can be severed by persisting RDD.

1.3.1 Partition

RDD is logically partitioned (the same partitioning mode as the partitioning in Hadoop). The data of each partition exists abstractly. During calculation, the data of each partition will be obtained through a compute function. If the RDD is constructed from an existing file system, the compute function reads the data in the specified file system. If the RDD is converted from other RDDs, the compute function executes conversion logic to convert the data of other RDDs.

Insert image description here

1.3.2 Read only

As shown in the figure below, RDD is read-only. If you want to change the data in RDD, you can only create a new RDD based on the existing RDD.

Insert image description here
Converting one RDD to another can be achieved through a rich set of operation operators. It is no longer like MapReduce where you can only write map and reduce, as shown in the figure below.
Insert image description here
There are two types of RDD operation operators. One type is called transformations, which are used to transform RDD and construct the blood relationship of RDD; the other type is called actions, which is used to trigger the calculation of RDD and obtain the relevant calculation results of RDD. Or save the RDD in the file system. The following figure is a list of operation operators supported by RDD.

1.3.3 Dependencies

RDDs are converted through operation operators, and the converted new RDD contains the information necessary to derive from other RDDs. This blood relationship is maintained between RDDs, which is also called dependency. As shown in the figure below, there are two types of dependencies, one is narrow dependency, the partitions between RDDs are one-to-one correspondence, the other is wide dependency, each partition of the downstream RDD is related to the upstream RDD (also called the parent RDD) Each partition of is related and is a many-to-many relationship.
Insert image description here

1.3.4 Caching

If the same RDD is used multiple times in the application, the RDD can be cached. The RDD will only get the partitioned data based on the blood relationship when it is calculated for the first time. When the RDD is used in other places later, it will Get it directly from the cache instead of calculating based on blood relationship, which speeds up later reuse. As shown in the figure below, RDD-1 undergoes a series of conversions to obtain RDD-n and save it to hdfs. RDD-1 will have an intermediate result in this process (it must first perform an action operator operation before it can be cached) into memory, often using the collect operator before caching) If it is cached into memory, then during the subsequent conversion of RDD-1 to RDD-m, its previous RDD-0 will not be calculated.

Insert image description here

1.3.5 CheckPoint

Although the blood relationship of RDD can naturally achieve fault tolerance, when the data of a certain partition of the RDD fails or is lost, it can be rebuilt through the blood relationship. However, for long-term iterative applications, as the iteration proceeds, the blood relationship between RDDs will become longer and longer. Once an error occurs during subsequent iterations, it will need to be reconstructed through a very long blood relationship, which will inevitably affect performance. . For this reason, RDD supports checkpoint to save data to persistent storage, so that the previous blood relationship can be severed, because the RDD after checkpoint does not need to know its parent RDDs, it can get the data from the checkpoint.

2. RDD programming

2.1 Programming model

In Spark, RDDs are represented as objects, and RDDs are transformed through method calls on the objects. After defining the RDD through a series of transformations, you can call actions to trigger the calculation of the RDD. Actions can return results to the application (count, collect, etc.), or save data to the storage system (saveAsTextFile, etc.). In Spark, RDD calculations (i.e. delayed calculations) will only be performed when an action is encountered, so that multiple transformations can be transmitted through pipelines at runtime.
To use Spark, developers need to write a Driver program, which is submitted to the cluster to schedule and run Workers, as shown in the figure below. One or more RDDs are defined in the Driver and actions on the RDDs are called, and the Worker performs RDD partition calculation tasks.

Insert image description here

Insert image description here

2.2 Creation of RDD

There are three ways to create RDDs in Spark: creating RDDs from collections; creating RDDs from external storage; and creating RDDs from other RDDs.

2.2.1 Create from a collection

To create RDD from a collection, Spark mainly provides two functions: parallelize and makeRDD
1) Use parallelize() to create
scala> val rdd = sc.parallelize(Array(1,2,3,4,5,6,7) from the collection ,8))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
2) Use makeRDD() to create
scala> val from the collection rdd1 = sc.makeRDD(Array(1, 2,3,4,5,6,7,8))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :24

2.2.2 Creation from datasets in external storage systems

Including the local file system, as well as all Hadoop supported data sets, such as HDFS, Cassandra, HBase, etc., we will introduce it in detail in Chapter 4.
scala> val rdd2= sc.textFile("hdfs://hadoop102:9000/RELEASE")
rdd2: org.apache.spark.rdd.RDD[String] = hdfs://hadoop102:9000/RELEASE MapPartitionsRDD[4] at textFile at :24

2.2.3 Create from other RDDs

look down

2.3 RDD conversion

RDD is generally divided into Value type and Key-Value type.

2.3.1 Value type

2.3.1.1 map(func) case
  1. Function: Returns a new RDD, which is composed of each input element converted by the func function
  2. Requirement: Create an RDD of 1-10 array, and combine all elements*2 to form a new RDD
    (1) Create
scala> var source  = sc.parallelize(1 to 10)
source: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24

(2) Print
scala> source.collect()
res7: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
(3) Put all elements *2
scala> val mapadd = source.map(_ * 2)
mapadd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[9] at map at :26
(4) Print the final result
scala> mapadd.collect()
res8: Array[Int] = Array(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

2.3.1.2 mapPartitions(func) case
  1. Function: Similar to map, but runs independently on each shard of the RDD. Therefore, when running on an RDD of type T, the function type of func must be Iterator[T] => Iterator[U]. Assuming there are N elements and M partitions, then the map function will be called N times, and mapPartitions will be called M times. One function processes all partitions at once.
  2. Requirements: Create an RDD so that each element 2 forms a new RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(Array(1,2,3,4))
    rdd: org.apache.spark.rdd .RDD[Int] = ParallelCollectionRDD[4] at parallelize at :24
    (2) Make each element
    2 form a new RDD
    scala> rdd.mapPartitions(x=>x.map(_*2))
    res3: org.apache .spark.rdd.RDD[Int] = MapPartitionsRDD[6] at mapPartitions at :27
    (3) Print new RDD
    scala> res3.collect
    res4: Array[Int] = Array(2, 4, 6, 8)
2.3.1.3 mapPartitionsWithIndex(func) case
  1. Function: Similar to mapPartitions, but func takes an integer parameter to represent the index value of the fragment, so when running on an RDD of type T, the function type of func must be (Int, Interator[T]) => Iterator[U ];
  2. Requirement: Create an RDD so that each element and its partition form a tuple to form a new RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(Array(1,2,3,4))
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at :24
    (2) Make each element and its partition form a tuple to form a new RDD
    scala> val indexRdd = rdd.mapPartitionsWithIndex( (index,items)=>(items.map((index,_))))
    indexRdd: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[5] at mapPartitionsWithIndex at :26
    (3 ) print new RDD
    scala> indexRdd.collect
    res2: Array[(Int, Int)] = Array((0,1), (0,2), (1,3), (1,4))
2.3.1.4 flatMap(func) case
  1. Function: Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence instead of a single element)
  2. Requirements: Create an RDD with elements 1-5, and use flatMap to create a new RDD. The new RDD is an extension of each element of the original RDD (1->1, 2->1, 2...5->1 ,2,3,4,5)
    (1) Create
    scala> val sourceFlat = sc.parallelize(1 to 5)
    sourceFlat: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at :24
    (2) Print
    scala> sourceFlat.collect()
    res11: Array[Int] = Array(1, 2, 3, 4, 5)
    (3) Create a new RDD based on the original RDD (1->1,2->1, 2……5->1,2,3,4,5)
    scala> val flatMap = sourceFlat.flatMap(1 to _)
    flatMap: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[13] at flatMap at :26
    (4) Print new RDD
    scala> flatMap.collect()
    res12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3 , 4, 5)
2.3.1.5 The difference between map() and mapPartition()
  1. map(): Process one piece of data at a time.
  2. mapPartition(): Each time the data of one partition is processed, the data of the partition in the original RDD can be released only after the data of this partition is processed, which may cause OOM.
  3. Development guidance: When the memory space is large, it is recommended to use mapPartition() to improve processing efficiency.
2.3.1.6 glom case
  1. Function: Form each partition into an array and form a new RDD type RDD[Array[T]]
  2. Requirements: Create an RDD with 4 partitions, and put the data of each partition into an array
    (1) Create
    scala> val rdd = sc.parallelize(1 to 16,4)
    rdd: org.apache.spark.rdd. RDD[Int] = ParallelCollectionRDD[65] at parallelize at :24
    (2) Put the data of each partition into an array and collect it to the Driver for printing
    scala> rdd.glom().collect()
    res25: Array[Array[ Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8), Array(9, 10, 11, 12), Array(13, 14, 15, 16))
2.3.1.7 groupBy(func) case
  1. Function: Grouping, grouping according to the return value of the incoming function. Put the values ​​corresponding to the same key into an iterator.
  2. Requirement: Create an RDD and group the elements according to the value modulo 2.
    (1) Create
    scala> val rdd = sc.parallelize(1 to 4)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[65] at parallelize at :24
    (2) Modulate the elements by 2 Value grouping
    scala> val group = rdd.groupBy(_%2)
    group: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at :26
    (3) Print result
    scala> group.collect
    res0: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4)), (1,CompactBuffer(1, 3)))
2.3.1.8 filter(func) case
  1. Function: filter. Returns a new RDD consisting of input elements that return true after being calculated by the func function.
  2. Requirements: Create an RDD (composed of strings) and filter out a new RDD (containing the "xiao" substring)
    (1) Create
    scala> var sourceFilter = sc.parallelize(Array("xiaoming", "xiaojiang", "xiaohe ","dazhi"))
    sourceFilter: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[10] at parallelize at :24
    (2) print
    scala> sourceFilter.collect()
    res9: Array[String] = Array (xiaoming, xiaojiang, xiaohe, dazhi)
    (3) Filter out the substrings containing "xiao" to form a new RDD
    scala> val filter = sourceFilter.filter(_.contains("xiao"))
    filter: org.apache. spark.rdd.RDD[String] = MapPartitionsRDD[11] at filter at :26
    (4) Print new RDD
    scala> filter.collect()
    res10: Array[String] = Array(xiaoming, xiaojiang, xiaohe)
2.3.1.9 sample(withReplacement, fraction, seed) case
  1. Function: Use the specified random seed to randomly sample a fraction of data. withReplacement indicates whether the extracted data is replaced. true means sampling with replacement, false means sampling without replacement. seed is used to specify the random number generator. seed. The example randomly extracts 50% of the data from the RDD with replacement, and the random seed value is 3 (that is, it may start with one of the starting values ​​1 2 3)
  2. Requirements: Create an RDD (1-10), choose sampling with or without replacement
    (1) Create RDD
    scala> val rdd = sc.parallelize(1 to 10)
    rdd: org.apache.spark.rdd.RDD[ Int] = ParallelCollectionRDD[20] at parallelize at :24
    (2) print
    scala> rdd.collect()
    res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    (3) Replacement sampling
    scala> var sample1 = rdd.sample(true,0.4,2)
    sample1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[21] at sample at :26
    (4) Print the sampling results with replacement
    scala> sample1.collect()
    res16: Array[Int] = Array(1, 2, 2, 7, 7, 8, 9)
    (5) Sampling without replacement
    scala> var sample2 = rdd.sample (false,0.2,3)
    sample2: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[22] at sample at :26
    (6) Print the sampling results without replacement
    scala> sample2.collect()
    res17: Array [Int] = Array(1, 9)
2.3.1.10 distinct([numTasks])) case
  1. Function: Return a new RDD after deduplicating the source RDD. By default, only 8 parallel tasks operate, but this can be changed by passing an optional numTasks parameter.
  2. Requirement: Create an RDD and use distinct() to deduplicate it.
    (1) Create an RDD
    scala> val distinctRdd = sc.parallelize(List(1,2,1,5,2,9,6,1))
    distinctRdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [34] at parallelize at :24
    (2) Deduplicate RDD (without specifying the degree of parallelism)
    scala> val unionRDD = distinctRdd.distinct()
    unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[37 ] at distinct at :26
    (3) Print the new RDD generated after deduplication
    scala> unionRDD.collect()
    res20: Array[Int] = Array(1, 9, 5, 6, 2)
    (4) For RDD (specify The degree of parallelism is 2)
    scala> val unionRDD = distinctRdd.distinct(2)
    unionRDD: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[40] at distinct at :26
    (5) Print the new generated after deduplication RDD
    scala> unionRDD.collect()
    res21: Array[Int] = Array(6, 2, 1, 9, 5)
2.3.1.11 coalesce(numPartitions) case
  1. Function: Reduce the number of partitions, which is used to improve the execution efficiency of small data sets after filtering large data sets.
  2. Requirements: Create an RDD with 4 partitions, and reduce the partition
    (1) to create an RDD
    scala> val rdd = sc.parallelize(1 to 16,4)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at :24
    (2) View the number of RDD partitions
    scala> rdd.partitions.size
    res20: Int = 4
    (3) Repartition the RDD
    scala> val coalesceRDD = rdd.coalesce(3)
    coalesceRDD: org .apache.spark.rdd.RDD[Int] = CoalescedRDD[55] at coalesce at :26
    (4) Check the number of partitions of the new RDD
    scala> coalesceRDD.partitions.size
    res21: Int = 3
2.3.1.12 repartition(numPartitions) case
  1. Function: Randomly shuffle all data through the network according to the number of partitions.
  2. Requirements: Create an RDD with 4 partitions, repartition it
    (1) Create an RDD
    scala> val rdd = sc.parallelize(1 to 16,4)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at :24
    (2) View the number of partitions of the RDD
    scala> rdd.partitions.size
    res22: Int = 4
    (3) Repartition the RDD
    scala> val rerdd = rdd.repartition(2)
    rerdd: org .apache.spark.rdd.RDD[Int] = MapPartitionsRDD[60] at repartition at :26
    (4) Check the number of partitions of the new RDD
    scala> rerdd.partitions.size
    res23: Int = 2
2.3.1.13 The difference between coalesce and repartition
  1. coalesce repartitions, you can choose whether to perform the shuffle process. Determined by parameter shuffle: Boolean = false/true.
  2. Repartition actually calls coalesce, and shuffle is not performed by default. The source code is as follows:
    def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }

2.3.1.14 sortBy(func,[ascending], [numTasks]) case
  1. Function: Use func to process the data first, and sort according to the processed data comparison results. The default is positive order.
  2. Requirements: Create an RDD and sort it according to different rules
    (1) Create an RDD
    scala> val rdd = sc.parallelize(List(2,1,3,4))
    rdd: org.apache.spark.rdd.RDD[ Int] = ParallelCollectionRDD[21] at parallelize at :24
    (2) Sort according to its own size
    scala> rdd.sortBy(x => x).collect()
    res11: Array[Int] = Array(1, 2, 3, 4 )
    (3) Sort according to the size of the remainder of 3
    scala> rdd.sortBy(x => x%3).collect()
    res12: Array[Int] = Array(3, 4, 1, 2)
2.3.1.15 pipe(command, [envVars]) case
  1. Function: Pipeline, for each partition, executes a shell script and returns the output RDD.
    Note: The script needs to be placed where the Worker node can access it.
  2. Requirements: Write a script and use pipelines to apply the script to RDD.
    (1) Write a script
    Shell script
    #!/bin/sh
    echo “AA”
    while read LINE; do
    echo ">>>"${LINE}
    done
    (2) Create an RDD with only one partition
    scala> val rdd = sc .parallelize(List("hi","Hello","how","are","you"),1)
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[50] at parallelize at :24
    (3) Apply the script to the RDD and print
    scala> rdd.pipe(“/opt/module/spark/pipe.sh”).collect()
    res18: Array[String] = Array(AA, >>>hi , >>>Hello, >>>how, >>>are, >>>you)
    (4) Create an RDD with two partitions
    scala> val rdd = sc.parallelize(List("hi","Hello" , "how", "are", "you"), 2)
    rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[52] at parallelize at :24
    (5) Apply the script to the RDD and print it
    scala> rdd.pipe("/opt/module/spark/pipe.sh").collect()
    res19: Array[String] = Array(AA, >>>hi, >>>Hello, AA, >>>how , >>>are, >>>you)

2.3.2 Double Value type interaction

2.3.2.1 union(otherDataset) case
  1. Function: Return a new RDD after union of source RDD and parameter RDD
  2. Requirements: Create two RDDs and find the union
    (1) Create the first RDD
    scala> val rdd1 = sc.parallelize(1 to 5)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at :24
    (2) Create a second RDD
    scala> val rdd2 = sc.parallelize(5 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at :24
    (3) Calculate the union of two RDDs
    scala> val rdd3 = rdd1.union(rdd2)
    rdd3: org.apache.spark.rdd.RDD[Int] = UnionRDD[25] at union at :28
    (4) Print the union Set result
    scala> rdd3.collect()
    res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10)
2.3.2.2 subtract (otherDataset) case
  1. Function: A function that calculates the difference, removes the same elements in two RDDs, and different RDDs will remain.
  2. Requirements: Create two RDDs and find the difference between the first RDD and the second RDD
    (1) Create the first RDD
    scala> val rdd = sc.parallelize(3 to 8)
    rdd: org.apache.spark.rdd .RDD[Int] = ParallelCollectionRDD[70] at parallelize at :24
    (2) Create a second RDD
    scala> val rdd1 = sc.parallelize(1 to 5)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at parallelize at :24
    (3) Calculate the difference between the first RDD and the second RDD and print
    scala> rdd.subtract(rdd1).collect() res27
    : Array[Int] = Array(8 , 6, 7)
2.3.2.3 intersection(otherDataset) case
  1. Function: Return a new RDD after finding the intersection of the source RDD and the parameter RDD.
  2. Requirements: Create two RDDs and find the intersection of the two RDDs
    (1) Create the first RDD
    scala> val rdd1 = sc.parallelize(1 to 7)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [26] at parallelize at :24
    (2) Create a second RDD
    scala> val rdd2 = sc.parallelize(5 to 10)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at parallelize at :24
    (3) Calculate the intersection of two RDDs
    scala> val rdd3 = rdd1.intersection(rdd2)
    rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[33] at intersection at :28
    (4) Print calculation results
    scala> rdd3.collect()
    res19: Array[Int] = Array(5, 6, 7)
2.3.2.4 cartesian(otherDataset) case
  1. Function: Cartesian product (try to avoid using it)
  2. Requirements: Create two RDDs and calculate the Cartesian product of the two RDDs
    (1) Create the first RDD
    scala> val rdd1 = sc.parallelize(1 to 3)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[47] at parallelize at :24
    (2) Create a second RDD
    scala> val rdd2 = sc.parallelize(2 to 5)
    rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at :24
    (3) Calculate the Cartesian product of two RDDs and print
    scala> rdd1.cartesian(rdd2).collect()
    res17: Array[(Int, Int)] = Array((1,2), ( 1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3, 3), (3,4), (3,5))
2.3.2.5 zip(otherDataset) case
  1. Function: Combine two RDDs into an RDD in the form of Key/Value. By default, the number of partitions and elements of the two RDDs are the same, otherwise an exception will be thrown.
  2. Requirements: Create two RDDs and combine the two RDDs together to form one (k, v) RDD
    (1) Create the first RDD
    scala> val rdd1 = sc.parallelize(Array(1,2,3),3 )
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :24
    (2) Create a second RDD (same number of partitions as 1)
    scala> val rdd2 = sc.parallelize(Array ("a", "b", "c"),3)
    rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at parallelize at :24
    (3) The first RDD combination is the second RDD and print
    scala> rdd1.zip(rdd2).collect
    res1: Array[(Int, String)] = Array((1,a), (2,b), (3,c))
    (4) Second RDDs combine the first RDD and print
    scala> rdd2.zip(rdd1).collect
    res2: Array[(String, Int)] = Array((a,1), (b,2), (c,3))
    (5) Create the third RDD (different from the number of partitions 1,2)
    scala> val rdd3 = sc.parallelize(Array("a","b","c"),2)
    rdd3: org.apache.spark .rdd.RDD[String] = ParallelCollectionRDD[5] at parallelize at :24
    (6) The first RDD combines the third RDD and prints
    scala> rdd1.zip(rdd3).collect
    java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(3, 2)
    at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
    at org.apache.spark.rdd.RDD KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲partitions$2.ap… anonfun$partitions 2. apply ( RDD . scala : 250 ) atscala . Option . get O r E lse ( Option . scala : 121 ) atorg . apache . spark . rdd . RDD . partitions ( RDD . scala : 250 ) atorg . apache . spark . S park C ontext . run J ob ( S park C ontext . scala : 1965) atorg . apache . spark . rdd . RDD 2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark. SparkContext.runJob(SparkContext.scala:1965) at org.apache.spark.rdd.RDD2.apply(RDD.scala:250)atscala.Option.getOrElse(Option.scala:121)atorg.apache.spark.rdd.RDD.partitions(RDD.scala:250 ) a t or g . a p a c h e . s p a r k . Sp a r k C o n t e x t . r u n J o b ( Sp a r k C o n t e x t . sc a l a:1965)atorg.apache.spark.rdd.RDD a n o n f u n anonfun anonfuncollect 1. a p p l y ( R D D . s c a l a : 936 ) a t o r g . a p a c h e . s p a r k . r d d . R D D O p e r a t i o n S c o p e 1.apply(RDD.scala:936) at org.apache.spark.rdd.RDDOperationScope 1.apply(RDD.scala:936)atorg.apache.spark.rdd.RDDOperationScope.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
    … 48 elided

2.3.3 Key-Value type

2.3.3.1 partitionBy case
  1. Function: Perform partition operation on pairRDD. If the original partitionRDD is consistent with the existing partitionRDD, the partition will not be performed. Otherwise, a ShuffleRDD will be generated, which will generate a shuffle process.
  2. Requirements: Create an RDD with 4 partitions and repartition it
    (1) Create an RDD
    scala> val rdd = sc.parallelize(Array((1,"aaa"),(2,"bbb"),(3, "ccc"),(4,"ddd")),4)
    rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[44] at parallelize at :24
    (2) View RDD Number of partitions
    scala> rdd.partitions.size
    res24: Int = 4
    (3) Repartition the RDD
    scala> var rdd2 = rdd.partitionBy(new org.apache.spark.HashPartitioner(2))
    rdd2: org.apache.spark. rdd.RDD[(Int, String)] = ShuffledRDD[45] at partitionBy at :26
    (4) Check the number of partitions of the new RDD
    scala> rdd2.partitions.size
    res25: Int = 2
2.3.3.2 groupByKey case
  1. Function: groupByKey also operates on each key, but only generates one sequence.
  2. Requirement: Create a pairRDD, aggregate the values ​​corresponding to the same key into a sequence, and calculate the addition result of the values ​​corresponding to the same key.
    (1) Create a pairRDD
    scala> val words = Array(“one”, “two”, “two”, “three”, “three”, “three”)
    words: Array[String] = Array(one, two, two, three, three, three)

scala> val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
wordPairsRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at: 26
(2) Aggregate values ​​corresponding to the same key into a sequence
scala> val group = wordPairsRDD.groupByKey()
group: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[5] at groupByKey at :28
(3) Print results
scala> group.collect()
res1: Array[(String, Iterable[Int])] = Array((two,CompactBuffer(1, 1)), (one,CompactBuffer(1 )), (three,CompactBuffer(1, 1, 1)))
(4) Calculate the addition result of the corresponding value of the same key
scala> group.map(t => (t._1, t._2.sum))
res2 : org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at map at :31
(5) Print the result
scala> res2.collect()
res3: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
2.3.3.3 The difference between reduceByKey and groupByKey

  1. reduceByKey: Aggregate according to key, there is a combine (pre-aggregation) operation before shuffle, and the return result is RDD[k,v].
  2. groupByKey: Group by key and shuffle directly.
  3. Development guidance: reduceByKey is better than groupByKey and is recommended to be used. But you need to pay attention to whether it will affect the business logic. If you want to find the average salary, don’t use reduceByKey.
2.3.3.4 reduceByKey(func, [numTasks]) case
  1. Called on a (K, V) RDD, return a (K, V) RDD, use the specified reduce function to aggregate the values ​​​​of the same key together, the number of reduce tasks can be passed through the second optional parameters to set.
  2. Requirements: Create a pairRDD and calculate the addition result of the values ​​corresponding to the same key
    (1) Create a pairRDD
    scala> val rdd = sc.parallelize(List(("female",1),("male",5),(" female",5),("male",2)))
    rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[46] at parallelize at :24
    (2) Calculate the corresponding key The sum of values
    ​​scala> val reduce = rdd.reduceByKey((x,y) => x+y)
    reduce: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[47] at reduceByKey at :26
    (3) Print results
    scala> reduce.collect()
    res29: Array[(String, Int)] = Array((female,6), (male,7))
2.3.3.5 aggregateByKey case

Parameter:(zeroValue:U,[partitioner: Partitioner]) (seqOp: (U, V) => U,combOp: (U, U) => U)

  1. Function: In the RDD of kv pairs, values ​​are grouped and merged by key. When merging, each value and the initial value are used as parameters of the seq function for calculation, and the returned result is used as a new kv pair, and then the The results are merged according to the key, and finally the value of each group is passed to the combine function for calculation (first the first two values ​​are calculated, the return result and the next value are passed to the combine function, and so on), and the key is calculated with The result is output as a new kv pair.
  2. Parameter description:
    (1) zeroValue: Give each key in each partition an initial value; (only applies to the partition)
    (2) seqOp: The function is used to gradually iterate the value with the initial value in each partition;
    (3 )combOp: Function used to combine the results in each partition.
  3. Requirement: Create a pairRDD, take the maximum value corresponding to the same key in each partition, and then add it up.
  4. demand analysis

Insert image description here
(1) Create a pairRDD
scala> val rdd = sc.parallelize(List(("a",3),("a",2),("c",4),("b",3),( "c",6),("c",8)),2)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :24
(2) Take out The maximum value corresponding to the same key in each partition is then added
scala> val agg = rdd.aggregateByKey(0)(math.max( , ), + )
agg: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[1] at aggregateByKey at :26
(3) Print the result
scala> agg.collect()
res0: Array[(String, Int)] = Array((b,3), (a,3), (c,12))

2.3.3.6 foldByKey case

Parameters: (zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
1. Function: Simplified operation of aggregateByKey, seqop and combop are the same
2. Requirement: Create a pairRDD, calculate The addition result of the corresponding values ​​​​of the same key
(1) creates a pairRDD
scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2,3),( 3,6),(3,8)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[91] at parallelize at :24
(2) Calculate the corresponding value of the same key The addition result
scala> val agg = rdd.foldByKey(0)( + )
agg: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[92] at foldByKey at :26
(3) print Result
scala> agg.collect()
res61: Array[(Int, Int)] = Array((3,14), (1,9), (2,3))

2.3.3.7 combineByKey[C] case

Parameters: (createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)
1. Function: For the same K, merge V into a set.
2. Parameter description:
(1) createCombiner: combineByKey() will traverse all elements in the partition, so the key of each element has either not been encountered yet, or it is the same as the key of a previous element. If this is a new element, combineByKey() will use a function called createCombiner() to create the initial value of the accumulator corresponding to that key
(2) mergeValue: If this is an element that has been encountered before processing the current partition key, it will use the mergeValue() method to merge the current value corresponding to the key's accumulator with this new value
(3) mergeCombiners: Since each partition is processed independently, there can be multiple for the same key accumulator. If there are two or more partitions with accumulators corresponding to the same key, you need to use the user-provided mergeCombiners() method to merge the results of each partition.
3. Requirement: Create a pairRDD and calculate the mean value of each key based on the key. (First calculate the number of occurrences of each key and the sum of the corresponding values, and then divide to get the result)
4. Requirements analysis:
Insert image description here
(1) Create a pairRDD
scala> val input = sc.parallelize(Array(("a", 88 ), (“b”, 95), (“a”, 91), (“b”, 93), (“a”, 95), (“b”, 98)),2) input
: org.apache .spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[52] at parallelize at :26
(2) Add the values ​​corresponding to the same key, record the number of times the key appears, and put it into a tuple
scala > val combine = input.combineByKey((_,1),(acc:(Int,Int),v)=>(acc._1+v,acc._2+1),(acc1:(Int,Int), acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2))
combine: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ShuffledRDD[5] at combineByKey at :28
(3) Print the combined result
scala> combine.collect
res5: Array[(String, (Int, Int))] = Array((b,(286,3)), (a,(274,3)))
(4) Calculate the average
scala> val result = combine.map{case (key,value) => (key,value._1/value._2.toDouble)}
result: org .apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[54] at map at :30
(5) Print the result
scala> result.collect()
res33: Array[(String, Double)] = Array( (b,95.33333333333333), (a,91.33333333333333))

2.3.3.8 sortByKey([ascending], [numTasks]) case
  1. Function: Called on an RDD of (K, V). K must implement the Ordered interface and return an RDD of (K, V) sorted by key.
  2. Requirements: Create a pairRDD and sort it according to the positive and negative order of the key
    (1) Create a pairRDD
    scala> val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2 ,"bb"),(1,"dd")))
    rdd: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[14] at parallelize at :24
    (2) According to the correct value of key Sequence
    scala> rdd.sortByKey(true).collect()
    res9: Array[(Int, String)] = Array((1,dd), (2,bb), (3,aa), (6,cc))
    (3) According to the reverse order of key
    scala> rdd.sortByKey(false).collect()
    res10: Array[(Int, String)] = Array((6,cc), (3,aa), (2,bb), (1,dd))
2.3.3.9 mapValues ​​case
  1. For types of the form (K, V), only V is operated.
  2. Requirements: Create a pairRDD and add the value to the string "|||"
    (1) Create a pairRDD
    scala> val rdd3 = sc.parallelize(Array((1,"a"),(1,"d"), (2, "b"), (3, "c")))
    rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[67] at parallelize at :24
    (2) to value Add the string "|||"
    scala> rdd3.mapValues(_+"|||").collect()
    res26: Array[(Int, String)] = Array((1,a|||), (1 ,d|||), (2,b|||), (3,c|||))
2.3.3.10 join(otherDataset, [numTasks]) case
  1. Function: Called on RDDs of type (K, V) and (K, W), return an RDD of (K, (V, W)) in which all elements corresponding to the same key are paired together.
  2. Requirement: Create two pairRDDs and aggregate data with the same key into a tuple.
    (1) Create the first pairRDD
    scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
    rdd: org.apache. spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[32] at parallelize at :24
    (2) Create a second pairRDD
    scala> val rdd1 = sc.parallelize(Array((1,4),(2, 5),(3,6)))
    rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[33] at parallelize at :24
    (3) Join operation and print the result
    scala> rdd. join(rdd1).collect()
    res13: Array[(Int, (String, Int))] = Array((1,(a,4)), (2,(b,5)), (3,(c ,6)))
2.3.3.11 cogroup(otherDataset, [numTasks]) case
  1. Function: Called on RDDs of type (K, V) and (K, W), return an RDD of type (K, (Iterable, Iterable))
  2. Requirement: Create two pairRDDs and aggregate data with the same key into an iterator.
    (1) Create the first pairRDD
    scala> val rdd = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
    rdd: org.apache. spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[37] at parallelize at :24
    (2) Create a second pairRDD
    scala> val rdd1 = sc.parallelize(Array((1,4),(2, 5),(3,6)))
    rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[38] at parallelize at :24
    (3) cogroup two RDDs and print the results
    scala> rdd.cogroup(rdd1).collect()
    res14: Array[(Int, (Iterable[String], Iterable[Int]))] = Array((1,(CompactBuffer(a),CompactBuffer(4))), ( 2,(CompactBuffer(b),CompactBuffer(5))), (3,(CompactBuffer©,CompactBuffer(6))))

2.3.4 Case practice

  1. Data structure: timestamp, province, city, user, advertisement, middle fields are separated by spaces.

The sample is as follows:
1516609143867 6 7 64 16
1516609143869 9 4 75 18
1516609143869 1 7 87 12
2. Requirement: Count the TOP3 number of clicks on advertisements in each province
3. Implementation process:

package com.wxn.practice

import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}

//需求:统计出每一个省份广告被点击次数的TOP3
object Practice {
    
    

  def main(args: Array[String]): Unit = {
    
    

    //1.初始化spark配置信息并建立与spark的连接
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Test")
    val sc = new SparkContext(sparkConf)

    //2.读取数据生成RDD:TS,Province,City,User,AD
    val line = sc.textFile("E:\\IDEAWorkSpace\\SparkTest\\src\\main\\resources\\agent.log")

    //3.按照最小粒度聚合:((Province,AD),1)
    val provinceAdAndOne = line.map {
    
     x =>
      val fields: Array[String] = x.split(" ")
      ((fields(1), fields(3)), 1)
    }

    //4.计算每个省中每个广告被点击的总数:((Province,AD),sum)
    val provinceAdToSum = provinceAdAndOne.reduceByKey(_ + _)

    //5.将省份作为key,广告加点击数为value:(Province,(AD,sum))
    val provinceToAdSum = provinceAdToSum.map(x => (x._1._1, (x._1._2, x._2)))

    //6.将同一个省份的所有广告进行聚合(Province,List((AD1,sum1),(AD2,sum2)...))
    val provinceGroup = provinceToAdSum.groupByKey()

    //7.对同一个省份所有广告的集合进行排序并取前3条,排序规则为广告点击总数
    val provinceAdTop3 = provinceGroup.mapValues {
    
     x =>
      x.toList.sortWith((x, y) => x._2 > y._2).take(3)
    }

    //8.将数据拉取到Driver端并打印
    provinceAdTop3.collect().foreach(println)

    //9.关闭与spark的连接
    sc.stop()

  }
  
}

Pom文件:
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>0.11.0.2</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.1.1</version>
</dependency>

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.27</version>
</dependency>

2.4 Action

2.4.1 reduce(func) case

  1. Function: Aggregate all elements in the RDD through the func function, first aggregate the data within the partition, and then aggregate the data between partitions.
  2. Requirements: Create an RDD and aggregate all elements to get the result
    (1) Create an RDD[Int]
    scala> val rdd1 = sc.makeRDD(1 to 10,2)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[85] at makeRDD at :24
    (2) Aggregate all elements of RDD[Int]
    scala> rdd1.reduce( + )
    res50: Int = 55
    (3) Create a RDD[String]
    scala> val rdd2 = sc.makeRDD (Array(("a",1),("a",3),("c",3),("d",5)))
    rdd2: org.apache.spark.rdd.RDD[(String , Int)] = ParallelCollectionRDD[86] at makeRDD at :24
    (4) Aggregate RDD[String] all data
    scala> rdd2.reduce((x,y)=>(x._1 + y._1,x._2 + y._2))
    res51: (String, Int) = (adca,12)

2.4.2 collect() case

  1. Function: In the driver program, return all elements of the data set in the form of an array.
  2. Requirements: Create an RDD and collect the RDD contents to the Driver for printing
    (1) Create an RDD
    scala> val rdd = sc.parallelize(1 to 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [0] at parallelize at :24
    (2) Collect the results to the Driver side
    scala> rdd.collect
    res0: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )

2.4.3 count() case

  1. Function: Returns the number of elements in RDD
  2. Requirements: Create an RDD and count the number of entries in the RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(1 to 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :24
    (2) Count the number of items in the RDD
    scala> rdd.count
    res1: Long = 10

2.4.4 first() case

  1. Function: Returns the first element in RDD
  2. Requirements: Create an RDD and return the first element in the RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(1 to 10)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [0] at parallelize at :24
    (2) Count the number of entries in the RDD
    scala> rdd.first
    res2: Int = 1
    2.4.5 take(n) case
  3. Function: Returns an array consisting of the first n elements of RDD
  4. Requirements: Create an RDD and count the number of entries in the RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(Array(2,5,4,6,8,3))
    rdd: org.apache.spark. rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :24
    (2) Count the number of entries in the RDD
    scala> rdd.take(3)
    res10: Array[Int] = Array(2, 5, 4)

2.4.6 takeOrdered(n) case

  1. Function: Returns an array composed of the first n elements after sorting the RDD
  2. Requirements: Create an RDD and count the number of entries in the RDD
    (1) Create an RDD
    scala> val rdd = sc.parallelize(Array(2,5,4,6,8,3))
    rdd: org.apache.spark. rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :24
    (2) Count the number of entries in the RDD
    scala> rdd.takeOrdered(3)
    res18: Array[Int] = Array(2, 3, 4)

2.4.7 aggregate case (merging within and between partitions)

  1. 参数:(zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
  2. Function: The aggregate function aggregates the elements in each partition through seqOp and the initial value, and then uses the combine function to combine the results of each partition with the initial value (zeroValue). The final type returned by this function does not need to be consistent with the element type in the RDD.
  3. Requirements: Create an RDD and add all elements to get the result
    (1) Create an RDD
    scala> var rdd1 = sc.makeRDD(1 to 10,2)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [88] at makeRDD at :24
    (2) Add all the elements of the RDD to get the result
    scala> rdd1.aggregate(0)( + , + )
    res22: Int = 55
    scala> rdd1.aggregate(1)( + , + )
    res20: Int = 58
    scala> rdd1.aggregate(2)( + , + )
    res21: Int = 61

2.4.8 fold(num)(func) case

  1. Function: folding operation, simplified operation of aggregate, seqop and combop are the same.
  2. Requirements: Create an RDD and add all elements to get the result
    (1) Create an RDD
    scala> var rdd1 = sc.makeRDD(1 to 10,2)
    rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD [88] at makeRDD at :24
    (2) Add all the elements of the RDD to get the result
    scala> rdd1.fold(0)( + )
    res24: Int = 55
    scala> rdd1.fold(1)( + )
    res23: Int = 58
    scala> rdd1.fold(2)( + )
    res24: Int = 61

2.4.9 saveAsTextFile(path)

Function: Save the elements of the data set to the HDFS file system or other supported file systems in the form of textfile. For each element, Spark will call the toString method to convert it to text in the file.

2.4.10 saveAsSequenceFile(path)

Function: Save the elements in the data set to the specified directory in the format of Hadoop sequencefile, which can be used in HDFS or other Hadoop-supported file systems.

2.4.11 saveAsObjectFile(path)

Function: Used to serialize elements in RDD into objects and store them in files.

2.4.12 countByKey() case

  1. Function: For RDD of type (K, V), return a map of (K, Int), indicating the number of elements corresponding to each key.
  2. Requirements: Create a PairRDD and count the number of each key
    (1) Create a PairRDD
    scala> val rdd = sc.parallelize(List((1,3),(1,2),(1,4),(2 ,3),(3,6),(3,8)),3)
    rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[95] at parallelize at :24
    (2) Count the number of each key
    scala> rdd.countByKey
    res63: scala.collection.Map[Int,Long] = Map(3 -> 2, 1 -> 3, 2 -> 1)

2.4.13 foreach(func) case

  1. Function: Run the function func on each element of the data set to update.
  2. Requirements: Create an RDD and print each element
    (1) Create an RDD
    scala> var rdd = sc.makeRDD(1 to 5,2)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[ 107] at makeRDD at :24
    (2) Print each element of the RDD
    scala> rdd.foreach(println(_))
    3
    4
    5
    1
    2

2.5 Function transfer in RDD

In actual development, we often need to define some operations on RDD ourselves. So the main thing at this time is that the initialization work is done on the Driver side, and the actual running program is done on the Executor side, which involves cross-process communication. , needs to be serialized. Let’s look at a few examples below:

2.5.1 Passing a method

1. Create a class

class Search(s:String){
    
    

//过滤出包含字符串的数据
  def isMatch(s: String): Boolean = {
    
    
    s.contains(query)
  }

//过滤出包含字符串的RDD
  def getMatch1 (rdd: RDD[String]): RDD[String] = {
    
    
    rdd.filter(isMatch)
  }

  //过滤出包含字符串的RDD
  def getMatche2(rdd: RDD[String]): RDD[String] = {
    
    
    rdd.filter(x => x.contains(query))
  }

}

2. Create Spark main program

object SeriTest {
    
    

  def main(args: Array[String]): Unit = {
    
    

    //1.初始化配置信息及SparkContext
    val sparkConf: SparkConf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
    val sc = new SparkContext(sparkConf)

//2.创建一个RDD
    val rdd: RDD[String] = sc.parallelize(Array("hadoop", "spark", "hive", "wxn"))

//3.创建一个Search对象
    val search = new Search()

//4.运用第一个过滤函数并打印结果
    val match1: RDD[String] = search.getMatche1(rdd)
    match1.collect().foreach(println)
    }
}

3. Run program

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
    at com.wxn.Search.getMatche1(SeriTest.scala:39)
    at com.wxn.SeriTest$.main(SeriTest.scala:18)
    at com.wxn.SeriTest.main(SeriTest.scala)
Caused by: java.io.NotSerializableException: com.wxn.Search

4. Problem description
//Filter out RDD containing strings
def getMatch1 (rdd: RDD[String]): RDD[String] = { rdd.filter(isMatch) } The method isMatch() called in this method is defined in Search In this class, what is actually called is this. isMatch(). This represents the object of the Search class. During the running process, the program needs to serialize the Search object and pass it to the Executor.


5. Solution:
Just make the class inherit scala.Serializable.
class Search() extends Serializable{…}

2.5.2 Passing an attribute

1. Create Spark main program

object TransmitTest {
    
    

  def main(args: Array[String]): Unit = {
    
    

    //1.初始化配置信息及SparkContext
    val sparkConf: SparkConf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
    val sc = new SparkContext(sparkConf)

//2.创建一个RDD
    val rdd: RDD[String] = sc.parallelize(Array("hadoop", "spark", "hive", "wxn"))

//3.创建一个Search对象
    val search = new Search()

//4.运用第一个过滤函数并打印结果
    val match1: RDD[String] = search.getMatche2(rdd)
    match1.collect().foreach(println)
    }
}

2. Run program

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:387)
    at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:386)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.filter(RDD.scala:386)
    at com.wxn.Search.getMatche1(SeriTest.scala:39)
    at com.wxn.SeriTest$.main(SeriTest.scala:18)
    at com.wxn.SeriTest.main(SeriTest.scala)
Caused by: java.io.NotSerializableException: com.wxn.Search

3. Problem description
//Filter out RDD containing string
def getMatche2(rdd: RDD[String]): RDD[String] = { rdd.filter(x => x.contains(query)) } called in this method The method query is a field defined in the Search class. It actually calls this. query. This represents the object of the Search class. During the running process, the program needs to serialize the Search object and pass it to the Executor.


4. Solution
1) Just make the class inherit scala.Serializable.
class Search() extends Serializable{…}
2) Assign the class variable query to the local variable.
Modify getMatche2 to
//Filter out the RDD containing strings
def getMatche2(rdd: RDD[String]): RDD[String] = { val query_ : String = this.query//Assign class variables to local variables rdd.filter(x => x.contains(query_)) }


2.6 RDD dependencies

2.6.1 Lineage

RDDs only support coarse-grained transformations, i.e. single operations performed on a large number of records. Record the series of Lineage (lineage) used to create the RDD in order to recover lost partitions. The RDD's Lineage records the RDD's metadata information and conversion behavior. When some partition data of the RDD is lost, it can recalculate and restore the lost data partitions based on this information.
Insert image description here
(1) Read an HDFS file and map the contents into tuples

scala> val wordAndOne = sc.textFile("/fruit.tsv").flatMap(_.split("\t")).map((_,1))
wordAndOne: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[22] at map at <console>:24

(2) Count the number corresponding to each key

scala> val wordAndCount = wordAndOne.reduceByKey(_+_)
wordAndCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[23] at reduceByKey at <console>:26

(3) View the Lineage of “wordAndOne”

scala> wordAndOne.toDebugString
res5: String =
(2) MapPartitionsRDD[22] at map at <console>:24 []
 |  MapPartitionsRDD[21] at flatMap at <console>:24 []
 |  /fruit.tsv MapPartitionsRDD[20] at textFile at <console>:24 []
 |  /fruit.tsv HadoopRDD[19] at textFile at <console>:24 []

(4) View the Lineage of “wordAndCount”

scala> wordAndCount.toDebugString
res6: String =
(2) ShuffledRDD[23] at reduceByKey at <console>:26 []
 +-(2) MapPartitionsRDD[22] at map at <console>:24 []
    |  MapPartitionsRDD[21] at flatMap at <console>:24 []
    |  /fruit.tsv MapPartitionsRDD[20] at textFile at <console>:24 []
    |  /fruit.tsv HadoopRDD[19] at textFile at <console>:24 []

(5) Check the dependency type of "wordAndOne"

scala> wordAndOne.dependencies
res7: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@5d5db92b)

(6) Check the dependency type of "wordAndCount"

scala> wordAndCount.dependencies
res8: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@63f3e6a8)

Note: There are two different types of relationships between an RDD and the parent RDD(s) it depends on, namely narrow dependency and wide dependency.

2.6.2 Narrow dependencies

Narrow dependency means that each Partition of the parent RDD can be used by at most one Partition of the child RDD. Our image of narrow dependence is likened to an only child.

Insert image description here

2.6.3 Wide dependencies

Wide dependency means that the Partitions of multiple child RDDs will depend on the Partition of the same parent RDD, which will cause shuffle. Summary: Our image of wide dependence is like super life.
Insert image description here

2.6.4 DAY

DAG (Directed Acyclic Graph) is called a directed acyclic graph. The original RDD is formed into a DAG through a series of transformations. The DAG is divided into different stages according to the different dependencies between RDDs. For narrow dependencies , partition conversion processing is completed in Stage. For wide dependencies, due to the existence of Shuffle, the next calculation can only be started after the parent RDD processing is completed, so wide dependencies are the basis for dividing stages.

Insert image description here

2.6.5 Task division

RDD tasks are divided into: Application, Job, Stage and Task
1) Application: Initializing a SparkContext will generate an Application
2) Job: An Action operator will generate a Job
3) Stage: According to the dependencies between RDDs The job is divided into different Stages, and when a wide dependency is encountered, a Stage is divided.

Insert image description here
4) Task: Stage is a TaskSet, and sending the results of Stage division to different Executors for execution is a Task.
Note: Each layer of Application->Job->Stage->Task has a 1 to n relationship.

2.7 RDD caching

RDD can cache previous calculation results through the persist method or cache method. By default, persist() will cache the data in the JVM heap space in serialized form.
However, these two methods are not cached immediately when called. Instead, when subsequent actions are triggered, the RDD will be cached in the memory of the computing node and reused later.

Insert image description here
By checking the source code, we found that the cache finally called the persist method. The default storage level only stores one copy in the memory. There are many storage levels in Spark. The storage level is defined in object StorageLevel.

Insert image description here
Add "_2" at the end of the storage level to store persistent data in two copies

Insert image description here
The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. The cache fault tolerance mechanism of RDD ensures that even if the cache is lost, the calculation can be executed correctly. Through a series of transformations based on RDD, the lost data will be recalculated. Since each Partition of RDD is relatively independent, only the missing part needs to be calculated, and there is no need to recalculate all Partitions.
(1) Create an RDD
scala> val rdd = sc.makeRDD(Array(“wxn”))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at makeRDD at :25
(2) Will RDD is converted to carry the current timestamp without caching
scala> val nocache = rdd.map(_.toString+System.currentTimeMillis)
nocache: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at map at: 27
(3) Print results multiple times

scala> nocache.collect
res0: Array[String] = Array(wxn1538978275359)

scala> nocache.collect
res1: Array[String] = Array(wxn1538978282416)

scala> nocache.collect
res2: Array[String] = Array(wxn1538978283199)

(4) Convert RDD to carry the current timestamp and cache
scala> val cache = rdd.map(_.toString+System.currentTimeMillis).cache
cache: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[ 21] at map at :27
(5) Print the cached results multiple times

scala> cache.collect
res3: Array[String] = Array(wxn1538978435705)                                   

scala> cache.collect
res4: Array[String] = Array(wxn1538978435705)

scala> cache.collect
res5: Array[String] = Array(wxn1538978435705)

2.8 RDD CheckPoint

In addition to persistence operations, Spark also provides a checkpoint mechanism for data storage. Checkpoints (essentially writing RDDs to Disk as checkpoints) are used to assist fault tolerance through lineage. Lineage is too long. This will cause the cost of fault tolerance to be too high, so it is better to do checkpoint fault tolerance in the intermediate stage. If there is a node problem later and the partition is lost, Lineage will be redone from the RDD where the checkpoint was made, which will reduce the cost. Checkpoint implements the checkpoint function of RDD by writing data to the HDFS file system.
Set a checkpoint for the current RDD. This function will create a binary file and store it in the checkpoint directory, which is set with SparkContext.setCheckpointDir(). During the checkpoint process, all information in the RDD that depends on the parent RDD will be removed. The checkpoint operation on the RDD will not be executed immediately, and an Action operation must be executed to trigger it.

Case practice:
(1) Set checkpoint
scala> sc.setCheckpointDir(“hdfs://hadoop102:9000/checkpoint”)
(2) Create an RDD
scala> val rdd = sc.parallelize(Array(“wxn”))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[14] at parallelize at :24
(3) Convert RDD to carry the current timestamp and do checkpoint
scala> val ch = rdd.map(_+System .currentTimeMillis)
ch: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[16] at map at :26

scala> ch.checkpoint
(4) Print results multiple times

scala> ch.collect
res55: Array[String] = Array(wxn1538981860336)

scala> ch.collect
res56: Array[String] = Array(wxn1538981860504)

scala> ch.collect
res57: Array[String] = Array(wxn1538981860504)

scala> ch.collect
res58: Array[String] = Array(wxn1538981860504)

3. Key-value pair RDD data partitioning

Spark currently supports Hash partition and Range partition. Users can also customize partitions. Hash partition is the current default partition. The partitioner in Spark directly determines the number of partitions in the RDD, which partition each piece of data in the RDD belongs to after the Shuffle process, and Number of Reduces
Note:
(1) Only Key-Value type RDDs have partitions, and the value of non-Key-Value type RDD partitions is None
(2) The partition ID range of each RDD: 0~numPartitions-1, Determine which partition this value belongs to.

3.1 Get RDD partition

You can get the partitioning method of an RDD by using the partitioner attribute of the RDD. It will return a scala.Option object, the value of which can be obtained through the get method. The relevant source code is as follows:

def getPartition(key: Any): Int = key match {
    
    
  case null => 0
  case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
def nonNegativeMod(x: Int, mod: Int): Int = {
    
    
  val rawMod = x % mod
  rawMod + (if (rawMod < 0) mod else 0)
}

(1) Create a pairRDD
scala> val pairs = sc.parallelize(List((1,1),(2,2),(3,3)))
pairs: org.apache.spark.rdd.RDD[(Int , Int)] = ParallelCollectionRDD[3] at parallelize at :24
(2) View the RDD partitioner
scala> pairs.partitioner
res1: Option[org.apache.spark.Partitioner] = None
(3) Import the HashPartitioner class
scala> import org.apache.spark.HashPartitioner
import org.apache.spark.HashPartitioner
(4) Use HashPartitioner to repartition RDD
scala> val partitioned = pairs.partitionBy(new HashPartitioner(2))
partitioned: org.apache.spark.rdd. RDD[(Int, Int)] = ShuffledRDD[4] at partitionBy at :27
(5) View the partitioner of RDD after repartition
scala> partitioned.partitioner
res2: Option[org.apache.spark.Partitioner] = Some(org .apache.spark.HashPartitioner@2)

3.2 Hash partition

The principle of HashPartitioner partitioning: for a given key, calculate its hashCode, and divide it by the number of partitions to take the remainder. If the remainder is less than 0, use the remainder + the number of partitions (otherwise add 0), and the final returned value is this The partition ID to which the key belongs.
Practical operation of using Hash partition

scala> val nopar = sc.parallelize(List((1,3),(1,2),(2,4),(2,3),(3,6),(3,8)),8)
nopar: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> nopar.partitioner
res20: Option[org.apache.spark.Partitioner] = None


scala>nopar.mapPartitionsWithIndex((index,iter)=>{
    
     Iterator(index.toString+" : "+iter.mkString("|")) }).collect
res0: Array[String] = Array("0 : ", 1 : (1,3), 2 : (1,2), 3 : (2,4), "4 : ", 5 : (2,3), 6 : (3,6), 7 : (3,8))
 
scala> val hashpar = nopar.partitionBy(new org.apache.spark.HashPartitioner(7))
hashpar: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[12] at partitionBy at <console>:26

scala> hashpar.count
res18: Long = 6

scala> hashpar.partitioner
res21: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@7)

scala> hashpar.mapPartitions(iter => Iterator(iter.length)).collect()
res19: Array[Int] = Array(0, 3, 1, 2, 0, 0, 0)

3.3 Ranger partition

Disadvantages of HashPartitioner partitioning: It may lead to uneven data volume in each partition. In extreme cases, some partitions may have all the data of the RDD.
The role of RangePartitioner: Map numbers within a certain range to a certain partition, try to ensure that the amount of data in each partition is uniform, and the partitions are ordered, and the elements in one partition are definitely better than the other partitions. The elements within are small or large, but the order of the elements within the partition cannot be guaranteed. Simply put, it is to map numbers within a certain range to a certain partition. The implementation process is:
first step: extract sample data from the entire RDD, sort the sample data, calculate the maximum key value of each partition, and form an array variable rangeBounds of type Array[KEY];
second step: judge The range of the key in rangeBounds gives the partition id subscript of the key value in the next RDD; this partitioner requires that the KEY type in the RDD must be sortable

3.4 Custom partition

To implement a custom partitioner, you need to inherit the org.apache.spark.Partitioner class and implement the following three methods.
(1) numPartitions: Int: Returns the number of partitions created.
(2) getPartition(key: Any): Int: Returns the partition number of the given key (0 to numPartitions-1).
(3) equals(): Java’s standard method for determining equality. The implementation of this method is very important. Spark needs to use this method to check whether your partitioner object is the same as other partitioner instances, so that Spark can determine whether the partitioning methods of the two RDDs are the same.
Requirement: Write data with the same suffix to the same file. This is achieved by partitioning the data with the same suffix into the same partition and saving the output.
(1) Create a pairRDD
scala> val data = sc.parallelize(Array((1,1),(2,2),(3,3),(4,4),(5,5),(6, 6)))
data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[3] at parallelize at :24
(2) Define a custom partition class

scala> :paste
// Entering paste mode (ctrl-D to finish)
class CustomerPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
    
    

  //覆盖分区数
  override def numPartitions: Int = numParts

  //覆盖分区号获取函数
  override def getPartition(key: Any): Int = {
    
    
    val ckey: String = key.toString
    ckey.substring(ckey.length-1).toInt%numParts
  }
}

// Exiting paste mode, now interpreting.

defined class CustomerPartitioner

(3) Repartition the RDD using a custom partition class
scala> val par = data.partitionBy(new CustomerPartitioner(2))
par: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[ 2] at partitionBy at :27
(4) View the data distribution after repartition
scala> par.mapPartitionsWithIndex((index,items)=>items.map((index,_))).collect
res3: Array[(Int, (Int, Int))] = Array((0,(2,2)), (0,(4,4)), (0,(6,6)), (1,(1,1)), (1,(3,3)), (1,(5,5)))
Using a custom Partitioner is easy: just pass it to the partitionBy() method. There are many methods in Spark that rely on data shuffling, such as join() and groupByKey(), and they can also receive an optional Partitioner object to control how the output data is partitioned.

4. Data reading and saving

Spark的数据读取及数据保存可以从两个维度来作区分:文件格式以及文件系统。

File formats are divided into: Text files, Json files, Csv files, Sequence files and Object files;
file systems are divided into: local file systems, HDFS, HBASE and databases.

4.1 Reading and saving file data

4.1.1 Text file

1) Data reading: textFile(String)

scala> val hdfsFile = sc.textFile("hdfs://hadoop102:9000/fruit.txt")
hdfsFile: org.apache.spark.rdd.RDD[String] = hdfs://hadoop102:9000/fruit.txt MapPartitionsRDD[21] at textFile at <console>:24

2) Data saving: saveAsTextFile(String)
scala> hdfsFile.saveAsTextFile(“/fruitOut”)

4.1.2 Json file

If each line in the JSON file is a JSON record, you can read the JSON file as a text file, and then use the relevant JSON library to perform JSON parsing on each piece of data.
Note: Using RDD to read JSON files is very complicated. At the same time, SparkSQL integrates a good way to process JSON files, so SparkSQL is mostly used to process JSON files in applications.
(1) Import the package
scala required to parse json> import scala.util.parsing.json.JSON
(2) Upload the json file to HDFS
[wxn@hadoop102 spark]$ hadoop fs -put ./examples/src/main/resources /people.json /
(3) Read the file
scala> val json = sc.textFile(“/people.json”)
json: org.apache.spark.rdd.RDD[String] = /people.json MapPartitionsRDD[8] at textFile at :24
(4) Parse json data
scala> val result = json.map(JSON.parseFull)
result: org.apache.spark.rdd.RDD[Option[Any]] = MapPartitionsRDD[10] at map at: 27
(5) Print
scala> result.collect
res11: Array[Option[Any]] = Array(Some(Map(name -> Michael)), Some(Map(name -> Andy, age -> 30.0)), Some (Map(name -> Justin, age -> 19.0)))

4.1.3 Sequence file

SequenceFile file is a flat file (Flat File) designed by Hadoop to store key-value pairs in binary form. Spark has a dedicated interface for reading SequenceFile. In SparkContext, sequenceFile keyClass, valueClass can be called .
Note: The SequenceFile file only
creates an RDD for PairRDD (1)
scala> val rdd = sc.parallelize(Array((1,2),(3,4),(5,6)))
rdd: org.apache.spark .rdd.RDD[(Int, Int)] = ParallelCollectionRDD[13] at parallelize at :24
(2) Save RDD as Sequence file
scala> rdd.saveAsSequenceFile("file:///opt/module/spark/seqFile" )
(3) View the file
[wxn@hadoop102 seqFile]$ pwd
/opt/module/spark/seqFile

[wxn@hadoop102 seqFile]$ lldose
8
-rw-r–r-- 1 wxn wxn 108 Oct 9 10:29 part-00000
-rw-r–r-- 1 wxn wxn 124 Oct 9 10:29 part-00001
-rw-r–r-- 1 wxn wxn 0 Oct 9 10:29 _SUCCESS

[wxn@hadoop102 seqFile]$ cat part-00000
SEQ org.apache.hadoop.io.IntWritable org.apache.hadoop.io.IntWritableط
(4) Read the Sequence file
scala> val seq = sc.sequenceFile Int,Int
seq: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[18] at sequenceFile at :24
(5) Print the read Sequence file
scala> seq.collect
res14: Array[(Int, Int) ] = Array((1,2), (3,4), (5,6))

4.1.4 Object files

Object files are files that are saved after serializing objects, using Java's serialization mechanism. You can receive a path through the objectFile k, v function, read the object file, and return the corresponding RDD, or you can output the object file by calling saveAsObjectFile(). Because it is serialization, the type must be specified.
(1) Create an RDD
scala> val rdd = sc.parallelize(Array(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at parallelize at :24
(2) Save the RDD as an Object file
scala> rdd.saveAsObjectFile("file:///opt/module/spark/objectFile")
(3) View the file
[wxn@hadoop102 objectFile]$ pwd
/opt/module/spark /objectFile

[wxn@hadoop102 objectFile]$ llR_dose
8
-rw-r–r-- 1 wxn wxn 142 Oct 9 10:37 part-00000
-rw-r–r-- 1 wxn wxn 142 Oct 9 10:37 part-00001
-rw-r–r-- 1 wxn wxn 0 Oct 9 10:37 _SUCCESS

[wxn@hadoop102 objectFile]$ cat part-00000
SEQ!org.apache.hadoop.io.NullWritable"org.apache.hadoop.io.BytesWritableW@`l
(4) Read the Object file
scala> val objFile = sc.objectFile (Int)
objFile: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[31] at objectFile at :24
(5) Print the read Sequence file
scala> objFile.collect
res19: Array[Int] = Array (1, 2, 3, 4)

4.2 Reading and saving file system data

4.2.1 HDFS

The entire ecosystem of Spark is fully compatible with Hadoop, so Spark also supports the file types or database types supported by Hadoop. In addition, since Hadoop’s API has two versions, old and new, Spark is compatible with all versions of Hadoop. , also provides two sets of creation operation interfaces. For external storage creation operations, hadoopRDD and newHadoopRDD are the two most abstract function interfaces, which mainly include the following four parameters.
1) Input format (InputFormat): Specify the type of data input , such as TextInputFormat, etc., the versions referenced by the old and new versions are org.apache.hadoop.mapred.InputFormat and org.apache.hadoop.mapreduce.InputFormat(NewInputFormat)
2) Key type: Specify [K, V] key value Type of K in the pair
3) Value type: Specify the type of V in the [K, V] key-value pair
4) Partition value: Specify the minimum number of partitions for the RDD generated by external storage. If not specified, the system will use the default Value defaultMinSplits
Note: Other API interfaces for creation operations are set up for the convenience of the final Spark program developer, and are efficient implementation versions of these two interfaces. For example, for textFile, there is only the parameter path that specifies the file path. Other parameters have default values ​​specified internally by the system.
1. Data stored in compressed form in Hadoop can be read without specifying the decompression method, because Hadoop itself has a decompressor that will infer the decompression algorithm based on the suffix of the compressed file.
2. If you use Spark to extract data from Hadoop When reading a certain type of data and you don’t know how to read it, go online and look up how to read this kind of data using map-reduce, and then rewrite the corresponding reading method into the above hadoopRDD and newAPIHadoopRDD Just two categories will do

4.2.2 MySQL database connection

Supports accessing relational databases through Java JDBC. It needs to be done through JdbcRDD, the example is as follows:
(1) Add dependencies

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.27</version>
</dependency>

(2) Mysql reading:

package com

import java.sql.DriverManager

import org.apache.spark.rdd.JdbcRDD
import org.apache.spark.{
    
    SparkConf, SparkContext, rdd}

//此类是通过spark来连接mysql数据库,并操作数据库
object MysqlRDD {
    
    
  def main(args: Array[String]): Unit = {
    
    
    //1、创建spark的配置信息,并连接spark
    val sparkConf: SparkConf = new SparkConf().setAppName("MysqlRDD").setMaster("local[*]")
    val sc = new SparkContext(sparkConf)

    //2、定义连接sql的参数
    val driver="com.mysql.jdbc.Driver"
    val url="jdbc:mysql://hadoop102:3306/test"
    val userName="root"
    val passWd="123456"

    //3、创建jdbcRDD
    val rdd = new JdbcRDD(sc,
      () => {
    
    
      Class.forName(driver)
      DriverManager.getConnection(url, userName, passWd)
    }, //jdbc的连接
      "select * from rddtable where id >=? and id<=?;", //sql语句
      1, //下边界
      10, //下边界
      1, //切片数量 ,平均切分下边界与下边界的数据去执行
      r => (r.getInt(1),r.getString(2) ) //jdbc查询的返回结果
    )


    //4、打印最后结果
    println(rdd.count())
    rdd.foreach(println)

    sc.stop()
  }

}

Mysql writes:

def main(args: Array[String]) {
    
    
  val sparkConf = new SparkConf().setMaster("local[2]").setAppName("HBaseApp")
  val sc = new SparkContext(sparkConf)
  val data = sc.parallelize(List("Female", "Male","Female"))

  data.foreachPartition(insertData)
}

def insertData(iterator: Iterator[String]): Unit = {
    
    
Class.forName ("com.mysql.jdbc.Driver").newInstance()
  val conn = java.sql.DriverManager.getConnection("jdbc:mysql://master01:3306/rdd", "root", "hive")
  iterator.foreach(data => {
    
    
    val ps = conn.prepareStatement("insert into rddtable(name) values (?)")
    ps.setString(1, data) 
    ps.executeUpdate()
  })
}

4.2.3 HBase database
Due to the implementation of the org.apache.hadoop.hbase.mapreduce.TableInputFormat class, Spark can access HBase through the Hadoop input format. This input format will return key-value pair data, where the key type is org.apache.hadoop.hbase.io.ImmutableBytesWritable and the value type is org.apache.hadoop.hbase.client.Result
.
(1) Add dependencies

<dependency>
	<groupId>org.apache.hbase</groupId>
	<artifactId>hbase-server</artifactId>
	<version>1.3.1</version>
</dependency>

<dependency>
	<groupId>org.apache.hbase</groupId>
	<artifactId>hbase-client</artifactId>
	<version>1.3.1</version>
</dependency>

(2) Read data from HBase

package com.wxn

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.spark.rdd.RDD
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.apache.hadoop.hbase.util.Bytes

object HBaseSpark {
    
    

  def main(args: Array[String]): Unit = {
    
    

    //创建spark配置信息
    val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("JdbcRDD")

    //创建SparkContext
    val sc = new SparkContext(sparkConf)

    //构建HBase配置信息
    val conf: Configuration = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", "hadoop102,hadoop103,hadoop104")
    conf.set(TableInputFormat.INPUT_TABLE, "rddtable")

    //从HBase读取数据形成RDD
    val hbaseRDD: RDD[(ImmutableBytesWritable, Result)] = sc.newAPIHadoopRDD(
      conf,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result])

    val count: Long = hbaseRDD.count()
    println(count)

    //对hbaseRDD进行处理
    hbaseRDD.foreach {
    
    
      case (_, result) =>
        val key: String = Bytes.toString(result.getRow)
        val name: String = Bytes.toString(result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name")))
        val color: String = Bytes.toString(result.getValue(Bytes.toBytes("info"), Bytes.toBytes("color")))
        println("RowKey:" + key + ",Name:" + name + ",Color:" + color)
    }

    //关闭连接
    sc.stop()
  }

}

3) Write to HBase

def main(args: Array[String]) {
    
    
//获取Spark配置信息并创建与spark的连接
  val sparkConf = new SparkConf().setMaster("local[*]").setAppName("HBaseApp")
  val sc = new SparkContext(sparkConf)

//创建HBaseConf
  val conf = HBaseConfiguration.create()
  val jobConf = new JobConf(conf)
  jobConf.setOutputFormat(classOf[TableOutputFormat])
  jobConf.set(TableOutputFormat.OUTPUT_TABLE, "fruit_spark")

//构建Hbase表描述器
  val fruitTable = TableName.valueOf("fruit_spark")
  val tableDescr = new HTableDescriptor(fruitTable)
  tableDescr.addFamily(new HColumnDescriptor("info".getBytes))

//创建Hbase表
  val admin = new HBaseAdmin(conf)
  if (admin.tableExists(fruitTable)) {
    
    
    admin.disableTable(fruitTable)
    admin.deleteTable(fruitTable)
  }
  admin.createTable(tableDescr)

//定义往Hbase插入数据的方法
  def convert(triple: (Int, String, Int)) = {
    
    
    val put = new Put(Bytes.toBytes(triple._1))
    put.addImmutable(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes(triple._2))
    put.addImmutable(Bytes.toBytes("info"), Bytes.toBytes("price"), Bytes.toBytes(triple._3))
    (new ImmutableBytesWritable, put)
  }

//创建一个RDD
  val initialRDD = sc.parallelize(List((1,"apple",11), (2,"banana",12), (3,"pear",13)))

//将RDD内容写到HBase
  val localData = initialRDD.map(convert)

  localData.saveAsHadoopDataset(jobConf)
}

5. Advanced RDD programming

5.1 Accumulator

Accumulators are used to aggregate information. Usually when passing functions to Spark, such as using the map() function or filter() to pass conditions, you can use variables defined in the driver program, but each task running in the cluster will get A new copy of these variables is created, and updating the values ​​of these copies will not affect the corresponding variables in the drive. If we want to implement the function of updating shared variables when all shards are processed, then the accumulator can achieve the effect we want.

5.1.1 System Accumulator

For an input log file, if we want to count the number of all blank lines in the file, we can write the following program:

scala> val notice = sc.textFile("./NOTICE")
notice: org.apache.spark.rdd.RDD[String] = ./NOTICE MapPartitionsRDD[40] at textFile at <console>:32

scala> val blanklines = sc.accumulator(0)
warning: there were two deprecation warnings; re-run with -deprecation for details
blanklines: org.apache.spark.Accumulator[Int] = 0

scala> val tmp = notice.flatMap(line => {
    
    
     |    if (line == "") {
    
    
     |       blanklines += 1
     |    }
     |    line.split(" ")
     | })
tmp: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[41] at flatMap at <console>:36

scala> tmp.count()
res31: Long = 3213

scala> blanklines.value
res32: Int = 171

The usage of accumulator is as follows.
Create an accumulator with an initial value by calling the SparkContext.accumulator(initialValue) method in the driver. The return value is an org.apache.spark.Accumulator[T] object, where T is the type of the initial value initialValue. Executor code within a Spark closure can increment the accumulator using the += method of the accumulator (add in Java). The driver program can access the value of the accumulator by calling the value attribute of the accumulator (using value() or setValue() in Java).
Note: Tasks on worker nodes cannot access the value of the accumulator. From the perspective of these tasks, the accumulator is a write-only variable.
For accumulators to be used in action operations, Spark will only apply modifications to each accumulator once per task. Therefore, if we want an accumulator that is absolutely reliable regardless of failure or repeated calculations, we must put it in an action operation such as foreach(). The accumulator may be updated more than once during a conversion operation

5.1.2 Custom accumulator

The function of customizing the accumulator type has been provided in version 1.X, but it is more troublesome to use. After version 2.0, the ease of use of the accumulator has been greatly improved, and the official also provides a new Abstract class: AccumulatorV2 to provide a more friendly implementation of custom type accumulators. To implement a custom type accumulator, you need to inherit AccumulatorV2 and at least override the methods that appear in the example. The following accumulator can be used to collect some text information during program running, and ultimately return it in the form of Set[String].

package com.wxn.spark

import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{
    
    SparkConf, SparkContext}
import scala.collection.JavaConversions._

class LogAccumulator extends org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]] {
    
    
  private val _logArray: java.util.Set[String] = new java.util.HashSet[String]()

  override def isZero: Boolean = {
    
    
    _logArray.isEmpty
  }

  override def reset(): Unit = {
    
    
    _logArray.clear()
  }

  override def add(v: String): Unit = {
    
    
    _logArray.add(v)
  }

  override def merge(other: org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]]): Unit = {
    
    
    other match {
    
    
      case o: LogAccumulator => _logArray.addAll(o.value)
    }

  }

  override def value: java.util.Set[String] = {
    
    
    java.util.Collections.unmodifiableSet(_logArray)
  }

  override def copy():org.apache.spark.util.AccumulatorV2[String, java.util.Set[String]] = {
    
    
    val newAcc = new LogAccumulator()
    _logArray.synchronized{
    
    
      newAcc._logArray.addAll(_logArray)
    }
    newAcc
  }
}

// 过滤掉带字母的
object LogAccumulator {
    
    
  def main(args: Array[String]) {
    
    
    val conf=new SparkConf().setAppName("LogAccumulator")
    val sc=new SparkContext(conf)

    val accum = new LogAccumulator
    sc.register(accum, "logAccum")
    val sum = sc.parallelize(Array("1", "2a", "3", "4b", "5", "6", "7cd", "8", "9"), 2).filter(line => {
    
    
      val pattern = """^-?(\d+)"""
      val flag = line.matches(pattern)
      if (!flag) {
    
    
        accum.add(line)
      }
      flag
    }).map(_.toInt).reduce(_ + _)

    println("sum: " + sum)
    for (v <- accum.value) print(v + "")
    println()
    sc.stop()
  }
}

5.2 Broadcast variables (tuning strategy)

Broadcast variables are used to distribute larger objects efficiently. Sends a large read-only value to all worker nodes for use by one or more Spark operations. For example, if your application needs to send a large read-only query table to all nodes, or even a large feature vector in a machine learning algorithm, broadcast variables are convenient to use. Use the same variable in multiple parallel operations, but Spark sends it separately for each task.
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(35)

scala> broadcastVar.value
res33: Array[Int] = Array(1, 2, 3)
The process of using broadcast variables is as follows:
(1) Create a Broadcast[T] object by calling SparkContext.broadcast on an object of type T. Any serializable type can be implemented this way.
(2) Access the value of the object through the value attribute (value() method in Java).
(3) The variable will only be sent to each node once and should be treated as a read-only value (modifying this value will not affect other nodes).

6. Extension

6.1 RDD-related conceptual relationships

Insert image description here
Input may be stored on HDFS in the form of multiple files, and each File contains many blocks, called Blocks. When Spark reads these files as input, it will parse them according to the InputFormat corresponding to the specific data format. Generally, several Blocks are merged into one input split, called InputSplit. Note that InputSplit cannot span files. Specific Tasks will then be generated for these input fragments. InputSplit and Task have a one-to-one correspondence. Each of these specific Tasks will then be assigned to an Executor on a node on the cluster for execution.
1) Each node can have one or more Executors.
2) Each Executor is composed of several cores, and each core of each Executor can only execute one Task at a time.
3) The result of each Task execution is a partition of the target RDD.
Note: The core here is a virtual core rather than the physical CPU core of the machine. It can be understood as a working thread of the Executor. The concurrency of Task execution = number of Executors * number of cores of each Executor. As for the number of partitions:
1) For the data reading stage, such as sc.textFile, as many InputSplits as the input file is divided into will require as many initial Tasks.
2) The number of partitions remains unchanged during the Map phase.
3) In the Reduce stage, the aggregation of RDD will trigger the shuffle operation. The number of partitions of the aggregated RDD is related to the specific operation. For example, the repartition operation will aggregate into a specified number of partitions, and some operators are configurable.
When RDD is calculated, each partition will have a task, so the number of RDD partitions determines the total number of tasks. The number of computing nodes (Executor) applied for and the number of cores of each computing node determine the tasks that you can execute in parallel at the same time.
For example, if an RDD has 100 partitions, then 100 tasks will be generated during calculation. Your resource configuration is 10 computing nodes, each with two cores. The number of tasks that can be parallelized at the same time is 20. Calculating this RDD is 5 rounds are required. If the computing resources remain unchanged and you have 101 tasks, it will take 6 rounds. In the last round, only one task is executed and the rest of the cores are idling. If the resources remain unchanged and your RDD has only 2 partitions, then only 2 tasks will be running at the same time, and the remaining 18 cores will be idling, causing a waste of resources. This is how to increase the number of RDD partitions and increase task parallelism in spark tuning.

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135380720
Recommended