Spark-Core core operators

Article directory


1. Data source acquisition

1. Obtain from the collection

sc.parallelize(list)
sc.makeRDD(list)
sc.makeRDD(list, 2)
val list: List[Int] = List(1, 2, 3, 4, 5)
//  从List中创建RDD
val rdd01: RDD[Int] = sc.parallelize(list)
//  底层调用parallelize。从结合list中获取数据
val rdd02: RDD[Int] = sc.makeRDD(list)
//  2:分区数量为2
val rdd03: RDD[Int] = sc.makeRDD(list, 2)

2. Create from external storage system

//	从文件中获取
sc.textFile("input/1.txt")
//  无论文件中存储的是什么数据,读取过来都当字符串进行处理
val rdd04: RDD[String] = sc.textFile("input/1.txt")

3. Create from other RDDs

After other execution steps are completed, generate a new RDD object

val rdd05: RDD[String] = rdd04.map(_ * 2)

4. Partition rules—when loading data

create from collection

Create from file

Second, the conversion operator (Transformation)

//  1、创建SparkConf并设置App名称
val conf = new SparkConf().setAppName("SparkCore").setMaster("local[*]")
//  2、创建SparkContext,该对象时提交Spark APP 的入口
val sc = new SparkContext(conf)
//  3、创建RDD
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4))
//  4、具体执行步骤
val rdd01: RDD[Int] = rdd.map(x => x * 20)
//  5、打印结果
println(rdd01.collect().toList)
//  6、关闭连接
sc.stop()

1. Value type

1.1 map()_

insert image description here

//  4、具体执行步骤
val rdd01: RDD[Int] = rdd.map(x => x * 20)
//  4、具体执行步骤
val rdd02: RDD[Int] = rdd01.map(_ * 20)

1.2 mapPartitions()

Executed in units of partitionsmap()

insert image description here

1.3 mapPartitionsWithIndex (not commonly used)

  • The functions inside operate on each partition, and the function is executed as many times as there are partitions.
  • The first parameter of the function represents the partition number.
  • The second parameter of the function represents the partition data iterator.
  /**
   *
   * @param f                     分区编号
   * @param preservesPartitioning 分区数据迭代器
   */
def mapPartitionsWithIndex[U: ClassTag](
    f: (Int, Iterator[T]) => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U] = withScope {
    
    

}
rdd03.mapPartitionsWithIndex((index, items) => {
    
    
  items.map((index, _))
})
//	指定迭代器规则,并使用分区数据迭代器
rdd03.mapPartitionsWithIndex((index, items) => {
    
    
  items.map((index, _))
}, preservesPartitioning = true)

1.4 filterMap()_flattening (merging streams)

Flatten (merge streams)

Function Description

  • Similar to the map operation, each element in the RDD is converted into a new element in turn by applying the f function, and encapsulated into the RDD.
  • Difference: In the flatMap operation, the return value of the f function is a collection, and each element in the collection will be split out and placed in a new RDD.

insert image description here

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    
    

}
val rdd08: RDD[List[Int]] = sc.makeRDD(List(List(1, 2), List(3, 4), List(5, 6)), 2)
val rdd09: RDD[Int] = rdd08.flatMap(list => list)
//	List(1, 2, 3, 4, 5, 6)
println(rdd09.collect().toList)

1.5 groupBy()_grouping

group

Group by the return value of the passed function. Put the values ​​corresponding to the same key into an iterator.

insert image description here

The transfer of the external link image failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly.

insert image description here

def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
    
    
  groupBy[K](f, defaultPartitioner(this))
}

the case

// 3.2 将每个分区的数据放到一个数组并收集到Driver端打印
rdd.groupBy((x)=>{
    
    x%2})
// 简化
rdd.groupBy(_%2)
val rdd10: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
//  (0,CompactBuffer(2, 4, 6, 8))
//  (1,CompactBuffer(1, 3, 5, 7, 9))
rdd10.groupBy(_ % 2).collect().foreach(println)
val rdd11: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5), 3)
//  按照数字相同进行分区
//  (3,CompactBuffer(3))
//  (4,CompactBuffer(4))
//  (1,CompactBuffer(1))
//  (5,CompactBuffer(5))
//  (2,CompactBuffer(2))
rdd11.groupBy(a => a).collect().foreach(println)

1.6 filter()_filtering

filter

Takes a function that returns a Boolean as an argument. When an RDD calls the filter method, the f function will be applied to each element in the RDD. If the return value type is true, the element will be added to the new RDD.

insert image description here

rdd11.filter(a => a % 2 == 0)
rdd11.filter(_% 2 == 0)
val rdd11: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
val rdd110: RDD[Int] = rdd11.filter(a => a % 2 == 0)
//  List(2, 4, 6, 8)
println(rdd110.collect().toList)

1.7 distinct()_ deduplication

Deduplication

  • Deduplicate the internal elements, and put the deduplicated elements into the new RDD.
  • By default, distinct will generate the same number of partitions as the original RDD partitions.
  • Deduplication in a distributed way is less OOM than the HashSet collection method.

insert image description here

//	去重
rdd.distinct()
//	去重(2并发度)
rdd.distinct(2)
val rdd12: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 2, 3), 3)
//  List(3, 4, 1, 2)
println(rdd12.distinct().collect().toList)
//  List(4, 2, 1, 3)(采用多个Task提高并发读)
println(rdd12.distinct(2).collect().toList)

1.8 coalesce()_ merge partition

merge partition

  • The Coalesce operator includes two modes: configuration to execute Shuffle and configuration not to execute Shuffle.
  • Reduce the number of partitions to improve the execution efficiency of small data sets after filtering large data sets.

insert image description here

rdd13.coalesce(2)
rdd14.coalesce(2, shuffle = true)

Shrink the partition and execute Shuffer

val rdd14: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
//  缩减分区为2个
val rdd131: RDD[Int] = rdd13.coalesce(2)
//  缩减分区为2个,并执行Shuffer
val rdd141: RDD[Int] = rdd14.coalesce(2, shuffle = true)

1.9 repartition()_ repartition

repartition

  • Execute Shuffle.
  • This operation actually executes a coalesce operation internally, and the default value of the parameter shuffle is true.
  • Whether it is converting an RDD with a large number of partitions to an RDD with a small number of partitions, or converting an RDD with a small number of partitions into an RDD with a large number of partitions, the repartition operation can be completed, because it will go through the shuffle process anyway.
  • The partition rule is not hash, because the partitions used in daily use are implemented according to hash, repartition is generally not satisfied with the result of hash, and wants to break up and repartition.

insert image description here

rdd.repartition(2)
val rdd15: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
//  重新分区
val rdd151: RDD[Int] = rdd15.repartition(2)

1.10 sortBy()_sorting

to sort

  • This operation is used to sort data.
  • Before sorting, the data can be processed by the f function, and then sorted according to the result of the f function processing, and the default is positive order.
  • The number of partitions of the newly generated RDD after sorting is the same as the number of partitions of the original RDD.
  • Implement forward and reverse sorting.

insert image description here

//	正序
rdd.sortBy(num => num)
//	倒叙
rdd.sortBy(num => num, ascending = false)

case:

val rdd16: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)
//  重新排序,默认升序
val rdd161: RDD[Int] = rdd16.sortBy(num => num)
//  重新排序,配置降序
val rdd162: RDD[Int] = rdd16.sortBy(num => num, ascending = false)
val rdd17: RDD[(Int, Int)] = sc.makeRDD(List((1, 2), (3, 4), (5, 6)))
//  先按照第1个值升序,在按第2个值排序
val rdd171: RDD[(Int, Int)] = rdd17.sortBy(num => num)

1.11 The difference between map and mapPartitions

insert image description here

The difference between map and mapPartitions

  • Functions target different objects
    • The function of map is to operate on each element
    • The function of mapPartitions is for each partition operation
  • The return value of the function is different
    • The function of map operates on each element and requires returning a new element. The number of new RDD elements generated by map = the number of original RDD elements
    • The function of mapPartitions is for partition operations, and it is required to return the iterator of the new partition. The number of new RDD elements generated by mapPartitions is not necessarily equal to the number of original RDD elements.
  • The timing of element memory recovery is different
    • The map can be garbage collected after the operation of the elements is completed.
    • mapPartitions must wait until all the data in the partition data iterator is processed before unified garbage collection. If the partition data is relatively large, memory overflow may occur, and map can be used instead.
val rdd02: RDD[Int] = rdd01.mapPartitions(a => a.map(b => b * 2))
val rdd03: RDD[Int] = rdd02.mapPartitions(a => a.map(_ * 2))

1.12 The difference between coalesce and repartition

  • coalesce repartition, you can choose whether to perform the shuffle process. Determined by the parameter shuffle: Boolean = false/true.
  • repartition is actually calling coalesce to perform shuffle. The source code is as follows:
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    
    
    coalesce(numPartitions, shuffle = true)
}
  • Coalesce generally shrinks the partition. If you expand the partition, it is meaningless not to use shuffle. Repartition expands the partition and executes shuffle.

2. Double-Value type

2.1 intersection()_intersection

union without deduplication

  • Returns a new RDD after intersecting the source RDD and the parameter RDD.
  • Use the principle of shuffle to find the intersection, and all the data needs to be placed on the disk to shuffle, which is very inefficient
  • not recommended

insert image description here

println(rdd01.intersection(rdd02)
val rdd01: RDD[Int] = sc.makeRDD(1 to 4)
val rdd02: RDD[Int] = sc.makeRDD(4 to 8)
//	取交集
//	利用shuffle的原理进行求交集  需要将所有的数据落盘shuffle 效率很低  不推荐使用
println(rdd01.intersection(rdd02).collect().toList)

2.2 union()_union without deduplication

union without deduplication

  • Returns a new RDD after the union of the source RDD and the parameter RDD
  • Since no shuffle is used, the efficiency is high.
  • all will use

insert image description here

rdd1.union(rdd2)
val rdd01: RDD[Int] = sc.makeRDD(1 to 4)
val rdd02: RDD[Int] = sc.makeRDD(4 to 8)
//	由于不走shuffle  效率高  所有会使用到
rdd1.union(rdd2).collect().foreach(println)

2.3 subtract()_difference set

difference set

  • A function that calculates the difference, removes the same elements in two RDDs, and the different RDDs will remain.
  • Also use the principle of shuffle to write the data of two RDDs to the same location for difference set
  • Need to go shuffle, low efficiency, not recommended
  • In the data of rdd01, the data different from rdd02 (1,2,3)

insert image description here

//	计算第一个RDD与第二个RDD的差集并打印
rdd01.subtract(rdd02)
val rdd01: RDD[Int] = sc.makeRDD(1 to 4)
val rdd02: RDD[Int] = sc.makeRDD(4 to 8)
// 同样使用shuffle的原理  将两个RDD的数据写入到相同的位置 进行求差集
// 需要走shuffle  效率低  不推荐使用
//	在rdd01的数据中,与rdd02相差的数据(1,2,3)
rdd01.subtract(rdd02).collect().foreach(println)

2.4 zip()_zipper

zipper

  • This operation can combine elements in two RDDs in the form of key-value pairs.
  • Among them, the Key in the key-value pair is the element in the first RDD, and the Value is the element in the second RDD.
  • Combine two RDDs into an RDD in the form of Key/Value. Here, the number of partitions and the number of elements of the two RDDs are the same by default, otherwise an exception will be thrown.

insert image description here

val rdd01: RDD[Int] = sc.makeRDD(Array(1, 2, 3), 3)
val rdd02: RDD[String] = sc.makeRDD(Array("a", "b", "c"), 3)
//  List((1,a), (2,b), (3,c))
println(rdd01.zip(rdd02).collect().toList)
//  List((a,1), (b,2), (c,3))
println(rdd02.zip(rdd01).collect().toList)

Counter example:

val rdd02: RDD[String] = sc.makeRDD(Array("a", "b", "c"), 3)

val rdd03: RDD[String] = sc.makeRDD(Array("a", "b"), 3)
//  元素个数不同,不能拉链
//  SparkException: Can only zip RDDs with same number of elements in each partition
println(rdd03.zip(rdd02).collect().toList)
val rdd04: RDD[String] = sc.makeRDD(Array("a", "b", "c"), 2)
//  分区数不同,不能拉链
//  java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(2, 3)
println(rdd04.zip(rdd02).collect().toList)

3. Key-Value type

3.1 partitionBy()_Repartition according to K

Repartition by K

  • Repartition K in RDD[K,V] according to the specified Partitioner;
  • If the original RDD and the new RDD are consistent, no partition will be performed, otherwise the Shuffle process will occur.
  • The number of partitions will change.

insert image description here

//	使用hash计算方式重分区,并重分区后分区数量 = 2
rdd01.partitionBy(new HashPartitioner(2))
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((111, "aaa"), (222, "bbbb"), (333, "ccccc")), 3)
val rdd02: RDD[(Int, String)] = rdd01.partitionBy(new HashPartitioner(2))

//	打印重分区后的分区数量
//	(0,(2,bbbb))
//	(1,(1,aaa))
//	(1,(3,ccccc))
val rdd03: RDD[(Int, (Int, String))] = rdd02.mapPartitionsWithIndex((index, datas) => {
    
    
  datas.map((index, _))
})
rdd03.collect().foreach(println)

3.2 groupByKey()_regroup according to K

Regroup by K

  • groupByKey operates on each key, but only generates a seq without aggregation.
  • This operation can specify the partitioner or the number of partitions (HashPartitioner is used by default).
  • The number of partitions will not change.

insert image description here

rdd001.groupByKey()
val rdd001: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("b", 5), ("a", 5), ("b", 2)
val rdd002: RDD[(String, Iterable[Int])] = rdd001.groupByKey()
//  (a,CompactBuffer(1, 5))
//  (b,CompactBuffer(5, 2))
rdd002.collect().foreach(println)

3.3 reduceByKey()_ aggregate V according to K

aggregate V according to K

  • This operation can aggregate the elements in RDD[K,V] according to the same K to V.
  • There are many overloaded forms, and the number of partitions of the new RDD can also be set.

insert image description here

rdd01.reduceByKey((v1, v2) => (v1 + v2))
val rdd01: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("b", 5), ("a", 5), ("b", 2)))
val rdd02: RDD[(String, Int)] = rdd01.reduceByKey((v1, v2) => (v1 + v2))
//  List((a,6), (b,7))
println(rdd02.collect().toList)

3.4 aggregateByKey()_ reduction of different logic

Reductions with different logic within and between partitions

insert image description here

//	zeroValue(初始值):给每一个分区中的每一种key一个初始值;
//	seqOp(分区内):函数用于在每一个分区中用初始值逐步迭代value;
//	combOp(分区间):函数用于合并每个分区中的结果。
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    
    
  }
//	分区初始值=0,分区内取最大值,分区间求和
rdd01.aggregateByKey(0)(math.max(_, _), _ + _)
val rdd01: RDD[(String, Int)] = sc.makeRDD(List(("a", 1), ("b", 5), ("a", 5), ("b", 2)))
//	取出每个分区相同key对应值的最大值,然后相加
val rdd02: RDD[(String, Int)] = rdd01.aggregateByKey(0)(math.max(_, _), _ + _)
//	List((a,6), (b,7))
println(rdd02.collect().toList)

3.5 sortByKey()_ sort by K

Sort by K

  • Called on a (K, V) RDD, K must implement the Ordered interface, and return a (K, V) RDD sorted by key.

insert image description here

//  按照key的正序(默认正序)
rdd01.sortByKey(ascending = true)
//  按照key的倒序排列
rdd01.sortByKey(ascending = false)
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((3, "aa"), (6, "cc"), (2, "bb"), 1, "dd"))
//  按照key的正序(默认正序)
println(rdd01.sortByKey(ascending = true).collect().toList)
//  按照key的倒序排列
println(rdd01.sortByKey(ascending = false).collect().toList)

3.6 mapValues()_Only operate on V

Only operate on V

  • Only operate on V for types of the form (K, V)

insert image description here

rdd01.mapValues(_ + "|||")
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (1, "d"), (2, "b"), (3, "c")))
//  对Value值添加字符串|||
//  List((1,a|||), (1,d|||), (2,b|||), (3,c|||))
println(rdd01.mapValues(_ + "|||").collect().toList)

3.7 join()_ is equivalent to SQL inner join

join() is equivalent to the inner join in sql, the association is required, and the association is discarded

  • Called on RDDs of type (K, V) and (K, W), returns a (K, (V, W)) RDD of all pairs of elements corresponding to the same key.
  • Similar to join (inline) in SQL

insert image description here

//  按key进行 内联join
rdd01.join(rdd02)
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (2, "b"), (3, "c")))
val rdd02: RDD[(Int, Int)] = sc.makeRDD(Array((1, 4), (2, 5), (4, 6)))
//  按key进行 内联join
//  List((1,(a,4)), (2,(b,5)))
println(rdd01.join(rdd02).collect().toList)

3.8 cogroup()_similar to sql full join

cogroup() is similar to SQL's full connection, but aggregates keys in the same RDD

  • Called on RDDs of type (K, V) and (K, W), returns an RDD of type (K, (Iterable, Iterable)).
  • Operate the KV elements in two RDDs, and aggregate the elements in the same key in each RDD into a set.
  • union

insert image description here

//  cogroup 合并两个RDD,取并集
rdd01.cogroup(rdd02)
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (2, "b"), (3, "c")))
val rdd02: RDD[(Int, Int)] = sc.makeRDD(Array((1, 4), (2, 5), (4, 6)))
//  cogroup 两个RDD并打印
//  List((1,(CompactBuffer(a),CompactBuffer(4))), (2,(CompactBuffer(b),CompactBuffer(5)))
//  (3,(CompactBuffer(c),CompactBuffer())), (4,(CompactBuffer(),CompactBuffer(6))))
println(rdd01.cogroup(rdd02).collect().toList)

Result processing after cogroup

val rdd01: RDD[(Int, Int)] = sc.makeRDD(Array((1, 4), (2, 1), (3, 5)))
val rdd02: RDD[(Int, Int)] = sc.makeRDD(Array((1, 4), (2, 5), (4, 6)))
//	cogroup后类型为Iterable,key调用其sum进行值求和(相同的key)
val value1: RDD[(Int, (Iterable[Int], Iterable[Int]))] = rdd01.cogroup(rdd02)
val value: RDD[(Int, (Int, Int))] = value1.mapValues(a => {
    
    
  (a._1.sum, a._2.sum)
})

3.9 Custom Partitioner

To implement a custom partitioner, you need to inherit the org.apache.spark.Partitioner class and implement the following three methods.

  1. numPartitions: Int: Returns the number of created partitions.
  2. getPartition(key: Any): Int: Returns the partition number (0 to numPartitions-1) for the given key.
  3. equals(): Java's standard method for judging equality. The implementation of this method is very important. Spark needs to use this method to check whether your partitioner object is the same as other partitioner instances, so that Spark can determine whether the partitioning methods of two RDDs are the same.
val rdd01: RDD[(Int, String)] = sc.makeRDD(Array((1, "a"), (2, "b"), (3, "c")))
val rdd02: RDD[(Int, String)] = rdd01.partitionBy(new MyParTition(2))
println(rdd02.collect().toList)
class MyParTition(num: Int) extends Partitioner {
    
    
  //  设置分区数
  override def numPartitions: Int = num
  //  具体分区逻辑
  override def getPartition(key: Any): Int = {
    
    
    //  采用模式匹配。依据不同的类型,采用不同的处理逻辑
    //  字符串:放入0号分区。整数:取模分区个数
    key match {
    
    
      case s: String => 0
      case i: Int => i % numPartitions
      case _ => 0
    }
  }
}

3.10 Difference between reduceByKey and groupByKey

  • reduceByKey: aggregate according to the key, there is a combine (pre-aggregation) operation before shuffle, and the returned result is RDD[K,V].
  • groupByKey: group by key and shuffle directly.
  • Development guidance: On the premise of not affecting the business logic, reduceByKey is preferred. The summation operation does not affect the business logic, but the average value affects the business logic. Later, we will learn a more powerful reduction operator, which can realize the average value in the case of pre-aggregation.

3. Action operator (Action)

The action operator triggers the execution of the entire job. Because conversion operators are lazy loaded and will not be executed immediately.

1. collect()_ returns the data set in the form of an array

Return the dataset as an array

  • In the driver, all elements of the dataset are returned as an Array.

insert image description here

rdd02.collect().toList

2. count()_ returns the number of elements in the RDD

Returns the number of elements in the RDD

insert image description here

println(rdd01.count())

3. first()_ returns the first element in the RDD

Returns the first element in the RDD

insert image description here

println(rdd01.first())

4. take()_ returns an array consisting of the first n elements of RDD

Returns an array consisting of the first n elements of the RDD

insert image description here

//	返回由前3个元素组成的数组
rdd01.take(3)
val number: Array[(Int, String)] = rdd01.take(3)

5. takeOrdered()_ returns the first n elements after sorting

Returns an array of the first n elements of the RDD sorted

insert image description here

// returns Array(2)
sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)

// returns Array(2, 3)
sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)

//  List(1, 2)
val rdd02: Array[Int] = sc.makeRDD(List(1, 3, 2, 4)).takeOrdered(2)
println(rdd02.toList)

6. countByKey()_ counts the number of each key

Count the number of each key

insert image description here

 rdd01.countByKey()
val rdd01: RDD[(Int, String)] = sc.makeRDD(List((1, "a"), (1, "a"), (1, "a"), (2, "b"), (3, "c"), (3, "c")))
val rdd02: collection.Map[Int, Long] = rdd01.countByKey()
//  Map(1 -> 3, 2 -> 1, 3 -> 2)
println(rdd02)

7. saveAsTextFile(path)_ save as Text file

Save as Text file

  • Save the elements of the dataset to the HDFS file system or other supported file systems in the form of textfile. For each element, Spark will call the toString method to replace it with the text in the file

insert image description here

val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2)
//	保存到本地Text文件
rdd.saveAsTextFile("output01")

8. saveAsSequenceFile(path)_ save as Sequencefile

Save as Sequencefile

  • Save the elements in the data set to the specified directory in the format of Hadoop Sequencefile, which can be HDFS or other file systems supported by Hadoop.
  • Only kv type RDD has this operation, single value does not
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2)
//	保存成Sequencefile文件
rdd.saveAsObjectFile("output02")

9. saveAsObjectFile(path)_serialized into an object and saved to a file

Serialize into an object and save it to a file

  • It is used to serialize the elements in RDD into objects and store them in files.
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4), 2)
//	序列化成对象保存到文件
rdd.map((_, 1)).saveAsObjectFile("output03")

10. foreach()_ traverses each element in RDD

Iterate over each element in the RDD

insert image description here

//  收集后打印
rdd.collect().foreach(println)
//  分布式打印
rdd.foreach(println)

Guess you like

Origin blog.csdn.net/weixin_44624117/article/details/132653525