Use spark achieve max / min / topN classic problem mapreduce

Troubleshooting guide:

  1. How to achieve maximum use of minimum spark problem?
  2. Use spark how to achieve the average problem?
  3. Use spark how topN problem?
    Here Insert Picture Description

Summary

Spark is an Apache project, which is billed as "lightning-fast cluster computing." It has a thriving open source community, and is the most active of the Apache project. Spark provides a faster, more versatile data processing platform. And compared to Hadoop, Spark lets you program 100 times faster when running in memory, or 10 times faster when run on a disk. At the same time spark also allow the development of the traditional map reduce job easier and faster.

1. Maximum Minimum

Maximum Minimum requirements has been a classic case of Hadoop, Spark we use to achieve the look, feel like to take this idea and implementation of spark in mr. Man of few words said, directly on the code:

@Test  def testMaxMin: Unit = {
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //初始化测试数据    val data = sc.parallelize(Array(10,7,3,4,5,6,7,8,1001,6,2))
    //方法一    val res = data.map(x => ("key", x)).groupByKey().map(x => {
      var min = Integer.MAX_VALUE      var max = Integer.MIN_VALUE      for(num <- x._2){
        if(num>max){
          max = num
        }
        if(num<min){
          min = num
        }
      }
      (max,min)
    }).collect.foreach(x => {
      println("max\t"+x._1)
      println("min\t"+x._2)
    })
  
    //方法二,下面用一个比较鸡贼的方式求最大最小值    val max = data.reduce((a,b) => Math.max(a,b))
    val min = data.reduce((a,b) => Math.min(a,b))
    println("max : " + max)
    println("min : " + min)
    sc.stop
  }

Expected results:
max: 1001
min: 2
Similar ideas and in hadoop mr, a set key, value set to the maximum request and the minimum value, then groupBykey polymerization process together. The second method is more simple, and better performance.

2. The average value problem

Averaging each key corresponds to the common cases dealing with similar issues often spark combineByKey will use this function, usage details please google it, look at the code below:

@Test
  def testAvg(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //初始化测试数据    val foo = sc.parallelize(List(Tuple2("a", 1), Tuple2("a", 3), Tuple2("b", 2), Tuple2("b", 8)));
    //这里需要用到combineByKey这个函数,需要了解的请google    val results=foo.combineByKey(
      (v)=>(v,1),
      (acc:(Int,Int),v) =>(acc._1+v,acc._2+1),
      (acc1:(Int,Int),acc2:(Int,Int))=>(acc1._1+acc2._1,acc1._2+acc2._2)
    ).map{case(key,value)=>(key,value._1/value._2.toDouble)}
    results.collect().foreach(println)
  }

We first determined so that each partiton all integers and sum and count the number of each single key corresponding to the partition, and then returns a pair (sum, count) in the shuffle and the accumulated sum of all the individual key corresponding to the count, and then divided by get mean.

3.TopN problem

@Test  def testTopN(): Unit ={
    val sconf = new SparkConf().setAppName("test")
    val sc = new SparkContext(sconf)
    //初始话测试数据    val foo = sc.parallelize(Array(
      ("a", 1),
      ("a", 2),
      ("a", 3),
      ("b", 3),
      ("b", 1),
      ("a", 4),
      ("b", 4),
      ("b", 2)
    ))
    //这里测试,取top 2。    val groupsSort=foo.groupByKey().map(tu=>{
      val key=tu._1
      val values=tu._2
      val sortValues=values.toList.sortWith(_>_).take(2)
      (key,sortValues)
    })
    //转换格式进行print    val flattenedTopNPerGroup =
      groupsSort.flatMap({case (key, numbers) => numbers.map(key -> _)})
    flattenedTopNPerGroup.foreach((value: Any) => {
      println(value)
    })
    sc.stop
  }

GroupBykey data packet is formed after the press key and then take the maximum of two for each packet. Expected results:
(A,. 4)
(A,. 3)
(B,. 4)
(B,. 3)

Published 118 original articles · won praise 26 · views 60000 +

Guess you like

Origin blog.csdn.net/qq_43147136/article/details/84544942