Chapter 2: "RDD Programming Examples"

One, find the Top value

We have two files like this. The
Insert picture description here
first number is the line number, followed by three columns of data. Let's find the Top(N) of the second column of data

(1) We first read the data and create an Rdd
Insert picture description here
(2) Filter the data and take the second column of data.
Insert picture description here

filter()来过滤数据,用line.trim().length是过滤没有内容的空行然后计算长度,长度大于0,并且分能用逗号切分为4个子数据的数据为有效数据。然后我们来切分取出第二列数据

(3) Data type conversion and modification into the form of key-value pairs
Insert picture description here

因为我们没有办法对一列数值进行排序,要采用orderByKey()方法进行排序就必须把它处理为键值对的格式。所以我们通过.map(x=>(x.toInt,""))把原来数据(string)修改成为(int,String)类型的键值对

(4) Sort and retrieve the keys in the key-value pair
Insert picture description here
(5) Complete code

import org.apache.spark.{
    
    SparkConf, SparkContext}

object TopN {
    
    
    val sparkConf = new SparkConf().setAppName("TopN")  //生成conf对象
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("ERROR") //设置日志等级,只显示报错
    val lines = sc.textFile("hdfs://localhost:9000/user/local/spark/data",2)
    var num = 0//排名初始化
    var result = lines.filter(line => (line.trim.length > 0 && line.split(",").length > 4))//过滤数据
                      .map(_.split(",")(2)) //拆分文件取第二列数
                      .map(x =>(x.toInt,"")) //修改数据类型并转化为键值对的形式
                      .sortByKey(false)//排序
                      .map(x => x._1)//取键
                      .take(5)//取前五条数据
                      .foreach( x =>{
    
     //显示数据
                            num = num + 1 //排名
                            println(num+"\t"+x) //显示
    })
}

Second, find the maximum and minimum values

Insert picture description here

package ClassicCase
 
import org.apache.spark.{
    
    SparkConf, SparkContext}
object case5 {
    
    
  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setMaster("local").setAppName("reduce")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val fifth = sc.textFile("hdfs://192.168.109.130:8020//user/flume/ClassicCase/case5/*", 2)
    //_.trim().length>0中的_.trim()是消去空列,line.trim.toInt中的line.trim也是消去那些空的line
    val res = fifth.filter(_.trim().length>0).map(line => ("key",line.trim.toInt)).groupByKey().map(x => {
    
    
      var min = Integer.MAX_VALUE
      var max = Integer.MIN_VALUE
      for(num <- x._2){
    
    
        if(num>max){
    
    
          max = num
        }
        if(num<min){
    
    
          min = num
        }
      }
      (max,min)  //.map还没有结束,这里是封装成一个(max,min)的元组
    }).collect.foreach(x => {
    
    
      println("max\t"+x._1)
      println("min\t"+x._2)
    })
  }
 
}

Insert picture description here
Insert picture description here

Three, file sorting

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner

object Filesort {
    
    
  def main(args: Array[String]){
    
    
    val conf = new SparkConf().setAppName("Filesort")
    val sc  = new SparkContext(conf)
    val dataFile="file///D:/测试数据/Sort3file"
    val lines=sc.textFIle(dataFile,3)  //生成3个分区的RDD
    var index=0
    val result=lines.filter(_.trim().length>0)
      .map(x=>(x.trim.toInt,""))
      .partitionBy(new HashPartitioner(1))  //将三个分区回归为一个分区,要不然三个分区还是没办法归一化排序
      .sortByKey(true)
      .map(x=>{
    
    index += 1;(index,x._1)})
    result.saveAsTextFile( "file:///D:/输出结果/OUTSort3File")
  }
}

Fourth, secondary sorting

Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here

import org.apache.spark.{
    
    SparkConf, SparkContext}

object SecondarySortApp extends  App {
    
    
  val conf = new SparkConf().setMaster("local").setAppName("SecondarySortApp")
  val sc = new SparkContext(conf)
  val array = Array("8 3", "5 6", "5 3", "4 9", "4 7", "3 2", "1 6")
  val rdd = sc.parallelize(array)
  rdd.map(_.split(" "))
    .map(item => (item(0).toInt, item(1).toInt))
    .map(item => (new SecondarySortKey(item._1, item._2), s"${item._1} ${item._2}"))
    .sortByKey(false)
    .foreach(x => println(x._2))
}


class SecondarySortKey(val first:Int, val second: Int) extends Ordered[SecondarySortKey] with Serializable{
    
    
  override def compare(that: SecondarySortKey): Int = {
    
      //比你小返回一个负数,比你大返回一个正数
    if (this.first - that.first != 0){
    
      //也就是当第一列的值不等时
      this.first - that.first
    }else {
    
      //也就是当第一列的值相等时
      this.second - that.second  //比较第二列的值
    }
  }
}

Five, connection operation

Insert picture description here
Task: Find the movies with an average user rating greater than 4 (the average rating is calculated from the sum of all user ratings/number of users).

We look at the results of two files, the first file has the movie’s ID and name, the second file has the movie’s ID and the ratings of all users

1. We first calculate the rating of the movie
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
2. Get the movie ID and movie name
Insert picture description here
Insert picture description here
3. Connect via movie ID
The result we need is (ID, NAME, SCORE).
If we connect the id directly, the result of our connection is only (NAME, SCORE) missing the ID,
so we need to process the data again, we pass.keyBy() The method generates a new key, and the value is the original data
Insert picture description here
Insert picture description here
Insert picture description here
4. Complete code

import org.apache.spark._
import SparkContext._
object SparkJoin{
    
    
  def main(args:Arrays[String]): Unit ={
    
    
    if(args.length != 3){
    
    
      println("usage is WordCount <rating> <movie> <output>")
      return
    }
    
    //1.先计算电影的平均分
    val conf=new SparkConf().setAppName("SparkJoin").setMaster("local")
    val sc=new SparkContext(conf)
    val textFile=sc.textFile(args(0))
    val rating=textFile.map(line=>{
    
    
      val fileds=line.split("::")
      (fileds(1).toInt,fileds(2).toDouble)
    })
    val movieScores=rating.groupByKey().map(data=>{
    
    
      val avg=data._2.sum / data._2.size
      (data._1,avg)
    })

    //2.在取电影ID和电影名
    val movies=sc.textFile(args(1))
    val movieskey=movies.map(line=>{
    
    
      val fields=line.split("::")
      (fields(0).toInt,fileds(1))
    }).keyBy(tup=>tup._1)

    val result=movieScores.keyBy(tup=>tup._1).join(movieskey).filter(f=>f._2._1._2>4.0).map(f=>(f._1,f._2._1._2,f._2._2._2))  //过滤求出平均评分大于4的记录
    result.saveAsTextFile(args(2))
  }
}

Guess you like

Origin blog.csdn.net/weixin_45014721/article/details/109706149