Spark SQL job statistics and output the top ten student scores and sort them

Experiment type: RDD conversion DataFrame

Experimental requirements:

Given three sets of student data, it is required to count all the information from these three sets of data and finally output the top ten students . The output result format is: ranking, class, student ID, and score.

Steps:

1. Open the idea software and create the file as follows: File —> New —> Project —> Maven —> scala3–> main–> sql. You only need to follow the prompts. Note: (If your computer does not have the scala plug-in installed before doing this experiment, you must install the Scala plug-in in idea. The steps are: File —> Settings —> Plugins —> Enter Scala —> Apply).

2. Create three sets of student data in the newly created main file in (.txt) format. The data I created is as follows:
3. Add Spark SQL dependency: Add Spark SQL dependency in the pom.xml file. The code snippet is as follows

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.3.2</version>
</dependency>

Note: You can add the latest versions of spark-sql. This experiment I did was done before, and the experimental dependency version is the previous one.

data file:

1. class1.txt file data:

1,1001,50,2018001
2,1002,60,2018001
3,1003,70,2018001
4,1004,20,2018001
5,1005,80,2018001
6,1006,66,2018001
7,1007,99,2018001

2. class2.txt file data:

1,2001,55,2018002
2,2002,56,2018002
3,2003,88,2018002
4,2004,60,2018002
5,2005,78,2018002
6,2006,62,2018002

3. class3.txt file data:

1,3001,99,2018003
2,3002,84,2018003
3,3003,59,2018003
4,3004,71,2018003
5,3005,69,2018003
6,3006,100,2018003

After creating the data, we can start the experiment. The experiment code is as follows.

Experimental code block:

import io.netty.handler.codec.smtp.SmtpRequests.data
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{
    
    DataFrame, Row, SparkSession}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.dense_rank
import org.apache.spark.sql.functions._
  case class  student(班级:Int,学号:Int,成绩:Int)
object sql {
    
    
  def main(args:Array[String]):Unit= {
    
    
    val spark: SparkSession = SparkSession.builder()
      .appName("test1")
      .master("local[2]")
      .getOrCreate()
    val sc : SparkContext  =spark.sparkContext
    sc.setLogLevel("WARN")
    val data1:RDD[Array[String]]=
                 sc.textFile("src/main/class1.txt").map(x=>x.split(","))
    val data2:RDD[Array[String]]=
                sc.textFile("src/main/class2.txt").map(x=>x.split(","))
    val data3:RDD[Array[String]]=
                sc.textFile("src/main/class3.txt").map(x => x.split(","))
    val data = data1.union(data2).union(data3)
    val studentRDD:RDD[student]=
                data.map(x => student(x(3).toInt, x(1).toInt, x(2).toInt))
    import spark.implicits._
    val select=data1.union(data2).union(data3)
    val studentDF:DataFrame=studentRDD.toDF("班级","学号","成绩")
       //studentDF.sort(studentDF("成绩").desc).show(10)
       val windowSpec= Window.partitionBy().orderBy(col("成绩") desc)
      val xinliebiao=studentDF.withColumn("排名",row_number()
     .over( Window.partitionBy().orderBy(col("成绩") desc)))
        .select("排名","班级","学号","成绩")
        //.where("rank<=10")
        .show(10)
    sc.stop()
    spark.stop()
      }}

Some important code explanations above:
1. Start the experiment and import the corresponding package, and define the sample class: case class student (class: Int, student number: Int, grade: Int)
2. The function of import spark.implicits._ is Supports converting RDD into DataFrame
3. desc functions in descending order, col represents column, orderBy represents sorting
4. partition by is to group and sort only some fields on the basis of retaining all data.

Screenshot of experimental results:

From the picture we can see the statistics and output the top ten students .

Insert image description here

Guess you like

Origin blog.csdn.net/qq_62127918/article/details/130426849