一、提出任务
分组求TopN是大数据领域常见的需求,主要是根据数据的某一列进行分组,然后将分组后的每一组数据按照指定的列进行排序,最后取每一组的前N行数据。
数据表t_grade
执行查询
SELECT * FROM t_grade tg
WHERE ( SELECT COUNT ( * ) FROM t_grade
WHERE tg. name = t_grade. name
AND score >= tg. score
) <= 3 ORDER BY name, score DESC ;
二、完成任务
(一)新建Maven项目
点击Finish创建项目
(二)添加相关依赖和构建插件
<?xml version="1.0" encoding="UTF-8"?>
< project xmlns = " http://maven.apache.org/POM/4.0.0"
xmlns: xsi= " http://www.w3.org/2001/XMLSchema-instance"
xsi: schemaLocation= " http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd" >
< modelVersion> 4.0.0</ modelVersion>
< groupId> net.huawei.sql</ groupId>
< artifactId> GradeTopNBySQL</ artifactId>
< version> 1.0-SNAPSHOT</ version>
< dependencies>
< dependency>
< groupId> org.scala-lang</ groupId>
< artifactId> scala-library</ artifactId>
< version> 2.11.12</ version>
</ dependency>
< dependency>
< groupId> org.apache.spark</ groupId>
< artifactId> spark-core_2.11</ artifactId>
< version> 2.1.1</ version>
</ dependency>
< dependency>
< groupId> org.apache.spark</ groupId>
< artifactId> spark-sql_2.11</ artifactId>
< version> 2.1.1</ version>
</ dependency>
</ dependencies>
< build>
< sourceDirectory> src/main/scala</ sourceDirectory>
< plugins>
< plugin>
< groupId> org.apache.maven.plugins</ groupId>
< artifactId> maven-assembly-plugin</ artifactId>
< version> 3.3.0</ version>
< configuration>
< descriptorRefs>
< descriptorRef> jar-with-dependencies</ descriptorRef>
</ descriptorRefs>
</ configuration>
< executions>
< execution>
< id> make-assembly</ id>
< phase> package</ phase>
< goals>
< goal> single</ goal>
</ goals>
</ execution>
</ executions>
</ plugin>
< plugin>
< groupId> net.alchim31.maven</ groupId>
< artifactId> scala-maven-plugin</ artifactId>
< version> 3.3.2</ version>
< executions>
< execution>
< id> scala-compile-first</ id>
< phase> process-resources</ phase>
< goals>
< goal> add-source</ goal>
< goal> compile</ goal>
</ goals>
</ execution>
< execution>
< id> scala-test-compile</ id>
< phase> process-test-resources</ phase>
< goals>
< goal> testCompile</ goal>
</ goals>
</ execution>
</ executions>
</ plugin>
</ plugins>
</ build>
</ project>
(三)创建日志属性文件
创建日志属性文件 log4j.properties 内容如下:
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
在scala目录下创建net.xxx.sql包
在net.xxx.sql包里创建GradeTopNBySQL.scala
package net. xjx. sql
import org. apache. spark. sql. {
Dataset, SparkSession}
object GradeTopNBySQL {
def main( args: Array[ String ] ) : Unit = {
val spark = SparkSession. builder( )
. appName( "GradeTopNBySQL" )
. master( "local[*]" )
. getOrCreate( )
val lines: Dataset[ String ] = spark. read. textFile( "hdfs://master:9000/input/grades.txt" )
import spark. implicits. _
val gradeDS: Dataset[ Grade] = lines. map(
line => {
val fields = line. split( " " )
val name = fields( 0 )
val score = fields( 1 ) . toInt
Grade( name, score)
} )
val df = gradeDS. toDF( )
df. createOrReplaceTempView( "t_grade" )
val top3 = spark. sql(
"""
|SELECT name, score FROM
| (SELECT name, score, row_number() OVER (PARTITION BY name ORDER BY score DESC) rank from t_grade) t
| WHERE t.rank <= 3
|""" . stripMargin
)
top3. show( )
top3. foreach( row => println( row( 0 ) + ": " + row( 1 ) ) )
spark. close( )
}
case class Grade( name: String , score: Int )
}
(五)本地运行程序,查看结果
在控制台查看输出结果
(六)交互式操作查看中间结果
1、读取成绩文件得到数据集
执行命令:val lines: Dataset[String] = spark.read.textFile("hdfs://master:9000/input/grades.txt")
2、定义成绩样例类
执行命令:case class Grade(name: String, score: Int)
3、导入隐式转换
执行命令:import spark.implicits._
4、创建成绩数据集
val gradeDS: Dataset[ Grade] = lines. map(
line => {
val fields = line. split( " " )
val name = fields( 0 )
val score = fields( 1 ) . toInt
Grade( name, score)
} )
执行上述语句
5、将数据集转换成数据帧
执行命令:val df = gradeDS.toDF()
6、基于数据帧创建临时表
执行命令:df.createOrReplaceTempView("t_grade")
7、查询临时表,实现分组排行榜
val top3 = spark. sql(
"""
|SELECT name, score FROM
| (SELECT name, score, row_number() OVER (PARTITION BY name ORDER BY score DESC) rank from t_grade) t
| WHERE t.rank <= 3
|""" . stripMargin
)
执行上述语句
8、显示分组排行榜结果
执行命令:top3.show()