spark sql的练习题

1、使用Structured Streaming读取Socket数据，把单词和单词的反转组成 json 格式写入到当前目录中的file文件夹中
2、请使用Structured Streaming读取student_info文件夹写的csv文件，
2.1、统计出文件中的男女生各有多少人
2.2、统计出姓“王”男生和女生的各有多少人
3、请使用Structured Streaming读取department_info文件夹写的csv文件
3.1统计出各个院系的分别多少条信息
4、请使用spark sql读取student_score文件夹写的csv文件
4.1、统计出每个班级的最高分数
4.2、统计出男生最高分
4.3、统计出女生最高分
4.4、分别统计出男生和女生的分数前三名
4.5、统计出分数在500分以上的人数
4.7、统计出分数在300分以下的人中男女各占多少
5.请使用Spark sql读取class_info文件夹写的csv文件
5.1、统计出哪个院系的专业最多
5.2、统计出计算机学院中有多少专业
5.3、统计出经济管理学院的会计和工商管理的班级数
5.4、分别统计出每个学院的班级最多的专业
5.5、统计出班级编号以2开头的所有的专业名称
以下是sparksql的练习题
表（一）Student (学生表)
属性名数据类型可否为空含义
Sno varchar (20) 否学号
Sname varchar (20) 否学生姓名
Ssex varchar (20) 否学生性别
Sbirthday datetime 可学生出生年月
Class varchar (20) 可学生所在班级
表（二）Course（课程表）
属性名数据类型可否为空含义
Cno varchar (20) 否课程号
Cname varchar (20) 否课程名称
Tno varchar (20) 否教工编号
表（三）Score(成绩表)
属性名数据类型可否为空含义
Sno varchar (20) 否学号
Cno varchar (20) 否课程号
Degree Decimal(4,1) 可成绩
表（四）Teacher(教师表)
属性名数据类型可否为空含义
Tno varchar (20) 否教工编号
Tname varchar (20) 否教工姓名
Tsex varchar (20) 否教工性别
Tbirthday datetime 可教工出生年月
Prof varchar (20) 可职称
Depart varchar (20) 否教工所在部门
表1-2数据库中的数据
表（一）Student
Sno Sname Ssex Sbirthday class
108 丘东男 1977-09-01 95033
105 匡明男 1975-10-02 95031
107 王丽女 1976-01-23 95033
101 李军男 1976-02-20 95033
109 王芳女 1975-02-10 95031
103 陆君男 1974-06-03 95031
表（二）Course
Cno Cname Tno
3-105 计算机导论 825
3-245 操作系统 804
6-166 数字电路 856
9-888 高等数学 831
表（三）Score
Sno Cno Degree
103 3-245 86
105 3-245 75
109 3-245 68
103 3-105 92
105 3-105 88
109 3-105 76
101 3-105 64
107 3-105 91
108 3-105 78
101 6-166 85
107 6-166 79
108 6-166 81
表（四）Teacher
Tno Tname Tsex Tbirthday Prof Depart
804 李诚男 1958-12-02 副教授计算机系
856 张旭男 1969-03-12 讲师电子工程系
825 王萍女 1972-05-05 助教计算机系
831 刘冰女 1977-08-14 助教电子工程系
6、查询Student表中“95031”班或性别为“女”的同学记录。
7、以Class降序,升序查询Student表的所有记录。
8、以Cno升序、Degree降序查询Score表的所有记录。
9、查询“95031”班的学生。
10、查询Score表中的最高分的学生学号和课程号。（子查询或者排序）
11、查询每门课的平均成绩。
12、查询Score表中至少有5名学生选修的并以3开头的课程的平均分数。
13、查询分数大于70，小于90的Sno列。
14、查询所有学生的Sname、Cno和Degree列。
15、查询所有学生的Sno、Cname和Degree列。
16、查询所有学生的Sname、Cname和Degree列。
17、查询“95033”班学生的平均分。
18、查询所有选修“计算机导论”课程的“女”同学的成绩表。
19、查询选修“3-105”课程的成绩高于“109”号同学成绩的所有同学的记录。
20、查询score中选学多门课程的同学中分数为非最高分成绩的记录。
21、查询成绩高于学号为“109”、课程号为“3-105”的成绩的所有记录。
22、查询和学号为105的同学同年出生的所有学生的Sno、Sname和Sbirthday列。
23、查询“张旭“教师任课的学生成绩
24、查询选修某课程的同学人数多于4人的教师姓名。
25、查询95033班和95031班全体学生的记录。
26、查询存在有85分以上成绩的课程Cno.
27、查询出“计算机系“教师所教课程的成绩表。
28、查询“计算机系”与“电子工程系“不同职称的教师的Tname和Prof。
29、查询选修编号为“3-105“课程且成绩至少高于选修编号为“3-245”的同学的Cno、Sno和Degree,并按Degree从高到低次序排序。
30、查询选修编号为“3-105”且成绩高于选修编号为“3-245”课程的同学的Cno、Sno和Degree.
31、查询所有教师和同学的name、sex和birthday.
32、查询所有“女”教师和“女”同学的name、sex和birthday.
33、查询成绩比该课程平均成绩低的同学的成绩表。
34、查询所有任课教师的Tname和Depart.
35、查询所有未讲课的教师的Tname和Depart.
36、查询至少有2名男生的班号。
37、查询Student表中不姓“王”的同学记录。
38、查询Student表中每个学生的姓名和年龄。将函数运用到spark sql中去计算，可以直接拿String的类型计算不需要再转换成数值型默认是会转换成Double类型计算浮点型转整型
39、查询Student表中最大和最小的Sbirthday日期值。时间格式最大值,最小值
40、以班号和年龄从大到小的顺序查询Student表中的全部记录。查询结果排序
41、查询“男”教师及其所上的课程。
42、查询最高分同学的Sno、Cno和Degree列。
43、查询和“李军”同性别的所有同学的Sname.
44、查询和“李军”同性别并同班的同学Sname.
45、查询所有选修“计算机导论”课程的“男”同学的成绩表。
46、查询Student表中的所有记录的Sname、Ssex和Class列。
47、查询教师所有的单位即不重复的Depart列。
48、查询Student表的所有记录
49、查询Score表中成绩在60到80之间的所有记录。
50、查询Score表中成绩为85，86或88的记录。
代码：

1、1、使用Structured Streaming读取Socket数据，把单词和单词的反转组成 json 格式写入到当前目录中的file文件夹中
object day {
  //1、请使用Structured Streaming读取Socket数据，统计出每个单词的个数
  def main(args: Array[String]): Unit = {
    //1.创建SparkSession
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("day1").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    val frame: DataFrame = spark.readStream
      .option("host", "hadoop01")
      .option("port", 9999)
      .format("socket")
      .load()
    import  spark.implicits._
    val dataDS: Dataset[String] = frame.as[String]
    val wordDF = dataDS.flatMap(_.split(" "))
      .map({ x => (x, x.reverse) }).toDF("before", "reverse")
    //wordDF.show()
    //输出数据
    wordDF.writeStream
      .format("json")
      .option("path","F:\\第四学期的大数据资料\\day02四月份资料\\第二周\\day05\\4.16号练习题50道2.0\\file")
      .option("checkpointLocation","json")//必须指定 checkpoint 目录，否则报错
      .trigger(Trigger.ProcessingTime(0))
      .start()
      .awaitTermination()
  }
}
2、请使用Structured Streaming读取student_info文件夹写的csv文件，
2.1、统计出文件中的男女生各有多少人
2.2、统计出姓“王”男生和女生的各有多少人
object day2 {
  def main(args: Array[String]): Unit = {
    //创建SparkSession
    val spark: SparkSession =
    SparkSession.builder().master("local[*]").appName("day2").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    //准备数据结构
    val Schema: StructType = new StructType()
      .add("id", "integer")
      .add("name", "string")
      .add("sex", "string")
      .add("classs", "string")
      .add("date", "string")
    //接受数据
    import spark.implicits._
    val dataDF: DataFrame =
      spark.readStream.schema(Schema).csv("F:\\第四学期的大数据资料\\day02四月份资料\\第二周\\day05\\4.16号练习题50道2.0\\student_info")
    // 根据业务处理和计算数据
    val result: Dataset[Row] =
    //2.1、统计出文件中的男女生各有多少人
      dataDF.selectExpr("sex").groupBy("sex").count().sort($"count".desc)
    //2.2、统计出姓“王”男生和女生的各有多少人
    dataDF.select("name", "sex").where("name like '%王%'")
      .groupBy("sex").count().sort($"count".desc)
    //输出数据
    result.writeStream
      .format("console")
      .outputMode("complete")
      .trigger(Trigger.ProcessingTime(0))
      .start()
      .awaitTermination()
  }
}
3、请使用Structured Streaming读取department_info文件夹写的csv文件
3.1统计出各个院系的分别多少条信息
def main(args: Array[String]): Unit = {
    //创建SparkSession
    val spark: SparkSession =
      SparkSession.builder().master("local[*]").appName("day2").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    //准备数据结构
    val Schema: StructType = new StructType()
      .add("id", "integer")
      .add("name", "string")
    //接受数据
    import spark.implicits._
    val dataDF: DataFrame =
      spark.readStream.schema(Schema).csv("F:\\第四学期的大数据资料\\day02四月份资料\\第二周\\day05\\4.16号练习题50道2.0\\department_info")
    // 根据业务处理和计算数据
    val result: Dataset[Row] =
    //3.1统计出各个院系的分别多少条信息
      dataDF.select("name")
        .groupBy("name").count().sort($"count".desc)
    //输出数据
    result.writeStream
      .format("console")
      .outputMode("complete")
      .trigger(Trigger.ProcessingTime(0))
      .start()
      .awaitTermination()
  }
4、请使用spark sql读取student_score文件夹写的csv文件
4.1、统计出每个班级的最高分数
4.2、统计出男生最高分
4.3、统计出女生最高分
4.4、分别统计出男生和女生的分数前三名
4.5、统计出分数在500分以上的人数
4.7、统计出分数在300分以下的人中男女各占多少
def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("day04").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    val frame: DataFrame = spark.read.csv("F:\\第四学期的大数据资料\\day02四月份资料\\第二周\\day05\\4.16号练习题50道2.0\\student_score")
    import spark.implicits._
    //将RDD转成toDF
    val personDF: DataFrame = frame.toDF("id", "name", "sex", "classs", "score")
    //打印数据
    personDF.createOrReplaceTempView("student")
    //4.1、统计出每个班级的最高分数
    spark.sql("select classs,max(score) from student group by classs ").show()
    //4.2、统计出男生最高分
    spark.sql("select sex,max(score) from student where sex = '男' group by sex").show()
    //4.3、统计出女生最高分
    spark.sql("select sex,max(score) from student where sex = '女' group by sex").show()
    //4.4、分别统计出男生和女生的分数前三名
    spark.sql("select name,classs,score,sex,row_number() over(partition by sex order by score desc) rk from student having rk<=3 ").show()
    //4.5、统计出分数在500分以上的人数
    spark.sql("select count(score) from student where score > 500 ").show()
    //4.7、统计出分数在300分以下的人中男女各占多少
    spark.sql("select sex,count(sex) from (select * from student where score<300)  group by sex").show()
  }
5.请使用Spark sql读取class_info文件夹写的csv文件
5.1、统计出哪个院系的专业最多
5.2、统计出计算机学院中有多少专业
5.3、统计出经济管理学院的会计和工商管理的班级数
5.4、分别统计出每个学院的班级最多的专业
5.5、统计出班级编号以2开头的所有的专业名称
def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("day05").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    val frame: DataFrame = spark.read.csv("F:\\第四学期的大数据资料\\day02四月份资料\\第二周\\day05\\4.16号练习题50道2.0\\class_info")
    import spark.implicits._
    //将RDD转成toDF
    val personDF: DataFrame = frame.toDF("id", "name", "date", "classs")
    //打印数据
    personDF.createOrReplaceTempView("classI")
    //5.1、统计出哪个院系的专业最多
    //spark.sql("select classs,max(name) from classI group by classs").show()
    //5.2、统计出计算机学院中有多少专业
    // spark.sql("select count(name) from classI where classs ='计算机学院'").show()
    //5.3、统计出经济管理学院的会计和工商管理的班级数
    //spark.sql("select subStr(name,0,2),count(*) from classI where classs ='经济管理学院' and name like '会计%' or name like '工商管理%' group by subStr(name,0,2)").show()
    //5.4、分别统计出每个学院的班级最多的专业
   // spark.sql("select max(name),classs from classI group by classs").show()
    //5.5、统计出班级编号以2开头的所有的专业名称
    spark.sql("select id,name from classI where id like '_2%'").show()
  }
以下是sparksql的练习题
package demo02

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}

object day006 {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("day006").master("local").getOrCreate()
    import spark.implicits._
    val jdbcDF1 = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/test?characterEncoding=UTF-8")
      .option("dbtable", "student")
      .option("user", "root")
      .option("password", "root")
      .load()

    val frame1: DataFrame = jdbcDF1.toDF()
    frame1.createOrReplaceTempView("student")

    val jdbcDF2 = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/test?characterEncoding=UTF-8")
      .option("dbtable", "Score")
      .option("user", "root")
      .option("password", "root")
      .load()
    val frame2: DataFrame = jdbcDF2.toDF()
    frame2.createOrReplaceTempView("Score")

    val jdbcDF3 = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/test?characterEncoding=UTF-8")
      .option("dbtable", "Course")
      .option("user", "root")
      .option("password", "root")
      .load()
    val frame3: DataFrame = jdbcDF3.toDF()
    frame3.createOrReplaceTempView("Course")

    val jdbcDF4 = spark.read
      .format("jdbc")
      .option("url", "jdbc:mysql://localhost:3306/test?characterEncoding=UTF-8")
      .option("dbtable", "teacher")
      .option("user", "root")
      .option("password", "root")
      .load()
    val frame4: DataFrame = jdbcDF4.toDF()
    frame4.createOrReplaceTempView("teacher")

    //6、查询Student表中“95031”班或性别为“女”的同学记录。
    // spark.sql("select * from student where  Class = 95031 and Ssex ='女'").show()
    //7、以Class降序,升序查询Student表的所有记录。
    // spark.sql("select * from student  order by Class desc").show()
    // 8、以Cno升序、Degree降序查询Score表的所有记录。
    //  spark.sql("select * from Score order by Cno asc,Degree desc").show()
    //9、查询“95031”班的学生。
    // spark.sql("select * from student where Class = 95031").show()
    //    10、查询Score表中的最高分的学生学号和课程号。（子查询或者排序）
    // spark.sql("select SNO,CNO from Score where Degree=(select MAX(Degree) from Score)").show()
    //    11、查询每门课的平均成绩。
    // spark.sql("select Cno,AVG(Degree)  from Score group by Cno").show()
    //    12、查询Score表中至少有5名学生选修的并以3开头的课程的平均分数。
    //spark.sql("select AVG(Degree ) from Score where Cno like '3%' group by Cno having COUNT(Cno)>4").show()
    //    13、查询分数大于70，小于90的Sno列。
    // spark.sql("select sno from Score where Degree between 70 and 90").show()
    //    14、查询所有学生的Sname、Cno和Degree列。
    // spark.sql("select Sname,Cno,Degree from student join Score on student.Sno=Score.Sno").show()
    //    15、查询所有学生的Sno、Cname和Degree列。
    //spark.sql("select Sno,Cname,degree from Score join Course on Course.Cno=Score.Cno").show()
    //    16、查询所有学生的Sname、Cname和Degree列。
    //spark.sql("select student.Sname,Cname,degree from student join Score on student.Sno=Score.Sno join Course on Course.Cno=Score.Cno").show()
    //    17、查询“95033”班学生的平均分。
    //spark.sql("select AVG(Degree) from Score,student where student.Sno=Score.Sno and Class='95033'").show()
    //    18、查询所有选修“计算机导论”课程的“女”同学的成绩表。
    //spark.sql("select Sno,Degree from Score where Sno in (select Sno from student where Ssex='女') and Cno in (select Cno from Course where Cname='计算机导论')").show()
    //    19、查询选修“3-105”课程的成绩高于“109”号同学成绩的所有同学的记录。
    //spark.sql("select * from student,Score where Score.Cno='3-105' and student.Sno=Score.Sno and Score.Degree>(select Degree from Score where Cno='3-105' and Sno='109')").show()
    //    20、查询score中选学多门课程的同学中分数为非最高分成绩的记录。
    // spark.sql("select * from Score a where Degree <(select MAX(degree) from Score b where a.Cno=b.Cno) and Sno in(select Sno from Score group by Sno having count(*)>1)").show()
    //    21、查询成绩高于学号为“109”、课程号为“3-105”的成绩的所有记录。
    //spark.sql("select * from student,Score where student.Sno=Score.Sno and Score.Degree>(select Degree from Score where Cno='3-105' and Sno='109')").show()
    //    22、查询和学号为105的同学同年出生的所有学生的Sno、Sname和Sbirthday列。
    //spark.sql("select Sno,Sname,Sbirthday from student where year(student.Sbirthday)=(select year(Sbirthday) from student where Sno='105')").show()
    //    23、查询“张旭“教师任课的学生成绩
    //spark.sql("select Degree from Score,Teacher,Course where Teacher.Tname='张旭' and Teacher.Tno=Course.Tno and Course.Cno=Score.Cno").show()
    //    24、查询选修某课程的同学人数多于4人的教师姓名。
    //spark.sql("select Tname from Teacher where Tno in (select Tno from Course where Cno in (select Cno from Score group by Cno having COUNT(*)>4))").show()
    //    25、查询95033班和95031班全体学生的记录。
    // spark.sql("select * from student where Class='95033' or Class='95031'").show()
    //    26、查询存在有85分以上成绩的课程Cno.
    // spark.sql("select distinct cno from Score where Degree>85").show()
    //    27、查询出“计算机系“教师所教课程的成绩表。
    //spark.sql("select sno,Cno ,Degree from Score where Cno in (select Cno from Course where Tno in (select tno from Teacher where Depart='计算机系'))").show()
    //    28、查询“计算机系”与“电子工程系“不同职称的教师的Tname和Prof。
    //spark.sql("select Tname,Prof from Teacher a where Prof not in(select Prof from Teacher b where a.Depart!=b.Depart)").show()
    //    29、查询选修编号为“3-105“课程且成绩至少高于选修编号为“3-245”的同学的Cno、Sno和Degree,并按Degree从高到低次序排序。
    //spark.sql("select Cno,Sno,Degree from Score a where (select Degree from Score b where Cno='3-105' and b.Sno=a.Sno)>=(select Degree from Score c where Cno='3-245' and c.Sno=a.Sno) order by Degree desc").show()
    //    30、查询选修编号为“3-105”且成绩高于选修编号为“3-245”课程的同学的Cno、Sno和Degree.
    //spark.sql("select Cno,Sno,Degree from Score a where (select Degree from Score b where Cno='3-105' and b.Sno=a.Sno)>(select Degree from Score c where Cno='3-245' and c.Sno=a.Sno)").show()
    //    31、查询所有教师和同学的name、sex和birthday.
    //spark.sql("select distinct Sname as name,Ssex as sex,Sbirthday as birthday from student union select distinct Tname as name,Tsex as sex,Tbirthady as birthday from Teacher").show()
    //    32、查询所有“女”教师和“女”同学的name、sex和birthday.
    //spark.sql("select distinct Sname as name,Ssex as sex,Sbirthday as birthday from student where Ssex='女' union select distinct Tname as name,Tsex as sex,Tbirthady as birthday from Teacher where Tsex='女'").show()
    //    33、查询成绩比该课程平均成绩低的同学的成绩表。
    //spark.sql("select Sno,Cno,Degree from Score a where a.Degree<(select AVG(Degree) from Score b where a.Cno=b.Cno)").show()
    //    34、查询所有任课教师的Tname和Depart.
    //spark.sql("select Tname,Depart from Teacher where tno in (select tno from course where Cno in (select distinct Cno from Score))").show()
    //    35、查询所有未讲课的教师的Tname和Depart.
    //spark.sql("select Tname,Depart from Teacher where Tname not in (select distinct Tname from Teacher,Course,Score where Teacher.Tno=Course.Tno and Course.Cno=Score.Cno)").show()
    //    36、查询至少有2名男生的班号。
    //spark.sql("select Class FROM student where Ssex='男' group by Class having COUNT(*)>1").show()
    //    37、查询Student表中不姓“王”的同学记录。
    // spark.sql("select * from student where Sname not like '王%'").show()
    //    38、查询Student表中每个学生的姓名和年龄。将函数运用到spark sql中去计算，可以直接拿String的类型计算不需要再转换成数值型 默认是会转换成Double类型计算浮点型转整型
    // spark.sql("select Sname,YEAR(GETDATE())-year(Sbirthday) from student").show()
    //    39、查询Student表中最大和最小的Sbirthday日期值。 时间格式最大值,最小值
    // spark.sql("select MAX(Sbirthday) as 最大,MIN(Sbirthday) as 最小 from student").show()
    //    40、以班号和年龄从大到小的顺序查询Student表中的全部记录。 查询结果排序
    // spark.sql("select * from student order by Class desc,Sbirthday asc").show()
    //    41、查询“男”教师及其所上的课程。
    // spark.sql("select Tname,Cname from Teacher,Course where Tsex='男' and Teacher.Tno=Course.Tno").show()
    //    42、查询最高分同学的Sno、Cno和Degree列。
    //spark.sql("select Sno,Cno,Degree from Score where degree=(select MAX(Degree)from Score)").show()
    //    43、查询和“李军”同性别的所有同学的Sname.
    // spark.sql("select Sname from student where Ssex=(select Ssex from student where Sname='李军') and Sname not in ('李军')").show()
    //    44、查询和“李军”同性别并同班的同学Sname.
    //spark.sql("select Sname from student where Ssex=(select Ssex from student where Sname='李军') and Sname not in ('李军') and Class=(select Class from student where Sname='李军')").show()
    //    45、查询所有选修“计算机导论”课程的“男”同学的成绩表。
    // spark.sql("select Sno,Degree from Score where Sno in (select Sno from student where Ssex='男') and Cno in (select Cno from Course where Cname='计算机导论')").show()
    //    46、查询Student表中的所有记录的Sname、Ssex和Class列。
    //spark.sql("select Sname,Ssex,Class from student").show()
    //    47、查询教师所有的单位即不重复的Depart列。
    // spark.sql("select distinct Depart from Teacher").show()
    //    48、查询Student表的所有记录
    //spark.sql("select * from student").show()
    //    49、查询Score表中成绩在60到80之间的所有记录。
    //spark.sql("select * from Score where Degree between 60 and 80").show()
    //    50、查询Score表中成绩为85，86或88的记录。
    spark.sql("select * from Score where Degree in (85,86,88)").show()
  }
}

大数据666

发布了238 篇原创文章 · 获赞 429 · 访问量 25万+

私信关注

猜你喜欢