pyspark入门---sparksql练习

1.实验数据说明:

Student字段说明:

字段名 类型 备注
sno int 学号
sclass int 班级号
sname Date 姓名
sgender int 性别
sage int 年龄

Teacher字段说明:

字段名 类型 备注
tno int 教工号
tname int 姓名
tsex int 性别
tage int 年龄

Course字段说明:

字段名 类型 备注
cno int 课程编号
cname int 课程名称
tno int 教工号

Score字段说明:

字段名 类型 备注
sno int 学号
cno int 课程编号
degree int 成绩

2.实验要求

  1. 查询每门课程选修人数

输出字段:课程编号,人数

  1. 查询所有学生课程编号为1-101所有的成绩,并以降序排列

输出字段:学号,课程名称,成绩

  1. 查询平均成绩前3的课程

输出字段:课程编号,平均成绩

  1. 查询4003号同学的各科成绩

输出字段:学生姓名,课程名称,成绩

  1. 查询男女同学的平均成绩

输出字段:性别,平均成绩

  1. 查询每位教师所代课程名称

输出字段:教师名称,课程名称

  1. 查询编号为3-103课程的所有男同学名称及成绩

输出字段: 学生名称,课程编号,成绩

  1. 查询成绩在大于等于80分的学生名称、课程名称及分数

输出字段:学生名称,课程名称,分数

3.实验内容

from pyspark import SparkContext
courserdd=sc.textFile(r"G:\pythonProject\pc\spark\data\c5_labs_course.txt")
scorerdd=sc.textFile(r"G:\pythonProject\pc\spark\data\c5_labs_score.txt")
studentrdd=sc.textFile(r"G:\pythonProject\pc\spark\data\c5_labs_student.txt")
teacherrdd=sc.textFile(r"G:\pythonProject\pc\spark\data\c5_labs_teacher.txt")
print(courserdd.count())
print(scorerdd.count())
print(studentrdd.count())
print(teacherrdd.count())

image-20200510212509968

#通过逗号分割字段
courserdd=courserdd.map(lambda line:line.split(","))
courserdd.take(5)

image-20200510213031314

scorerdd=scorerdd.map(lambda line:line.split(","))
scorerdd.take(5)

image-20200510213127324

studentrdd=studentrdd.map(lambda line:line.split(","))
studentrdd.take(5)

image-20200510213204268

teacherrdd=teacherrdd.map(lambda line:line.split(","))
teacherrdd.take(5)

image-20200510213241469

#创建Sparksql,首先创建dataframe
from pyspark.sql import SparkSession
sqlContext=SparkSession.builder.getOrCreate()
from pyspark.sql import Row
courserows = courserdd.map(lambda p:Row( 
cno=p[0],
cname=p[1],
tno=p[2]
)
)
courserows.take(5)

image-20200510213400114

from pyspark.sql import Row
scorerows = scorerdd.map(lambda p:Row( 
sno=p[0],
cno=p[1],
degree=p[2]
)
)
scorerows.take(5)

image-20200510213900986

from pyspark.sql import Row
studentrows=studentrdd.map(lambda p:Row( 
sno=p[0],
sclass=p[1],
sname=p[2],
ssex=p[3],
sage=p[4]
)
)
studentrows.take(5)

image-20200510213934249

from pyspark.sql import Row
teacherrows = teacherrdd.map(lambda p:Row( 
tno=p[0],
tname=p[1],
tsex=p[2],
tage=p[3]
)
)
teacherrows.take(5)

image-20200510214003909

# 创建sparksql
course_df=sqlContext.createDataFrame(courserows)
course_df.printSchema()
course_df.show()
score_df=sqlContext.createDataFrame(scorerows)
score_df.printSchema()
score_df.show()
student_df=sqlContext.createDataFrame(studentrows)
student_df.printSchema()
student_df.show()
teacher_df=sqlContext.createDataFrame(teacherrows)
teacher_df.printSchema()
teacher_df.show()

image-20200510214059298

#使用sparksql
course_df.registerTempTable("course")
score_df.registerTempTable("score")
student_df.registerTempTable("student")
teacher_df.registerTempTable("teacher")
sqlContext.sql("select * from course").show()
sqlContext.sql("select * from score").show()
sqlContext.sql("select * from student").show()
sqlContext.sql("select * from teacher").show()

image-20200510214229928

# 查询每门课程选修人数
sqlContext.sql("select score.cno,count(degree) from score group by cno").show()

image-20200510214306486

# 查询所有学生课程编号为1-101所有的成绩,并以降序排列
sqlContext.sql("select score.cno,cname,degree from score,course where score.cno=course.cno and score.cno='1-101' order by degree DESC").show()

image-20200510214338516

#查询平均成绩前3的课程
sqlContext.sql("select cno,avg(degree) from score  group by cno order by avg(degree) DESC").show(3)

image-20200510214442839

# 查询4003号同学的各科成绩
sqlContext.sql("select sname,cname,degree from student,course,score where student.sno=score.sno and score.cno=course.cno and student.sno='4003'").show()

image-20200510214519151

#查询男女同学的平均成绩
sqlContext.sql("select ssex,avg(degree) from student,score where score.sno=student.sno group by ssex").show()

image-20200510214556776

#查询每位教师所代课程名称
sqlContext.sql("select teacher.tname,course.cname from teacher inner join course on teacher.tno=course.tno").show()
sqlContext.sql("select teacher.tname,course.cname from teacher left join course on teacher.tno=course.tno").show()

image-20200510214628568

#查询编号为3-103课程的所有男同学名称及成绩
sqlContext.sql("select score.degree,student.sname from student,score where score.cno='3-103'and student.sno=score.sno and student.ssex='male'").show()

image-20200510214650553

# 查询成绩在大于等于80分的学生名称、课程名称及分数
sqlContext.sql("select student.sname,course.cname,score.degree from student,score,course where score.degree>=80 and student.sno=score.sno and score.cno=course.cno").show()

image-20200510214718528

原创文章 20 获赞 14 访问量 2159

猜你喜欢

转载自blog.csdn.net/qq_42166929/article/details/106043059