Article directory
- 1. Put forward the task
- Second, realize the idea
- Three, preparation
- Four, complete the task
-
- 1. Complete tasks in Spark Shell
-
- (1) Read the score file and generate RDD
- (2) Define a list of binary grades
- (3) Use RDD to populate the list of binary groups
- (4) Create an RDD based on a list of binary grades
- (5) Reduce the rdd button to get rdd1, and calculate the total score
- (6) Map rdd1 to rdd2, calculate the total score and average score
- 2. Complete the task in IntelliJ IDEA
1. Put forward the task
For the grade table, calculate the total score and average score of each student
Name | language | math | English |
---|---|---|---|
Zhang San | 78 | 90 | 76 |
Li Si | 95 | 88 | 98 |
Wang Wu | 78 | 80 | 60 |
Second, realize the idea
Read the grade file, generate lines; define the list of 2-tuple grades; traverse the lines, fill the 2-tuple grade list; create RDD based on the 2-tuple grade list; reduce rdd to get rdd1, calculate the total score; map rdd1 to rdd2, calculate the total score and average score.
Three, preparation
1. Start the HDFS service
Execute the command: start-dfs.sh
2. Start the Spark service
Execute the command: start-all.sh
3. Create a score file locally
Create scores.txt file in /home
4. Upload the score file to HDFS
Create the /scores/input directory on HDFS, and upload the score file to this directory
Four, complete the task
1. Complete tasks in Spark Shell
(1) Read the score file and generate RDD
Excuting an order:val lines = sc.textFile("hdfs://master:9000/scores/input/scores.txt")
(2) Define a list of binary grades
Execute command: import scala.collection.mutable.ListBuffer
Execute command:val scores = new ListBuffer[(String, Int)]()
(3) Use RDD to populate the list of binary groups
lines.collect.foreach(line => {
val fields = line.split(" ")
scores.append((fields(0), fields(1).toInt))
scores.append((fields(0), fields(2).toInt))
scores.append((fields(0), fields(3).toInt))
})
scores.foreach(println)
Execute the above code
(4) Create an RDD based on a list of binary grades
Excuting an order:val rdd = sc.makeRDD(scores);
(5) Reduce the rdd button to get rdd1, and calculate the total score
Excuting an order:val rdd1 = rdd.reduceByKey(_ + _)
(6) Map rdd1 to rdd2, calculate the total score and average score
Excuting an order:val rdd2 = rdd1.map(score => (score._1, score._2, (score._2 / 3.0).formatted("%.2f")))
2. Complete the task in IntelliJ IDEA
(1) Open the RDD project
SparkRDDDemo
(2) Create an object for calculating the average score of the total score
Create the day07 subpackage in the net.army.rdd package, and then create the CalculateSumAvg object in the subpackage
package net.army.rdd.day07
import org.apache.spark.{
SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
/**
* 作者:梁辰兴
* 日期:2023/6/6
* 功能:统计总分与平均分
*/
object CalculateSumAvg {
def main(args: Array[String]): Unit = {
// 创建Spark配置对象
val conf = new SparkConf()
.setAppName("CalculateSumAvg ") // 设置应用名称
.setMaster("local[*]") // 设置主节点位置(本地调试)
// 基于Spark配置对象创建Spark容器
val sc = new SparkContext(conf)
// 读取成绩文件,生成RDD
val lines = sc.textFile("hdfs://master:9000/scores/input/scores.txt")
// 定义二元组成绩列表
val scores = new ListBuffer[(String, Int)]()
// 利用RDD填充二元组成绩列表
lines.collect.foreach(line => {
val fields = line.split(" ")
scores.append((fields(0), fields(1).toInt))
scores.append((fields(0), fields(2).toInt))
scores.append((fields(0), fields(3).toInt))
})
// 基于二元组成绩列表创建RDD
val rdd = sc.makeRDD(scores);
// 对rdd按键归约得到rdd1,计算总分
val rdd1 = rdd.reduceByKey(_ + _)
// 将rdd1映射成rdd2,计算总分与平均分
val rdd2 = rdd1.map(score => (score._1, score._2, (score._2 / 3.0).formatted("%.2f")))
// 在控制台输出rdd2的内容
rdd2.collect.foreach(println)
// 将rdd2内容保存到HDFS指定位置
rdd2.saveAsTextFile("hdfs://master:9000/scores/output")
// 关闭Spark容器
sc.stop()
}
}
(3) Run the program and view the results
Run the program CalculateSumAvg, console results
View HDFS result files