Table of contents
0. Learning objectives of this lecture
1. Use Spark SQL to realize word frequency statistics
3. Modify the source program directory
4. Add dependencies and set source program directory
5. Create a log properties file
6. Create HDFS configuration file
7. Create a word frequency statistics singleton object
8. Start the program and view the results
9. Word frequency statistical data conversion flow chart
2. Use Spark SQL to calculate the total score and average score
3. Modify the source program directory
4. Add relevant dependencies and set the source program directory
5. Create a log properties file
6. Create HDFS configuration file
7. Create a singleton object for calculating the total score and average score
8. Run the program and view the results
3. Use Spark SQL to implement group leaderboards
3. Modify the source program directory
4. Add relevant dependencies and set the source program directory
5. Create a log properties file
6. Create HDFS configuration file
7. Create a group leaderboard singleton object
8. Run the program and view the results
4. Use SparkSQL to count daily new users
4. Add relevant dependencies and set the source program directory
5. Create a log properties file
6. Create HDFS configuration file
7. Create statistics and add user singleton objects
8. Run the program and view the results
0. Learning objectives of this lecture
- Using Spark SQL to Realize Word Frequency Statistics
- Use Spark SQL to calculate total score and average score
- Use Spark SQL to implement group leaderboards
- Use Spark SQL to count daily new users
1. Use Spark SQL to realize word frequency statistics
(1) Propose a task
- Word frequency statistics is an introductory program for learning distributed computing. There are many implementation methods, such as MapReduce; using the RDD operator provided by Spark can realize word frequency statistics more easily. This task requires the use of SparkSQL to achieve word frequency statistics.
- word file
hello scala world
hello spark world
scala is very concise
spark is very powerful
let us learn scala and spark
we can learn them well
- Word frequency statistics
(2) Achievement of tasks
1. Prepare data files
/home
Created in directorywords.txt
- Upload the word file to the specified directory of HDFS
2. Create a Maven project
- Create Maven project - SparkSQLWordCount
- Click the [Finish] button
3. Modify the source program directory
- change
java
directory toscala
directory
4. Add dependencies and set source program directory
pom.xml
Add dependencies and set the source program directory in the file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.huawei.sql</groupId>
<artifactId>SparkSQLWordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
</build>
</project>
5. Create a log properties file
resources
createlog4j.properties
files in directory
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
6. Create HDFS configuration file
resources
createhdfs-site.xml
files in directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<description>only config in clients</description>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
</configuration>
7. Create a word frequency statistics singleton object
- Create a package and create a singleton object
net.huawei.sql
in the packageWordCount
package net.huawei.sql
import org.apache.spark.sql.{Dataset, SparkSession}
/**
* 功能:利用Spark SQL实现词频统计
* 作者:华卫
* 日期:2023年05月25日
*/
object WordCount {
def main(args: Array[String]): Unit = {
// 创建或得到SparkSession
val spark = SparkSession.builder()
.appName("SparkSQLWordCount")
.master("local[*]")
.getOrCreate()
// 读取HDFS上的单词文件
val lines: Dataset[String] = spark.read.textFile("hdfs://master:9000/wordcount/input/words.txt")
// 显示数据集lines内容
lines.show()
// 导入Spark会话对象的隐式转换
import spark.implicits._
// 将数据集中的数据按空格切分并合并
val words: Dataset[String] = lines.flatMap(_.split(" "))
// 显示数据集words内容
words.show()
// 将数据集默认列名由value改为word,并转换成数据帧
val df = words.withColumnRenamed("value", "word").toDF()
// 显示数据帧内容
df.show()
// 基于数据帧创建临时视图
df.createTempView("v_words")
// 执行SQL分组查询,实现词频统计
val wc = spark.sql(
"""
| select word, count(*) as count
| from v_words group by word
| order by count desc
|""".stripMargin)
// 显示词频统计结果
wc.show()
// 关闭会话
spark.close()
}
}
8. Start the program and view the results
- run
WordCount
singleton object
9. Word frequency statistical data conversion flow chart
- Text files are converted into datasets, then converted into data frames, and finally the result data frames are obtained based on table queries
2. Use Spark SQL to calculate the total score and average score
(1) Propose a task
-
Multi-subject score sheets, such as python.txt, spark.txt, django.txt, calculate the total score and average score of each student's three subjects
-
Python score table -
python.txt
1 张三丰 89
2 李孟达 95
3 唐雨涵 92
4 王晓云 93
5 张晓琳 88
6 佟湘玉 88
7 杨文达 66
8 陈燕文 98
- Spark score table-
spark.txt
1 张三丰 67
2 李孟达 78
3 唐雨涵 89
4 王晓云 75
5 张晓琳 93
6 佟湘玉 70
7 杨文达 87
8 陈燕文 90
- Django Transcript -
django.txt
1 张三丰 88
2 李孟达 93
3 唐雨涵 97
4 王晓云 87
5 张晓琳 79
6 佟湘玉 89
7 杨文达 93
8 陈燕文 95
- expected output
1 张三丰 244 81.33
2 李孟达 266 88.67
3 唐雨涵 278 92.67
4 王晓云 255 85.00
5 张晓琳 260 86.67
6 佟湘玉 247 82.33
7 杨文达 246 82.00
8 陈燕文 283 94.33
(2) Complete the task
1. Prepare data files
master
Create three grade files on the virtual machine- View the contents of the three grade files
- Upload the three grade files to
/calsumavg/input
the directory of HDFS
2. Create a new Maven project
-
Set project information (project name, save location, group number, project number)
-
Click the [Finish] button
3. Modify the source program directory
- change
java
directory toscala
directory
4. Add relevant dependencies and set the source program directory
pom.xml
Add dependencies and set the source program directory in the file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.huawei.sql</groupId>
<artifactId>CalculateSumAvg</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
</build>
</project>
5. Create a log properties file
resources
Create the log properties file in -log4j.properties
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
6. Create HDFS configuration file
resources
createhdfs-site.xml
files in directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<description>only config in clients</description>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
</configuration>
7. Create a singleton object for calculating the total score and average score
net.huawei.sql
CreateCalculateSumAverage
a singleton object in the package
package net.huawei.sql
import org.apache.spark.sql.{Dataset, SparkSession}
/**
* 功能:利用Spark SQL计算总分与平均分
* 作者:华卫
* 日期:2023年05月25日
*/
object CalculateSumAverage {
def main(args: Array[String]): Unit = {
// 创建或得到Spark会话对象
val spark = SparkSession.builder()
.appName("CalculateSumAverage")
.master("local[*]")
.getOrCreate()
// 读取HDFS上的成绩文件
val lines: Dataset[String] = spark.read.textFile("hdfs://master:9000/calsumavg/input")
// 导入隐式转换
import spark.implicits._
// 创建成绩数据集
val gradeDS: Dataset[Grade] = lines.map(
line => {
val fields = line.split(" ")
val id = fields(0).toInt
val name = fields(1)
val score = fields(2).toInt
Grade(id, name, score)
})
// 将数据集转换成数据帧
val df = gradeDS.toDF();
// 基于数据帧创建临时表
df.createOrReplaceTempView("t_grade")
// 查询临时表,计算总分与平均分
val result = spark.sql(
"""
|select first(id) as id, name, sum(score) as sum,
| cast(avg(score) as decimal(5, 2)) as average
| from t_grade
| group by name
| order by id
|""".stripMargin
)
// 按照指定格式输出总分与平均分
result.collect.foreach(row => println(row(0) + " " + row(1) + " " + row (2) + " " + row(3)))
// 关闭Spark会话
spark.close()
}
// 定义成绩样例类
case class Grade(id: Int, name: String, score: Int)
}
8. Run the program and view the results
- run
CalculateSumAvg
singleton object
3. Use Spark SQL to implement group leaderboards
(1) Propose a task
- Finding TopN by grouping is a common requirement in the field of big data. It is mainly grouped according to a certain column of the data, and then sort each grouped data according to the specified column, and finally get the top N rows of data in each group.
- There is a set of student grade data
张三丰 90
李孟达 85
张三丰 87
王晓云 93
李孟达 65
张三丰 76
王晓云 78
李孟达 60
张三丰 94
王晓云 97
李孟达 88
张三丰 80
王晓云 88
李孟达 82
王晓云 98
- The same student has multiple grades, now it is necessary to calculate the first 3 grades with the highest scores for each student, and the expected output result
张三丰:94
张三丰:90
张三丰:87
李孟达:88
李孟达:85
李孟达:82
王晓云:98
王晓云:97
王晓云:93
- data sheet
t_grade
- execute query
SELECT * FROM t_grade tg
WHERE (SELECT COUNT(*) FROM t_grade
WHERE tg.name = t_grade.name
AND score >= tg.score
) <= 3 ORDER BY name, score DESC;
(2) Knowledge points involved
1. Datasets and DataFrames
- See this blog " Spark Big Data Processing Lecture Notes 4.1 Spark SQL Overview, Data Frames and Datasets "
2. Window function
(1) Overview of windowing function
Spark 1.5.x
After the version, the windowing function was introduced inSpark SQL
and , among which the more commonly used windowing function is, the function of this function is to group according to the fields in the table, and then sort according to the fields in the table; in fact, according to the sorting order, give the group A sequence number is added to each record in , and each group of sequence numbers starts from the beginning, which can be used for grouping .DataFrame
row_number()
1
topN
ROW_NUMBER()
OVER (PARTITION BYfield1
ORDER BYfield2
DESC) rank- SQL statement for grouping to find top3
(3) Complete the task
1. Prepare data files
/home
Create grade files in directorygrades.txt
- The directory that will be
grades.txt
uploaded to HDFS/topn/input
2. Create a new Maven project
-
Set project information (project name, save location, group number, project number)
-
Click the [Finish] button
3. Modify the source program directory
- change
java
directory toscala
directory
4. Add relevant dependencies and set the source program directory
pom.xml
Add dependencies and set the source program directory in the file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.huawei.sql</groupId>
<artifactId>SparkSQLGradeTopN</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
</build>
</project>
5. Create a log properties file
resources
createlog4j.properties
files in directory
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
6. Create HDFS configuration file
resources
createhdfs-site.xml
files in directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<description>only config in clients</description>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
</configuration>
7. Create a group leaderboard singleton object
net.huawei.sql
CreateGradeTopN
a singleton object in the package
package net.huawei.sql
import org.apache.spark.sql.{Dataset, SparkSession}
/**
* 功能:利用Spark SQL实现分组排行榜
* 作者:华卫
* 日期:2022年06月15日
*/
object GradeTopNBySQL {
def main(args: Array[String]): Unit = {
// 创建或得到Spark会话对象
val spark = SparkSession.builder()
.appName("GradeTopNBySQL")
.master("local[*]")
.getOrCreate()
// 读取HDFS上的成绩文件
val lines: Dataset[String] = spark.read.textFile("hdfs://master:9000/input/grades.txt")
// 导入隐式转换
import spark.implicits._
// 创建成绩数据集
val gradeDS: Dataset[Grade] = lines.map(
line => { val fields = line.split(" ")
val name = fields(0)
val score = fields(1).toInt
Grade(name, score)
})
// 将数据集转换成数据帧
val df = gradeDS.toDF()
// 基于数据帧创建临时表
df.createOrReplaceTempView("t_grade")
// 查询临时表,实现分组排行榜
val top3 = spark.sql(
"""
|SELECT name, score FROM
| (SELECT name, score, row_number() OVER (PARTITION BY name ORDER BY score DESC) rank from t_grade) t
| WHERE t.rank <= 3
|""".stripMargin
)
// 按指定格式输出分组排行榜
top3.foreach(row => println(row(0) + ": " + row(1)))
// 关闭Spark会话
spark.close()
}
// 定义成绩样例类
case class Grade(name: String, score: Int)
}
8. Run the program and view the results
- run
GradeTopN
singleton object
4. Use SparkSQL to count daily new users
(1) Propose a task
- It is known that the following user access history data is available. The first column is the date when the user visits the website, and the second column is the user name.
2023-05-01,mike
2023-05-01,alice
2023-05-01,brown
2023-05-02,mike
2023-05-02,alice
2023-05-02,green
2023-05-03,alice
2023-05-03,smith
2023-05-03,brian
2023-05-01 | mike | alice | brown |
2023-05-02 | mike | alice | green |
2023-05-03 | alice | smith | brian |
- Now it is necessary to count the number of new users added every day based on the above data, and expect the statistical results.
2023-05-01新增用户数:3
2023-05-02新增用户数:1
2023-05-03新增用户数:2
- 1
- 2
- 3
- That is, 3 new users were added on 2023-05-01 (respectively mike, alice, and brown), 1 new user (green) was added on 2023-05-02, and two new users were added on 2023-05-03 (respectively for smith and brian).
(2) Implementation ideas
- Using the inverted index method , if the user name is regarded as a keyword and the access date is regarded as a document ID, the mapping relationship between the user name and the access date is shown in the figure below.
2023-05-01 | 2023-05-02 | 2023-05-3 | |
---|---|---|---|
mike | √ | √ | |
alice | √ | √ | √ |
brown | √ | ||
green | √ | ||
smith | √ | ||
brian | √ |
- If the same user corresponds to multiple visit dates, the smallest date is the registration date of the user, that is, the new date, and other dates are repeated visit dates and should not be counted. Therefore, each user should only calculate the minimum date that the user visits . As shown in the figure below, move the minimum date of each user's visit to the first column. The first column is valid data. Only the number of occurrences of each date in the first column is counted, which is the number of new users on the corresponding date.
column one | column two | column three | |
---|---|---|---|
mike | 2023-05-01 | 2023-05-02 | |
alice | 2023-05-01 | 2022-01-02 | 2022-01-03 |
brown | 2023-05-01 | ||
green | 2023-05-02 | ||
smith | 2023-05-03 | ||
brian | 2023-05-03 |
(3) Complete the task
1. Prepare data files
/home
createusers.txt
files in directory- Create
/newusers/input
a directory first, then upload user files to that directory
2. Create a new Maven project
- Set project information (project name, save location, group number, project number)
- Click the [Finish] button
- change
java
directory toscala
directory
4. Add relevant dependencies and set the source program directory
pom.xml
Add dependencies and set the source program directory in the file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.huawei.sql</groupId>
<artifactId>SparkSQLCountNewUsers</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
</build>
</project>
5. Create a log properties file
resources
createlog4j.properties
files in directory
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
6. Create HDFS configuration file
resources
createhdfs-site.xml
files in directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<description>only config in clients</description>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
</configuration>
7. Create statistics and add user singleton objects
- Create a package and create a singleton object
net.huawei.sql
in the packageCountNewUsers
package net.huawei.sql
import org.apache.spark.sql.{Dataset, SparkSession}
/**
* 功能:使用SparkSQL统计新增用户
* 作者:华卫
* 日期:2023年05月25日
*/
object CountNewUsers {
def main(args: Array[String]): Unit = {
def main(args: Array[String]): Unit = {
// 创建或得到Spark会话对象
val spark = SparkSession.builder()
.appName("CountNewUsers")
.master("local[*]")
.getOrCreate()
// 读取HDFS上的用户文件
val ds: Dataset[String] = spark.read.textFile("hdfs://master:9000/newusers/input/users.txt")
// 导入隐式转换
import spark.implicits._
// 创建用户数据集
val userDS: Dataset[User] = ds.map(
line => {
val fields = line.split(",")
val date = fields(0)
val name = fields(1)
User(date, name)
}
)
//将数据转换成数据帧
val df = userDS.toDF()
// 创建临时表
df.createOrReplaceTempView("t_user")
// 统计每日新增用户
val result = spark.sql(
"""
|select date, count(name) as count from
| (select min(date) as date, name from t_user group by name)
|group by date
|""".stripMargin
)
// 输出统计结果
result.foreach(row => println(row(0) + "新增用户数:" + row(1)))
// 关闭Spark会话
spark.close()
}
}
//编写User样例类
case class User(date: String, name: String)
}
8. Run the program and view the results
- run
CountNewUsers
singleton object - It's strange that no statistics are displayed
9. Run code in Spark Shell
- Execute the following code to view the result