SparkSQL case - book information analysis
- lab environment
- Dataset introduction
- Problem Description
- Experimental procedure
-
- 0. Preparations
- 1. Observation data
- 2. Register DataFrame as View
- 3. Search for books whose title contains "program"
- 4. Search for books with a rating greater than 9
- 5. Count the number of books by each publisher
- 6. Top 10 books by visual publishing house
- 7. Calculate the average rating of books published by various publishers
- related resources
SparkSQL is an advanced component provided by Spark for processing structured data. This article will use SparkSQL to complete book information statistics. The sample code will be explained in DSL and SQL styles, and the results will be visualized with Matplotlib and Pandas.
At the end of the article, the network disk resources of data and source code are provided.
lab environment
- ubuntu1804
- pyspark2.4.7
- JupyterLab
- Anconda3
The experimental environment can be referred to: https://blog.csdn.net/tangyi2008/article/details/123109198
For the use of JupyterLab, please refer to: https://blog.csdn.net/tangyi2008/article/details/123761210
Dataset introduction
-
Data file books.txt
data fragment
序号,书名,评分,价格,出版社,url 5173,動力取向精神醫學--臨床應用與實務,10.0 ,1200元,心灵工坊,https://book.douban.com/subject/6053667/ 9929,水彩绘森活,10.0 ,29.8,人民邮电出版社,https://book.douban.com/subject/26115807/ 10124,殷周金文集成(修订增补本共8册)(精),10.0 ,2400.00元,中华书局,https://book.douban.com/subject/2235855/ 16628,纸雕游戏大书,10.0 ,99.00元,重庆出版集团,https://book.douban.com/subject/26673804/ 19103,Michelangelo,10.0 ,$200.00 ,Taschen,https://book.douban.com/subject/2342660/ 20063,一支笔的快乐涂鸦2,10.0 ,29.8,人民邮电出版社,https://book.douban.com/subject/26280062/ ...
- The data contains 6 fields, each
,
separated by a comma, and is a standard CSV file - The data contains Chinese characters, and the encoding of the file should be considered when reading
- The first line of the data is the field name in Chinese
- The data contains 6 fields, each
Problem Description
- Search for books whose title contains "program"
- count the number
- Search for books with a rating greater than 9
- Statistics on the number of books by each publisher
- Requires sorting from largest to smallest
- Top 10 books by visual publishing house
- Statistics on the average rating of books published by various publishers
- Only publishers with more than 200 reviews are counted
- Sort by average score from high to low
- Visualization average score Top10
Experimental procedure
0. Preparations
-
Create a directory book in the home directory
mkdir ~/book
-
Upload data files to a directory
~/book
-
Use the following shell command to view the file encoding
file ~/book/book.txt
You can see the following content, you can see that the file encoding is UTF-8
book.txt: UTF-8 Unicode text, with CRLF line terminators
You can also open the file with vim
vim book.txt
and enter it in command line mode to:set fileencoding
view the encoding -
Start JupyterLab
cd ~/book jupyter lab
Create a new Notebook in JupyterLab
1. Observation data
1) Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('sparksql-book').getOrCreate()
2) Read the file and observe the data
As can be seen from the file fragment just now, to read the file, you can specify the read format of the DataFrame Reader as csv
; secondly, the field names are all in Chinese, which is inconvenient to write code, so the schema information is specified when importing data.
- The schema is a StructType consisting of many fields , which are called StructFields , which have a name, a type, a boolean flag (which specifies whether the column can contain missing or null values), and the user can specify the metadata associated with the column ( metadata).
from pyspark.sql.types import *
schema = StructType([
StructField('id', StringType(), False),
StructField('name', StringType(), True),
StructField('rate', FloatType(), True),
StructField('price', StringType(), True),
StructField('publish', StringType(), True),
StructField('url', StringType(), True),
])
books = spark.read.csv('file:///home/xiaobai/book/book.txt', header=True, schema=schema)
books.show()
+-----+-----------------------------------+----+---------+------------------+--------------------+
| id| name|rate| price| publish| url|
+-----+-----------------------------------+----+---------+------------------+--------------------+
| 5173| 動力取向精神醫學--臨床應用與實務|10.0| 1200元| 心灵工坊|https://book.doub...|
| 9929| 水彩绘森活|10.0| 29.8| 人民邮电出版社|https://book.doub...|
|10124| 殷周金文集成(修订增补本共8册)(精)|10.0|2400.00元| 中华书局|https://book.doub...|
|16628| 纸雕游戏大书|10.0| 99.00元| 重庆出版集团|https://book.doub...|
|19103| Michelangelo|10.0| $200.00 | Taschen|https://book.doub...|
|20063| 一支笔的快乐涂鸦2|10.0| 29.8| 人民邮电出版社|https://book.doub...|
|32781| 亲亲宝贝装|10.0| 28.00元|江西科学技术出版社|https://book.doub...|
|32879| Photoshop7解像|10.0| 68.00元| 海洋出版社|https://book.doub...|
|45687| 戚蓼生序本石头记|10.0| 350.00元| 人民文学出版社|https://book.doub...|
|52504| 宇宙兄弟(7)|10.0| JPY580| 講談社|https://book.doub...|
|52505| 宇宙兄弟(8)|10.0| JPY580| 講談社|https://book.doub...|
| 573| TCP\IP详解(卷1英文版)| 9.9| 45| 机械工业出版社|https://book.doub...|
| 589|计算机程序设计艺术卷1:基本算法(...| 9.9| 119.00元| 人民邮电出版社|https://book.doub...|
| 5522| 微积分和数学分析引论-第1卷| 9.9| 79.00元| 世界图书出版公司|https://book.doub...|
| 5547| PrinciplesofNeura...| 9.9| $103.41 |McGraw-HillMedical|https://book.doub...|
| 7443| 奈特人体神经解剖彩色图谱| 9.9| 138.00元| 人民卫生出版社|https://book.doub...|
| 8703| 数学、科学和认识论| 9.9| 32.00元| 商务印书馆|https://book.doub...|
| 9924| 零基础学素描| 9.9| 20元| 人民邮电出版社|https://book.doub...|
| 9926| 黑白花意3:300例超写实的花之绘| 9.9| 29.80元| 人民邮电出版社|https://book.doub...|
| 9927| 黑白画意:经典植物手绘教程| 9.9| 29.80元| 人民邮电出版社|https://book.doub...|
+-----+-----------------------------------+----+---------+------------------+--------------------+
only showing top 20 rows
- Check out the types of books:
type(books)
- View the first 5 elements by way of RDD:
books.take(5)
- View meta information about books:
books.schema
2. Register DataFrame as View
SparkSQL provides two modes of operation
- DSL
- SQL
This case will be demonstrated in two ways. In order to use the SQL method, the corresponding DataFrame needs to be registered as a View first.
books.createOrReplaceTempView('books')
Obsolete way
registerTempTable
, deprecated after Spark 2.0DataFrame provides four ways to register as View,
df.createTempView
df.createOrReplaceTempView
df.createGlobalTempView
df.createOrReplaceGlobalTempView
The difference and connection of the above four methods:
From the scope of use, it can be divided into two types: with global and without. Among them, with global is available in the current spark application.
From the perspective of creation, it can be divided into two types: with replace and without. When creating a view, if the target view already exists, the function with replace will overwrite the original, otherwise an error will be reported.
3. Search for books whose title contains "program"
-
DSL
View Books Containing "Programs"
books.filter('name LIKE "%程序%" ').show()
+----+------------------------------------+----+--------+--------------+--------------------+ | id| name|rate| price| publish| url| +----+------------------------------------+----+--------+--------------+--------------------+ | 589| 计算机程序设计艺术卷1:基本算法(...| 9.9|119.00元|人民邮电出版社|https://book.doub...| | 343| 计算机程序设计艺术(第3卷)| 9.8| 98.00元|国防工业出版社|https://book.doub...| | 173|计算机程序设计艺术第2卷半数值算法...| 9.7| 83|清华大学出版社|https://book.doub...| | 198| C程序设计语言| 9.7| 23.00元|清华大学出版社|https://book.doub...| | 634| C程序设计语言| 9.7| 35.00元|机械工业出版社|https://book.doub...| | 342| 计算机程序设计艺术| 9.6| 45.00元|机械工业出版社|https://book.doub...| | 10| 计算机程序的构造和解释| 9.5| 45.00元|机械工业出版社|https://book.doub...| | 328| C++程序设计语言(特别版)(英文...| 9.5| 55|高等教育出版社|https://book.doub...| | 344| 计算机程序设计艺术(第1卷)| 9.5| 98.00元|国防工业出版社|https://book.doub...| | 25| C程序设计语言| 9.4| 30.00元|机械工业出版社|https://book.doub...| | 175| 计算机程序设计艺术(第2卷)| 9.4| 98.00元|国防工业出版社|https://book.doub...| | 556| 深入Linux设备驱动程序内核机制| 9.4| 98.00元|电子工业出版社|https://book.doub...| | 83| 计算机程序设计艺术(第1卷)| 9.3| 80.00元|清华大学出版社|https://book.doub...| | 282| JavaScript高级程序设计(...| 9.3| 99.00元|人民邮电出版社|https://book.doub...| | 538| JavaScript高级程序设计:第2版| 9.3| 89.00元|人民邮电出版社|https://book.doub...| | 307| 程序设计实践| 9.2| 22|机械工业出版社|https://book.doub...| |1025| Windows程序设计| 9.2|129.00元|北京大学出版社|https://book.doub...| |1090| 程序设计实践| 9.2| 59.00元|机械工业出版社|https://book.doub...| |2927| C语言程序设计| 9.2| 79.00元|人民邮电出版社|https://book.doub...| | 36| 程序设计实践| 9.1| 20.00元|机械工业出版社|https://book.doub...| +----+------------------------------------+----+--------+--------------+--------------------+ only showing top 20 rows
count the number
books.filter('name LIKE "%程序%" ').count()
104
-
SQL
View Books Containing "Programs"
spark.sql('select * from books where name LIKE "%程序%"').show()
count the number
spark.sql('select count(*) from books where name LIKE "%程序%"').show()
show(n=20, truncate=True, vertical=False)
The first 20 lines are printed by default, and parameters can be passed to control the number of output lines, whether the content is truncated, and the printing method
4. Search for books with a rating greater than 9
-
DSL
from pyspark.sql.functions import col books.filter(col('rate') > 9).show()
+-----+-----------------------------------+----+---------+------------------+--------------------+ | id| name|rate| price| publish| url| +-----+-----------------------------------+----+---------+------------------+--------------------+ | 5173| 動力取向精神醫學--臨床應用與實務|10.0| 1200元| 心灵工坊|https://book.doub...| | 9929| 水彩绘森活|10.0| 29.8| 人民邮电出版社|https://book.doub...| |10124| 殷周金文集成(修订增补本共8册)(精)|10.0|2400.00元| 中华书局|https://book.doub...| |16628| 纸雕游戏大书|10.0| 99.00元| 重庆出版集团|https://book.doub...| |19103| Michelangelo|10.0| $200.00 | Taschen|https://book.doub...| |20063| 一支笔的快乐涂鸦2|10.0| 29.8| 人民邮电出版社|https://book.doub...| |32781| 亲亲宝贝装|10.0| 28.00元|江西科学技术出版社|https://book.doub...| |32879| Photoshop7解像|10.0| 68.00元| 海洋出版社|https://book.doub...| |45687| 戚蓼生序本石头记|10.0| 350.00元| 人民文学出版社|https://book.doub...| |52504| 宇宙兄弟(7)|10.0| JPY580| 講談社|https://book.doub...| |52505| 宇宙兄弟(8)|10.0| JPY580| 講談社|https://book.doub...| | 573| TCP\IP详解(卷1英文版)| 9.9| 45| 机械工业出版社|https://book.doub...| | 589|计算机程序设计艺术卷1:基本算法(...| 9.9| 119.00元| 人民邮电出版社|https://book.doub...| | 5522| 微积分和数学分析引论-第1卷| 9.9| 79.00元| 世界图书出版公司|https://book.doub...| | 5547| PrinciplesofNeura...| 9.9| $103.41 |McGraw-HillMedical|https://book.doub...| | 7443| 奈特人体神经解剖彩色图谱| 9.9| 138.00元| 人民卫生出版社|https://book.doub...| | 8703| 数学、科学和认识论| 9.9| 32.00元| 商务印书馆|https://book.doub...| | 9924| 零基础学素描| 9.9| 20元| 人民邮电出版社|https://book.doub...| | 9926| 黑白花意3:300例超写实的花之绘| 9.9| 29.80元| 人民邮电出版社|https://book.doub...| | 9927| 黑白画意:经典植物手绘教程| 9.9| 29.80元| 人民邮电出版社|https://book.doub...| +-----+-----------------------------------+----+---------+------------------+--------------------+ only showing top 20 rows
rate
There are various ways to represent columnscol('rate')
#Need to import the col functionbooks.rate
#This writing method should pay attention to whether the column name should be the same as the method name. For example, if there is a column calledcount
, an error will be reported when you use thebooks.count
representation column, because it represents a methodbooks['rate']
-
SQL
spark.sql('select * from books where rate > 9').show()
5. Count the number of books by each publisher
-
DSL
books.groupby('publish').count().sort(col('count').desc()).show()
or
books.groupby('publish').count().sort('count', ascending=False).show()
+----------------------+-----+ | publish|count| +----------------------+-----+ | 人民文学出版社| 1437| | 上海译文出版社| 1426| | 中华书局| 1278| | 东立出版社| 1223| |生活·读书·新知三联书店| 1105| | 北京大学出版社| 948| | 译林出版社| 934| | 商务印书馆| 917| | 上海人民出版社| 829| | 广西师范大学出版社| 726| | 中国人民大学出版社| 641| | 人民邮电出版社| 599| | 上海古籍出版社| 590| | 南海出版公司| 575| | 尖端出版社| 557| | 中信出版社| 537| | 机械工业出版社| 519| | 新星出版社| 511| | 集英社| 465| | 講談社| 426| +----------------------+-----+ only showing top 20 rows
-
SQL
spark.sql('select publish, count(1) as count from books group by publish order by count desc' ).show()
6. Top 10 books by visual publishing house
1) Solve the problem of Chinese display
If you don't care about Chinese displaying garbled characters or troublesome, skip this step
(1) Download the fontSimHei.ttf
In the ubuntu system, the corresponding fonts may be lacking and need to be downloaded by yourself
The download method can be downloaded by Baidu, or downloaded from the network disk link at the end of the article
(2) Upload the font to the font directory of matplotlib
Use the following code to view the directory of matplotlib
import matplotlib
print(matplotlib.matplotlib_fname())
Example of display result:
'/home/xiaobai/opt/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/matplotlibrc'
Upload the downloaded SimHei.ttf to a subdirectory of the output directory of the above codefonts/ttf
example directory
/home/xiaobai/opt/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf
Note that the actual directory is subject to the code running result
(3) Clear the matplotlib cache
rm -rf ~/.cache/matplotlib
2) Visualization
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
df = books.groupby('publish').count().sort('count', ascending=False).toPandas()
df.iloc[10::-1].set_index('publish').plot.barh()
7. Calculate the average rating of books published by various publishers
Specific requirements are as follows:
- Only publishers with more than 200 reviews are counted
- Sort by average score from high to low
- Visualization average score Top10
1) Calculate the average score as required
-
DSL
from pyspark.sql import functions as F #按出版社进行分组,并分别统计数量和评分的平均分 pub_cnt_rate = books.groupby('publish').agg(F.count(F.col('id')).alias('count'), F.mean(F.col('rate')).alias('avg_rate')) #筛选评论数大于200的数据,并按降序排列 top_avg = pub_cnt_rate.filter(F.col('count')>200).sort('avg_rate', ascending = False) #显示数据 top_avg.show()
+------------------+-----+-----------------+ | publish|count| avg_rate| +------------------+-----+-----------------+ | 集英社| 465|9.001505358501147| |中国少年儿童出版社| 207|8.999033773578883| | 講談社| 426|8.939671380978794| | 东立出版社| 1223|8.926819284334597| | 小学館| 253|8.866007923608713| | 尖端出版社| 557| 8.78402155562834| | 台灣角川| 267|8.656554338190887| | 上海古籍出版社| 590|8.648813579042079| | 中华书局| 1278|8.647104871478252| | 世界图书出版公司| 269|8.635687769567213| | 接力出版社| 219| 8.63196347510978| | 人民邮电出版社| 599| 8.62036727465851| | 二十一世纪出版社| 323| 8.60743037539739| | 時報文化| 215|8.576279125657193| |中国建筑工业出版社| 209|8.540191401705217| |北京十月文艺出版社| 230|8.536086980156277| | 人民文学出版社| 1437|8.530897729498028| | 河北教育出版社| 385|8.528571470681722| | 商务印书馆| 917|8.522464596198196| | 机械工业出版社| 519|8.521194627059907| +------------------+-----+-----------------+ only showing top 20 rows
-
SQL
top_avg = spark.sql(''' select * from (select publish, count(1) as count, avg(rate) as avg_rate from books group by publish) where count > 200 order by avg_rate desc ''') top_avg.show()
2) Visualization top10
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False # 用来正常显示负号
df = top_avg.toPandas()
df.iloc[10::-1, [0,2]].set_index('publish').plot.barh(legend=False,xlim = (8,10))
related resources
链接:https://pan.baidu.com/s/15dm0Y-H1JQE0TcvWC4kL0w?pwd=dvzj
提取码:dvzj