SparkSQL case - book information analysis


SparkSQL is an advanced component provided by Spark for processing structured data. This article will use SparkSQL to complete book information statistics. The sample code will be explained in DSL and SQL styles, and the results will be visualized with Matplotlib and Pandas.

At the end of the article, the network disk resources of data and source code are provided.

lab environment

  • ubuntu1804
  • pyspark2.4.7
  • JupyterLab
  • Anconda3

The experimental environment can be referred to: https://blog.csdn.net/tangyi2008/article/details/123109198

For the use of JupyterLab, please refer to: https://blog.csdn.net/tangyi2008/article/details/123761210

Dataset introduction

  • Data file books.txt

    data fragment

    序号,书名,评分,价格,出版社,url
    5173,動力取向精神醫學--臨床應用與實務,10.0 ,1200元,心灵工坊,https://book.douban.com/subject/6053667/
    9929,水彩绘森活,10.0 ,29.8,人民邮电出版社,https://book.douban.com/subject/26115807/
    10124,殷周金文集成(修订增补本共8册)(精),10.0 ,2400.00元,中华书局,https://book.douban.com/subject/2235855/
    16628,纸雕游戏大书,10.0 ,99.00元,重庆出版集团,https://book.douban.com/subject/26673804/
    19103,Michelangelo,10.0 ,$200.00 ,Taschen,https://book.douban.com/subject/2342660/
    20063,一支笔的快乐涂鸦2,10.0 ,29.8,人民邮电出版社,https://book.douban.com/subject/26280062/
    ...
    
    • The data contains 6 fields, each ,separated by a comma, and is a standard CSV file
    • The data contains Chinese characters, and the encoding of the file should be considered when reading
    • The first line of the data is the field name in Chinese

Problem Description

  • Search for books whose title contains "program"
    • count the number
  • Search for books with a rating greater than 9
  • Statistics on the number of books by each publisher
    • Requires sorting from largest to smallest
  • Top 10 books by visual publishing house
  • Statistics on the average rating of books published by various publishers
    • Only publishers with more than 200 reviews are counted
    • Sort by average score from high to low
    • Visualization average score Top10

Experimental procedure

0. Preparations

  • Create a directory book in the home directory

    mkdir ~/book
    
  • Upload data files to a directory~/book

  • Use the following shell command to view the file encoding

    file ~/book/book.txt
    

    You can see the following content, you can see that the file encoding is UTF-8

    book.txt: UTF-8 Unicode text, with CRLF line terminators
    

    You can also open the file with vim vim book.txtand enter it in command line mode to :set fileencodingview the encoding

  • Start JupyterLab

    cd ~/book
    jupyter lab
    

    Create a new Notebook in JupyterLab

    insert image description here

1. Observation data

1) Create SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('sparksql-book').getOrCreate()

2) Read the file and observe the data

As can be seen from the file fragment just now, to read the file, you can specify the read format of the DataFrame Reader as csv; secondly, the field names are all in Chinese, which is inconvenient to write code, so the schema information is specified when importing data.

  • The schema is a StructType consisting of many fields , which are called StructFields , which have a name, a type, a boolean flag (which specifies whether the column can contain missing or null values), and the user can specify the metadata associated with the column ( metadata).
from pyspark.sql.types import *
schema = StructType([
    StructField('id', StringType(), False),
    StructField('name', StringType(), True),
    StructField('rate', FloatType(), True),
    StructField('price', StringType(), True),
    StructField('publish', StringType(), True),
    StructField('url', StringType(), True),
])
books = spark.read.csv('file:///home/xiaobai/book/book.txt', header=True, schema=schema)
books.show()
+-----+-----------------------------------+----+---------+------------------+--------------------+
|   id|                               name|rate|    price|           publish|                 url|
+-----+-----------------------------------+----+---------+------------------+--------------------+
| 5173|   動力取向精神醫學--臨床應用與實務|10.0|   1200元|          心灵工坊|https://book.doub...|
| 9929|                         水彩绘森活|10.0|     29.8|    人民邮电出版社|https://book.doub...|
|10124|  殷周金文集成(修订增补本共8册)(精)|10.0|2400.00元|          中华书局|https://book.doub...|
|16628|                       纸雕游戏大书|10.0|  99.00元|      重庆出版集团|https://book.doub...|
|19103|                       Michelangelo|10.0| $200.00 |           Taschen|https://book.doub...|
|20063|                  一支笔的快乐涂鸦2|10.0|     29.8|    人民邮电出版社|https://book.doub...|
|32781|                         亲亲宝贝装|10.0|  28.00元|江西科学技术出版社|https://book.doub...|
|32879|                     Photoshop7解像|10.0|  68.00元|        海洋出版社|https://book.doub...|
|45687|                   戚蓼生序本石头记|10.0| 350.00元|    人民文学出版社|https://book.doub...|
|52504|                      宇宙兄弟(7)|10.0|   JPY580|            講談社|https://book.doub...|
|52505|                      宇宙兄弟(8)|10.0|   JPY580|            講談社|https://book.doub...|
|  573|            TCP\IP详解(卷1英文版)| 9.9|       45|    机械工业出版社|https://book.doub...|
|  589|计算机程序设计艺术卷1:基本算法(...| 9.9| 119.00元|    人民邮电出版社|https://book.doub...|
| 5522|         微积分和数学分析引论-第1卷| 9.9|  79.00元|  世界图书出版公司|https://book.doub...|
| 5547|               PrinciplesofNeura...| 9.9| $103.41 |McGraw-HillMedical|https://book.doub...|
| 7443|           奈特人体神经解剖彩色图谱| 9.9| 138.00元|    人民卫生出版社|https://book.doub...|
| 8703|                 数学、科学和认识论| 9.9|  32.00元|        商务印书馆|https://book.doub...|
| 9924|                       零基础学素描| 9.9|     20元|    人民邮电出版社|https://book.doub...|
| 9926|     黑白花意3:300例超写实的花之绘| 9.9|  29.80元|    人民邮电出版社|https://book.doub...|
| 9927|         黑白画意:经典植物手绘教程| 9.9|  29.80元|    人民邮电出版社|https://book.doub...|
+-----+-----------------------------------+----+---------+------------------+--------------------+
only showing top 20 rows
  • Check out the types of books:type(books)
  • View the first 5 elements by way of RDD:books.take(5)
  • View meta information about books:books.schema

2. Register DataFrame as View

SparkSQL provides two modes of operation

  • DSL
  • SQL

This case will be demonstrated in two ways. In order to use the SQL method, the corresponding DataFrame needs to be registered as a View first.

books.createOrReplaceTempView('books')

Obsolete way registerTempTable, deprecated after Spark 2.0

DataFrame provides four ways to register as View,

  • df.createTempView

  • df.createOrReplaceTempView

  • df.createGlobalTempView

  • df.createOrReplaceGlobalTempView

The difference and connection of the above four methods:

  • From the scope of use, it can be divided into two types: with global and without. Among them, with global is available in the current spark application.

  • From the perspective of creation, it can be divided into two types: with replace and without. When creating a view, if the target view already exists, the function with replace will overwrite the original, otherwise an error will be reported.

3. Search for books whose title contains "program"

  • DSL

    View Books Containing "Programs"

    books.filter('name LIKE "%程序%" ').show()
    
    +----+------------------------------------+----+--------+--------------+--------------------+
    |  id|                                name|rate|   price|       publish|                 url|
    +----+------------------------------------+----+--------+--------------+--------------------+
    | 589| 计算机程序设计艺术卷1:基本算法(...| 9.9|119.00元|人民邮电出版社|https://book.doub...|
    | 343|         计算机程序设计艺术(第3卷)| 9.8| 98.00元|国防工业出版社|https://book.doub...|
    | 173|计算机程序设计艺术第2卷半数值算法...| 9.7|      83|清华大学出版社|https://book.doub...|
    | 198|                       C程序设计语言| 9.7| 23.00元|清华大学出版社|https://book.doub...|
    | 634|                       C程序设计语言| 9.7| 35.00元|机械工业出版社|https://book.doub...|
    | 342|                  计算机程序设计艺术| 9.6| 45.00元|机械工业出版社|https://book.doub...|
    |  10|              计算机程序的构造和解释| 9.5| 45.00元|机械工业出版社|https://book.doub...|
    | 328|  C++程序设计语言(特别版)(英文...| 9.5|      55|高等教育出版社|https://book.doub...|
    | 344|         计算机程序设计艺术(第1卷)| 9.5| 98.00元|国防工业出版社|https://book.doub...|
    |  25|                       C程序设计语言| 9.4| 30.00元|机械工业出版社|https://book.doub...|
    | 175|         计算机程序设计艺术(第2卷)| 9.4| 98.00元|国防工业出版社|https://book.doub...|
    | 556|       深入Linux设备驱动程序内核机制| 9.4| 98.00元|电子工业出版社|https://book.doub...|
    |  83|         计算机程序设计艺术(第1卷)| 9.3| 80.00元|清华大学出版社|https://book.doub...|
    | 282|         JavaScript高级程序设计(...| 9.3| 99.00元|人民邮电出版社|https://book.doub...|
    | 538|        JavaScript高级程序设计:第2版| 9.3| 89.00元|人民邮电出版社|https://book.doub...|
    | 307|                        程序设计实践| 9.2|      22|机械工业出版社|https://book.doub...|
    |1025|                     Windows程序设计| 9.2|129.00元|北京大学出版社|https://book.doub...|
    |1090|                        程序设计实践| 9.2| 59.00元|机械工业出版社|https://book.doub...|
    |2927|                       C语言程序设计| 9.2| 79.00元|人民邮电出版社|https://book.doub...|
    |  36|                        程序设计实践| 9.1| 20.00元|机械工业出版社|https://book.doub...|
    +----+------------------------------------+----+--------+--------------+--------------------+
    only showing top 20 rows
    

    count the number

    books.filter('name LIKE "%程序%" ').count()
    
    104
    
  • SQL

    View Books Containing "Programs"

    spark.sql('select * from books where name LIKE "%程序%"').show()
    

    count the number

    spark.sql('select count(*) from books where name LIKE "%程序%"').show()
    

show(n=20, truncate=True, vertical=False)The first 20 lines are printed by default, and parameters can be passed to control the number of output lines, whether the content is truncated, and the printing method

4. Search for books with a rating greater than 9

  • DSL

    from pyspark.sql.functions import col
    books.filter(col('rate') > 9).show()
    
    +-----+-----------------------------------+----+---------+------------------+--------------------+
    |   id|                               name|rate|    price|           publish|                 url|
    +-----+-----------------------------------+----+---------+------------------+--------------------+
    | 5173|   動力取向精神醫學--臨床應用與實務|10.0|   1200元|          心灵工坊|https://book.doub...|
    | 9929|                         水彩绘森活|10.0|     29.8|    人民邮电出版社|https://book.doub...|
    |10124|  殷周金文集成(修订增补本共8册)(精)|10.0|2400.00元|          中华书局|https://book.doub...|
    |16628|                       纸雕游戏大书|10.0|  99.00元|      重庆出版集团|https://book.doub...|
    |19103|                       Michelangelo|10.0| $200.00 |           Taschen|https://book.doub...|
    |20063|                  一支笔的快乐涂鸦2|10.0|     29.8|    人民邮电出版社|https://book.doub...|
    |32781|                         亲亲宝贝装|10.0|  28.00元|江西科学技术出版社|https://book.doub...|
    |32879|                     Photoshop7解像|10.0|  68.00元|        海洋出版社|https://book.doub...|
    |45687|                   戚蓼生序本石头记|10.0| 350.00元|    人民文学出版社|https://book.doub...|
    |52504|                      宇宙兄弟(7)|10.0|   JPY580|            講談社|https://book.doub...|
    |52505|                      宇宙兄弟(8)|10.0|   JPY580|            講談社|https://book.doub...|
    |  573|            TCP\IP详解(卷1英文版)| 9.9|       45|    机械工业出版社|https://book.doub...|
    |  589|计算机程序设计艺术卷1:基本算法(...| 9.9| 119.00元|    人民邮电出版社|https://book.doub...|
    | 5522|         微积分和数学分析引论-第1卷| 9.9|  79.00元|  世界图书出版公司|https://book.doub...|
    | 5547|               PrinciplesofNeura...| 9.9| $103.41 |McGraw-HillMedical|https://book.doub...|
    | 7443|           奈特人体神经解剖彩色图谱| 9.9| 138.00元|    人民卫生出版社|https://book.doub...|
    | 8703|                 数学、科学和认识论| 9.9|  32.00元|        商务印书馆|https://book.doub...|
    | 9924|                       零基础学素描| 9.9|     20元|    人民邮电出版社|https://book.doub...|
    | 9926|     黑白花意3:300例超写实的花之绘| 9.9|  29.80元|    人民邮电出版社|https://book.doub...|
    | 9927|         黑白画意:经典植物手绘教程| 9.9|  29.80元|    人民邮电出版社|https://book.doub...|
    +-----+-----------------------------------+----+---------+------------------+--------------------+
    only showing top 20 rows
    
    

    rateThere are various ways to represent columns

    • col('rate') #Need to import the col function
    • books.rate #This writing method should pay attention to whether the column name should be the same as the method name. For example, if there is a column called count, an error will be reported when you use the books.countrepresentation column, because it represents a method
    • books['rate']
  • SQL

    spark.sql('select * from books where rate > 9').show()
    

5. Count the number of books by each publisher

  • DSL

    books.groupby('publish').count().sort(col('count').desc()).show()
    

    or

    books.groupby('publish').count().sort('count', ascending=False).show()
    
    +----------------------+-----+
    |               publish|count|
    +----------------------+-----+
    |        人民文学出版社| 1437|
    |        上海译文出版社| 1426|
    |              中华书局| 1278|
    |            东立出版社| 1223|
    |生活·读书·新知三联书店| 1105|
    |        北京大学出版社|  948|
    |            译林出版社|  934|
    |            商务印书馆|  917|
    |        上海人民出版社|  829|
    |    广西师范大学出版社|  726|
    |    中国人民大学出版社|  641|
    |        人民邮电出版社|  599|
    |        上海古籍出版社|  590|
    |          南海出版公司|  575|
    |            尖端出版社|  557|
    |            中信出版社|  537|
    |        机械工业出版社|  519|
    |            新星出版社|  511|
    |                集英社|  465|
    |                講談社|  426|
    +----------------------+-----+
    only showing top 20 rows
    
  • SQL

    spark.sql('select publish, count(1) as count from books group by publish order by count desc' ).show()
    

6. Top 10 books by visual publishing house

1) Solve the problem of Chinese display

If you don't care about Chinese displaying garbled characters or troublesome, skip this step

(1) Download the fontSimHei.ttf

In the ubuntu system, the corresponding fonts may be lacking and need to be downloaded by yourself

The download method can be downloaded by Baidu, or downloaded from the network disk link at the end of the article

(2) Upload the font to the font directory of matplotlib

Use the following code to view the directory of matplotlib

import matplotlib
print(matplotlib.matplotlib_fname())

Example of display result:

'/home/xiaobai/opt/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/matplotlibrc'

Upload the downloaded SimHei.ttf to a subdirectory of the output directory of the above codefonts/ttf

example directory

/home/xiaobai/opt/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf

Note that the actual directory is subject to the code running result

(3) Clear the matplotlib cache

rm -rf  ~/.cache/matplotlib

2) Visualization

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

df = books.groupby('publish').count().sort('count', ascending=False).toPandas()
df.iloc[10::-1].set_index('publish').plot.barh()

insert image description here

7. Calculate the average rating of books published by various publishers

Specific requirements are as follows:

  • Only publishers with more than 200 reviews are counted
  • Sort by average score from high to low
  • Visualization average score Top10

1) Calculate the average score as required

  • DSL

    from pyspark.sql import functions as F
    #按出版社进行分组,并分别统计数量和评分的平均分
    pub_cnt_rate = books.groupby('publish').agg(F.count(F.col('id')).alias('count'), 
                                          F.mean(F.col('rate')).alias('avg_rate'))
    #筛选评论数大于200的数据,并按降序排列
    top_avg = pub_cnt_rate.filter(F.col('count')>200).sort('avg_rate', ascending = False)
    #显示数据
    top_avg.show()
    
    +------------------+-----+-----------------+
    |           publish|count|         avg_rate|
    +------------------+-----+-----------------+
    |            集英社|  465|9.001505358501147|
    |中国少年儿童出版社|  207|8.999033773578883|
    |            講談社|  426|8.939671380978794|
    |        东立出版社| 1223|8.926819284334597|
    |            小学館|  253|8.866007923608713|
    |        尖端出版社|  557| 8.78402155562834|
    |          台灣角川|  267|8.656554338190887|
    |    上海古籍出版社|  590|8.648813579042079|
    |          中华书局| 1278|8.647104871478252|
    |  世界图书出版公司|  269|8.635687769567213|
    |        接力出版社|  219| 8.63196347510978|
    |    人民邮电出版社|  599| 8.62036727465851|
    |  二十一世纪出版社|  323| 8.60743037539739|
    |          時報文化|  215|8.576279125657193|
    |中国建筑工业出版社|  209|8.540191401705217|
    |北京十月文艺出版社|  230|8.536086980156277|
    |    人民文学出版社| 1437|8.530897729498028|
    |    河北教育出版社|  385|8.528571470681722|
    |        商务印书馆|  917|8.522464596198196|
    |    机械工业出版社|  519|8.521194627059907|
    +------------------+-----+-----------------+
    only showing top 20 rows
    
  • SQL

    top_avg = spark.sql('''
    select * from (select publish, count(1) as count, avg(rate) as avg_rate 
    from books group by publish)
    where count > 200 order by avg_rate desc
    ''')
    top_avg.show()
    

2) Visualization top10

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

df = top_avg.toPandas()
df.iloc[10::-1, [0,2]].set_index('publish').plot.barh(legend=False,xlim = (8,10))

insert image description here

related resources

链接:https://pan.baidu.com/s/15dm0Y-H1JQE0TcvWC4kL0w?pwd=dvzj 
提取码:dvzj 

Guess you like

Origin blog.csdn.net/tangyi2008/article/details/124169242