Saprk SQL Basics

1. Basic introduction to Spark SQL

1.What is Spark SQL

Spark SQL is one of the various components of Spark, mainly used to process large-scale [structured data]

Features of Spark SQL:

1). Integration: You can use SQL statements or write code, and support the mixed use of both.

2). Unified data access: Spark SQL uses a unified API to connect to different data sources.

3). Hive compatibility: Spark SQL can be integrated with Hive. After the merger, the execution engine is replaced by Spark. The core is processed based on hive's metastore.

4). Standardized connection: Spark SQL supports JDBC/ODBC connection

2. Similarities and differences between Spark SQL and Hive

Same point:

①They are all distributed SQL computing engines

②Can handle large-scale structured data

③All can be built and run on Yarn cluster

difference:

①The bottom layer of Spark SQL is RDD, and the bottom layer of Hive SQL is MapReduce.

②Spark SQL can write both SQL statements and code, while Hive SQL can only write SQL statements.

③Spark SQL does not have a metadata management service, while Hive SQL has a metastore management metadata service.

④Spark SQL runs based on memory, and Hive SQL runs based on disk.

3.Spark SQL data structure comparison

illustrate:

DataFrame of pandas: two-dimensional table processing of single-machine structured data

Spark Core: Process any data structure and handle large-scale distributed data

Spark SQL: two-dimensional table, processing large-scale distributed structured data

RDD: The storage is directly an object. For example, in the figure, the storage is a Person object, but it is not clear what data content is inside.

DataFrame: Structured storage of each field data in Person to form a DataFrame, where you can directly see the data

Dataset: Store the data in the Person object in a structured manner, while retaining the type of the object, so that you know it comes from a Person object

Since Python does not support generics, the Dataset type cannot be used. The client only supports the DataFrame type.

2. Detailed explanation of DataFrame

1.Basic introduction to DataFrame

DataFrame represents a two-dimensional table. A two-dimensional table must have table structure description information such as rows and columns.

Table structure description information (metadata Schema): StructType object

Field: StructField object, which can describe the field name, field data type, and whether it can be empty

Row: Row object

Column: Column object, containing field names and field values

Under a StructType object, it is composed of multiple StructFields to build a complete metadata information.

2.How to construct DataFrame

2.1 Get a DataFrame through RDD

from pyspark import SparkConf, SparkContext
import os
from pyspark.sql import SparkSession

# 绑定指定的Python解释器
from pyspark.sql.types import StructType, IntegerType, StringType, StructField

os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    # 1- 创建SparkSession对象
    spark = SparkSession.builder\
        .appName('rdd_2_dataframe')\
        .master('local[*]')\
        .getOrCreate()

    # 通过SparkSession得到SparkContext
    sc = spark.sparkContext

    # 2- 数据输入
    # 2.1- 创建一个RDD
    init_rdd = sc.parallelize(["1,李白,20","2,安其拉,18"])

    # 2.2- 将RDD的数据结构转换成二维结构
    new_rdd = init_rdd.map(lambda line: (
            int(line.split(",")[0]),
            line.split(",")[1],
            int(line.split(",")[2])
        )
    )

    # 将RDD转成DataFrame：方式一
    # schema方式一
    schema = StructType()\
        .add('id',IntegerType(),False)\
        .add('name',StringType(),False)\
        .add('age',IntegerType(),False)


    # schema方式二
    schema = StructType([
        StructField('id',IntegerType(),False),
        StructField('name',StringType(),False),
        StructField('age',IntegerType(),False)
    ])

    # schema方式三
    schema = "id:int,name:string,age:int"

    # schema方式四
    schema = ["id","name","age"]

    init_df = spark.createDataFrame(
        data=new_rdd,
        schema=schema
    )

    # 将RDD转成DataFrame：方式二
    """
        toDF：中的schema既可以传List，也可以传字符串形式的schema信息
    """
    # init_df = new_rdd.toDF(schema=["id","name","age"])
    init_df = new_rdd.toDF(schema="id:int,name:string,age:int")

    # 3- 数据处理
    # 4- 数据输出
    init_df.show()
    init_df.printSchema()

    # 5- 释放资源
    sc.stop()
    spark.stop()

Scenario: RDD can store data of any structure; while DataFrame can only handle two-dimensional table data. In the early days of using Spark to process data, the input data may be semi-structured or unstructured data. Then I can first perform ETL processing on the data into structured data through RDD, and then use SparkSQL, which has high development efficiency, to perform subsequent processing. The data is processed and analyzed.

2.2 Internal initialization data to obtain DataFrame

from pyspark import SparkConf, SparkContext
import os

# 绑定指定的Python解释器
from pyspark.sql import SparkSession

os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    print("内部初始化数据得到DataFrame。类似SparkCore中的parallelize")

    # 1- 创建SparkSession顶级对象
    spark = SparkSession.builder\
        .appName('inner_create_dataframe')\
        .master('local[*]')\
        .getOrCreate()

    # 2- 数据输入
    """
        通过createDataFrame创建DataFrame，schema数据类型可以是：DataType、字符串、List
            字符串：格式要求
                格式一 字段1 字段类型,字段2 字段类型
                格式二（推荐） 字段1:字段类型,字段2:字段类型
                
            List：格式要求
                ["字段1","字段2"]
    """
    # 内部初始化数据得到DataFrame
    init_df = spark.createDataFrame(
        data=[(1,'张三',18),(2,'李四',30)],
        schema="id:int,name:string,age:int"
    )

    # init_df = spark.createDataFrame(
    #     data=[(1, '张三', 18), (2, '李四', 30)],
    #     schema="id int,name string,age int"
    # )

    # init_df = spark.createDataFrame(
    #     data=[(1, '张三', 18), (2, '李四', 30)],
    #     schema=["id","name","age"]
    # )

    # init_df = spark.createDataFrame(
    #     data=[(1, '张三', 18), (2, '李四', 30)],
    #     schema=["id:int", "name:string", "age:int"]
    # )

    # 3- 数据处理
    # 4- 数据输出
    # 输出dataframe的数据内容
    init_df.show()

    # 输出dataframe的schema信息
    init_df.printSchema()

    # 5- 释放资源
    spark.stop()

Scenario: Generally used in development and testing, because only a small amount of data can be processed

Schema summary

Create a DataFrame through createDataFrame. The schema data type can be: DataType, string, List

1: string

Format 1 field 1 field type, field 2 field type

Format 2: Field 1: Field type, Field 2: Field type

2:List

["Field1","Field2"]

3:DataType

Format 1 schema = StructType().add('id',IntegerType(),False)

.add('id',IntegerType(),False).add('id',IntegerType(),False)

Format 2 schema = StructType([StructField('id',IntegerType,False),

StructField('id',IntegerType,False),

StructField('id',IntegerType,False)])

2.3 Reading external files

Complex API

Unified API format:

sparksession.read

.format('text|csv|json|parquet|orc|avro|jdbc|...')

.option('k','v')

.schema(StructType | String)

.load('Load data path') #Read the path of external files, supports HDFS and local

abbreviation API

Please note: All the above external reading methods have simple writing methods. Spark has built-in abbreviations for some commonly used reading schemes.

Format: spark.read.Reading method()

For example:

df = spark.read.csv(

path ='file:///export/data/_03_spark_sql/data/stu.txt',header=True,sep=' ',inferSchema=True,encoding='utf-8')

2.3.1 Text mode reading

from pyspark import SparkConf, SparkContext
import os
from pyspark.sql import SparkSession

# Bind the specified Python interpreter
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON '] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
print("Read file in text mode")

# 1- Create SparkSession object
spark = SparkSession.builder\
.appName('text_demo')\
.master('local[*]')\
.getOrCreate()

# 2- Data input
"""
load: supports reading HDFS file system and local file system
HDFS file system: hdfs://node1:8020/file path
local file system: file:///file path

text mode to read files Summary:
1- No matter what the content in the file is, text will put all the content into one column for processing
2- The default generated column name is value and the data type is string
3- We can only modify the field value in the schema Name, nothing else can be modified
"""
init_df = spark.read\
.format('text')\
.schema("my_field string")\
.load('file:///export/data/gz16_pyspark/02_spark_sql/ data/stu.txt')

# 3- Data processing
# 4- Data output
init_df.show()
init_df.printSchema()

# 5- Release resources
spark.stop()

Summary of reading files in text mode:

1-No matter what the content in the file is, text will put all the content into one column for processing

2-The default generated column name is value, and the data type is string.

3-We can only modify the name of the field value in the schema, nothing else can be modified

2.3.2 Reading in CSV mode

from pyspark import SparkConf, SparkContext
import os
from pyspark.sql import SparkSession

# 绑定指定的Python解释器
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    print("csv方式读取文件")

    # 1- 创建SparkSession对象
    spark = SparkSession.builder\
        .appName('csv_demo')\
        .master('local[*]')\
        .getOrCreate()

    # 2- 数据输入
    """
        csv格式读取外部文件总结：
            1- 复杂API和简写API都必须掌握
            2- 相关参数作用说明：
                2.1- path：指定读取的文件路径。支持HDFS和本地文件路径
                2.2- schema：手动指定元数据信息
                2.3- sep：指定字段间的分隔符
                2.4- encoding：指定文件的编码方式
                2.5- header：指定文件中的第一行是否是字段名称
                2.6- inferSchema：根据数据内容自动推断数据类型。但是，推断结果可能不精确
    """
    # 复杂API写法
    init_df = spark.read\
        .format('csv')\
        .schema("id int,name string,address string,sex string,age int")\
        .option("sep"," ")\
        .option("encoding","UTF-8")\
        .option("header","True")\
        .load('file:///export/data/gz16_pyspark/02_spark_sql/data/stu.txt')

    # 简写API写法
    # init_df = spark.read.csv(
    #     path='file:///export/data/gz16_pyspark/02_spark_sql/data/stu.txt',
    #     schema="id int,name string,address string,sex string,age int",
    #     sep=' ',
    #     encoding='UTF-8',
    #     header="True"
    # )

    # init_df = spark.read.csv(
    #     path='file:///export/data/gz16_pyspark/02_spark_sql/data/stu.txt',
    #     sep=' ',
    #     encoding='UTF-8',
    #     header="True",
    #     inferSchema=True
    # )

    # 3- 数据处理
    # 4- 数据输出
    init_df.show()
    init_df.printSchema()

    # 5- 释放资源
    spark.stop()

Summary of reading external files in csv format:

1-Related parameter description:

1.1 path: file path, HDFS and local

1.2 schema: manually specify metadata information

1.3 sep: specifies the separator between fields

1.4 encoding: Specify the encoding method of the file

1.5 header: Specify whether the first line in the file is the field name

1.6 inferSchema: Automatically infer data type based on data content, but the inference result may not be accurate

2.3.3 Reading in JSON mode

json data content

{'id': 1,'name': 'Zhang San','age': 20}
{'id': 2,'name': '李思','age': 23,'address': 'Beijing '}
{'id': 3,'name': 'Wang Wu','age': 25}
{'id': 4,'name': 'Zhao Liu','age': 29}

Code:

from pyspark import SparkConf, SparkContext
import os
from pyspark.sql import SparkSession

# 绑定指定的Python解释器
os.environ['SPARK_HOME'] = '/export/server/spark'
os.environ['PYSPARK_PYTHON'] = '/root/anaconda3/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/root/anaconda3/bin/python3'

if __name__ == '__main__':
    # 1- 创建SparkSession对象
    spark = SparkSession.builder\
        .appName('json_demo')\
        .master('local[*]')\
        .getOrCreate()

    # 2- 数据输入
    """
        json读取数据总结：
            1- 需要手动指定schema信息。如果手动指定的时候，字段名称与json中的key名称不一致，会解析不成功，以null值填充
            2- csv/json中schema的结构，如果是字符串类型，那么字段名称和字段数据类型间，只能以空格分隔
    """
    # init_df = spark.read.json(
    #     path='file:///export/data/gz16_pyspark/02_spark_sql/data/data.txt',
    #     schema="id2 int,name string,age int,address string",
    #     encoding='UTF-8'
    # )

    # init_df = spark.read.json(
    #     path='file:///export/data/gz16_pyspark/02_spark_sql/data/data.txt',
    #     schema="id:int,name:string,age:int,address:string",
    #     encoding='UTF-8'
    # )

    init_df = spark.read.json(
        path='file:///export/data/gz16_pyspark/02_spark_sql/data/data.txt',
        schema="id int,name string,age int,address string",
        encoding='UTF-8'
    )

    # 3- 数据输出
    init_df.show()
    init_df.printSchema()


    # 4- 释放资源
    spark.stop()

Summary of json reading data:

1-You need to manually specify the schema information. If the field name is inconsistent with the key name in json when manually specified, the parsing will fail and will be filled with null value.

2-The structure of the schema in csv/json. If it is a string type, the field name and field data type can only be separated by spaces.

3.DataFrame related APIs

There are generally two operation schemes for operating DataFrame: one is DSL mode and the other is SQL mode.

SQL method: complete statistical analysis operations by writing SQL statements
DSL method: domain-specific language, use DataFrame's unique API to complete calculation operations, that is, code form

From a usage perspective: SQL may be more convenient. Once you adapt to the DSL writing method, you will find that DSL is easier to use than SQL.
From a Spark perspective: It is more recommended to use the DSL solution, which is more conducive to the bottom layer of Spark. Optimization

3.1 SQL related APIs

Create a view/table

df.createTempView('view name'): Create a temporary view (table name)
df.createOrReplaceTempView('view name'): Create a temporary view (table name). If the view exists, directly replace
the temporary view. Only Used in the current Spark Session

df.createGlobalTempView('view name'): Create a global view that can be used in multiple spark sessions running in a Spark application. When using it, it must be loaded through global_temp. view name. less used

Execute SQL statement

spark.sql('Write SQL')

3.2 DSL related APIs

show(): used to display data in DF, by default only the first 20 rows are displayed

Parameter 1: Set how many lines are displayed by default, the default is 20

Parameter 2: Whether it is a stage column. By default, only the first 20 characters of data are displayed. If it is too long, it will not be displayed.

printSchema(): used to print the table structure information of the current DF

select(): Similar to select in SQL, what can be written after select in SQL is the same as this

filter() and where(): used to filter data, generally where is mainly used in spark SQL
groupBy(): used to perform grouping operations
orderBy(): used to perform sorting operations

DSL mainly supports the following transfer methods: str | Column object | list
   str format: 'field'
   Column object:
       fields contained in DataFrame df['field']
       newly generated during execution: F.col('field')
   list:
       ['Field1','Field2'...]
       [df['Field1'],df['Field2']]

In order to support the use of SQL functions in DSL when writing Spark SQL DSL, a SQL function library is specially provided. Just load and use it directly

Import this function library: import pyspark.sql.functions as F.
Just call the corresponding function through F. The functions supported in SparkSQL can be queried at the following address:
https://spark.apache.org/docs/3.1.2/api/sql/index.html