Table of contents
0. Learning objectives of this lecture
Second, the default data source
(1) Default data source Parquet
(2) Case demonstration to read Parquet files
2. Demonstration through Scala program
3. Manually specify the data source
(1) Overview of format() and option() methods
(2) Case demonstration to read different data sources
1. Read the real estate csv file
2. Read json and save it as parquet
3. Read the jdbc data source and save it as a json file
(2) Enumeration class SaveMode
(3) Case demonstration of different writing modes
Five, partition automatic inference
(1) Overview of partition automatic inference
(2) Partition automatic inference demonstration
4. Display the content of the data frame
(3) Precautions for automatic partition inference
0. Learning objectives of this lecture
- Learn to use default data sources
- Learn to manually specify the data source
- Understanding Data Write Patterns
- Master partition automatic inference
Spark SQL supports operations on various data sources through the DataFrame interface. DataFrames can be manipulated using relative transformation operators, and can also be used to create temporary views. Register a DataFrame as a temporary view to use SQL queries on the data in it .
1. Basic operation
- Spark SQL provides two common methods for loading and writing data:
load()
method andsave()
method.load()
The method can load an external data source as a DataFrame, andsave()
the method can write a DataFrame to the specified data source.
Second, the default data source
(1) Default data source Parquet
- By default, the load() method and save() method only support files in Parquet format. Parquet files store data in binary format, so they cannot be read directly. The file includes the actual data and Schema information of the file, and also
spark.sql.sources.default
The default file format can be changed via parameters in the configuration file . Spark SQL can easily read Parquet files and convert their data into DataFrame datasets.
(2) Case demonstration to read Parquet files
- Upload the data file
users.parquet
to the master virtual machine/home
- Upload data files
users.parquet
to HDFS/datasource/input
directory
1. Demo in Spark Shell
- Start Spark Shell and execute the command:
spark-shell --master spark://master:7077
- Load a parquet file, return a dataframe
- Execute the command: val userdf = spark.read.load("hdfs://master:9000/datasource/input/users.parquet")
- Execute the command:
userdf.show()
to view the content of the data frame - Execute the command: userdf.select("name", "favorite_color").write.save("hdfs://master:9000/datasource/output") to query the specified column of the data frame, and the query result is still a data frame. Then write to the HDFS specified directory through the save() method
- View output results on HDFS
- In addition to using the select() method to query, you can also use the sql() method of the SparkSession object to execute SQL statements for query, and the return result of this method is still a DataFrame.
- To create a temporary view based on a data frame, execute the command:
userdf.createTempView("t_user")
- Execute SQL query, write the result to HDFS, and execute the command:
spark.sql("select name, favorite_color from t_user").write.save("hdfs://master:9000/result2")
- View output results on HDFS
Classroom exercise 1. student.txt
Convert the files in Section 4.1 student.parquet
into HDFS /datasource/input
directories
-
Solution:
student.txt
convert to , save to the directorystudentdf
using the data frame method , and then rename the file and copy it to the directorysave()
/datasource/output3
/datasource/input
-
get student dataframe
-
Save student dataframe as
parquet
file -
View generated
parquet
files -
copy
parquet
files to/datasource/input
directory
Classroom exercise 2. Read student.parquet
the file to get the student data frame and display the content of the data frame
- Excuting an order:
val studentDF = spark.read.load("hdfs://master:9000/datasource/input/student.parquet")
- Excuting an order:
studentDF.show
2. Demonstration through Scala program
- Create a Maven project - SparkSQLDemo
- Add dependencies and plugins to the pom.xml file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>net.huawei.sql</groupId>
<artifactId>SparkSQLDemo</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.1.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
</build>
</project>
resources
Add the HFDS configuration file in the directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<description>only config in clients</description>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
</configuration>
resources
Add a log properties file to the directory
log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
- Create a package and create objects
net.hw.sparksql
in the packageReadParquet
package net.hw.sparksql
import org.apache.spark.sql.SparkSession
/**
* 功能:Parquet数据源
* 作者:华卫
* 日期:2022年05月01日
*/
object ReadParquet {
def main(args: Array[String]): Unit = {
// 本地调试必须设置,否则会报Permission Denied错误
System.setProperty("HADOOP_USER_NAME", "root")
// 创建或得到SparkSession
val spark = SparkSession.builder()
.appName("ReadParquet")
.master("local[*]")
.getOrCreate()
// 加载parquet文件,返回数据帧
val usersdf = spark.read.load("hdfs://master:9000/input/users.parquet")
// 显示数据帧内容
usersdf.show()
// 查询DataFrame中指定列,结果写入HDFS
usersdf.select("name","favorite_color")
.write.save("hdfs://master:9000/result3")
}
}
- Run the program and view the console results
- View output results in HDFS
3. Manually specify the data source
(1) Overview of format() and option() methods
- The method can be used
format()
to manually specify the data source. Data sources need to use fully qualified names (for exampleorg.apache.spark.sql.parquet
), but for Spark SQL's built-in data sources, their abbreviated names (JSON, Parquet, JDBC, ORC, Libsvm, CSV, Text) can also be used. - DataFrame datasets can be saved to or converted between different file formats by manually specifying the data source.
- While specifying the data source, you can use the option() method to pass the required parameters to the specified data source. For example, pass parameters such as account number and password to the JDBC data source.
(2) Case demonstration to read different data sources
1. Read the real estate csv file
- View the files
/input
in the directory on HDFShouse.csv
- In the spark shell, execute the command:
val house_csv_df = spark.read.format("csv").load("hdfs://master:9000/input/house.csv")
, read the housing source csv file, and get the housing source data frame - Execute the command:
house_csv_df.show()
, to view the content of the housing data frame - As you can see,
house.csv
the first line of the file is a list of field names, but after converting it into a data frame, it becomes the first record. This is obviously unreasonable. What should I do? You need to useoption()
a method to pass parameters, telling Spark that the first line is the headerheader
, not the table record. - Excuting an order:
val house_csv_df = spark.read.format("csv").option("header", "true").load("hdfs://master:9000/input/house.csv")
- Execute the command:
house_csv_df.show()
, to view the content of the housing data frame
2. Read json and save it as parquet
- The directory that will be
people.json
uploaded to HDFS/input
- Excuting an order:
val peopledf = spark.read.format("json").load("hdfs://master:9000/input/people.json")
- Excuting an order:
peopledf.show()
- Excuting an order:
peopledf.select("name", "age").write.format("parquet").save("hdfs://master:9000/result4")
- View generated
parquet
files
3. Read the jdbc data source and save it as a json file
- View the tables
student
in the databaset_user
- Excuting an order
val userdf = spark.read.format("jdbc")
.option("url", "jdbc:mysql://master:3306/student")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "903213")
.load()
- An error was reported, the database driver could not be found
com.mysql.jdbc.Driver
- To solve the problem, copy the database driver to
$SPARK_HOME/jars
the directory - Distribute the data driver to slave1 and slave2 virtual machines
- Excuting an order
val userdf = spark.read.format("jdbc")
.option("url", "jdbc:mysql://master:3306/student")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "903213")
.load()
- The jdbc data source is loaded successfully, but there is a warning, which needs to be
useSSL=false
eliminated by setting - Excuting an order
val userdf = spark.read.format("jdbc")
.option("url", "jdbc:mysql://master:3306/student?useSSL=false")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "903213")
.load()
- Excuting an order:
userdf.show()
- Excuting an order:
userdf.write.format("json").save("hdfs://master:9000/result5")
- View the generated json file on the virtual machine slave1, and execute the command:
hdfs dfs -cat /result5/*
4. Data writing mode
(1) mode() method
- When writing data, you can use
mode()
the method to specify how to deal with the existing data. The parameter of this method is an enumeration classSaveMode
. - use
SaveMode
class, needimport org.apache.spark.sql.SaveMode;
(2) Enumeration class SaveMode
SaveMode.ErrorIfExists
: Default value . When writing a DataFrame to a data source, an exception is thrown if the data already exists.SaveMode.Append
: When writing a DataFrame to the data source, if the data or table already exists, it will be appended on the original basis.SaveMode.Overwrite
: When writing a DataFrame to the data source, if the data or table already exists, it will be overwritten (including the schema of the data or table).SaveMode.Ignore
: When writing a DataFrame to the data source, if the data or table already exists, the content will not be written, similar to SQLCREATE TABLE IF NOT EXISTS
.
(3) Case demonstration of different writing modes
- View data sources:
people.json
- Query the file
name
and write it in the overwrite mode/result
./result
There are things in the directory - Excuting an order:
val peopledf = spark.read.format("json").load("hdfs://master:9000/input/people.json")
- Import
SaveMode
the class and execute the command:peopledf.select("name").write.mode(SaveMode.Overwrite).format("json").save("hdfs://master:9000/result")
- View the generated json file on the slave1 virtual machine
- Query
age
columns, write to the HDFS/result
directory in append mode, execute the command:peopledf.select("age").write.mode(SaveMode.Append).format("json").save("hdfs://master:9000/result")
- View the additionally generated json file on the slave1 virtual machine
Five, partition automatic inference
(1) Overview of partition automatic inference
- Table partitioning is a commonly used method to optimize query efficiency in systems such as Hive (table partitioning in Spark SQL is similar to table partitioning in Hive). In a partitioned table, data is usually stored in different partition directories, and the partition directories are usually
分区列名=值
named in the format of " ". - Take people as the table name, gender and country as the partition column, and give the directory structure for storing data
(2) Partition automatic inference demonstration
1. Create four files
/home
Create the following directory and files on the master virtual machine , where the directorypeople
represents the table name,gender
andcountry
represents the partition column,people.json
storing the actual population data
2. Read table data
- Execute the command:
spark-shell
, start the Spark Shell - Excuting an order:
val peopledf = spark.read.format("json").load("file:///home/people")
3. Output Schema information
- Excuting an order:
peopledf.printSchema()
4. Display the content of the data frame
- Excuting an order:
peopledf.show()
- It can be seen from the output Schema information and table data that when Spark SQL reads data, it automatically infers two partition columns
gender
andcountry
adds the values of these two columns to the data framepeopledf
.
(3) Precautions for automatic partition inference
- The data type of the partition column is automatically inferred, and currently supports numeric, date, timestamp, and string data types.
spark.sql.sources.partitionColumnTypeInference.enabled
If you don't want to automatically infer the data type of the partition column, you can set the value tofalse
(default istrue
, means enabled) in the configuration file . The partition column to use when automatic inference is disabled字符串数据类型
.