Spark big data processing lecture notes 4.2 Spark SQL data source - basic operation

 

Table of contents

0. Learning objectives of this lecture

1. Basic operation

Second, the default data source

(1) Default data source Parquet

(2) Case demonstration to read Parquet files

1. Demo in Spark Shell

2. Demonstration through Scala program

3. Manually specify the data source

(1) Overview of format() and option() methods

(2) Case demonstration to read different data sources

1. Read the real estate csv file

2. Read json and save it as parquet

3. Read the jdbc data source and save it as a json file

4. Data writing mode

(1) mode() method

(2) Enumeration class SaveMode

(3) Case demonstration of different writing modes

Five, partition automatic inference

(1) Overview of partition automatic inference

(2) Partition automatic inference demonstration

1. Create four files

2. Read table data

3. Output Schema information

4. Display the content of the data frame

(3) Precautions for automatic partition inference


0. Learning objectives of this lecture

  1. Learn to use default data sources
  2. Learn to manually specify the data source
  3. Understanding Data Write Patterns
  4. Master partition automatic inference

Spark SQL supports operations on various data sources through the DataFrame interface. DataFrames can be manipulated using relative transformation operators, and can also be used to create temporary views. Register a DataFrame as a temporary view to use SQL queries on the data in it .

1. Basic operation

  • Spark SQL provides two common methods for loading and writing data: load()method and save()method. load()The method can load an external data source as a DataFrame, and save()the method can write a DataFrame to the specified data source.

Second, the default data source

(1) Default data source Parquet

  • By default, the load() method and save() method only support files in Parquet format. Parquet files store data in binary format, so they cannot be read directly. The file includes the actual data and Schema information of the file, and also spark.sql.sources.defaultThe default file format can be changed via parameters in the configuration file . Spark SQL can easily read Parquet files and convert their data into DataFrame datasets.

(2) Case demonstration to read Parquet files

  • Upload the data file users.parquetto the master virtual machine/home
  • Upload data files users.parquetto HDFS  /datasource/inputdirectory

1. Demo in Spark Shell

  • Start Spark Shell and execute the command:spark-shell --master spark://master:7077
  • Load a parquet file, return a dataframe
  • Execute the command: val userdf = spark.read.load("hdfs://master:9000/datasource/input/users.parquet")
  • Execute the command: userdf.show()to view the content of the data frame
  • Execute the command: userdf.select("name", "favorite_color").write.save("hdfs://master:9000/datasource/output") to query the specified column of the data frame, and the query result is still a data frame. Then write to the HDFS specified directory through the save() method
  • View output results on HDFS
  • In addition to using the select() method to query, you can also use the sql() method of the SparkSession object to execute SQL statements for query, and the return result of this method is still a DataFrame.
  • To create a temporary view based on a data frame, execute the command:userdf.createTempView("t_user")
  • Execute SQL query, write the result to HDFS, and execute the command:spark.sql("select name, favorite_color from t_user").write.save("hdfs://master:9000/result2")
  • View output results on HDFS

Classroom exercise 1. student.txtConvert the files in Section 4.1 student.parquetinto HDFS /datasource/inputdirectories

  • Solution: student.txtconvert to , save to the directory studentdfusing the data frame method , and then rename the file and copy it to the directorysave()/datasource/output3/datasource/input

  • get student dataframe

  • Save student dataframe as parquetfile

  • View generated parquetfiles

  • copy parquetfiles to /datasource/inputdirectory

Classroom exercise 2. Read student.parquetthe file to get the student data frame and display the content of the data frame

  • Excuting an order:val studentDF = spark.read.load("hdfs://master:9000/datasource/input/student.parquet")
  • Excuting an order:studentDF.show

2. Demonstration through Scala program

  • Create a Maven project - SparkSQLDemo
  • Add dependencies and plugins to the pom.xml file
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>net.huawei.sql</groupId>
    <artifactId>SparkSQLDemo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.15</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.1.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.1.3</version>
        </dependency>
    </dependencies>
    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
    </build>
</project>
  • resourcesAdd the HFDS configuration file in the directory
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <property>
        <description>only config in clients</description>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
    </property>
</configuration>

  • resourcesAdd a log properties file to the directory

log4j.rootLogger=ERROR, stdout, logfile
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spark.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

  • Create a package and create objects net.hw.sparksqlin the packageReadParquet
    insert image description here
package net.hw.sparksql

import org.apache.spark.sql.SparkSession

/**
 * 功能:Parquet数据源
 * 作者:华卫
 * 日期:2022年05月01日
 */
object ReadParquet {
  def main(args: Array[String]): Unit = {
    // 本地调试必须设置,否则会报Permission Denied错误
    System.setProperty("HADOOP_USER_NAME", "root")
    // 创建或得到SparkSession
    val spark = SparkSession.builder()
      .appName("ReadParquet")
      .master("local[*]")
      .getOrCreate()
    // 加载parquet文件,返回数据帧
    val usersdf = spark.read.load("hdfs://master:9000/input/users.parquet")
    // 显示数据帧内容
    usersdf.show()
    // 查询DataFrame中指定列,结果写入HDFS
    usersdf.select("name","favorite_color")
      .write.save("hdfs://master:9000/result3")
  }
}
  • Run the program and view the console results
  • View output results in HDFS

3. Manually specify the data source

(1) Overview of format() and option() methods

  • The method can be used format()to manually specify the data source. Data sources need to use fully qualified names (for example org.apache.spark.sql.parquet), but for Spark SQL's built-in data sources, their abbreviated names (JSON, Parquet, JDBC, ORC, Libsvm, CSV, Text) can also be used.
  • DataFrame datasets can be saved to or converted between different file formats by manually specifying the data source.
  • While specifying the data source, you can use the option() method to pass the required parameters to the specified data source. For example, pass parameters such as account number and password to the JDBC data source.

(2) Case demonstration to read different data sources

1. Read the real estate csv file

  • View the files /inputin the directory on HDFShouse.csv
  • In the spark shell, execute the command: val house_csv_df = spark.read.format("csv").load("hdfs://master:9000/input/house.csv"), read the housing source csv file, and get the housing source data frame
  • Execute the command: house_csv_df.show(), to view the content of the housing data frame
  • As you can see, house.csvthe first line of the file is a list of field names, but after converting it into a data frame, it becomes the first record. This is obviously unreasonable. What should I do? You need to use option()a method to pass parameters, telling Spark that the first line is the header header, not the table record.
  • Excuting an order:val house_csv_df = spark.read.format("csv").option("header", "true").load("hdfs://master:9000/input/house.csv")
  • Execute the command: house_csv_df.show(), to view the content of the housing data frame

2. Read json and save it as parquet

  • The directory that will be people.jsonuploaded to HDFS/input
  • Excuting an order:val peopledf = spark.read.format("json").load("hdfs://master:9000/input/people.json")
  • Excuting an order:peopledf.show()
  • Excuting an order:peopledf.select("name", "age").write.format("parquet").save("hdfs://master:9000/result4")
  • View generated parquetfiles

3. Read the jdbc data source and save it as a json file

  • View the tables studentin the databaset_user
  • Excuting an order

val userdf = spark.read.format("jdbc")
  .option("url", "jdbc:mysql://master:3306/student")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", "t_user")  
  .option("user", "root")  
  .option("password", "903213")
  .load()
  • An error was reported, the database driver could not be foundcom.mysql.jdbc.Driver
  • To solve the problem, copy the database driver to $SPARK_HOME/jarsthe directory
  • Distribute the data driver to slave1 and slave2 virtual machines
  • Excuting an order
val userdf = spark.read.format("jdbc")
  .option("url", "jdbc:mysql://master:3306/student")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", "t_user")  
  .option("user", "root")  
  .option("password", "903213")
  .load()
  • The jdbc data source is loaded successfully, but there is a warning, which needs to be useSSL=falseeliminated by setting
  • Excuting an order

val userdf = spark.read.format("jdbc")
  .option("url", "jdbc:mysql://master:3306/student?useSSL=false")
  .option("driver", "com.mysql.jdbc.Driver")
  .option("dbtable", "t_user")  
  .option("user", "root")  
  .option("password", "903213")
  .load()

  • Excuting an order:userdf.show()
  • Excuting an order:userdf.write.format("json").save("hdfs://master:9000/result5")
  • View the generated json file on the virtual machine slave1, and execute the command:hdfs dfs -cat /result5/*

4. Data writing mode

(1) mode() method

  • When writing data, you can use mode()the method to specify how to deal with the existing data. The parameter of this method is an enumeration class SaveMode.
  • use SaveModeclass, needimport org.apache.spark.sql.SaveMode;

(2) Enumeration class SaveMode

  • SaveMode.ErrorIfExists: Default value . When writing a DataFrame to a data source, an exception is thrown if the data already exists.
  • SaveMode.Append: When writing a DataFrame to the data source, if the data or table already exists, it will be appended on the original basis.
  • SaveMode.Overwrite: When writing a DataFrame to the data source, if the data or table already exists, it will be overwritten (including the schema of the data or table).
  • SaveMode.Ignore: When writing a DataFrame to the data source, if the data or table already exists, the content will not be written, similar to SQL CREATE TABLE IF NOT EXISTS.

(3) Case demonstration of different writing modes

  • View data sources:people.json
  • Query the file nameand write it in the overwrite mode /result. /resultThere are things in the directory
  • Excuting an order:val peopledf = spark.read.format("json").load("hdfs://master:9000/input/people.json")
  • Import SaveModethe class and execute the command:peopledf.select("name").write.mode(SaveMode.Overwrite).format("json").save("hdfs://master:9000/result")
  • View the generated json file on the slave1 virtual machine
  • Query agecolumns, write to the HDFS /resultdirectory in append mode, execute the command:peopledf.select("age").write.mode(SaveMode.Append).format("json").save("hdfs://master:9000/result")
  • View the additionally generated json file on the slave1 virtual machine

Five, partition automatic inference

(1) Overview of partition automatic inference

  • Table partitioning is a commonly used method to optimize query efficiency in systems such as Hive (table partitioning in Spark SQL is similar to table partitioning in Hive). In a partitioned table, data is usually stored in different partition directories, and the partition directories are usually 分区列名=值named in the format of " ".
  • Take people as the table name, gender and country as the partition column, and give the directory structure for storing data

(2) Partition automatic inference demonstration

1. Create four files

  • /homeCreate the following directory and files on the master virtual machine , where the directory peoplerepresents the table name, genderand countryrepresents the partition column, people.jsonstoring the actual population data

2. Read table data

  • Execute the command: spark-shell, start the Spark Shell
  • Excuting an order:val peopledf = spark.read.format("json").load("file:///home/people")

3. Output Schema information

  • Excuting an order:peopledf.printSchema()

4. Display the content of the data frame

  • Excuting an order:peopledf.show()
  • It can be seen from the output Schema information and table data that when Spark SQL reads data, it automatically infers two partition columns genderand countryadds the values ​​of these two columns to the data frame peopledf.

(3) Precautions for automatic partition inference

  • The data type of the partition column is automatically inferred, and currently supports numeric, date, timestamp, and string data types. spark.sql.sources.partitionColumnTypeInference.enabledIf you don't want to automatically infer the data type of the partition column, you can set the value to false(default is true, means enabled) in the configuration file . The partition column to use when automatic inference is disabled 字符串数据类型.

Guess you like

Origin blog.csdn.net/qq_61324603/article/details/130859380