4. Spark SQL data source

4.1 General Load / Save method 

  4.1.1 manually specify options 

      Spark SQL interface supports operation of DataFrame a variety of data sources. DataFrame RDDs may operate a mode, may also be registered as a temporary table. After the DataFrame registered as a temporary table, you can execute SQL queries against DataFrame

      Spark SQL default data source for Parquet format. When the data source for Parquet file, Spark SQL can easily perform all the operations. Modify the configuration item spark.sql.sources.default, you can modify the default data source format

val df = spark.read.load("examples/src/main/resources/users.parquet")

df.select("name","favorite_color").write.save("namesAndFavColors.parquet")

      When the data format is not the source format file parquet, you need to manually specify the format of the data source. Data source format to specify the full name (e.g.: org.apache.spark.sql.parquet), if the data format is the built-in source format, referred to only need to specify a given json, parquet, jdbc, orc, libsvm, csv, text specified format of the data

      It can be used for general load data provided by read.load SparkSession method, using the write and save data saved

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.write.format("parquet").save("hdfs://master01:9000/namesAndAges.parquet")

      In addition, you can run directly on the SQL file:

val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://master01:9000/namesAndAges.parquet`")
sqlDF.show()
scala> val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> peopleDF.write.format("parquet").save("hdfs://master01:9000/namesAndAges.parquet")

scala> peopleDF.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://master01:9000/namesAndAges.parquet`")
19/07/17 11:15:11 WARN ObjectStore: Failed to get database parquet, returning NoSuchObjectException
sqlDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> sqlDF.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

  4.1.2 File Saving Options  

      SaveMode perform storage operations may be employed, SaveMode defines the processing mode data. Note that these save mode does not use any locking, it is not atomic. In addition, when performed using Overwrite mode, the output before new data in the original data had been deleted. SaveMode detailed in the following table

Scala / Java

Any Language

Meaning

SaveMode.ErrorIfExists(default)

"error"(default)

If the file exists, the error

SaveMode.Append

"append"

page32image11437184

add to

SaveMode.Overwrite

"overwrite"

page32image21713168
page32image11438336

Overwrite

page32image21710368

SaveMode.Ignore

"ignore"

Data exists, it is ignored

4.2 Parquet file

      Parquet is a popular format storage column can be efficiently stored in a recording nested fields

      

  4.2.1 Parquet read and write 

      Parquet format is often used in the Hadoop ecosystem, it also supports all Spark SQL data types. Spark SQL provides direct read and store the file format Parquet method

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._
  
val peopleDF = spark.read.json("examples/src/main/resources/people.json")
  
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("hdfs://master01:9000/people.parquet")
  
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("hdfs://master01:9000/people.parquet")
  
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

  4.2.2 resolve partition information  

      Partitioning a table is one of the ways the data optimization. In the partition table, the column data by partitioning the data stored in different directories. Parquet data source can now automatically discover and resolve partition information. For example, population data storage partitions, partition as gender and Country, use the following directory structure:

      

      By passing path / to / table to SQLContext.read.parquet or SQLContext.read.load, Spark SQL automatically resolves partition information. Returned DataFrame of Schema as follows:

      

      Note that the data type of the data partitioning column is automatically resolved. Currently, the numeric types and string types supported. Partition type of automatic analytical parameters: spark.sql.sources.partitionColumnTypeInference.enabled, default is true. If you want to turn off this feature, the parameter to be directly disabled. In this case, partition column data format is set as the default string type, type no longer resolved

  4.2.3 Schema merger  

      像ProtocolBuffer、Avro和Thrift那样,Parquet也支持Schema evolution(Schema演变)。用户可以先定义一个简单的Schema,然后逐渐的向Schema中增加列描述。通过这种方式,用户可以获取多个有不同Schema但相互兼容的Parquet文件。现在Parquet数据源能自动检测这种情况,并合并这些文件的schemas

      因为Schema合并是一个高消耗的操作,在大多数情况下并不需要,所以Spark SQL从1.5.0开始默认关闭了该功能。可以通过下面两种方式开启该功能:

      当数据源为Parquet文件时,将数据源选项mergeSchema设置为true

      设置全局SQL选项spark.sql.parquet.mergeSchema为true

      示例如下:

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._

// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("hdfs://master01:9000/data/test_table/key=1")
  
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("hdfs://master01:9000/data/test_table/key=2")

// Read the partitioned table
val df3 = spark.read.option("mergeSchema", "true").parquet("hdfs://master01:9000/data/test_table")
df3.printSchema()
  
  
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)

4.3 Hive数据库  

      Apache Hive是Hadoop上的SQL引擎,Spark SQL编译时可以包含Hive支持,也可以不包含。包含Hive支持的Spark SQL可以支持Hive表访问、UDF(用户自定义函数)以及Hive查询语言(HiveQL/HQL)等。需要强调的一点是,如果要在Spark SQL中包含Hive的库,并不需要事先安装Hive。一般来说,最好还是在编译Spark SQL时引入Hive支持,这样就可以使用这些特性了。如果下载的是二进制版本的Spark,它应该已经在编译时添加了Hive支持

      若要把Spark SQL连接到一个部署好的Hive上,必须把hive-site.xml复制到Spark的配置文件目录中($SPARK_HOME/conf)。即使没有部署好Hive,Spark SQL也可以运行。需要注意的是,如果没有部署好Hive,Spark SQL会在当前的工作目录中创建出自己的Hive元数据仓库,叫做metastore_db。此外,如果尝试使用HiveQL中的CREATE TABLE(并非CREATE EXTERNAL TABLE)语句来创建表,这些表会被放在默认的文件系统中的/user/hive/warehouse目录中(如果classpath中配好的hdfs-site.xml,默认的文件系统就是HDFS,否则就是本地文件系统) 

import java.io.File

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
  case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

  4.3.1 内嵌Hive应用  

      如果要使用内嵌的Hive,什么都不用做,直接用就可以了。--conf: spark.sql.warehouse.dir=

      注意:如果使用的是内部的Hive,在Spark2.0之后,spark.sql.warehouse.dir用于指定数据仓库的地址,如果你需要是用HDFS作为路径,那么需要将core-site.xml和hdfs-site.xml加入到Spark conf目录,否则只会创建master节点上的warehouse目录,查询时会出现文件找不到的问题,这是需要向使用HDFS,则需要将metastore删除,重启集群

  4.3.2 外部Hive应用 

      如果想连接外部已经部署好的Hive,需要通过以下几个步骤

       1) 将Hive中的hive-site.xml拷贝或者软连接到Spark安装目录下的conf目录下

       2) 打开spark shell,注意带上访问Hive元数据库的JDBC客户端

  $ bin/spark-shell --master spark://master01:7077 --jars mysql-connector-java- 5.1.27-bin.jar

4.4 JSON数据集  

      Spark SQL能够自动推测JSON数据集的结构,并将它加载为一个Dataset[Row]。可以通过SparkSession.read.json()去加载一个Dataset[String]或者一个JSON文件,注意,这个JSON文件不是一个传统的JSON文件,每一行都得是一个JSON串

{"name":"Michael"} 
{"name":"Andy", "age":30} 
{"name":"Justin", "age":19}
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._
  
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
"""{"name":"Hui","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Hui|
// +---------------+----+

4.5 JDBC 

      Spark SQL可以通过JDBC从关系型数据库中读取数据的方式创建DataFrame,通过对DataFrame一系列的计算后,还可以将数据再写回关系型数据库中

      注意,需要将相关的数据库驱动放到spark的类路径下

$ bin/spark-shell --master spark://master01:7077 --jars mysql-connector-java- 5.1.27-bin.jar
// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read.format("jdbc").option("url",
"jdbc:mysql://master01:3306/rdd").option("dbtable", " rddtable").option("user", "root").option("password", "hive").load()

val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "hive")
val jdbcDF2 = spark.read.jdbc("jdbc:mysql://master01:3306/rdd", "rddtable", connectionProperties)

// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://master01:3306/rdd")
.option("dbtable", "rddtable2")
.option("user", "root")
.option("password", "hive")
.save()

jdbcDF2.write.jdbc("jdbc:mysql://master01:3306/mysql", "db", connectionProperties)

// Specifying create table column data types on write
jdbcDF.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:mysql://master01:3306/mysql", "db", connectionProperties)

 

 

Guess you like

Origin www.cnblogs.com/zhanghuicheng/p/11199835.html