Spark SQL Common four kinds of data sources (in detail)

General load / write method

Manually specify options

Spark SQL interface supports operation of DataFrame a variety of data sources. DataFrame RDDs may operate a mode, may also be registered as a temporary table. After the DataFrame registered as a temporary table, you can query the DataFrame execute SQL.

Spark SQL default data source for Parquet format. When the data source for Parquet file, Spark SQL can easily perform all the operations.

Modify configuration items spark.sql.sources.default, you can modify the default data source format.

scala> val df = spark.read.load("hdfs://hadoop001:9000/namesAndAges.parquet")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.select("name").write.save("names.parquet")

When the data format is not the source format file parquet, you need to manually specify the format of the data source. Data source format to specify the full name (for example: org.apache.spark.sql.parquet), if the data format is the built-in source format, only need to specify referred json, parquet, jdbc, orc, libsvm, csv, text to specify the format of the data.

It can be used for general load data provided by read.load SparkSession method, using the write and save data saved.

scala> val peopleDF = spark.read.format("json").load("hdfs://hadoop001:9000/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]          

scala> peopleDF.write.format("parquet").save("hdfs://hadoop001:9000/namesAndAges.parquet")
scala>

In addition, you can run directly on the SQL file:

val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://hadoop001:9000/namesAndAges.parquet`")
sqlDF.show()

File Saving Options

SaveMode perform storage operations may be employed, SaveMode defines the processing mode data. Note that these save mode does not use any locking, it is not atomic. In addition, when performed using Overwrite mode, the output before new data in the original data had been deleted. SaveMode detailed in the following table:

Scala / Java Any Language Meaning
SaveMode.ErrorIfExists(default) “error”(default) If the file exists, the error
SaveMode.Append “append” add to
SaveMode.Overwrite “overwrite” Overwrite
SaveMode.Ignore “ignore” Data exists, it is ignored

Parquet file

Parquet read and write

Parquet format is often used in the Hadoop ecosystem, it also supports all Spark SQL data types. Spark SQL provides direct read and store Parquet format approach.

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("hdfs://hadoop001:9000/people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("hdfs://hadoop001:9000/people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

Resolve partition information

Partitioning a table is one of the ways the data optimization. In the partition table, the column data by partitioning the data stored in different directories. Parquet data source can now automatically discover and resolve partition information. For example, population data storage partitions, partition as gender and Country, use the following directory structure:

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

By passing path / to / table to SQLContext.read.parque

Or SQLContext.read.load, Spark SQL will automatically parse partition information.

Returned DataFrame of Schema as follows:

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)

Note that the data type of the data partitioning column is automatically resolved. Currently, the numeric types and string types supported. Automatically resolves partition type parameters are:

spark.sql.sources.partitionColumnTypeInference.enabledThe default value is true.

If you want to turn off this feature, the parameter to be directly disabled. In this case, partition column data format is set as the default string type, type no longer resolved.

Schema merger

Like ProtocolBuffer, Avro and Thrift as, Parquet also supports Schema evolution (Schema evolution). The user can define a simple Schema, and then increases toward the Schema column describes the gradual. In this way, users can access multiple different but mutually compatible Parquet Schema file. Parquet now the data source to automatically detect this situation, these schemas and merge files.
Because Schema merger is operating a high consumption and does not require, in most cases, so Spark SQL 1.5.0

Start the feature turned off by default. This feature can be turned on via the following two ways:

When the data source for Parquet file, the data source options mergeSchema set to true.

Set global SQL options:

spark.sql.parquet.mergeSchemaIt is true.

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._

// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("hdfs://hadoop001:9000/data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("hdfs://hadoop001:9000/data/test_table/key=2")

// Read the partitioned table
val df3 = spark.read.option("mergeSchema", "true").parquet("hdfs://hadoop001:9000/data/test_table")
df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)

Hive data source

Apache Hive is a SQL engine on Hadoop, Spark SQL may contain Hive support compile time, it may not be included. Spark SQL support contains Hive can support Hive table visit, UDF (user-defined functions) and Hive Query Language (HiveQL / HQL) and so on. Needs to be emphasized is that if you want to include in the Spark SQL Hive library, it is not need to install Hive. In general, it is best to introduce Hive support at compile time Spark SQL, so you can use these features of. If you download a binary version of the Spark, it should have been added Hive support at compile time.

If want Spark SQL connection to a deployment on a good Hive, you must copy the hive-site.xml configuration file directory to Spark's ($ SPARK_HOME / conf). Even without deployed Hive, Spark SQL can run.

Note that, if you have not deployed Hive, Spark SQL will create in the current working directory own Hive metadata repository, called metastore_db. Also, if you try to use the CREATE TABLE HiveQL in (not CREATE EXTERNAL TABLE) statements to create tables that are in your default file system / user / hive / warehouse directory (if you have the classpath with a good hdfs-site.xml, the default file system is HDFS, otherwise the local file system).

import java.io.File
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

Hive embedded applications

If you are using embedded Hive, you do nothing, directly on it. -conf:

spark.sql.warehouse.dir=

Note: If you are using internal Hive, after Spark2.0, spark.sql.warehouse.dir address specified for the data warehouse, if you need to use HDFS as a path, you need the core-site.xml and hdfs-site.xml added to Spark conf directory, otherwise it will create a warehouse directory on the master node, the problem can not find the file will appear when a query, it is necessary to delete metastore to use HDFS, you will need to restart the cluster.

Hive external application

If you want to connect an external already deployed Hive, through the following steps.

The hive-site.xml Hive a copy in or connected to the lower soft conf directory under Spark installation directory.

b Open spark shell, take note of JDBC database access Hive yuan client.

$ bin/spark-shell --master spark://hadoop001:7077 --jars mysql-connector-java-5.1.27-bin.jar

JSON data set

Spark SQL automatically presumed structure of JSON data sets, and loads it into a Dataset [Row]. By SparkSession.read.json () to load a Dataset [String] JSON or a file. Note that this file is not a JSON traditional JSON file, each line had to be a JSON string.

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

JDBC

Spark SQL can create DataFrame by way of reading data from a relational database JDBC, through a series of calculations DataFrame, can also write data back to a relational database.

Note that the database needs related to the driving of the spark into the class path.

$ bin/spark-shell --master spark://hadoop001:7077 --jars mysql-connector-java-5.1.27-bin.jar
// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://hadoop001:3306/rdd").option("dbtable", " rddtable").option("user", "root").option("password", "hive").load()

val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "hive")
val jdbcDF2 = spark.read
.jdbc("jdbc:mysql://hadoop001:3306/rdd", "rddtable", connectionProperties)

// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://hadoop001:3306/rdd")
.option("dbtable", "rddtable2")
.option("user", "root")
.option("password", "hive")
.save()

jdbcDF2.write
.jdbc("jdbc:mysql://hadoop001:3306/mysql", "db", connectionProperties)

// Specifying create table column data types on write
jdbcDF.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:mysql://hadoop001:3306/mysql", "db", connectionProperties)

Guess you like

Origin blog.51cto.com/14309075/2411816