Spark Big Data technologies of SQL
A: Spark SQL Overview
-
Definitions: Spark Spark SQL is a module for processing structured data, it provides two programming objects: DataFrame and the DataSet, and acts as a distributed SQL query engine.
-
Features: Easy integration, unified data access method, compatible Hive, standard data connection
-
DtaFrame definition: and the like RDD, DataFrame is a distributed data container. However Dataframe more like a two-dimensional table of a database, in addition to data other than the data structure information is also recorded, i.e. schema
-
DataSet definition: DataSet is an extension DtaFrame API, the Spark is the latest data abstraction.
1) is an extension Dataframe API, the Spark is the latest data abstraction.
2) user-friendly API style, with both types of security checks also have a query optimization characteristics of Dataframe.
3) Dataset supported codec, when access is required to avoid non-heap data deserialized whole object to improve the
efficiency.
4) Sample class is used to define a data structure information Dataset, the class name of each sample attribute map directly to the
4/26 Beijing East Yanjiao Yan Ling Road 169 south Ark Square Tel: 010 83868569
field name in the DataSet.
5) Dataframe column is particularly the Dataset, DataFrame = Dataset [Row], it is possible by the method as DataFrame
convert Dataset. Row is a type, just like these types of Car, Person, all the information I have to use the table structure
to represent Row.
6) DataSet is strongly typed. For example, there can Dataset [Car], Dataset [the Person].
7) DataFrame just know the field, but do not know the type of field, so when performing these operations is no way
at compile time checking whether the type of failure, for example, you can String to a subtraction operation, was reported in the implementation of
mistake, and DataSet not know the field, and know the type of field, so there are more stringent error checking. Just on JSON
analogy between objects and classes of objects
Two: Spark SQL Programming
In the old version, SparkSQL provides two SQL queries starting point: a man named SQLContext, for Spark provide their own SQL queries; a call HiveContext, for the Hive query connection.
SparkSession Spark is the latest SQL queries starting point, is essentially a combination of SQLContext and HiveContext, it is available on SQLContext and HiveContext API also can be used on SparkSession. Internal SparkSession encapsulates sparkContext, so the calculation is actually done by the sparkContext.
-
DataFrame: In the SQL Spark SparkSession create and execute SQL DataFrame inlet, creating DataFrame three ways: Spark created by the data source; conversion from an existing RDD; can also be returned from the query Hive Table.
1) Create a Data Source
1. 查看Spark数据源进行创建的文件格式 spark.read 2. 读取json文件创建DataFrame val df=spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
2) conversion to create from RDD
3) Create a query returns from Hive Table
The third chapter focuses on the follow-up
-
SQL syntax style
1. 创建一个DataFrame val df=spark.read.json("/opt/module/spark/examples/src/main/resources/people.json") 2.对 DataFrame 创建一个临时表 scala> df.createOrReplaceTempView("people") 注意: 临时表是 Session 范围内的, Session 退出后,表就失效了。如果想应用范围内有效,可以 使用全局表。注意使用全局表时需要全路径访问, 如: global_temp.people 3. 对于 DataFrame 创建一个全局表 scala> df.createGlobalTempView("people") 4. 通过 SQL 语句实现查询全表 scala> spark.sql("SELECT * FROM global_temp.people").show() scala> spark.newSession().sql("SELECT * FROM global_temp.people").show()
-
RDD converted to DataFrame
-
Note: If you need to operate between the RDD and the DF or DS, you will need to introduce import spark.implicits._ [spark
Not a package name, but the name of the object sparkSession]
Preconditions: Import implicit conversion and create an RDD.
scala> import spark.implicits._ import spark.implicits._ scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt") peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[3] at textFile at <console>:27 1)通过手动确定转换 scala> peopleRDD.map{x=>val para = x.split(",");(para(0),para(1).trim.toInt)}.toDF("name","age")
-
-
DataFrame converted to RDD
-
It can directly call rdd
1) 创建一个 DataFrame scala> val df = spark.read.json("/opt/module/spark/examples/src/main/resources/people.json") 2)将 DataFrame 转换为 RDD scala> val dfToRDD = df.rdd
-
Three: DataSet: DataSet is a strongly typed data set, it is necessary to provide information corresponding to the type of
-
Create a DataSet
1. 创建一个样例类 case class Person(nam: String,age:Long) 2. 创建 DataSet val caseClassDS = Seq(Person("Andy", 32)).toDS()
- RDD converted to DataSet
SparkSQL automatically will contain case classes RDD converted DataFrame, case class defines the structure of the table,
case by reflection into a class attribute table column name.1. 创建一个 RDD val peopleRDD = sc.textFile("examples/src/main/resources/people.txt") 2. 创建一个样例类 case class Person(name: String, age: Long) 3. 将 RDD 转化为 DataSet scala> peopleRDD.map(line => {val para = line.split(",");Person(para(0),para(1).trim.toInt)}).toDS()
- DataSet convert RDD: you can call the method rdd
1) 创建一个 DataSet scala> val DS = Seq(Person("Andy", 32)).toDS() 2)将 DataSet 转换为 RDD scala> DS.rdd
- DataFrame interchangeable to Data''Set
1)创建一个 DateFrame scala> val df = spark.read.json("examples/src/main/resources/people.json") df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] 2)创建一个样例类 scala> case class Person(name: String, age: Long) defined class Person 3)将 DateFrame 转化为 DataSet scala> df.as[Person] res14: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string] 2. DataSet 转换为 DataFrame 1)创建一个样例类 scala> case class Person(name: String, age: Long) defined class Person 2)创建 DataSet scala> val ds = Seq(Person("Andy", 32)).toDS() ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] 3)将 DataSet 转化为 DataFrame scala> val df = ds.toDF
Summary: The method of DataSet DataFrame conversion is very simple, just need to ROW case class encapsulated, introduced after the call toDF implicit conversion. DataFrame turn DataSet relatively simple, just after import implicit conversion method call as [row], the specific data type annotation can be cleared
-
Three common
1, RDD, DataFrame, Dataset all distributed under the spark elastic data sets platform for processing large data
facilitate
2, the three are inert mechanism, making creation, conversion, such as map method, is not performed immediately only in the face of Action
when such foreach, the three will begin traversing operation.
3, the three operations will automatically cache memory according to the situation spark, so that even if a large amount of data, do not worry about memory
overflow.
4, there are three conceptual partition
5, three of the many common functions, such as filter, sorting
6, performed in support of DataFrame Dataset and many operations are required to operate this package
Import spark.implicits._
. 7, DataFrame Dataset and pattern matching can be used to obtain the value and type of each field -
Between the three types
1.RDD:
1) RDD general use and spark mlib
2) RDD does not support operation sparksql2.DataFrame:
1) and the RDD Dataset different types DataFrame fixed to each row Row, values for each column can not directly access, only the value acquired by parsing the fields,
2) DataFrame and Dataset are generally not used simultaneously with the spark MLIB
3) DataFrame and Dataset support operations sparksql, such as select, groupby and the like, but also the temporary registration table / window, perform sql statement operation4) DataFrame with some special support Dataset convenient way to preserve, such as saving to csv, can bring the header, such that each field name column glance
3.DataSet:
1) Dataset member and have exactly the same functions DataFrame above, except for the data type of each line.
2) DataFrame can also be called Dataset [Row], the type of each row is Row, not resolved, each line exactly which fields, and each field have no way of knowing what type, can only use the methods mentioned above or getAS Article in common pattern matching out specific fields mentioned. The Dataset, each row is not necessarily what type of, after the custom of the case class can be free access to information of each lineAs can be seen, Dataset when you need to access a field in a column is very convenient, however, if you write some adaptability
when a strong function, if you use Dataset, line type and are not sure, probably all kinds case class, can not be achieved adaptation, this time with DataFrame that Dataset [Row] will be able to better solve the problem. -
IDEA SparkSQL and create user-defined UDF
package com.ityouxin.sparkSql import org.apache.spark.sql.SparkSession object HelloWorld { def main(args: Array[String]): Unit = { //创建 SparkConf()并设置 App 名称 val spark = SparkSession .builder() .master("local[*]") .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() import spark.implicits._ val df = spark.read.json("datas/people.json") df.show() df.filter($"age" > 21).show() df.createOrReplaceTempView("persons") spark.sql("SELECT * FROM persons where age > 21").show() spark.stop() } }
Four: Spark SQL data source
-
General Load / Save method
-
Manually specify options
Spark SQL interface supports operation of DataFrame a variety of data sources. DataFrame RDDs may operate a mode, may also be registered as a temporary table. After the DataFrame registered as a temporary table, you can query the DataFrame execute SQL.
Spark SQL default data source for Parquet format. When the data source for Parquet file, Spark SQL can easily perform all the operations. Modify the configuration item spark.sql.sources.default, you can modify the default data source format.
When the data format is not the source format file parquet, need to manually specify the format of the data source. Data source format to specify the full name (e.g.: org.apache.spark.sql.parquet), if the data format is the built-in source format, referred to only need to specify a given json, parquet, jdbc, orc, libsvm, csv, text specified format of the data. It can be used for general load data provided by read.load SparkSession method, using the write and save data saved.
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json") val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://hadoop102:9000/namesAndAges.parquet`")
-
File Saving Options
SaveMode perform storage operations may be employed, SaveMode defines the processing mode data. Note that these save mode does not use any locking, it is not atomic. In addition, when performed using Overwrite mode, the output before new data in the original data had been deleted. SaveMode detailed in the following table: Slightly
-
JSON
Spark SQL automatically presumed structure of JSON data set and load it a Dataset [Row] as may be to load a file via a JSON SparkSession.read.json ().
Note: This file is not a traditional JSON JSON file, each line had to be a JSON string.val path = "examples/src/main/resources/people.json" val peopleDF = spark.read.json(path) peopleDF.printSchema()
- Parquet file
Parquet is a popular format storage column can be efficiently stored in a recording nested fields. Parquet format is often used in the Hadoop ecosystem, it also supports all Spark SQL data types. Spark SQL provides direct read and store Parquet format approach.
importing spark.implicits._ import spark.implicits._ val peopleDF = spark.read.json("examples/src/main/resources/people.json") peopleDF.write.parquet("hdfs://hadoop102:9000/people.parquet") val parquetFileDF = spark.read.parquet("hdfs:// hadoop102:9000/people.parquet") parquetFileDF.createOrReplaceTempView("parquetFile") val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19") namesDF.map(attributes => "Name: " + attributes(0)).show()
- JDBC
Spark SQL can create DataFrame by way of reading data from a relational database JDBC, through a series of calculations DataFrame, can also write data back to a relational database.
Note: You need the relevant class database driven into the path of spark./load 默认的加载文件的格式由参数spark.sql.source.default决定 val df = spark.read.load("datas/users.parquet") df.show() // spark.sql("select * from parquet.'datas/users.parquet'").show() //spark.sql("select * from json.'datas/people.json'").show() //jdbc spark.read.format("jdbc") .option("url","jdbc:mysql://localhost:3306/day10") .option("dbtable","user") .option("user","root") .option("password","123") .load().show() println("-------------------") val option = new Properties option.setProperty("user","root") option.setProperty("password","123") spark.read.jdbc("jdbc:mysql://localhost:3306/day10","user",option).show()
- Hive
Apach Hive is a SQL engine on Hadoop, Spark SQL may contain Hive support compile time, it can not package
contains. Spark SQL support contains Hive can support Hive table visit, UDF (user-defined functions) and Hive Query
Language (HiveQL / HQL) and so on. Needs to be emphasized is that if you want to include in the Spark SQL Hive library, the matter does not need
to install Hive. In general, it is best to introduce Hive support at compile time Spark SQL, so you can use these features
of the. If you download a binary version of the Spark, it should have been added Hive support at compile time. -