Spark SQL from entry to proficiency

Spark SQL from entry to proficiency

Waves waves talk big data
this paper is to help everyone from entry to the master control spark sql. The length is longer and the content is rich. It is recommended that you collect it and read it carefully.

For more big data and spark tutorials, please click to read the original text and join Wavetip Knowledge Planet to get.

WeChat group can add Langjian WeChat 158570986.

Family history

Those who are familiar with spark sql know that spark sql is developed from shark. In order to achieve Hive compatibility, Shark reuses HQL analysis, logical execution plan translation, execution plan optimization and other logic in HQL. It can be approximated that only the physical execution plan is replaced from the MR job to the Spark job (supplied by memory column type). Storage and other optimizations that have little to do with Hive);
At the same time, it also relies on Hive Metastore and Hive SerDe (for compatibility with various existing Hive storage formats).
Spark SQL only relies on HQL parser, Hive Metastore and Hive SerDe at the Hive compatibility level. In other words, since HQL is parsed into an abstract syntax tree (AST), it is all taken over by Spark SQL. Catalyst is responsible for execution plan generation and optimization. With the help of Scala's pattern matching and other functional language features, the use of Catalyst to develop execution plan optimization strategies is much more concise than Hive.
Spark SQL
Spark SQL from entry to proficiency
spark sql provides a variety of interfaces:

  1. Pure Sql text

  2. dataset/dataframe api

Of course, correspondingly, there will be various clients:

sql text, you can use thriftserver/spark-sql

Encoding, Dataframe/dataset/sql

Dataframe / Dataset API 简介

Dataframe/Dataset is also a distributed data set, but it is different from RDD in that it carries schema information, which is similar to a table.
You can use the following figure to compare the difference between Dataset/dataframe and RDD in detail:
Spark SQL from entry to proficiency
Dataset was introduced in spark1.6, the purpose is to provide strong type like RDD, use powerful lambda function, and use spark sql optimized execution engine. After spark2.0, the DataFrame becomes a Row Dataset, which is:


type DataFrame = Dataset[Row]

Spark SQL from entry to proficiency
Therefore, many people who port spark1.6 and previous code to spark2+ will report an error and cannot find the dataframe class.

Basic operation


val df = spark.read.json(“file:///opt/meitu/bigdata/src/main/data/people.json”)
df.show()
import spark.implicits._
df.printSchema()
df.select("name").show()
df.select($"name", $"age" + 1).show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()
spark.stop()

Bucket sorting


分桶排序保存hive表
df.write.bucketBy(42,“name”).sortBy(“age”).saveAsTable(“people_bucketed”)
分区以parquet输出到指定目录
df.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")
分区分桶保存到hive表
df.write .partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("users_partitioned_bucketed")

cube rullup pivot


cube
sales.cube("city", "year”).agg(sum("amount")as "amount”) .show()
rull up
sales.rollup("city", "year”).agg(sum("amount")as "amount”).show()
pivot 只能跟在groupby之后
sales.groupBy("year").pivot("city",Seq("Warsaw","Boston","Toronto")).agg(sum("amount")as "amount”).show()

SQL programming

Spark SQL allows users to submit SQL text, and supports the following three methods to write SQL text:

  1. spark code
  2. spark-sql的shell
  3. thriftserver
    supports Spark SQL's own syntax and is also compatible with HSQL.

    1. Encoding

    It is necessary to declare the construction of SQLContext or SparkSession first, this is the coding entry of SparkSQL. Earlier versions used SQLContext or HiveContext. After spark2, it is recommended to use SparkSession.


1. SQLContext
new SQLContext(SparkContext)

2. HiveContext
new HiveContext(spark.sparkContext)

3. SparkSession
不使用hive元数据:
val spark = SparkSession.builder()
 .config(sparkConf) .getOrCreate()
使用hive元数据
val spark = SparkSession.builder()
 .config(sparkConf) .enableHiveSupport().getOrCreate()

use


val df =spark.read.json("examples/src/main/resources/people.json") 
df.createOrReplaceTempView("people") 
spark.sql("SELECT * FROM people").show()

2. spark-sql script

When spark-sql is started, similar to spark-submit, you can set deployment mode resources, etc. You can use
bin/spark-sql –help to view configuration parameters.
Need to put hive-site.xml in the ${SPARK_HOME}/conf/ directory, and then you can test


show tables;

select count(*) from student;

3. thriftserver

The implementation of thriftserver jdbc/odbc is similar to hiveserver2 of hive1.2.1, you can use spark beeline command to test jdbc server.


安装部署
1). 开启hive的metastore
bin/hive --service metastore 
2). 将配置文件复制到spark/conf/目录下
3). thriftserver
sbin/start-thriftserver.sh --masteryarn  --deploy-mode client
对于yarn只支持client模式
4). 启动bin/beeline
5). 连接到thriftserver
!connect jdbc:hive2://localhost:10001

User-defined function

1. UDF

It is very simple to define a UDF, for example, we customize a UDF to find the length of the string. .


val len = udf{(str:String) => str.length}
spark.udf.register("len",len)
val ds =spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.createOrReplaceTempView("employees")
ds.show()
spark.sql("select len(name) from employees").show()

2. UserDefinedAggregateFunction

Define a UDAF


import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverageUDAF extends UserDefinedAggregateFunction {
 //Data types of input arguments of this aggregate function
 definputSchema:StructType = StructType(StructField("inputColumn", LongType) :: Nil)
 //Data types of values in the aggregation buffer
 defbufferSchema:StructType = {
   StructType(StructField("sum", LongType):: StructField("count", LongType) :: Nil)
 }
 //The data type of the returned value
 defdataType:DataType = DoubleType
 //Whether this function always returns the same output on the identical input
 defdeterministic: Boolean = true
 //Initializes the given aggregation buffer. The buffer itself is a `Row` that inaddition to
 // standard methods like retrieving avalue at an index (e.g., get(), getBoolean()), provides
 // the opportunity to update itsvalues. Note that arrays and maps inside the buffer are still
 // immutable.
 definitialize(buffer:MutableAggregationBuffer): Unit = {
   buffer(0) = 0L
   buffer(1) = 0L
 }
 //Updates the given aggregation buffer `buffer` with new input data from `input`
 defupdate(buffer:MutableAggregationBuffer, input: Row): Unit ={
   if(!input.isNullAt(0)) {
     buffer(0) = buffer.getLong(0)+ input.getLong(0)
     buffer(1) = buffer.getLong(1)+ 1
   }
 }
 // Mergestwo aggregation buffers and stores the updated buffer values back to `buffer1`
 defmerge(buffer1:MutableAggregationBuffer, buffer2: Row): Unit ={
   buffer1(0) = buffer1.getLong(0)+ buffer2.getLong(0)
   buffer1(1) = buffer1.getLong(1)+ buffer2.getLong(1)
 }
 //Calculates the final result
 defevaluate(buffer:Row): Double =buffer.getLong(0).toDouble /buffer.getLong(1)
}

Use UDAF


val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.createOrReplaceTempView("employees")
ds.show()
spark.udf.register("myAverage", MyAverageUDAF)
val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()

3. Aggregator

Define an Aggregator


import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverageAggregator extends Aggregator[Employee, Average, Double] {

 // A zero value for this aggregation. Should satisfy the property that any b + zero = b
 def zero: Average = Average(0L, 0L)
 // Combine two values to produce a new value. For performance, the function may modify `buffer`
 // and return it instead of constructing a new object
 def reduce(buffer: Average, employee: Employee): Average = {
   buffer.sum += employee.salary
   buffer.count += 1
   buffer
 }
 // Merge two intermediate values
 def merge(b1: Average, b2: Average): Average = {
   b1.sum += b2.sum
   b1.count += b2.count
   b1
 }
 // Transform the output of the reduction
 def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
 // Specifies the Encoder for the intermediate value type
 def bufferEncoder: Encoder[Average] = Encoders.product
 // Specifies the Encoder for the final output value type
 def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

use


spark.udf.register("myAverage2", MyAverageAggregator)
import spark.implicits._
val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json").as[Employee]
ds.show()
val averageSalary = MyAverageAggregator.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show() 

data source

  1. The general laod/save function
    supports multiple data formats: json, parquet, jdbc, orc, libsvm, csv, text

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

The default is parquet, you can modify the default configuration through spark.sql.sources.default.

  1. Parquet file

val parquetFileDF =spark.read.parquet("people.parquet") 
peopleDF.write.parquet("people.parquet")
  1. ORC file

val ds = spark.read.json("file:///opt/meitu/bigdata/src/main/data/employees.json")
ds.write.mode("append").orc("/opt/outputorc/")
spark.read.orc("/opt/outputorc/*").show(1)
  1. JSON

ds.write.mode("overwrite").json("/opt/outputjson/")
spark.read.json("/opt/outputjson/*").show()
  1. Hive table

Spark 1.6 and earlier versions require hivecontext to use hive tables.

At the beginning of Spark2, you only need to create a sparksession and add enableHiveSupport().


val spark = SparkSession
.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()

spark.sql("select count(*) from student").show()
  1. JDBC

Write to mysql


wcdf.repartition(1).write.mode("append").option("user", "root")
 .option("password", "mdh2018@#").jdbc("jdbc:mysql://localhost:3306/test","alluxio",new Properties())

Read from mysql


val fromMysql = spark.read.option("user", "root")
 .option("password", "mdh2018@#").jdbc("jdbc:mysql://localhost:3306/test","alluxio",new Properties())
  1. Custom data source

Customizing the source is relatively simple, first of all we have to look at the way the source is loaded

In the specified directory, define a DefaultSource class and implement custom source in the class. We can achieve our goal.


import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport}

class DefaultSource  extends DataSourceV2 with ReadSupport {

 def createReader(options: DataSourceOptions) = new SimpleDataSourceReader()
}

import org.apache.spark.sql.Row
import org.apache.spark.sql.sources.v2.reader.{DataReaderFactory, DataSourceReader}
import org.apache.spark.sql.types.{StringType, StructField, StructType}

class SimpleDataSourceReader extends DataSourceReader {

 def readSchema() = StructType(Array(StructField("value", StringType)))

 def createDataReaderFactories = {
   val factoryList = new java.util.ArrayList[DataReaderFactory[Row]]
   factoryList.add(new SimpleDataSourceReaderFactory())
   factoryList
 }
}

import org.apache.spark.sql.Row
import org.apache.spark.sql.sources.v2.reader.{DataReader, DataReaderFactory}

class SimpleDataSourceReaderFactory extends
 DataReaderFactory[Row] with DataReader[Row] {
 def createDataReader = new SimpleDataSourceReaderFactory()
 val values = Array("1", "2", "3", "4", "5")

 var index = 0

 def next = index < values.length

 def get = {
   val row = Row(values(index))
   index = index + 1
   row
 }

 def close() = Unit
}

use


val simpleDf = spark.read
 .format("bigdata.spark.SparkSQL.DataSources")
 .load()

simpleDf.show()

Optimizer and execution plan

1. Introduction to the process

The overall process is as follows: the
Spark SQL from entry to proficiency
overall execution process is as follows: starting from the provided input API (SQL, Dataset, dataframe), go through unresolved logical plan, resolved logical plan, optimized logical plan, physical plan, and then select one according to cost based optimization The physical plan is executed.
Simplified into four parts:


1). analysis

Spark 2.0 以后语法树生成使用的是antlr4,之前是scalaparse。

2). logical optimization

常量合并,谓词下推,列裁剪,boolean表达式简化,和其它的规则

3). physical planning

eg:SortExec          

4). Codegen

codegen技术是用scala的字符串插值特性生成源码,然后使用Janino,编译成java字节码。Eg: SortExec

2. Custom Optimizer

1). Implement
inherited Rule[LogicalPlan]
2). Register


spark.experimental.extraOptimizations= Seq(MultiplyOptimizationRule)

3). Use


selectExpr("amountPaid* 1")
  1. The custom execution plan is
    mainly to implement the function of overloading the count function
    1). Physical plan:
    inherit SparkLan to implement the doExecute method
    2). Logical plan
    inherit SparkStrategy to implement apply
    3). Register to Spark execution strategy:

spark.experimental.extraStrategies =Seq(countStrategy)

4). Use


spark.sql("select count(*) fromtest")

Guess you like

Origin blog.51cto.com/15127544/2665112