Four, spark - sparkSQL principle and use

[TOC]

A, spark SQL Overview

1.1 What is the spark SQL

Spark Spark SQL is a module for processing structured data, it provides a programming abstraction called a DataFrame and acts as a distributed SQL query engine. Similar to the role hive.

1.2 spark SQL features

1, easy integration: the installation of Spark, has integrated well. No need to install separately.
2, unified data access method: JDBC, JSON, Hive, parquet file (a kind of columnar storage file is SparkSQL default data source, also supports the hive in)
3, fully compatible Hive. Hive the data can be directly read into Spark SQL processing.
Usually in the production, we are basically used for data warehousing data hive, then processed to read data from the hive with a spark.
4, supports standard data connections: JDBC, ODBC
5, computationally efficient than mr based on the hive, and hive2.x version, hive recommend the use of spark as the execution engine

Two, spark SQL basic principles

2.1 DataFrame basic concepts and DataSet

2.1.1 DataFrame

DataFrame organization reinstated ranked data sets. It is identical in concept to a relational database table, which has the table structure and the data, but with a richer optimized at the bottom. DataFrames may be constructed from a variety of sources,
for example:
structured data file
table hive in
an external database or an existing RDDs
DataFrame the API supported languages Scala, Java, Python, and R.

Compared RDD, DataFrame more information data structure, that schema. RDD is a collection of distributed Java objects. DataFrame Row is a collection of distributed objects. DataFrame In addition to providing a richer than the RDD operators outside, more important feature is optimized to enhance efficiency and reduce data read and execute the plan.

2.1.2 DataSet

Dataset is a distributed data collector. This is after Spark1.6 added a new interface to take into account the advantages of RDD advantages of efficiency and Spark SQL actuator (strongly typed, powerful lambda can be used). It can be seen as a special DataFrames Datasets, namely: Dataset (Row)

2.2 Creating DataFrame way

2.2.1 SparkSession objects

Apache Spark 2.0 introduces SparkSession, which provides a single entry point for users to use various functions of the Spark, Spark and allows the user to write a program it calls DataFrame by Dataset and related API. Most importantly, it reduces some of the concepts you need to know so that we can easily interact with the Spark.
Prior to version 2.0, and Spark interaction must be created before SparkConf and SparkContext. Spark 2.0 However, we SparkSession may be implemented by the same functionality without explicitly create SparkConf, SparkContext SqlContext and, because these objects have been encapsulated in the SparkSession.
To note that, in this spark version I used, the direct use of new SQLContext () to create SQLContext objects will be displayed this way have been abandoned (IDEA will show deprecated) recommended SparkSession to get SQLContext object.

2.2.2 Sample case class by class

It's more common in scala, because the case class is featured scala

/**
表 t_stu 的结构为:
id name age
*/

object CreateDF {
  def main(args: Array[String]): Unit = {
    //这是最新的获取SQLContext对象的方式
    //2、创建SparkSession对象,设置master,appname
    val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()
    //3、通过spark获取sparkContext对象,读取数据
    val lines = spark.sparkContext.textFile("G:\\test\\t_stu.txt").map(_.split(","))

    //4、将数据映射到case class中,也就是数据映射到表的对应字段中
    val tb = lines.map(t=>emp(t(0).toInt,t(1),t(2).toInt))
    //这里必须要加上隐式转换,否则无法调用 toDF 函数
    import spark.sqlContext.implicits._

    //5、生成df
    val df2 = tb.toDF()

    //相当于select name from t_stu
    df1.select($"name").show()

    //关闭spark对象
    spark.stop()
  }
}

/*1、定义case class,每个属性对应表中的字段名以及类型
     一般生产中为了方便,会全部定义为string类型,然后有需要的时候
     才根据实际情况将string转为需要的类型
   这一步相当于定义表的结构
*/
case class emp(id:Int,name:String,age:Int)

Summary of the steps are:

1、定义case class,用来表结构
2、创建sparkSession对象,用来读取数据
3、将rdd中的数据和case class映射
4、调用 toDF 函数将rdd转为 DataFrame

2.2.3 StructType by category

In this way more common java

package SparkSQLExer

import org.apache
import org.apache.spark
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}

/**
  * 创建dataschema方式2:
  * 通过spark session对象创建,表结构通过StructType创建
  */
object CreateDF02 {
  def main(args: Array[String]): Unit = {
    val sparkS = SparkSession.builder().master("local").appName("create schema").getOrCreate()

    //1、通过StructType创建表结构schema,里面表的每个字段使用 StructField定义
    val tbSchema = StructType(List(
        StructField("id",DataTypes.IntegerType),
        StructField("name",DataTypes.StringType),
        StructField("age",DataTypes.IntegerType)
      ))

    //2、读取数据
    var lines = sparkS.sparkContext.textFile("G:\\test\\t_stu.txt").map(_.split(","))

    //3、将数据映射为ROW对象
    val rdd1 = lines.map(t=>Row(t(0).toInt,t(1),t(2).toInt))

    //4、创建表结构和表数据映射,返回的就是df
    val df2 = sparkS.createDataFrame(rdd1, tbSchema)

    //打印表结构
    df2.printSchema()

    sparkS.stop()

  }

}

Summary of the steps are:

1、通过StructType创建表结构schema,里面表的每个字段使用 StructField定义
2、通过sparkSession.sparkContext读取数据
3、将数据映射格式为Row对象
4、将StructType和数据Row对象映射,返回df

2.2.4 using json, which are tabular file types

package SparkSQLExer

import org.apache.spark.sql.SparkSession

/**
  * 创建df方式3:通过有格式的文件直接导入数据以及表结构,比如json格式的文件
  * 返回的直接就是一个DF
  */
object CreateDF03 {
  def main(args: Array[String]): Unit = {
    val sparkS = SparkSession.builder().master("local").appName("create df through json").getOrCreate()

    //读取json方式1:
    val jsonrdd1= sparkS.read.json("path")

    //读取json方式2:
    val jsonrdd1= sparkS.read.format("json").load("path")

    sparkS.stop()
  }
}

This is relatively simple, it is directly read json file
sparkS.read.xxxx read arbitrary files, the return of all objects DF

2.3 Operation DataFrame

2.3.1 DSL statement

DSL statement is actually some sql statements into operation a similar way to call a function, such as:

df1.select("name").show

example:

为了方便,直接在spark-shell里操作了,
spark-shell --master spark://bigdata121:7077

1、打印表结构
scala> df1.printSchema
root
|-- empno: integer (nullable = true)
|-- ename: string (nullable = true)
|-- job: string (nullable = true)
|-- mgr: integer (nullable = true)
|-- hiredate: string (nullable = true)
|-- sal: integer (nullable = true)
|-- comm: integer (nullable = true)
|-- deptno: integer (nullable = true)

2、显示当前df的表数据或者查询结果的数据
scala> df1.show
+-----+------+---------+----+----------+----+----+------+
|empno| ename|      job| mgr|  hiredate| sal|comm|deptno|
+-----+------+---------+----+----------+----+----+------+
| 7369| SMITH|    CLERK|7902|1980/12/17| 800|   0|    20|
| 7499| ALLEN| SALESMAN|7698| 1981/2/20|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698| 1981/2/22|1250| 500|    30|
| 7566| JONES|  MANAGER|7839|  1981/4/2|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698| 1981/9/28|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839|  1981/5/1|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839|  1981/6/9|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566| 1987/4/19|3000|   0|    20|
| 7839|  KING|PRESIDENT|7839|1981/11/17|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698|  1981/9/8|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788| 1987/5/23|1100|   0|    20|
| 7900| JAMES|    CLERK|7698| 1981/12/3| 950|   0|    30|
| 7902|  FORD|  ANALYST|7566| 1981/12/3|3000|   0|    20|
| 7934|MILLER|    CLERK|7782| 1982/1/23|1300|   0|    10|
+-----+------+---------+----+----------+----+----+------+

3、执行select, 相当于select xxx form  xxx where xxx
scala> df1.select("ename","sal").where("sal>2000").show
+------+----+
| ename| sal|
+------+----+
| SMITH| 800|
| ALLEN|1600|
|  WARD|1250|
| JONES|2975|
|MARTIN|1250|
| BLAKE|2850|
| CLARK|2450|
| SCOTT|3000|
|  KING|5000|
|TURNER|1500|
| ADAMS|1100|
| JAMES| 950|
|  FORD|3000|
|MILLER|1300|
+------+----+

4、对某些列进行操作
对某个指定进行操作时,需要加上$符号,然后后面才能操作
$代表 取出来以后,再做一些操作。
注意:这个 $ 的用法在ideal中无法正常使用,解决方法下面说
scala> df1.select($"ename",$"sal",$"sal"+100).show
+------+----+-----------+
| ename| sal|(sal + 100)|
+------+----+-----------+
| SMITH| 800|        900|
| ALLEN|1600|       1700|
|  WARD|1250|       1350|
| JONES|2975|       3075|
|MARTIN|1250|       1350|
| BLAKE|2850|       2950|
| CLARK|2450|       2550|
| SCOTT|3000|       3100|
|  KING|5000|       5100|
|TURNER|1500|       1600|
| ADAMS|1100|       1200|
| JAMES| 950|       1050|
|  FORD|3000|       3100|
|MILLER|1300|       1400|
+------+----+-----------+

5、过滤行
scala> df1.filter($"sal">2000).show
+-----+-----+---------+----+----------+----+----+------+
|empno|ename|      job| mgr|  hiredate| sal|comm|deptno|
+-----+-----+---------+----+----------+----+----+------+
| 7566|JONES|  MANAGER|7839|  1981/4/2|2975|   0|    20|
| 7698|BLAKE|  MANAGER|7839|  1981/5/1|2850|   0|    30|
| 7782|CLARK|  MANAGER|7839|  1981/6/9|2450|   0|    10|
| 7788|SCOTT|  ANALYST|7566| 1987/4/19|3000|   0|    20|
| 7839| KING|PRESIDENT|7839|1981/11/17|5000|   0|    10|
| 7902| FORD|  ANALYST|7566| 1981/12/3|3000|   0|    20|
+-----+-----+---------+----+----------+----+----+------+

6、分组以及计数
scala> df1.groupBy($"deptno").count.show
+------+-----+                                                                  
|deptno|count|
+------+-----+
|    20|    5|
|    10|    3|
|    30|    6|
+------+-----+

Mentioned above does not work properly in the ide in select ($ "name"), the solution is:

在该语句之前加上这么一句:
import spark.sqlContext.implicits._

主要还是因为类型的问题,加上隐式转换就好了

2.3.2 sql statement

df object can not execute sql directly. We need to generate a view, and then execute SQL.
Need to specify the name of the view to create, view behind the name is equivalent to the table name.
Rear view also elaborate, here there is a concept first
example:

val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()
。。。。。。。。。。。。。。
//通过df对象创建临时视图。视图名就相当于表名
df1.createOrReplaceTempView("emp")

//通过sparksession对象执行执行
spark.sql("select * from emp").show
spark.sql("select * from emp where sal > 2000").show
spark.sql("select deptno,count(1) from emp group by deptno").show

//可以创建多个视图,不冲突
df1.createOrReplaceTempView("emp12345")
spark.sql("select e.deptno from emp12345 e").show

2.3.3 multi-table query

scala> case class Dept(deptno:Int,dname:String,loc:String)
defined class Dept

scala> val lines = sc.textFile("/usr/local/tmp_files/dept.csv").map(_.split(","))
lines: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[68] at map at <console>:24

scala> val allDept = lines.map(x=>Dept(x(0).toInt,x(1),x(2)))
allDept: org.apache.spark.rdd.RDD[Dept] = MapPartitionsRDD[69] at map at <console>:28

scala> val df2 = allDept.toDF
df2: org.apache.spark.sql.DataFrame = [deptno: int, dname: string ... 1 more field]

scala> df2.create
createGlobalTempView   createOrReplaceTempView   createTempView

scala> df2.createOrReplaceTempView("dept")

scala> spark.sql("select dname,ename from emp12345,dept where emp12345.deptno=dept.deptno").show
+----------+------+                                                             
|     dname| ename|
+----------+------+
|  RESEARCH| SMITH|
|  RESEARCH| JONES|
|  RESEARCH| SCOTT|
|  RESEARCH| ADAMS|
|  RESEARCH|  FORD|
|ACCOUNTING| CLARK|
|ACCOUNTING|  KING|
|ACCOUNTING|MILLER|
|     SALES| ALLEN|
|     SALES|  WARD|
|     SALES|MARTIN|
|     SALES| BLAKE|
|     SALES|TURNER|
|     SALES| JAMES|
+----------+------+

2.4 create a DataSet

2.4.1 by case class

And DataFrame similar, but instead call the toDF method toDS

package SparkSQLExer

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object CreateDS {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()
    val lines = spark.sparkContext.textFile("G:\\test\\t_stu.txt").map(_.split(","))
    val tb = lines.map(t=>emp1(t(0).toInt,t(1),t(2).toInt))
    import spark.sqlContext.implicits._
    val df1 = tb.toDS()
    df1.select($"name")

  }
}

case class emp1(id:Int,name:String,age:Int)

2.4.2 Sequence Seq class object

package SparkSQLExer

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object CreateDS {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()

    //创建一个序列对象,里面都是emp1对象,映射的数据,然后直接toDS转为DataSet
    val ds1 = Seq(emp1(1,"king",20)).toDS()
    ds1.printSchema()

  }
}

case class emp1(id:Int,name:String,age:Int)

2.4.3 using the json format file

定义case class
case class Person(name:String,age:BigInt)

使用JSON数据生成DataFrame
val df = spark.read.format("json").load("/usr/local/tmp_files/people.json")

将DataFrame转换成DataSet
df.as[Person].show

df.as[Person] 是一个 DataSet
as[T]中的泛型需要是一个case class类,用于映射表头

2.5 Operation DataSet

DataSet supports the operator actually is a combination of rdd and DataFrame operators.

使用emp.json 生成DataFrame
val empDF = spark.read.json("/usr/local/tmp_files/emp.json")

scala> empDF.show
+----+------+-----+------+----------+---------+----+----+
|comm|deptno|empno| ename|  hiredate|      job| mgr| sal|
+----+------+-----+------+----------+---------+----+----+
|    |    20| 7369| SMITH|1980/12/17|    CLERK|7902| 800|
| 300|    30| 7499| ALLEN| 1981/2/20| SALESMAN|7698|1600|
| 500|    30| 7521|  WARD| 1981/2/22| SALESMAN|7698|1250|
|    |    20| 7566| JONES|  1981/4/2|  MANAGER|7839|2975|
|1400|    30| 7654|MARTIN| 1981/9/28| SALESMAN|7698|1250|
|    |    30| 7698| BLAKE|  1981/5/1|  MANAGER|7839|2850|
|    |    10| 7782| CLARK|  1981/6/9|  MANAGER|7839|2450|
|    |    20| 7788| SCOTT| 1987/4/19|  ANALYST|7566|3000|
|    |    10| 7839|  KING|1981/11/17|PRESIDENT|    |5000|
|   0|    30| 7844|TURNER|  1981/9/8| SALESMAN|7698|1500|
|    |    20| 7876| ADAMS| 1987/5/23|    CLERK|7788|1100|
|    |    30| 7900| JAMES| 1981/12/3|    CLERK|7698| 950|
|    |    20| 7902|  FORD| 1981/12/3|  ANALYST|7566|3000|
|    |    10| 7934|MILLER| 1982/1/23|    CLERK|7782|1300|
+----+------+-----+------+----------+---------+----+----+

scala> empDF.where($"sal" >= 3000).show
+----+------+-----+-----+----------+---------+----+----+
|comm|deptno|empno|ename|  hiredate|      job| mgr| sal|
+----+------+-----+-----+----------+---------+----+----+
|    |    20| 7788|SCOTT| 1987/4/19|  ANALYST|7566|3000|
|    |    10| 7839| KING|1981/11/17|PRESIDENT|    |5000|
|    |    20| 7902| FORD| 1981/12/3|  ANALYST|7566|3000|
+----+------+-----+-----+----------+---------+----+----+

#### empDF 转换成 DataSet 需要 case class

scala> case class Emp(empno:BigInt,ename:String,job:String,mgr:String,hiredate:String,sal:BigInt,comm:String,deptno:BigInt)
defined class Emp

scala> val empDS = empDF.as[Emp]
empDS: org.apache.spark.sql.Dataset[Emp] = [comm: string, deptno: bigint ... 6 more fields]

scala> empDS.filter(_.sal > 3000).show
+----+------+-----+-----+----------+---------+---+----+
|comm|deptno|empno|ename|  hiredate|      job|mgr| sal|
+----+------+-----+-----+----------+---------+---+----+
|    |    10| 7839| KING|1981/11/17|PRESIDENT|   |5000|
+----+------+-----+-----+----------+---------+---+----+

scala> empDS.filter(_.deptno == 10).show
+----+------+-----+------+----------+---------+----+----+
|comm|deptno|empno| ename|  hiredate|      job| mgr| sal|
+----+------+-----+------+----------+---------+----+----+
|    |    10| 7782| CLARK|  1981/6/9|  MANAGER|7839|2450|
|    |    10| 7839|  KING|1981/11/17|PRESIDENT|    |5000|
|    |    10| 7934|MILLER| 1982/1/23|    CLERK|7782|1300|
+----+------+-----+------+----------+---------+----+----+

Multi-table query:

1、创建部门表
scala> val deptRDD = sc.textFile("/usr/local/tmp_files/dept.csv").map(_.split(","))
deptRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[154] at map at <console>:24

scala> case class Dept(deptno:Int,dname:String,loc:String)
defined class Dept

scala> val deptDS = deptRDD.map(x=>Dept(x(0).toInt,x(1),x(2))).toDS
deptDS: org.apache.spark.sql.Dataset[Dept] = [deptno: int, dname: string ... 1 more field]

scala> deptDS.show
+------+----------+--------+
|deptno|     dname|     loc|
+------+----------+--------+
|    10|ACCOUNTING|NEW YORK|
|    20|  RESEARCH|  DALLAS|
|    30|     SALES| CHICAGO|
|    40|OPERATIONS|  BOSTON|
+------+----------+--------+

2、员工表
同上 empDS

empDS.join(deptDS,"deptno").where(xxxx) 连接两个表,通过deptno字段
empDS.joinWith(deptDS,deptDS("deptno")===empDS("deptno")) 这个用于连接的字段名称不一样的情况

2.6 view view

If you want to use a standard sql statements to operate df ds or object, you must create a view or give df ds object, and then to operate the corresponding view through sql function SparkSession object can. So what views are?
A view is a virtual table, data is not stored, it can be accessed as a linked list. There are two types of views:
Normal View: Also called the local view, only valid in the current session session
global view: session is valid in all, the global view is created in the specified namespace: global_temp similar to a library
instructions:

val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()
val empDF = spark.read.json("/usr/local/tmp_files/emp.json")

创建本地视图:
empDF.createOrReplaceTempView(视图名),视图存在就会重新创建
empDF.createTempView(视图名),如果视图存在就不会创建

创建全局视图:
empDF.createGlobalTempView(视图名)

对视图执行sql操作,这里视图名就类似于表名
spark.sql("xxxxx")

例子:
empDF.createOrReplaceTempView("emp")
spark.sql("select * from emp").show

注意,只要创建了视图,那么就可以通过sparksession对象在任意一个类中操作视图,也就是表。这个特性很好用,当我们要操作一些表时,可以一开始就读取成df,然后创建成视图,那么就可以在任意一个地方查询表了。

2.7 Data Sources

Different formats can be read by the data source objects SparkSession:

val spark = SparkSession.builder().master("local").appName("createDF case class").getOrCreate()

The following are used in lieu of that spark above SparkSession.

2.7.1 SparkSession read the data mode

1、load
spark.read.load(path):读取指定路径的文件,要求文件存储格式为Parquet文件

2、format
spark.read.format("格式").load(path) :指定读取其他格式的文件,如json
例子:
spark.read.format("json").load(path)

3、直接读取其他格式文件
spark.read.格式名(路径),这是上面2中的一个简写方式,例子:
spark.read.json(路径)  json格式文件
spark.read.text(路径)  读取文本文件

注意:这些方式返回的都是 DataFrame 对象

2.7.2 SparkSession way to save data

可以将DataFrame 对象写入到指定格式的文件中,假设有个DataFrame 对象为df1.

1、save
df1.write.save(路径) 
他会将文件保存到这个目录下,文件名spark随机生成的,所以使用上面的读取方式的时候,直接指定读取目录即可,不用指定文件名。输出的文件格式为 Parquet。可以直接指定hdfs的路径,否则就存储到本地
如:
df1.write.save("/test")
spark.read.load("/test")

2、直接指定格式存储
df1.write.json(路径)  这样就会以json格式保存文件,生成的文件名的情况和上面类似

3、指定保存模式
如果没有指定保存模式,输出路径存在的情况下,就会报错
df1.write.mode("append").json(路径) 
mode("append") 就表示文件存在时就追加
mode("overwrite") 表示覆盖旧数据

4、保存为表
df1.write.saveAsTable(表名) 会保存在当前目录的spark-warehouse 目录下

5、format
df1.write.format(格式).save()
使用指定特定格式的方式来输出保存数据,比如保存到MongoDB数据库中

2.7.3 Parquet format

This one kind of row storage format, the specific principle can look hive before the article. This format is the default storage format, the default format when using the load and save, much like said front operation will not be repeated here. Here to talk about is a special feature of the Parquet, support merging schema (table structure). example:

scala> val df1 = sc.makeRDD(1 to 5).map(i=>(i,i*2)).toDF("single","double")
df1: org.apache.spark.sql.DataFrame = [single: int, double: int]

scala> df1.show
+------+------+
|single|double|
+------+------+
|     1|     2|
|     2|     4|
|     3|     6|
|     4|     8|
|     5|    10|
+------+------+

scala> sc.makeRDD(1 to 5)
res8: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[26] at makeRDD at <console>:25

scala> sc.makeRDD(1 to 5).collect
res9: Array[Int] = Array(1, 2, 3, 4, 5)

//导出表1
scala> df1.write.parquet("/usr/local/tmp_files/test_table/key=1")

scala> val df2 = sc.makeRDD(6 to 10).map(i=>(i,i*3)).toDF("single","triple")
df2: org.apache.spark.sql.DataFrame = [single: int, triple: int]

scala> df2.show
+------+------+
|single|triple|
+------+------+
|     6|    18|
|     7|    21|
|     8|    24|
|     9|    27|
|    10|    30|
+------+------+

//导出表2
scala> df2.write.parquet("/usr/local/tmp_files/test_table/key=2")

scala> val df3 = spark.read.parquet("/usr/local/tmp_files/test_table")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 1 more field]

//直接读取会丢失字段
scala> df3.show
+------+------+---+
|single|double|key|
+------+------+---+
|     8|  null|  2|
|     9|  null|  2|
|    10|  null|  2|
|     3|     6|  1|
|     4|     8|  1|
|     5|    10|  1|
|     6|  null|  2|
|     7|  null|  2|
|     1|     2|  1|
|     2|     4|  1|
+------+------+---+

//加上option,指定"mergeSchema"为true,就可以合并
scala> val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
df3: org.apache.spark.sql.DataFrame = [single: int, double: int ... 2 more fields]

scala> df3.show
+------+------+------+---+
|single|double|triple|key|
+------+------+------+---+
|     8|  null|    24|  2|
|     9|  null|    27|  2|
|    10|  null|    30|  2|
|     3|     6|  null|  1|
|     4|     8|  null|  1|
|     5|    10|  null|  1|
|     6|  null|    18|  2|
|     7|  null|    21|  2|
|     1|     2|  null|  1|
|     2|     4|  null|  1|
+------+------+------+---+

补充问题:key 是什么?必须用key嘛?
key是不同表的一个区分字段,在合并的时候,会作为合并之后的表的一个字段,并且值等于key=xx 中设置的值
如果目录下,两个表的目录名不一样,是无法合并的,合并字段名可以任意,
如:一个是key ,一个是 test 这两个无法合并,必须统一key或者test

2.7.4 json file

A document table with this format fields, examples:

scala> val peopleDF = spark.read.json("/usr/local/tmp_files/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> peopleDF.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)

scala> peopleDF.createOrReplaceTempView("people")

scala> spark.sql("select * from people where age=19")
res25: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> spark.sql("select * from people where age=19").show
+---+------+
|age|  name|
+---+------+
| 19|Justin|
+---+------+

scala> spark.sql("select age,count(1) from people group by  age").show
+----+--------+                                                                 
| age|count(1)|
+----+--------+
|  19|       1|
|null|       1|
|  30|       1|
+----+--------+

2.7.5 JDBC connection

df jdbc connection object supports the database by writing data to the database, or reads data from the database.
Examples:
1, by reading data from mysql jdbc:

Use format (xx) .option () connected to some of the parameters may be specified as a database, such as user name and password, and the like used for connecting the drive

import java.util.Properties

import org.apache.spark.sql.SparkSession

object ConnMysql {
  def main(args: Array[String]): Unit = {
    val sparkS = SparkSession.builder().appName("spark sql conn mysql").master("local").getOrCreate()
    //连接mysql方式1:
    //创建properties配置对象,用于存放连接mysql的参数
    val mysqlConn = new Properties()
    mysqlConn.setProperty("user","root")
    mysqlConn.setProperty("password","wjt86912572")
    //使用jdbc连接,指定连接字符串,表名,以及其他连接参数,并返回对应的dataframe
    val mysqlDF1 = sparkS.read.jdbc("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8","customer",mysqlConn)
    mysqlDF1.printSchema()
    mysqlDF1.show()
    mysqlDF1.createTempView("customer")
    sparkS.sql("select * from customer limit 2").show()

    //连接mysql方式2,这种方式比较常用:
    val mysqlConn2 = sparkS.read.format("jdbc")
      .option("url","jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8")              
      .option("user","root")
      .option("password","wjt86912572")
      .option("driver","com.mysql.jdbc.Driver")
      .option("dbtable","customer").load()

    mysqlConn2.printSchema()
  }
}

These are two ways to read the data connection.

2, jdbc writing data to mysql

And read similar, but replaced by a write operation

import java.util.Properties

import org.apache.spark.sql.SparkSession

object WriteToMysql {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("write to mysql").master("local").getOrCreate()

    val df1 = spark.read.text("G:\\test\\t_stu.json")

    //方式1:
    df1.write.format("jdbc")
      .option("url","jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8")
      .option("user","root")
      .option("password","wjt86912572")
      .option("driver","com.mysql.jdbc.Driver")
      .option("dbtable","customer").save()

    //方式2:
    val mysqlConn = new Properties()
    mysqlConn.setProperty("user","root")
    mysqlConn.setProperty("password","wjt86912572")
    df1.write.jdbc("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8","customer",mysqlConn)

  }

}

必须要保证df的表格式和写入的mysql的表格式一样,字段名也要一样

2.7.6 hive

1, is connected through jdbc hive
manner similar to ordinary jdbc, for example:

import java.util.Properties

import org.apache.spark.sql.SparkSession

/**
  * 连接hive的情况有两种:
  * 1、如果是直接在ideal中运行spark程序的话,则必须在程序中指定jdbc连接的hiveserver的地址
  * 且hiveserver必须以后台服务的形式暴露10000端口出来.这种方式是直接通过jdbc连接hive
  *
  * 2、如果程序是打包到spark集群中运行的话,一般spark集群的conf目录下,已经有hive client
  * 的配置文件,就会直接启动hive client来连接hive。这时不需要启动hiveserver服务。
  * 这种方式是通过hive client连接hive
  */
object ConnHive {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("spark sql conn mysql").master("local").getOrCreate()
    val properties = new Properties()
    properties.setProperty("user","")
    properties.setProperty("password","")

    val hiveDF = spark.read.jdbc("jdbc:hive2://bigdata121:10000/default","customer",properties)
    hiveDF.printSchema()
    spark.stop()
  }
}

In this way point to note:
in the form of service stations after exposure to port 10000 must hiveserver way out of this hive is connected directly through jdbc..

2, connected hive by hive client
in this way is generally used in production, because the task is usually submitted to a cluster run by spark-submit, this time will be directly connected hive by hive client, will not be connected by jdbc a.
To note: the need for spark on the nodes are configured on the hive client, and then copy the hive-site.xml configuration file to the conf directory of spark. Hadoop will also need the core-site.xml hdfs-site.xml also copy the past. On the other hand, due to the use hive client, so the hive server side, it is generally required to configure metastore server, the specific configuration of the article to see the hive.
Such programs spark cluster can be used directly

spark.sql("xxxx").show

这样的操作,默认就会从hive中读取对应的表进行操作。不用另外做任何连接hive 的操作

Or directly to the spark-shell, also may be used as the above manner hive table
example:

import org.apache.spark.sql.SparkSession

object ConnHive02 {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("spark sql conn hive").getOrCreate()
    spark.sql("select * from customer").show()
  }

}

这样直接操作的就是 hive 的表了

2.8 Small Case - Read hive data analysis results are written to mysql

import java.util.Properties

import org.apache.spark.sql.SparkSession

object HiveToMysql {
  def main(args: Array[String]): Unit = {
    //直接通过spark集群中的hive client连接hive,不需要jdbc以及hive server
    val spark = SparkSession.builder().appName("hive to mysql").enableHiveSupport().getOrCreate()
    val resultDF = spark.sql("select * from default.customer")

    //一般中间写的处理逻辑都是处理从hive读取的数据,处理完成后写入到mysql

    val mysqlConn = new Properties()
    mysqlConn.setProperty("user","root")
    mysqlConn.setProperty("password","wjt86912572")
    //通过jdbc写入mysql
  resultDF.write.mode("append").jdbc("jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8", "customer", mysqlConn)

    spark.stop()
  }

}

Third, performance optimization

3.1-memory cache data

First start a spark-shell

spark-shell --master spark://bigdata121:7077

要在spark-shell中操作mysql,所以记得自己找个 mysql-connector的jar放到spark的jars目录下

example:

创建df,从mysql读取表
scala> val mysqDF = spark.read.format("jdbc").option("url","jdbc:mysql://bigdata121:3306/test?serverTimezone=UTC&characterEncoding=utf-8").option("user","root").option("password","wjt86912572").option("driver","com.mysql.jdbc.Driver").option("dbtable","customer").load()
mysqDF: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]

scala> mysqDF.show
+---+------+--------------------+
| id|  name|            last_mod|
+---+------+--------------------+
|  1|  neil|2019-07-20 17:09:...|
|  2|  jack|2019-07-20 17:09:...|
|  3|martin|2019-07-20 17:09:...|
|  4|  tony|2019-07-20 17:09:...|
|  5|  eric|2019-07-20 17:09:...|
|  6|  king|2019-07-20 17:42:...|
|  7|   tao|2019-07-20 17:45:...|
+---+------+--------------------+

必须注册成一张表,才可以缓存。
scala> mysqDF.registerTempTable("customer")
warning: there was one deprecation warning; re-run with -deprecation for details

标识这张表可以被缓存,但是现在数据并没有直接缓存
scala> spark.sqlContext.cacheTable("customer")

第一次查询表,从mysql读取数据,并缓存到内存中
scala> spark.sql("select * from customer").show
+---+------+--------------------+
| id|  name|            last_mod|
+---+------+--------------------+
|  1|  neil|2019-07-20 17:09:...|
|  2|  jack|2019-07-20 17:09:...|
|  3|martin|2019-07-20 17:09:...|
|  4|  tony|2019-07-20 17:09:...|
|  5|  eric|2019-07-20 17:09:...|
|  6|  king|2019-07-20 17:42:...|
|  7|   tao|2019-07-20 17:45:...|
+---+------+--------------------+

这一次查询从内存中返回
scala> spark.sql("select * from customer").show
+---+------+--------------------+
| id|  name|            last_mod|
+---+------+--------------------+
|  1|  neil|2019-07-20 17:09:...|
|  2|  jack|2019-07-20 17:09:...|
|  3|martin|2019-07-20 17:09:...|
|  4|  tony|2019-07-20 17:09:...|
|  5|  eric|2019-07-20 17:09:...|
|  6|  king|2019-07-20 17:42:...|
|  7|   tao|2019-07-20 17:45:...|
+---+------+--------------------+

清空缓存
scala> spark.sqlContext.clearCache

3.2 tuning parameters

将数据缓存到内存中的相关优化参数
   spark.sql.inMemoryColumnarStorage.compressed
   默认为 true
   Spark SQL 将会基于统计信息自动地为每一列选择一种压缩编码方式。

   spark.sql.inMemoryColumnarStorage.batchSize
   默认值:10000
   缓存批处理大小。缓存数据时, 较大的批处理大小可以提高内存利用率和压缩率,但同时也会带来 OOM(Out Of Memory)的风险。

其他性能相关的配置选项(不过不推荐手动修改,可能在后续版本自动的自适应修改)
   spark.sql.files.maxPartitionBytes
   默认值:128 MB
   读取文件时单个分区可容纳的最大字节数

   spark.sql.files.openCostInBytes
   默认值:4M
   打开文件的估算成本, 按照同一时间能够扫描的字节数来测量。当往一个分区写入多个文件的时候会使用。高估更好, 这样的话小文件分区将比大文件分区更快 (先被调度)。

spark.sql.autoBroadcastJoinThreshold
   默认值:10M
   用于配置一个表在执行 join 操作时能够广播给所有 worker 节点的最大字节大小。通过将这个值设置为 -1 可以禁用广播。注意,当前数据统计仅支持已经运行了 ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan 命令的 Hive Metastore 表。

spark.sql.shuffle.partitions
   默认值:200
   用于配置 join 或聚合操作混洗(shuffle)数据时使用的分区数。

Guess you like

Origin blog.51cto.com/kinglab/2450773