SparkSQL basic analysis (3)

1. Overview of Spark SQL

1.1What is Spark SQL

Spark SQL is a module used by Spark to process structured data. It provides two programming abstractions: DataFrame and
DataSet, and functions as a distributed SQL query engine.
We have already learned about Hive, which converts Hive SQL into MapReduce and then submits it to the cluster for execution, which greatly simplifies the complexity of writing MapReduce programs because the computing model of MapReduce is relatively slow in execution. All Spark SQL came into being. It converts Spark SQL into RDD and then submits it to the cluster for execution. The execution efficiency is very fast!

1.2 Features of Spark SQL

1) Easy integration
Insert image description here
2) Unified data access method
Insert image description here

3) Compatible with Hive
Insert image description here
4) Standard data connection
Insert image description here

1.3 What is DataFrame

Similar to RDD, DataFrame is also a distributed data container. However, DataFrame is more like a two-dimensional table of a traditional database. In addition to data, it also records the structural information of the data, that is, schema. At the same time, similar to Hive, DataFrame also supports nested data types (struct, array and map). From the perspective of API ease of use, the DataFrame API provides a set of high-level relational operations, which is more friendly and has a lower threshold than the functional RDD API.
Insert image description here

The above figure intuitively reflects the difference between DataFrame and RDD. Although the RDD[Person] on the left uses Person as a type parameter, the Spark framework itself does not understand the internal structure of the Person class. The DataFrame on the right provides detailed structural information, allowing Spark SQL to clearly know which columns are included in the data set, and what the name and type of each column are. DataFrame is a view that provides Schema for data. It can be treated like a table in the database, and DataFrame is also executed lazily. The performance is higher than RDD. The main reason is:
optimized execution plan: the query plan is optimized through Spark catalyst optimiser.

Insert image description here
For example, the following example:
users.join(events,users("id") === events("uid")).filter(events("date")>"2023-01-01")
Insert image description here
In order to illustrate query optimization, Let’s look at an example of population data analysis shown above. In the figure, two DataFrames are constructed, and a filter operation is performed after joining them. If this execution plan is executed unchanged, the final execution efficiency will not be high. Because join is a costly operation, it may also produce a larger data set. If we can push the filter above the join, filter the DataFrame first, and then join the filtered smaller result set, we can effectively shorten the execution time. The Spark SQL query optimizer does exactly this. In short, logical query plan optimization is a process that uses equivalent transformations based on relational algebra to replace high-cost operations with low-cost operations.

1.4What is DataSet

1) It is an extension of the Dataframe API and is the latest data abstraction of Spark.
2) User-friendly API style, with both type safety checking and query optimization features of Dataframe.
3) Dataset supports codecs, which can avoid deserializing the entire object when accessing non-heap data, improving efficiency.
4) The sample class is used to define the structural information of the data in the Dataset. The name of each attribute in the sample class is directly mapped to the field name in the DataSet.
5) Dataframe is a special column of Dataset, DataFrame=Dataset[Row], so Dataframe can be converted into Dataset through the as method. Row is a type, just like Car and Person. I use Row to represent all table structure information.
6) DataSet is strongly typed. For example, there can be Dataset[Car], Dataset[Person].
7) DataFrame only knows the fields, but does not know the types of the fields, so when performing these operations, there is no way to check whether the type fails during compilation. For example, you can If you perform a subtraction operation on a String, an error will be reported during execution. However, DataSet not only knows the fields, but also the field types, so it has stricter error checking. Just like the analogy between JSON objects and class objects.

2. SparkSQL programming

2.1 SparkSession new starting point

In the old version, SparkSQL provided two SQL query starting points: one called SQLContext, used for SQL queries provided by Spark itself; one called HiveContext, used for queries connected to Hive.
SparkSession is Spark's latest SQL query starting point. It is essentially a combination of SQLContext and HiveContext, so the APIs available on SQLContext and HiveContext can also be used on SparkSession. SparkSession encapsulates sparkContext internally, so the calculation is actually completed by sparkContext.

2.2 DataFrame

2.2.1 Create

In Spark SQL, SparkSession is the entry point for creating DataFrame and executing SQL. There are three ways to create DataFrame: creating it through Spark's data source; converting it from an existing RDD; and querying it and returning it from Hive Table.
1) Create from Spark data source
(1) View the file format created by Spark data source
scala> spark.read.
csv format jdbc json load option options orc parquet schema table text textFile
(2) Read json file to create DataFrame
scala> val df = spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string ]
(3) Display results

scala> df.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

2) Convert from RDD
3) Query and return from Hive Table

2.2.2 SQL style syntax (main)

1)创建一个DataFrame
scala> val df = spark.read.json("/opt/module/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
2)对DataFrame创建一个临时表
scala> df.createOrReplaceTempView("people")
3)通过SQL语句实现查询全表
scala> val sqlDF = spark.sql("SELECT * FROM people")
sqlDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
4)结果展示
scala> sqlDF.show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
注意:临时表是Session范围内的,Session退出后,表就失效了。如果想应用范围内有效,可以使用全局表。注意使用全局表时需要全路径访问,如:global_temp.people
5)对于DataFrame创建一个全局表
scala> df.createGlobalTempView("people")
6)通过SQL语句实现查询全表
scala> spark.sql("SELECT * FROM global_temp.people").show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|

scala> spark.newSession().sql("SELECT * FROM global_temp.people").show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

2.2.3 DSL style syntax (minor)

1)创建一个DateFrame
scala> spark.read.
csv   format   jdbc   json   load   option   options   orc   parquet   schema   table   text   textFile
2)查看DataFrame的Schema信息
scala> df.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
3)只查看”name”列数据
scala> df.select("name").show()
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+
4)查看”name”列数据以及”age+1”数据
scala> df.select($"name", $"age" + 1).show()
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+
5)查看”age”大于”21”的数据
scala> df.filter($"age" > 21).show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
6)按照”age”分组,查看数据条数
scala> df.groupBy("age").count().show()
+----+-----+
| age|count|
+----+-----+
|  19|     1|
|null|     1|
|  30|     1|
+----+-----+

2.2.4 Convert RDD to DateFrame

Note: If you need to operate between RDD and DF or DS, you need to introduce import spark.implicits._ [spark is not the package name, but the name of the sparkSession object]
Precondition: Import implicit conversion and create an RDD
scala> import spark.implicits._
import spark.implicits._

scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD [3] at textFile at :27
1) Determine the conversion manually
scala> peopleRDD.map{x=>val para = x.split(“,”);(para(0),para(1).trim.toInt) }.toDF("name", "age")
res1: org.apache.spark.sql.DataFrame = [name: string, age: int]
2) Determine through reflection (requires sample class)
(1) Create A sample class
scala> case class People(name:String, age:Int)
(2) Convert RDD to DataFrame according to the sample class
scala> peopleRDD.map{ x => val para = x.split(",") ;People(para(0),para(1).trim.toInt)}.toDF
res2: org.apache.spark.sql.DataFrame = [name: string, age: int]
3) Through programming (understanding)
(1) Import the required types
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
(2) Create Schema
scala> val structType: StructType = StructType(StructField( "name", StringType) :: StructField("age", IntegerType) :: Nil)
structType: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType ,true))
(3) Import the required type
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
(4) Create a tuple RDD according to the given type
scala> val data = peopleRDD.map{ x => val para = x.split(",");Row(para(0),para(1).trim.toInt)}
data: org.apache.spark.rdd.RDD[ org.apache.spark.sql.Row] = MapPartitionsRDD[6] at map at :33
(5) Create a DataFrame based on the data and the given schema
scala> val dataFrame = spark.createDataFrame(data, structType)
dataFrame: org.apache .spark.sql.DataFrame = [name: string, age: int]

2.2.5 Convert DateFrame to RDD

Just call rdd directly
1) Create a DataFrame
scala> val df = spark.read.json(“/opt/module/spark/examples/src/main/resources/people.json”)
df: org.apache.spark. sql.DataFrame = [age: bigint, name: string]
2) Convert DataFrame to RDD
scala> val dfToRDD = df.rdd
dfToRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[19] at rdd at :29
3) Print RDD
scala> dfToRDD.collect
res13: Array[org.apache.spark.sql.Row] = Array([Michael, 29], [Andy, 30], [Justin , 19])

2.3 DataSet

Dataset is a strongly typed data collection, and corresponding type information needs to be provided.

2.3.1 Create

1) Create a sample class
scala> case class Person(name: String, age: Long)
defined class Person
2) Create a DataSet
scala> val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]

2.3.2 Convert RDD to DataSet

SparkSQL can automatically convert the RDD containing the case class into a DataFrame. The case class defines the structure of the table, and the case class attributes become the column names of the table through reflection. Case classes can contain complex structures such as Seqs or Arrays.
1) Create an RDD
scala> val peopleRDD = sc.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources /people.txt MapPartitionsRDD[3] at textFile at :27
2) Create a sample class
scala> case class Person(name: String, age: Long)
defined class Person
3) Convert RDD to DataSet
scala> peopleRDD.map( line => {val para = line.split(",");Person(para(0),para(1).trim.toInt)}).toDF()
res8: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

2.3.3 Convert DataSet to RDD

Just call the rdd method.
1) Create a DataSet
scala> val DS = Seq(Person("Andy", 32)).toDS()
DS: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
2 ) Convert DataSet to RDD
scala> DS.rdd
res11: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[15] at rdd at :28

2.4 Interoperation between DataFrame and DataSet

  1. DataFrame to DataSet
    1) Create a DateFrame
    scala> val df = spark.read.json("examples/src/main/resources/people.json")
    df: org.apache.spark.sql.DataFrame = [age: bigint , name: string]
    2) Create a sample class
    scala> case class Person(name: String, age: Long)
    defined class Person
    3) Convert DateFrame into DataSet
    scala> df.as[Person]
    res14: org.apache. spark.sql.Dataset[Person] = [age: bigint, name: string]
  2. Convert DataSet to DataFrame
    1) Create a sample class
    scala> case class Person(name: String, age: Long)
    defined class Person
    2) Create DataSet
    scala> val ds = Seq(Person("Andy", 32)).toDS ()
    ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
    3) Convert DataSet to DataFrame
    scala> val df = ds.toDF
    df: org.apache.spark.sql .DataFrame = [name: string, age: bigint]
    4) Show
    scala> df.show
    ±—±–+
    |name|age|
    ±—±–+
    |Andy| 32|
    ±—±–+

2.4.1 Dataset to DataFrame

This is very simple, because it just encapsulates the case class into Row
(1) Import implicit conversion
import spark.implicits._
(2) Convert
val testDF = testDS.toDF

2.4.2 DataFrame to Dataset

(1) Import implicit conversion
import spark.implicits._
(2) Create a sample class
case class Coltest(col1:String,col2:Int)extends Serializable //Define field names and types
(3) Convert
val testDS = testDF.as [Coltest]
This method is to use the as method to convert to Dataset after giving the type of each column. This is extremely convenient when the data type is DataFrame and each field needs to be processed. When using some special operations, be sure to add import spark.implicits._ otherwise toDF and toDS cannot be used.

2.5 RDD、DataFrame、DataSet

Insert image description here
In SparkSQL, Spark provides us with two new abstractions, namely DataFrame and DataSet. What is the difference between them and RDD? First, look at the generation of versions:
RDD (Spark1.0) —> Dataframe (Spark1.3) —> Dataset (Spark1.6)
If the same data is given to these three data structures, after they are calculated separately, they will all be given produces the same results. The difference lies in their execution efficiency and execution methods.
In later Spark versions, DataSet will gradually replace RDD and DataFrame as the only API interface.

2.5.1 Common features among the three

1. RDD, DataFrame, and Dataset are all distributed elastic data sets under the spark platform, which provide convenience for processing very large data.
2. All three have inert mechanisms. When creating and converting, such as the map method, they will not be executed immediately. , only when encountering Action such as foreach, the three will start traversing operations.
3. All three will automatically cache operations according to spark’s memory conditions, so even if the amount of data is large, there is no need to worry about memory overflow.
4. All three have the concept of partition
. 5. All three have many common functions, such as filter and sorting. etc.
6. Many operations on DataFrame and Dataset require this package to support
import spark.implicits._
7. Both DataFrame and Dataset can use pattern matching to obtain the value and type of each field.

DataFrame:
testDF.map{
    
    
      case Row(col1:String,col2:Int)=>
        println(col1);println(col2)
        col1
      case _=>
        ""
    }
Dataset:
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
    testDS.map{
    
    
      case Coltest(col1:String,col2:Int)=>
        println(col1);println(col2)
        col1
      case _=>
        ""
    }

2.5.2 Differences between the three

  1. RDD:
    1) RDD is generally used together with spark mlib
    2) RDD does not support sparksql operations
  2. DataFrame:
    1) Unlike RDD and Dataset, the type of each row of DataFrame is fixed to Row, and the value of each column cannot be accessed directly. The value of each field can only be obtained through parsing, such as:
    testDF.foreach{ line => val col1= line.getAs String val col2=line.getAs String } 2) DataFrame and Dataset are generally not used together with spark mlib 3) DataFrame and Dataset both support sparksql operations, such as select, groupby, etc., and can also register temporary tables/windows. Perform sql statement operations, such as: dataDF.createOrReplaceTempView("tmp") spark.sql("select ROW,DATE from tmp where DATE is not null order by DATE").show(100,false) 4) DataFrame and Dataset support some A particularly convenient way to save, such as saving to csv, can bring headers so that the field names of each column are clear at a glance //Save val saveoptions = Map("header" -> "true", "delimiter" -> "\t" , "path" -> "hdfs://hadoop102:9000/test") datawDF.write.format("com.wxn.spark.csv").mode(SaveMode.Overwrite).options(saveoptions).save() //Read val options = Map(“header” -> “true”, “delimiter” -> “\t”, “path” -> “hdfs://hadoop102:9000/test”) val datarDF= spark. read.options(options).format("com.wxn.spark.csv").load() uses this saving method to easily obtain the correspondence between field names and columns, and the delimiter can be freely specified.















  3. Dataset:
    1) Dataset and DataFrame have exactly the same member functions, the only difference is the data type of each row.
    2) DataFrame can also be called Dataset[Row]. The type of each row is Row. It is not parsed. It is not known what fields each row has and what type each field is. You can only use the getAS method or commonality mentioned above. The pattern matching mentioned in Article 7 brings out specific fields. In Dataset, the type of each row is not certain. After customizing the case class, you can freely obtain the information of each row.
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
/**
 rdd
 ("a", 1)
 ("b", 1)
 ("a", 1)
**/
val test: Dataset[Coltest]=rdd.map{
    
    line=>
      Coltest(line._1,line._2)
    }.toDS
test.map{
    
    
      line=>
        println(line.col1)
        println(line.col2)
    }

It can be seen that Dataset is very convenient when you need to access a field in a column. However, if you want to write some highly adaptable functions, if you use Dataset, the type of row is uncertain and may be various Case class cannot achieve adaptation. In this case, DataFrame or Dataset[Row] can better solve the problem.

2.6 IDEA creates SparkSQL program

The packaging and running methods of programs in IDEA are similar to SparkCore. New dependencies need to be added to Maven dependencies:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.1.1</version>
</dependency>
程序如下:
package com.wxn.sparksql

import org.apache.spark.sql.SparkSession
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.slf4j.LoggerFactory

object HelloWorld {
    
    

  def main(args: Array[String]) {
    
    
    //创建SparkConf()并设置App名称
    val spark = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    // For implicit conversions like converting RDDs to DataFrames
    import spark.implicits._

    val df = spark.read.json("examples/src/main/resources/people.json")

    // Displays the content of the DataFrame to stdout
    df.show()

    df.filter($"age" > 21).show()

    df.createOrReplaceTempView("persons")

    spark.sql("SELECT * FROM persons where age > 21").show()

    spark.stop()
  }

}

2.7 User-defined functions

Users can customize functions through the spark.udf function in the Shell window.

2.7.1 User-defined UDF function

scala> val df = spark.read.json("examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+


scala> spark.udf.register("addName", (x:String)=> "Name:"+x)
res5: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.createOrReplaceTempView("people")

scala> spark.sql("Select addName(name), age from people").show()
+-----------------+----+
|UDF:addName(name)| age|
+-----------------+----+
|     Name:Michael|null|
|        Name:Andy|  30|
|      Name:Justin|  19|
+-----------------+----+

2.7.2 User-defined aggregate function

Both strongly typed Dataset and weakly typed DataFrame provide related aggregate functions, such as count(), countDistinct(), avg(), max(), min(). In addition, users can set their own custom aggregate functions.
Weakly typed user-defined aggregate function: implement user-defined aggregate function by inheriting UserDefinedAggregateFunction. The following shows a custom aggregate function that finds the average salary.

import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

object MyAverage extends UserDefinedAggregateFunction {
    
    
// 聚合函数输入参数的数据类型 
def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
// 聚合缓冲区中值得数据类型
def bufferSchema: StructType = {
    
    
StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
}
// 返回值的数据类型 
def dataType: DataType = DoubleType
// 对于相同的输入是否一直返回相同的输出。
def deterministic: Boolean = true
// 初始化
def initialize(buffer: MutableAggregationBuffer): Unit = {
    
    
// 存工资的总额
buffer(0) = 0L
// 存工资的个数
buffer(1) = 0L
}
// 相同Execute间的数据合并。 
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    
    
if (!input.isNullAt(0)) {
    
    
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
}
}
// 不同Execute间的数据合并 
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    
    
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
}
// 计算最终结果
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// 注册函数
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+
强类型用户自定义聚合函数:通过继承Aggregator来实现强类型自定义聚合函数,同样是求平均工资
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SparkSession
// 既然是强类型,可能有case类
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {
    
    
// 定义一个数据结构,保存工资总数和工资总个数,初始都为0
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
    
    
buffer.sum += employee.salary
buffer.count += 1
buffer
}
// 聚合不同execute的结果
def merge(b1: Average, b2: Average): Average = {
    
    
b1.sum += b2.sum
b1.count += b2.count
b1
}
// 计算输出
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// 设定之间值类型的编码器,要转换成case类
// Encoders.product是进行scala元组和case类转换的编码器 
def bufferEncoder: Encoder[Average] = Encoders.product
// 设定最终输出值的编码器
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

import spark.implicits._

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

3. SparkSQL data source

3.1 Universal load/save method

3.1.1 Manually specify options

Spark SQL's DataFrame interface supports operations on multiple data sources. A DataFrame can be operated as RDDs or registered as a temporary table. After registering the DataFrame as a temporary table, you can execute SQL queries on the DataFrame.
The default data source of Spark SQL is Parquet format. When the data source is a Parquet file, Spark SQL can easily perform all operations. Modify the configuration item spark.sql.sources.default to modify the default data source format.

val df = spark.read.load("examples/src/main/resources/users.parquet") df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
当数据源格式不是parquet格式文件时,需要手动指定数据源的格式。数据源格式需要指定全名(例如:org.apache.spark.sql.parquet),如果数据源格式为内置格式,则只需要指定简称定json, parquet, jdbc, orc, libsvm, csv, text来指定数据的格式。
可以通过SparkSession提供的read.load方法用于通用加载数据,使用write和save保存数据。 
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.write.format("parquet").save("hdfs://hadoop102:9000/namesAndAges.parquet")
除此之外,可以直接运行SQL在文件上:
val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs://hadoop102:9000/namesAndAges.parquet`")
sqlDF.show()
scala> val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> peopleDF.write.format("parquet").save("hdfs://hadoop102:9000/namesAndAges.parquet")

scala> peopleDF.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> val sqlDF = spark.sql("SELECT * FROM parquet.`hdfs:// hadoop102:9000/namesAndAges.parquet`")
17/09/05 04:21:11 WARN ObjectStore: Failed to get database parquet, returning NoSuchObjectException
sqlDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> sqlDF.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

3.1.2 File saving options

SaveMode can be used to perform storage operations, and SaveMode defines the data processing mode. It is important to note that these save modes do not use any locks and are not atomic operations. In addition, when executing in Overwrite mode, the original data is deleted before the new data is output. SaveMode is detailed in the following table:
Insert image description here

3.2 JSON file

Spark SQL can automatically infer the structure of the JSON data set and load it as a Dataset[Row]. You can load JSON files one by one through SparkSession.read.json().
Note: This JSON file is not a traditional JSON file, each line must be a JSON string.

{
    
    "name":"Michael"}
{
    
    "name":"Andy", "age":30}
{
    
    "name":"Justin", "age":19}

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|

3.3 Parquet files

Parquet is a popular columnar storage format that can efficiently store records with nested fields. The Parquet format is often used in the Hadoop ecosystem, and it also supports all data types of Spark SQL. Spark SQL provides methods to directly read and store Parquet format files.

importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

peopleDF.write.parquet("hdfs://hadoop102:9000/people.parquet")

val parquetFileDF = spark.read.parquet("hdfs:// hadoop102:9000/people.parquet")

parquetFileDF.createOrReplaceTempView("parquetFile")

val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

3.4 JDBC

Spark SQL can create a DataFrame by reading data from a relational database through JDBC. After a series of calculations on the DataFrame, it can also write the data back to the relational database.
Note: The relevant database driver needs to be placed in the spark classpath.

$ bin/spark-shell --master spark://hadoop102:7077 --jars mysql-connector-java-5.1.27-bin.jar

// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://hadoop102:3306/rdd").option("dbtable", " rddtable").option("user", "root").option("password", "000000").load()

val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "hive")
val jdbcDF2 = spark.read
.jdbc("jdbc:mysql://hadoop102:3306/rdd", "rddtable", connectionProperties)

// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://hadoop102:3306/rdd")
.option("dbtable", "dftable")
.option("user", "root")
.option("password", "000000")
.save()

jdbcDF2.write
.jdbc("jdbc:mysql://hadoop102:3306/mysql", "db", connectionProperties)

3.5 Hive database

Apache Hive is the SQL engine on Hadoop. Spark SQL can be compiled with or without Hive support. Spark SQL, which includes Hive support, can support Hive table access, UDF (user-defined functions), and Hive query language (HiveQL/HQL), etc. One thing that needs to be emphasized is that if you want to include the Hive library in Spark SQL, you do not need to install Hive in advance. Generally speaking, it is best to include Hive support when compiling Spark SQL so that you can use these features. If you downloaded the binary version of Spark, it should have been compiled with Hive support added.
To connect Spark SQL to a deployed Hive, you must copy hive-site.xml to the Spark configuration file directory ($SPARK_HOME/conf). Spark SQL can run even if Hive is not deployed. It should be noted that if you have not deployed Hive, Spark SQL will create its own Hive metadata warehouse in the current working directory, called metastore_db. Additionally, if you try to create tables using the CREATE TABLE (not CREATE EXTERNAL TABLE) statement in HiveQL, the tables will be placed in the /user/hive/warehouse directory on your default file system (if your classpath has The configured hdfs-site.xml, the default file system is HDFS, otherwise it is the local file system).

import java.io.File

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
    
    
case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|

3.5.1 Embedded Hive application

If you want to use the embedded Hive, you don’t need to do anything, just use it. --conf : spark.sql.warehouse.dir=
Insert image description here
Note: If you are using internal Hive, after Spark2.0, spark.sql.warehouse.dir is used to specify the address of the data warehouse. If you need to use HDFS As a path, you need to add core-site.xml and hdfs-site.xml to the Spark conf directory. Otherwise, only the warehouse directory on the master node will be created, and the file cannot be found when querying. This requires using HDFS, you need to delete the metastore and restart the cluster.

3.5.2 External Hive applications

If you want to connect to Hive that has been deployed externally, you need to go through the following steps.
1) Copy or soft-link hive-site.xml in Hive to the conf directory in the Spark installation directory.
2) Open the spark shell and bring the JDBC client that accesses the Hive metadata database
$ bin/spark-shell --master spark://hadoop102:7077 --jars mysql-connector-java-5.1.27-bin.jar

3.5.3 Running Spark SQL CLI

Spark SQL CLI can easily run the Hive metadata service locally and execute query tasks from the command line. Execute the following command in the Spark directory to start Spark SQL CLI:
./bin/spark-sql
To configure Hive, you need to replace hive-site.xml under conf/.
//±-------------±—+

4. Spark SQL practice

4.1 Data description

The data set is a goods transaction data set.

Insert image description here
Each order may contain multiple items, each order may generate multiple transactions, and different items have different unit prices.

4.2 Loading data

tbStock:
scala> case class tbStock(ordernumber:String,locationid:String,dateid:String) extends Serializable
defined class tbStock

scala> val tbStockRdd = spark.sparkContext.textFile("tbStock.txt")
tbStockRdd: org.apache.spark.rdd.RDD[String] = tbStock.txt MapPartitionsRDD[1] at textFile at <console>:23

scala> val tbStockDS = tbStockRdd.map(_.split(",")).map(attr=>tbStock(attr(0),attr(1),attr(2))).toDS
tbStockDS: org.apache.spark.sql.Dataset[tbStock] = [ordernumber: string, locationid: string ... 1 more field]

scala> tbStockDS.show()
+------------+----------+---------+
| ordernumber|locationid|   dataid|
+------------+----------+---------+
|BYSL00000893|      ZHAO|2007-8-23|
|BYSL00000897|      ZHAO|2007-8-24|
|BYSL00000898|      ZHAO|2007-8-25|
|BYSL00000899|      ZHAO|2007-8-26|
|BYSL00000900|      ZHAO|2007-8-26|
|BYSL00000901|      ZHAO|2007-8-27|
|BYSL00000902|      ZHAO|2007-8-27|
|BYSL00000904|      ZHAO|2007-8-28|
|BYSL00000905|      ZHAO|2007-8-28|
|BYSL00000906|      ZHAO|2007-8-28|
|BYSL00000907|      ZHAO|2007-8-29|
|BYSL00000908|      ZHAO|2007-8-30|
|BYSL00000909|      ZHAO| 2007-9-1|
|BYSL00000910|      ZHAO| 2007-9-1|
|BYSL00000911|      ZHAO|2007-8-31|
|BYSL00000912|      ZHAO| 2007-9-2|
|BYSL00000913|      ZHAO| 2007-9-3|
|BYSL00000914|      ZHAO| 2007-9-3|
|BYSL00000915|      ZHAO| 2007-9-4|
|BYSL00000916|      ZHAO| 2007-9-4|
+------------+----------+---------+
only showing top 20 rows

tbStockDetail:
scala> case class tbStockDetail(ordernumber:String, rownum:Int, itemid:String, number:Int, price:Double, amount:Double) extends Serializable
defined class tbStockDetail

scala> val tbStockDetailRdd = spark.sparkContext.textFile("tbStockDetail.txt")
tbStockDetailRdd: org.apache.spark.rdd.RDD[String] = tbStockDetail.txt MapPartitionsRDD[13] at textFile at <console>:23

scala> val tbStockDetailDS = tbStockDetailRdd.map(_.split(",")).map(attr=> tbStockDetail(attr(0),attr(1).trim().toInt,attr(2),attr(3).trim().toInt,attr(4).trim().toDouble, attr(5).trim().toDouble)).toDS
tbStockDetailDS: org.apache.spark.sql.Dataset[tbStockDetail] = [ordernumber: string, rownum: int ... 4 more fields]

scala> tbStockDetailDS.show()
+------------+------+--------------+------+-----+------+
| ordernumber|rownum|        itemid|number|price|amount|
+------------+------+--------------+------+-----+------+
|BYSL00000893|     0|FS527258160501|    -1|268.0|-268.0|
|BYSL00000893|     1|FS527258169701|     1|268.0| 268.0|
|BYSL00000893|     2|FS527230163001|     1|198.0| 198.0|
|BYSL00000893|     3|24627209125406|     1|298.0| 298.0|
|BYSL00000893|     4|K9527220210202|     1|120.0| 120.0|
|BYSL00000893|     5|01527291670102|     1|268.0| 268.0|
|BYSL00000893|     6|QY527271800242|     1|158.0| 158.0|
|BYSL00000893|     7|ST040000010000|     8|  0.0|   0.0|
|BYSL00000897|     0|04527200711305|     1|198.0| 198.0|
|BYSL00000897|     1|MY627234650201|     1|120.0| 120.0|
|BYSL00000897|     2|01227111791001|     1|249.0| 249.0|
|BYSL00000897|     3|MY627234610402|     1|120.0| 120.0|
|BYSL00000897|     4|01527282681202|     1|268.0| 268.0|
|BYSL00000897|     5|84126182820102|     1|158.0| 158.0|
|BYSL00000897|     6|K9127105010402|     1|239.0| 239.0|
|BYSL00000897|     7|QY127175210405|     1|199.0| 199.0|
|BYSL00000897|     8|24127151630206|     1|299.0| 299.0|
|BYSL00000897|     9|G1126101350002|     1|158.0| 158.0|
|BYSL00000897|    10|FS527258160501|     1|198.0| 198.0|
|BYSL00000897|    11|ST040000010000|    13|  0.0|   0.0|
+------------+------+--------------+------+-----+------+
only showing top 20 rows

tbDate:
scala> case class tbDate(dateid:String, years:Int, theyear:Int, month:Int, day:Int, weekday:Int, week:Int, quarter:Int, period:Int, halfmonth:Int) extends Serializable
defined class tbDate

scala> val tbDateRdd = spark.sparkContext.textFile("tbDate.txt")
tbDateRdd: org.apache.spark.rdd.RDD[String] = tbDate.txt MapPartitionsRDD[20] at textFile at <console>:23

scala> val tbDateDS = tbDateRdd.map(_.split(",")).map(attr=> tbDate(attr(0),attr(1).trim().toInt, attr(2).trim().toInt,attr(3).trim().toInt, attr(4).trim().toInt, attr(5).trim().toInt, attr(6).trim().toInt, attr(7).trim().toInt, attr(8).trim().toInt, attr(9).trim().toInt)).toDS
tbDateDS: org.apache.spark.sql.Dataset[tbDate] = [dateid: string, years: int ... 8 more fields]

scala> tbDateDS.show()
+---------+------+-------+-----+---+-------+----+-------+------+---------+
|   dateid| years|theyear|month|day|weekday|week|quarter|period|halfmonth|
+---------+------+-------+-----+---+-------+----+-------+------+---------+
| 2003-1-1|200301|   2003|    1|  1|      3|   1|      1|     1|        1|
| 2003-1-2|200301|   2003|    1|  2|      4|   1|      1|     1|        1|
| 2003-1-3|200301|   2003|    1|  3|      5|   1|      1|     1|        1|
| 2003-1-4|200301|   2003|    1|  4|      6|   1|      1|     1|        1|
| 2003-1-5|200301|   2003|    1|  5|      7|   1|      1|     1|        1|
| 2003-1-6|200301|   2003|    1|  6|      1|   2|      1|     1|        1|
| 2003-1-7|200301|   2003|    1|  7|      2|   2|      1|     1|        1|
| 2003-1-8|200301|   2003|    1|  8|      3|   2|      1|     1|        1|
| 2003-1-9|200301|   2003|    1|  9|      4|   2|      1|     1|        1|
|2003-1-10|200301|   2003|    1| 10|      5|   2|      1|     1|        1|
|2003-1-11|200301|   2003|    1| 11|      6|   2|      1|     2|        1|
|2003-1-12|200301|   2003|    1| 12|      7|   2|      1|     2|        1|
|2003-1-13|200301|   2003|    1| 13|      1|   3|      1|     2|        1|
|2003-1-14|200301|   2003|    1| 14|      2|   3|      1|     2|        1|
|2003-1-15|200301|   2003|    1| 15|      3|   3|      1|     2|        1|
|2003-1-16|200301|   2003|    1| 16|      4|   3|      1|     2|        2|
|2003-1-17|200301|   2003|    1| 17|      5|   3|      1|     2|        2|
|2003-1-18|200301|   2003|    1| 18|      6|   3|      1|     2|        2|
|2003-1-19|200301|   2003|    1| 19|      7|   3|      1|     2|        2|
|2003-1-20|200301|   2003|    1| 20|      1|   4|      1|     2|        2|
+---------+------+-------+-----+---+-------+----+-------+------+---------+
only showing top 20 rows
注册表:
scala> tbStockDS.createOrReplaceTempView("tbStock")

scala> tbDateDS.createOrReplaceTempView("tbDate")

scala> tbStockDetailDS.createOrReplaceTempView("tbStockDetail")

4.3 Calculate the number of sales orders and total sales per year among all orders

Count the number of sales orders and total sales in all orders each year.
After the three tables are connected, count(distinct a.ordernumber) will calculate the number of sales orders, and sum(b.amount) will calculate the total sales.
Insert image description here

SELECT c.theyear, COUNT(DISTINCT a.ordernumber), SUM(b.amount)
FROM tbStock a
	JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
	JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear
ORDER BY c.theyear

spark.sql("SELECT c.theyear, COUNT(DISTINCT a.ordernumber), SUM(b.amount) FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear ORDER BY c.theyear").show
结果如下:
+-------+---------------------------+--------------------+                      
|theyear|count(DISTINCT ordernumber)|         sum(amount)|
+-------+---------------------------+--------------------+
|   2004|                       	  1094|   3268115.499199999|
|   2005|                       	  3828|1.3257564149999991E7|
|   2006|                      	  3772|1.3680982900000006E7|
|   2007|                   	      4885|1.6719354559999993E7|
|   2008|                    	      4861| 1.467429530000001E7|
|   2009|                            2619|   6323697.189999999|
|   2010|                              94|  210949.65999999997|
+-------+---------------------------+--------------------+

4.4 Calculate the annual sales of the largest order for all orders

Goal: Count the sales of the largest order each year:

Insert image description here
1) Count the total sales of each order every year

SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount
FROM tbStock a
	JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
GROUP BY a.dateid, a.ordernumber

spark.sql("SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber GROUP BY a.dateid, a.ordernumber").show
结果如下:
+----------+------------+------------------+
|    dateid| ordernumber|       SumOfAmount|
+----------+------------+------------------+
|  2008-4-9|BYSL00001175|             350.0|
| 2008-5-12|BYSL00001214|             592.0|
| 2008-7-29|BYSL00011545|            2064.0|
|  2008-9-5|DGSL00012056|            1782.0|
| 2008-12-1|DGSL00013189|             318.0|
|2008-12-18|DGSL00013374|             963.0|
|  2009-8-9|DGSL00015223|            4655.0|
| 2009-10-5|DGSL00015585|            3445.0|
| 2010-1-14|DGSL00016374|            2934.0|
| 2006-9-24|GCSL00000673|3556.1000000000004|
| 2007-1-26|GCSL00000826| 9375.199999999999|
| 2007-5-24|GCSL00001020| 6171.300000000002|
|  2008-1-8|GCSL00001217|            7601.6|
| 2008-9-16|GCSL00012204|            2018.0|
| 2006-7-27|GHSL00000603|            2835.6|
|2006-11-15|GHSL00000741|           3951.94|
|  2007-6-6|GHSL00001149|               0.0|
| 2008-4-18|GHSL00001631|              12.0|
| 2008-7-15|GHSL00011367|             578.0|
|  2009-5-8|GHSL00014637|            1797.6|
+----------+------------+------------------+
2)以上一步查询结果为基础表,和表tbDate使用dateid join,求出每年最大金额订单的销售额
SELECT theyear, MAX(c.SumOfAmount) AS SumOfAmount
FROM (SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount
	FROM tbStock a
		JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
	GROUP BY a.dateid, a.ordernumber
	) c
	JOIN tbDate d ON c.dateid = d.dateid
GROUP BY theyear
ORDER BY theyear DESC

spark.sql("SELECT theyear, MAX(c.SumOfAmount) AS SumOfAmount FROM (SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber GROUP BY a.dateid, a.ordernumber ) c JOIN tbDate d ON c.dateid = d.dateid GROUP BY theyear ORDER BY theyear DESC").show
结果如下:
+-------+------------------+                                                    
|theyear|       SumOfAmount|
+-------+------------------+
|   2010|13065.280000000002|
|   2009|25813.200000000008|
|   2008|           55828.0|
|   2007|          159126.0|
|   2006|           36124.0|
|   2005|38186.399999999994|
|   2004| 23656.79999999997|
+-------+------------------+

4.5 Calculate the best-selling items among all orders each year

Goal: Count the best-selling products every year (which product has the highest sales amount that year, which one is the best-selling product)
Insert image description here

The first step is to find the annual sales of each product

SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
FROM tbStock a
	JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
	JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear, b.itemid

spark.sql("SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid").show
结果如下:
+-------+--------------+------------------+                                     
|theyear|        itemid|       SumOfAmount|
+-------+--------------+------------------+
|   2004|43824480810202|           4474.72|
|   2006|YA214325360101|             556.0|
|   2006|BT624202120102|             360.0|
|   2007|AK215371910101|24603.639999999992|
|   2008|AK216169120201|29144.199999999997|
|   2008|YL526228310106|16073.099999999999|
|   2009|KM529221590106| 5124.800000000001|
|   2004|HT224181030201|2898.6000000000004|
|   2004|SG224308320206|           7307.06|
|   2007|04426485470201|14468.800000000001|
|   2007|84326389100102|           9134.11|
|   2007|B4426438020201|           19884.2|
|   2008|YL427437320101|12331.799999999997|
|   2008|MH215303070101|            8827.0|
|   2009|YL629228280106|           12698.4|
|   2009|BL529298020602|            2415.8|
|   2009|F5127363019006|             614.0|
|   2005|24425428180101|          34890.74|
|   2007|YA214127270101|             240.0|
|   2007|MY127134830105|          11099.92|
+-------+--------------+------------------+
第二步、在第一步的基础上,统计每年单个货品中的最大金额
SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount
FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
	FROM tbStock a
		JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
		JOIN tbDate c ON a.dateid = c.dateid
	GROUP BY c.theyear, b.itemid
	) d
GROUP BY d.theyear

spark.sql("SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) d GROUP BY d.theyear").show
结果如下:
+-------+------------------+                                                    
|theyear|       MaxOfAmount|
+-------+------------------+
|   2007|           70225.1|
|   2006|          113720.6|
|   2004|53401.759999999995|
|   2009|           30029.2|
|   2005|56627.329999999994|
|   2010|            4494.0|
|   2008| 98003.60000000003|
+-------+------------------+
第三步、用最大销售额和统计好的每个货品的销售额join,以及用年join,集合得到最畅销货品那一行信息
SELECT DISTINCT e.theyear, e.itemid, f.MaxOfAmount
FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
	FROM tbStock a
		JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
		JOIN tbDate c ON a.dateid = c.dateid
	GROUP BY c.theyear, b.itemid
	) e
	JOIN (SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount
		FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
			FROM tbStock a
				JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
				JOIN tbDate c ON a.dateid = c.dateid
			GROUP BY c.theyear, b.itemid
			) d
		GROUP BY d.theyear
		) f ON e.theyear = f.theyear
		AND e.SumOfAmount = f.MaxOfAmount
ORDER BY e.theyear

spark.sql("SELECT DISTINCT e.theyear, e.itemid, f.maxofamount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS sumofamount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) e JOIN (SELECT d.theyear, MAX(d.sumofamount) AS maxofamount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS sumofamount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) d GROUP BY d.theyear ) f ON e.theyear = f.theyear AND e.sumofamount = f.maxofamount ORDER BY e.theyear").show
结果如下:
+-------+--------------+------------------+                                     
|theyear|        itemid|       maxofamount|
+-------+--------------+------------------+
|   2004|JY424420810101|53401.759999999995|
|   2005|24124118880102|56627.329999999994|
|   2006|JY425468460101|          113720.6|
|   2007|JY425468460101|           70225.1|
|   2008|E2628204040101| 98003.60000000003|
|   2009|YL327439080102|           30029.2|
|   2010|SQ429425090101|            4494.0|
+-------+--------------+------------------+

Guess you like

Origin blog.csdn.net/qq_44696532/article/details/135382117