Arithmetic operation data DataFrame

A. SQL-style operation

The core essence: the DataFrame registered as a temporary view view, then you can directly perform various temporary sql view, there are two for view: session-level view, global level view;
the session-level view is within the range of valid Session after Session quit, table becomes ineffective;
global view is valid in application level;
need to pay attention to use the full path to access the global tables: global_temp.people

// application全局有效
df.createGlobalTempView("stu")
spark.sql(
  """
    |select * from global_temp.stu a order by a.score desc
  """.stripMargin)
    .show()

// session有效
df.createTempView("s")
spark.sql(
  """
    |select * from s order by score
  """.stripMargin)
  .show()

val spark2 = spark.newSession()
// 全局有效的view可以在session2中访问
spark2.sql("select id,name from global_temp.stu").show()
// session有效的view不能在session2中访问
spark2.sql("select id,name from s").show()

Two. DSL syntax style API

DSL style API, is to use programming api way to implement sql syntax

data preparation:

val df = spark.read
  .option("header", true)
  .option("inferSchema", true)
  .csv("data/stu.csv")

(1) Basic and select expression

/**
  * 逐行运算
  */
// 使用字符串表达"列"
df.select("id","name").show()

// 如果要用字符串形式表达sql表达式,应该使用selectExpr方法
df.selectExpr("id+1","upper(name)").show
// select方法中使用字符串sql表达式,会被视作一个列名从而出错
// df.select("id+1","upper(name)").show()

import spark.implicits._
// 使用$符号创建Column对象来表达"列"
df.select($"id",$"name").show()

// 使用单边单引号创建Column对象来表达"列"
df.select('id,'name).show()

// 使用col函数来创建Column对象来表达"列"
import org.apache.spark.sql.functions._
df.select(col("id"),col("name")).show()

// 使用Dataframe的apply方法创建Column对象来表达列
df.select(df("id"),df("name")).show()

// 对Column对象直接调用Column的方法,或调用能生成Column对象的functions来实现sql中的运算表达式
df.select('id.plus(2).leq("4").as("id2"),upper('name)).show()
df.select('id+2 <= 4 as "id2",upper('name)).show()

(2) Filter Condition

/**
  * 逐行过滤
  */
df.where("id>4 and score>95")
df.where('id > 4 and 'score > 95).select("id","name","age").show()

(3) field Rename

/**
  * 字段重命名
  */
// 对column对象调用as方法
df.select('id as "id2",$"name".as("n2"),col("age") as "age2").show()

// 在selectExpr中直接写sql的重命名语法
df.selectExpr("cast(id as string) as id2","name","city").show()

// 对dataframe调用withColumnRenamed方法对指定字段重命名
df.select("id","name","age").withColumnRenamed("id","id2").show()

// 对dataframe调用toDF对整个字段名全部重设
df.toDF("id2","name","age","city2","score").show()

(4) Packet aggregation

/**
  * 分组聚合
  */
df.groupBy("city").count().show()
df.groupBy("city").min("score").show()
df.groupBy("city").max("score").show()
df.groupBy("city").sum("score").show()
df.groupBy("city").avg("score").show()

df.groupBy("city").agg(("score","max"),("score","sum")).show()
df.groupBy("city").agg("score"->"max","score"->"sum").show()

(5) sub-queries

/**
  * 子查询
  * 相当于:
  * select
  * *
  * from 
  * (
  *   select
  *   city,sum(score) as score
  *   from stu
  *   group by city
  * ) o
  * where score>165
  */
df.groupBy("city")
  .agg(sum("score") as "score")
  .where("score > 165")
  .select("city", "score")
  .show()

(6) Join related queries

// 笛卡尔积
    //df1.crossJoin(df2).show()

    // 给join传入一个连接条件; 这种方式,要求,你的join条件字段在两个表中都存在且同名
    df1.join(df2,"id").show()

    // 传入多个join条件列,要求两表中这多个条件列都存在且同名
    df1.join(df2,Seq("id","sex")).show()


    // 传入一个自定义的连接条件表达式
    df1.join(df2,df1("id") + 1 === df2("id")).show()


    // 还可以传入join方式类型: inner(默认), left ,right, full ,left_semi, left_anti
    df1.join(df2,df1("id")+1 === df2("id"),"left").show()
    df1.join(df2,Seq("id"),"right").show()


    /**
      * 总结:
      *    join方式:   joinType: String
      *    join条件:
      *        可以直接传join列名:usingColumn/usingColumns : Seq(String)   注意: 右表的join列数据不会出现结果中
      *        可以传join自定义表达式: Column.+(1) === Column     df1("id")+1 === df2("id")
      */

(7) analysis window function call

Most people seeking information on two scores in each city
if sql write:

select 
id,name,age,sex,city,score
from 
(
select
id,name,age,sex,city,score,
row_number()  over(partition by city order by score desc) as rn
from t
) o
where rn<=2

DSL-style API implementation:

package cn.doitedu.sparksql
import org.apache.spark.sql.expressions.Window
/**
  *   用dsl风格api实现sql中的窗口分析函数
 */
object Demo15_DML_DSLAPI_WINDOW {

  def main(args: Array[String]): Unit = {
    val spark = SparkUtil.getSpark()
    val df = spark.read.option("header",true).csv("data/stu2.csv")
    import spark.implicits._
    import org.apache.spark.sql.functions._
    val window = Window.partitionBy('city).orderBy('score.desc)
    df.select('id,'name,'age,'sex,'city,'score,row_number().over(window) as "rn")
      .where('rn <= 2)
      .drop("rn") // 最后结果中不需要rn列,可以drop掉这个列
      //.select('id,'name,'age,'sex,'city,'score)  // 或者用select指定你所需要的列
      .show()
    spark.close()
  }
}

Three. RDD arithmetic operators DataFrame

To Meaning: some operational scenarios, logically by the SQL syntax more difficult to achieve, can be DataFrame converted to RDD operator to operate the data DataFrame is based RDD [Row] on package type, therefore, be of DataFrame be RDD Operator sub-operation, just need to learn how to deconstruct the data from Row in to

(1) extracting data from the embodiment in Row 1: acquired by sequential index number field

val rdd: RDD[Row] = df.rdd
rdd.map(row=>{
  val id = row.get(0).asInstanceOf[Int]
  val name = row.getString(1)
  (id,name)
}).take(10).foreach(println)

(2) extracting data from the manner in Row 2: acquired by the field name

rdd.map(row=>{
  val id = row.getAs[Int]("id")
  val name = row.getAs[String]("name")
  val age = row.getAs[Int]("age")
  val city = row.getAs[String]("city")
  val score = row.getAs[Double]("score")
  (id,name,age,city,score)
}).take(10).foreach(println)

(3) extracting data from the manner in Row 3: pattern matching data extraction

rdd.map({
  case Row(id: Int, name: String, age: Int, city: String, score: Double)
  => {
    // do anything
    (id,name,age,city,score)
  }
}).take(10).foreach(println)
Released five original articles · won praise 5 · Views 152

Guess you like

Origin blog.csdn.net/weixin_45687351/article/details/103812152