sparksql series (three) sparksql column operations, window function, join

A: Sparksql column operations

SparkContext and initialization data:

import java.util.Arrays

import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.{DataFrame, Row, SparkSession, functions}
import org.apache.spark.sql.functions.{col, desc, length, row_number, trim, when}
import org.apache.spark.sql.functions.{countDistinct,sum,count,avg}
import org.apache.spark.sql.functions.concat
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
import java.util.ArrayList

object WordCount {

  def initSparkAndData() : DataFrame = {

    val sparkSession= SparkSession.builder().master("local").appName("AppName").getOrCreate()
    val javasc = new JavaSparkContext(sparkSession.sparkContext)
    val nameRDD = javasc.parallelize(Arrays.asList("{'name':'wangwu','age':'18','vip':'t'}",
      "{'name':'sunliu','age':'19','vip':'t'}","{'name':'zhangsan','age':'20','vip':'f'}"));
    val namedf = sparkSession.read.json(nameRDD)

    namedf
  }
}

Increase Column

    val data = initSparkAndData()

    @ Method a: constant value may be added
    data.select Show (100) (When (COL ( "name") isNotNull,. 1) .otherwise (0) AS "UserGroup.").   
    // Method two: only exists the column operations
    data.withColumn ( "time", concat ( col ( "age"), col ( "name"))) .show (100)

Remove Columns

    val data = initSparkAndData()
    data.drop("vip").show(100)

Two: SparkSql window function

    Traditional databases have this function is partation by () order by (). Let us look at that sparksql in how to write:

    val data = initSparkAndData()
    data.withColumn("isVsip", row_number().over(Window.partitionBy(col("vip")).orderBy(desc("name")))).show(100)

    The above means that according to the VIP packet, after sorted according to name as a new column isVsip. As a function of the project is used to fetch the latest record, for example as follows:

    User statistics recently logged record:

    val sparkSession= SparkSession.builder().master("local").appName("AppName").getOrCreate()
    val javasc = new JavaSparkContext(sparkSession.sparkContext)

    val nameRDD1 = javasc.parallelize(Arrays.asList("{'name':'wangwu','time':'2019-08-12'}",
      "{'name':'sunliu','time':'2019-08-13'}","{'name':'zhangsan','time':'2019-08-14'}"));
    val namedf1 = sparkSession.read.json(nameRDD1)

    val nameRDD2 = javasc.parallelize(Arrays.asList("{'name':'wangwu','time':'2019-09-12'}",
      "{'name':'sunliu','time':'2019-08-13'}","{'name':'zhangsan', 'Time': '2019-07-14'} "));     // above all configuration data.
    Val namedf2 = sparkSession.read.json (nameRDD2)

    namedf1.union(namedf2).withColumn("max_time", row_number().over(Window.partitionBy(col("name")).orderBy(desc("time"))))
      .filter(col("max_time") ===1).show(100)

Three: Sparksql join operations

SparkContext and initialization data:

import java.util.Arrays

import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.{DataFrame, Row, SparkSession, functions}
import org.apache.spark.sql.functions.{col, desc, length, row_number, trim, when}
import org.apache.spark.sql.functions.{countDistinct,sum,count,avg}
import org.apache.spark.sql.functions.concat
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
import java.util.ArrayList


object WordCount {
  def joinTestData() = {
    val sparkSession= SparkSession.builder().master("local").appName("AppName").getOrCreate()
    val javasc = new JavaSparkContext(sparkSession.sparkContext)

    val nameRDD = javasc.parallelize(Arrays.asList("{'name':'zhangsan','age':'18','sex':'N'}", "{'name':'lisi','age':'19','sex':'F'}","{'':'','':'','':''}"));
    val nameRDD1 = javasc.parallelize(Arrays.asList("{'name':'wangwu','age':'18','vip':'t'}", "{'name':'sunliu','age':'19','vip':'t'}","{'name':'zhangsan','age':'18','vip':'f'}"));
    val data1 = sparkSession.read.json(nameRDD)
    val data2 = sparkSession.read.json(nameRDD1)

    (data1,data2)
  }
}

The same left, leftouter, left_outer three

    joinTestData dataTuple = val ()
    val dataTuple._1 date1 =
    val date2 = dataTuple._2

    val left = data1.join(data2,data1("name") === data2("name") ,"left").show(100)

       result:

age name sex age name vip
null null null null null null
18 zhangsan N 18 zhangsan f
19 lysis f null null null

 

The same right, rightouter, right_outer three

    joinTestData dataTuple = val ()
    val dataTuple._1 date1 =
    val date2 = dataTuple._2

    val right = data1.join(data2,data1("name") === data2("name") ,"right").show(100)

       result:

age name sex age name vip
null null null 18 wangwu t
18 zhangsan N 18 zhangsan f
null null null   sunliu t

 

cross, both the same inner

    joinTestData dataTuple = val ()
    val dataTuple._1 date1 =
    val date2 = dataTuple._2

    val right = data1.join(data2,data1("name") === data2("name") ,"right").show(100)

       result:

age name sex age name vip
18 zhangsan N 18 zhangsan f

full, fullouter, full_outer, outer four of the same

    joinTestData dataTuple = val ()
    val dataTuple._1 date1 =
    val date2 = dataTuple._2

    val full = data1.join(data2,data1("name") === data2("name") ,"full").show(100)

       result:

age name sex age name vip
null null null 18 wangwu t
null null null null null null
18 zhangsan N 18 zhangsan f
null null null 19 sunliu t
19 lysis F null null null

 

leftsemi (after innerjoin leaving only the left side)

    val dataTuple = joinTestData()
    val data1 = dataTuple._1
    val data2 = dataTuple._2

    val leftsemi = data1.join(data2,data1("name") === data2("name") ,"leftsemi").show(100)

    真正在项目中的使用:项目中有一张大表,主键是用户ID,里面有用户所有基本信息。项目使用过程中一般要求关联大表取得所有基本信息,leftsemi一般用于缩减大表。

       结果:

age name sex
18 zhangsan N

leftanti(innerjoin之后去除能关联上之外的)

    val dataTuple = joinTestData()
    val data1 = dataTuple._1
    val data2 = dataTuple._2

    val leftouter = data1.join(data2,data1("name") === data2("name") ,"leftanti").show(100)

       结果:

age name sex
null null null
19 lisi F

Guess you like

Origin www.cnblogs.com/wuxiaolong4/p/11706811.html