Big data processing framework - Spark DataFrame construction, join and null filling

1. Introduction to Spark DataFrame

insert image description here

DataFrame is a concept in Spark SQL. It is a distributed data collection and can be regarded as a table. The main difference between DataFrame and RDD is that the former has schema metadata, that is, each column of the two-dimensional table dataset represented by DataFrame has a name and type.

2. Construct DataFrame

import org.apache.log4j.{
    
    Level, Logger}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{
    
    SparkSession}

object AppendColDFTest {
    
    
  Logger.getLogger("org").setLevel(Level.ERROR)
  Logger.getRootLogger().setLevel(Level.ERROR) // 设置日志级别
  def main(args: Array[String]): Unit = {
    
    
    val spark = SparkSession.builder()
      .appName("InDFTest")
      .master("local[*]")
      .getOrCreate()

    // 创建aDF和bDF
    val aData = Seq(
      (1, 1, 10, 20, 30),
      (1, 2, 10, 20, 30),
      (2, 1, 10, 20, 20),
      (2, 2, 10, 20, 50),
      (3, 4, 10, 20, 40),
      (3, 5, 10, 20, 30),
        (3, 6, 10, 20, 30),
      (4, 1, 10, 20, 20),
      (4, 2, 10, 20, 50)

    )
    val aDF = spark.createDataFrame(aData).toDF("x", "y", "z", "p", "q")

    val bData = Seq(
      (1, 1, 5, 15, 25),
      (2, 1, 25, 55, 105),
      (3, 4, 75, 85, 95)
    )
    val bDF = spark.createDataFrame(bData).toDF("x", "y", "m", "n", "l")

  }
}

3. Two DataFrame joins

// 使用left join关联aDF和bDF
val joinedDF = aDF.join(bDF, Seq("x", "y"), "left")
joinedDF.show()
+---+---+---+---+---+----+----+----+
|  x|  y|  z|  p|  q|   m|   n|   l|
+---+---+---+---+---+----+----+----+
|  1|  1| 10| 20| 30|   5|  15|  25|
|  1|  2| 10| 20| 30|null|null|null|
|  2|  1| 10| 20| 20|  25|  55| 105|
|  2|  2| 10| 20| 50|null|null|null|
|  3|  4| 10| 20| 40|  75|  85|  95|
|  3|  5| 10| 20| 30|null|null|null|
|  3|  6| 10| 20| 30|null|null|null|
|  4|  1| 10| 20| 20|null|null|null|
|  4|  2| 10| 20| 50|null|null|null|
+---+---+---+---+---+----+----+----+

4. Null filling

// 添加新的列,并填充空缺的值
val resultDF = joinedDF
  .withColumn("m", when(col("m").isNull, lit(0)).otherwise(col("m")))
  .withColumn("n", when(col("n").isNull, lit(0)).otherwise(col("n")))
  .withColumn("l", when(col("l").isNull, lit(0)).otherwise(col("l")))
  .select("x", "y", "m", "n", "l")
  .orderBy("x", "y")

// 显示最终结果
resultDF.show()
+---+---+---+---+---+
|  x|  y|  m|  n|  l|
+---+---+---+---+---+
|  1|  1|  5| 15| 25|
|  1|  2|  0|  0|  0|
|  2|  1| 25| 55|105|
|  2|  2|  0|  0|  0|
|  3|  4| 75| 85| 95|
|  3|  5|  0|  0|  0|
|  3|  6|  0|  0|  0|
|  4|  1|  0|  0|  0|
|  4|  2|  0|  0|  0|
+---+---+---+---+---+

おすすめ

転載: blog.csdn.net/programmer589/article/details/131991744