Spark's DataFrame and Schema Detailed Explanation and Practical Case Demo

1. Concept introduction

Spark is a distributed computing framework for handling large-scale data processing tasks. In Spark, DataFrame is a distributed data collection, similar to a table in a relational database. DataFrame provides a higher-level abstraction that allows users to process data in a declarative manner without caring about the details of the underlying data and the complexities of distributed computing. Schema is used in Spark to describe the data structure in DataFrame, similar to the column definition in a table.

Let's introduce DataFrame and Schema separately:

DataFrame:

A DataFrame is a distributed collection of data composed of rows and columns, similar to the structure of a traditional database or spreadsheet. Spark's DataFrame has the following characteristics:
Distributed computing: DataFrame is distributed and can be processed in parallel on multiple nodes in the cluster to achieve high-performance large-scale data processing.
Immutability: DataFrames are immutable, which means that once created, they cannot be modified. Instead, operations on DataFrames generate new DataFrames.
Delayed execution: Spark adopts a delayed execution strategy, that is, the operations on the DataFrame are not executed immediately, but are optimized and executed when output results are required.
Users can use SQL statements, Spark API or Spark SQL to operate DataFrame, perform data filtering, conversion, aggregation and other operations. The advantage of DataFrame lies in its ease of use and optimization capabilities. Spark will optimize the entire calculation process according to the execution plan of the operation to improve performance.

Schema:

Schema is a structural description of the data in DataFrame, which defines the column names and data types of DataFrame. In Spark, a Schema is a collection of metadata including column names and data types. The Schema information of a DataFrame is crucial for optimized computation and correct interpretation of data types.
Typically, the schema is automatically inferred when the DataFrame is created, or it can be specified explicitly programmatically. The advantage of specifying a Schema is that it ensures that the data is interpreted correctly and avoids potential type conversion errors. If the data source does not contain Schema information or needs to modify the Schema, you can use StructType and StructField to customize the Schema. For example, you can create a Schema with multiple fields and data types, such as strings, integers, dates, and so on.

When using Spark to read data sources, such as CSV files, JSON data, database tables, etc., Spark will try to automatically infer the schema of the data. If the data source itself does not provide enough information, you can use the schema option to specify or adjust the DataFrame's Schema through subsequent data transformation operations.

Summary: DataFrame is a powerful distributed data structure in Spark that allows users to process data in a declarative manner, while Schema is used to describe the structural information of data in DataFrame to ensure that data is correctly interpreted and processed. These two concepts together constitute Spark's powerful data processing capabilities.

Code combat

package test.scala

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{
    
    IntegerType, StringType, StructType}

object TestSchema {
    
    
  def getSparkSession(appName: String, localType: Int): SparkSession = {
    
    
    val builder: SparkSession.Builder = SparkSession.builder().appName(appName)
    if (localType == 1) {
    
    
      builder.master("local[8]") // 本地模式，启用8个核心
    }

    val spark = builder.getOrCreate() // 获取或创建一个新的SparkSession
    spark.sparkContext.setLogLevel("ERROR") // Spark设置日志级别
    spark
  }

  def main(args: Array[String]): Unit = {
    
    
    println("Start TestSchema")
    val spark: SparkSession = getSparkSession("TestSchema", 1)

    val structureData = Seq(
      Row("36636", "Finance", Row(3000, "USA")),
      Row("40288", "Finance", Row(5000, "IND")),
      Row("42114", "Sales", Row(3900, "USA")),
      Row("39192", "Marketing", Row(2500, "CAN")),
      Row("34534", "Sales", Row(6500, "USA"))
    )

    val structureSchema = new StructType()
      .add("id", StringType)
      .add("dept", StringType)
      .add("properties", new StructType()
        .add("salary", IntegerType)
        .add("location", StringType)
      )

    val df = spark.createDataFrame(
      spark.sparkContext.parallelize(structureData), structureSchema)
    df.printSchema()
    df.show(false)

    val row = df.first()
    val schema = row.schema
    val structTypeList = schema.toList
    println(structTypeList.size)
    for (i <- 0 to structTypeList.size - 1) {
    
    
      val structType = structTypeList(i)
      println(structType.name, row.getAs(structType.name), structType.dataType, structType.dataType)
    }
  }
}

output

Start TestSchema
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
23/07/29 09:47:59 INFO SparkContext: Running Spark version 2.4.0
23/07/29 09:47:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
23/07/29 09:47:59 INFO SparkContext: Submitted application: TestSchema
23/07/29 09:47:59 INFO SecurityManager: Changing view acls to: Nebula
23/07/29 09:47:59 INFO SecurityManager: Changing modify acls to: Nebula
23/07/29 09:47:59 INFO SecurityManager: Changing view acls groups to:
23/07/29 09:47:59 INFO SecurityManager: Changing modify acls groups to:
23/07/29 09:47:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Nebula); groups with view permissions: Set(); users with modify permissions: Set(Nebula); groups with modify permissions: Set()
23/07/29 09:48:01 INFO Utils: Successfully started service ‘sparkDriver’ on port 60785.
23/07/29 09:48:01 INFO SparkEnv: Registering MapOutputTracker
23/07/29 09:48:01 INFO SparkEnv: Registering BlockManagerMaster
23/07/29 09:48:01 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/07/29 09:48:01 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/07/29 09:48:01 INFO DiskBlockManager: Created local directory at C:\Users\Nebula\AppData\Local\Temp\blockmgr-6f861361-4d98-4372-b78a-2949682bd557
23/07/29 09:48:01 INFO MemoryStore: MemoryStore started with capacity 8.3 GB
23/07/29 09:48:01 INFO SparkEnv: Registering OutputCommitCoordinator
23/07/29 09:48:01 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
23/07/29 09:48:01 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://LAPTOP-PEA8R2PO:4040
23/07/29 09:48:01 INFO Executor: Starting executor ID driver on host localhost
23/07/29 09:48:01 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 60826.
23/07/29 09:48:01 INFO NettyBlockTransferService: Server created on LAPTOP-PEA8R2PO:60826
23/07/29 09:48:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
23/07/29 09:48:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, LAPTOP-PEA8R2PO, 60826, None)
23/07/29 09:48:01 INFO BlockManagerMasterEndpoint: Registering block manager LAPTOP-PEA8R2PO:60826 with 8.3 GB RAM, BlockManagerId(driver, LAPTOP-PEA8R2PO, 60826, None)
23/07/29 09:48:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, LAPTOP-PEA8R2PO, 60826, None)
23/07/29 09:48:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, LAPTOP-PEA8R2PO, 60826, None)