spark - external data source

1. The External DataSource API is provided in Spark 1.2, and developers can implement their own external data sources according to the interface, such as avro, csv, json, parquet, etc.

 

(1) External data sources that come with spark

 

(2) https://spark-packages.org/ contributed by other developers

 

Take avro as an example, click on the homepage and jump to the github website: https://github.com/databricks/spark-avro, the github page describes the usage in detail


 

local shell test


 

 

 

2.Spark external data source Api exercise

package df

import org.apache.spark.sql.SparkSession

object ExternalSource {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("demo").master("local").getOrCreate()

    //1. Read json
    val jsonDF = spark.read.format("json").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/employees.json")
    jsonDF.printSchema()

    //2. Read parquet
    val parquetDF = spark.read.format("parquet").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/users.parquet")
    parquetDF.printSchema()

    //3. Read scv
    val csvDF = spark.read.format("csv").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/people.csv")
    csvDF.printSchema()
  }
}

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326173961&siteId=291194637