1. The External DataSource API is provided in Spark 1.2, and developers can implement their own external data sources according to the interface, such as avro, csv, json, parquet, etc.
(1) External data sources that come with spark
(2) https://spark-packages.org/ contributed by other developers
Take avro as an example, click on the homepage and jump to the github website: https://github.com/databricks/spark-avro, the github page describes the usage in detail
local shell test
2.Spark external data source Api exercise
package df import org.apache.spark.sql.SparkSession object ExternalSource { def main(args: Array[String]): Unit = { val spark = SparkSession.builder().appName("demo").master("local").getOrCreate() //1. Read json val jsonDF = spark.read.format("json").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/employees.json") jsonDF.printSchema() //2. Read parquet val parquetDF = spark.read.format("parquet").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/users.parquet") parquetDF.printSchema() //3. Read scv val csvDF = spark.read.format("csv").load("file:////data/spark-2.3.0-bin-hadoop2.7/examples/src/main/resources/people.csv") csvDF.printSchema() } }