Spark sql: load and save operations-one of the spark study notes

One, load and save basic operations

For Spark SQL DataFrame, no matter what data source is created from DataFrame, there are some common load and save operations.

The load operation is mainly used to load data and create a DataFrame;

The save operation is mainly used to save the data in the DataFrame to a file.

 

Scala implementation:

package **.tag.test


import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object GenericLoadSave {

    def main(args: Array[String]) {

      val conf = new SparkConf()
        .setAppName("GenericLoadSave")

      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)

      val usersDF = sqlContext.read.load("hdfs://ns1/***/users.parquet")
      usersDF.write.save("hdfs://ns1/home/***/nameAndFavoriteColors_scala")
     usersDF.select("name","favorite_color").write.save("hdfs://ns1/***/nameAndFavoriteColors_scala")
   }
}

 

Two, manually specify the data source type

Spark SQL has some built-in data source types, such as json, parquet, jdbc, etc., which can be converted between different types of data sources.

val df = sqlContext.read.format("json").load("people.json")

df.select("name","age").write.format("parquet").save("nameAndAges.parquet")

 

Three, SaveMode

Spark SQL provides different save modes for save operations. Mainly used to deal with how to deal with when there is data at the target location.

saveMode.ErrorifExists (default): If data already exists at the target location, an exception is thrown.

saveMode Append: If data already exists at the target location, append the data.

saveMode Overwrite: If data already exists at the target location, overwrite.

saveMode.ignore: If there is already data in the target location, no task operation is performed, and it is ignored.

 

One of Beifeng net spark study notes

 

Guess you like

Origin blog.csdn.net/limiaoiao/article/details/106527536