解决 Error:Unable to find encoder for type stored in a Dataset

Error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

Error: Unable to find encoder for type stored in dataset. Primitive types (Int, String, etc.) and product types (case classes) are supported by importing spark.implicits. Support for serialization of other types will be added in a future release.

The problem is solved as follows:

During development, I would like to save a text file already saved on HDFS into a CSV formatted file.

The previous file format is as follows (here I use fake data made by myself). Each column corresponds to id, name, no, sp, ep 3303
Longshun JD8 Chibi Zhanjiang 5426
Cheng Fan G58 Longyan Miaoli

The format is CSV default format
id, name, no, sp, ep
1309, Xiang Jing, BKZ, Shaoguan, Hubei
3507, Ning Fengchen, KY7, Heyuan, Ziyang The

processing code is as follows

def main(args: Array[String]) {
	val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate()
	//file path
	val path = "hdfs://master:9000/TestData/aviation9"
	//save route
	val savePath = "hdfs://master:9000/TestData/aviation10/"
	val file = sparkSession.read.textFile(path)
	// process data, split
	val rd = file.map(line => {
	  val arr = line.split("\t")
	  (arr(0), arr(1), arr(2), arr(3), arr(4))
	})
	//Add a header to the DataFrame,
	val res = rd.toDF("id", "name", "no", "sp", "ep")
	//save the file with the table
	res.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath)
}

Ready to package and run, the above error always occurs when packaging

Could not find an encoder for the type stored in the dataset.
Primitive types (Int, String, etc.) and product types (case classes) are supported by importing spark.implicits.
Support for serialization of other types will be added in a future release.
So we need to add encoding such as tuples to the Dataset ourselves.

Solution: Add the following line of code between processing data
implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders. STRING, Encoders.STRING, Encoders.STRING)

The full code at the end is

def main(args: Array[String]) {
	val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate()
	//file path
	val path = "hdfs://master:9000/TestData/aviation9"
	//save route
	val savePath = "hdfs://master:9000/TestData/aviation10/"
	val file = sparkSession.read.textFile(path)
	implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING)
	// process data, split
	val rd = file.map(line => {
	  val arr = line.split("\t")
	  (arr(0), arr(1), arr(2), arr(3), arr(4))
	})
	//Add a header to the DataFrame,
	val res = rd.toDF("id", "name", "no", "sp", "ep")
	//save the file with the table
	res.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath)
}

In dealing with this error, it is found that the default tuple in the encoder is at most 5 elements, so what if we have a lot of column data?
For example, the current data is as follows, with an additional column of time
5426 Cheng Fan G56 2013-12-24 17:26:23 Longyan Miaoli
4413 TV7 2014-04-09 20:44:25 Beipiao Kaiyuan

If the following code is used, the same error occurs again
implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING )

Because tuple of up to 5 elements is supported in Encoders, we need to convert the DataSet into a Row type, and finally convert it into an RDD to create a DataFrame with data and headers

def main(args: Array[String]) {
    val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate()
    val fields = "id,name,no,time,sp,ep"
    val path = "hdfs://master:9000/TestData/aviation9"
    val savePath = "hdfs://master:9000/TestData/aviation10/"
    
    val file: Dataset[String] = sparkSession.read.textFile(path)
    
    implicit val matchError = org.apache.spark.sql.Encoders.kryo[Row]
    // Process into Row
    val ds = file.map(x => {
      val arr = x.split("\t")
      Row.fromSeq(arr.toSeq)
    })
    //create header
    val field_array = fields.split(",")
    val schema = StructType(field_array.map(fieldName => StructField(fieldName, StringType, true)))
    //Create DataFrame
    val df = sparkSession.createDataFrame(ds.rdd, schema)
    df.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath)
}

解决 Error:Unable to find encoder for type stored in a Dataset

Guess you like