Error: Unable to find encoder for type stored in dataset. Primitive types (Int, String, etc.) and product types (case classes) are supported by importing spark.implicits. Support for serialization of other types will be added in a future release.
The problem is solved as follows:
During development, I would like to save a text file already saved on HDFS into a CSV formatted file.
The previous file format is as follows (here I use fake data made by myself). Each column corresponds to id, name, no, sp, ep 3303
Longshun JD8 Chibi Zhanjiang 5426
Cheng Fan G58 Longyan Miaoli
The format is CSV default format
id, name, no, sp, ep
1309, Xiang Jing, BKZ, Shaoguan, Hubei
3507, Ning Fengchen, KY7, Heyuan, Ziyang The
processing code is as follows
def main(args: Array[String]) { val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate() //file path val path = "hdfs://master:9000/TestData/aviation9" //save route val savePath = "hdfs://master:9000/TestData/aviation10/" val file = sparkSession.read.textFile(path) // process data, split val rd = file.map(line => { val arr = line.split("\t") (arr(0), arr(1), arr(2), arr(3), arr(4)) }) //Add a header to the DataFrame, val res = rd.toDF("id", "name", "no", "sp", "ep") //save the file with the table res.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath) }
Ready to package and run, the above error always occurs when packaging
Could not find an encoder for the type stored in the dataset.
Primitive types (Int, String, etc.) and product types (case classes) are supported by importing spark.implicits.
Support for serialization of other types will be added in a future release.
So we need to add encoding such as tuples to the Dataset ourselves.
Solution: Add the following line of code between processing data
implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders. STRING, Encoders.STRING, Encoders.STRING)
The full code at the end is
def main(args: Array[String]) { val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate() //file path val path = "hdfs://master:9000/TestData/aviation9" //save route val savePath = "hdfs://master:9000/TestData/aviation10/" val file = sparkSession.read.textFile(path) implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING) // process data, split val rd = file.map(line => { val arr = line.split("\t") (arr(0), arr(1), arr(2), arr(3), arr(4)) }) //Add a header to the DataFrame, val res = rd.toDF("id", "name", "no", "sp", "ep") //save the file with the table res.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath) }
For example, the current data is as follows, with an additional column of time
5426 Cheng Fan G56 2013-12-24 17:26:23 Longyan Miaoli
4413 TV7 2014-04-09 20:44:25 Beipiao Kaiyuan
If the following code is used, the same error occurs again
implicit val matchError = org.apache.spark.sql.Encoders.tuple( Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING, Encoders.STRING )
Because tuple of up to 5 elements is supported in Encoders, we need to convert the DataSet into a Row type, and finally convert it into an RDD to create a DataFrame with data and headers
def main(args: Array[String]) { val sparkSession = SparkSession.builder().appName("Spark shell").getOrCreate() val fields = "id,name,no,time,sp,ep" val path = "hdfs://master:9000/TestData/aviation9" val savePath = "hdfs://master:9000/TestData/aviation10/" val file: Dataset[String] = sparkSession.read.textFile(path) implicit val matchError = org.apache.spark.sql.Encoders.kryo[Row] // Process into Row val ds = file.map(x => { val arr = x.split("\t") Row.fromSeq(arr.toSeq) }) //create header val field_array = fields.split(",") val schema = StructType(field_array.map(fieldName => StructField(fieldName, StringType, true))) //Create DataFrame val df = sparkSession.createDataFrame(ds.rdd, schema) df.repartition(1).write.mode(SaveMode.Append).format("csv").option("header", true).save(savePath) }