Remember a SparkSql union operation is abnormal

When using union in sparksql to merge two DataFrames, I found that the type mismatch error was always reported, but after checking, it was found that the column names and column types in the two DataFrames were exactly the same. Reproduce it below This error

object SqlTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 设置日志输出的级别
    Logger.getLogger("org").setLevel(Level.ERROR)
    //初始化编程入口
    val session:SparkSession = SparkSession.builder.appName("name").master("local[2]").getOrCreate()
    import session.implicits._
    //创建第一个DataFrame
    var df1 = List[(String,String,Integer)](
      ("李奇峰","2019-5-12",88),
      ("李奇峰","2019-5-12",81),
      ("李奇峰","2019-5-12",82),
      ("李奇峰","2019-5-12",86)
    ).toDF("name","date","grade")
    println("df1的显示结果:")
    df1.show()
    //创建第二个DataFrame
    var df2 = List[(String,Integer,String)](
      ("李晓峰",88,"2019-5-12"),
      ("李晓峰",81,"2019-5-12"),
      ("李晓峰",82,"2019-5-12"),
      ("李晓峰",86,"2019-5-12")
    ).toDF("name","grade","date")
    println("df2的显示结果:")
    df2.show()
    //将两个DataFrame的date列转换为时间类型
    df1 = df1.withColumn("date",$"date".cast(DateType))
    df2 = df2.withColumn("date",$"date".cast(DateType))
    //进行合并操作
    val result = df1.union(df2)
    println("合并后的显示结果:")
    result.show()
  }
}

There will be an error when executing the above method

df1的显示结果:
+------+---------+-----+
|  name|     date|grade|
+------+---------+-----+
|李奇峰|2019-5-12|   88|
|李奇峰|2019-5-12|   81|
|李奇峰|2019-5-12|   82|
|李奇峰|2019-5-12|   86|
+------+---------+-----+

df2的显示结果:
+------+-----+---------+
|  name|grade|     date|
+------+-----+---------+
|李晓峰|   88|2019-5-12|
|李晓峰|   81|2019-5-12|
|李晓峰|   82|2019-5-12|
|李晓峰|   86|2019-5-12|
+------+-----+---------+

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. int <> date at the second column of the second table;;
'Union下·
:- Project [name#7, cast(date#8 as date) AS date#26, grade#9]
:  +- Project [_1#3 AS name#7, _2#4 AS date#8, _3#5 AS grade#9]
:     +- LocalRelation [_1#3, _2#4, _3#5]
+- Project [name#20, course#21, cast(date#22 as date) AS date#30]
   +- Project [_1#16 AS name#20, _2#17 AS course#21, _3#18 AS date#22]
      +- LocalRelation [_1#16, _2#17, _3#18]

	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12$$anonfun$apply$13.apply(CheckAnalysis.scala:293)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12$$anonfun$apply$13.apply(CheckAnalysis.scala:290)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12.apply(CheckAnalysis.scala:290)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$12.apply(CheckAnalysis.scala:279)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:279)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
	at org.apache.spark.sql.Dataset.withSetOperator(Dataset.scala:3419)
	at org.apache.spark.sql.Dataset.union(Dataset.scala:1857)
	at SqlTest$.main(SqlTest.scala:34)
	at SqlTest.main(SqlTest.scala)

It can be seen from the error report that there is a type incompatibility problem in the two DataFrames, but after inspection, it can be seen that the types and column names in the two DataFrames are exactly the same. I have no choice but to look up Union on the Spark official website and found two details:

1. The union operation is not equivalent to the union of the set, it will not remove duplicate data.

2. The union function does not merge according to column names, but according to position. That is, the column names of the DataFrame can be different, but the columns at the corresponding positions will be merged together.

According to the second detail above to check the code, it can be seen that although the column names and column corresponding types of the two DataFrames are the same, the positions between them are not the same, so it is necessary to modify the column of one of the DataFrames Location The
modified code is as follows:

object SqlTest {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 设置日志输出的级别
    Logger.getLogger("org").setLevel(Level.ERROR)
    //初始化编程入口
    val session:SparkSession = SparkSession.builder.appName("name").master("local[2]").getOrCreate()
    import session.implicits._
    //创建第一个DataFrame
    var df1 = List[(String,String,Integer)](
      ("李奇峰","2019-5-12",88),
      ("李奇峰","2019-5-12",81),
      ("李奇峰","2019-5-12",82),
      ("李奇峰","2019-5-12",86)
    ).toDF("name","date","grade")
    //创建第二个DataFrame
    var df2 = List[(String,Integer,String)](
      ("李晓峰",88,"2019-5-12"),
      ("李晓峰",81,"2019-5-12"),
      ("李晓峰",82,"2019-5-12"),
      ("李晓峰",86,"2019-5-12")
    ).toDF("name","grade","date")
    //将两个DataFrame的date列转换为时间类型
    df1 = df1.withColumn("date",$"date".cast(DateType))
    df2 = df2.withColumn("date",$"date".cast(DateType))
    //修改df2的列的位置
    df2 = df2.select("name","date","grade")
    //进行合并操作
    df1.union(df2).show()
  }
}

It can be seen that the above code only has one more line of code to adjust the position of the column than the previous code. The
df2 = df2.select("name","date","grade")
results are as follows:

df1的显示结果:
+------+---------+-----+
|  name|     date|grade|
+------+---------+-----+
|李奇峰|2019-5-12|   88|
|李奇峰|2019-5-12|   81|
|李奇峰|2019-5-12|   82|
|李奇峰|2019-5-12|   86|
+------+---------+-----+

df2的显示结果:
+------+-----+---------+
|  name|grade|     date|
+------+-----+---------+
|李晓峰|   88|2019-5-12|
|李晓峰|   81|2019-5-12|
|李晓峰|   82|2019-5-12|
|李晓峰|   86|2019-5-12|
+------+-----+---------+

合并后的显示结果:
+------+----------+-----+
|  name|      date|grade|
+------+----------+-----+
|李奇峰|2019-05-12|   88|
|李奇峰|2019-05-12|   81|
|李奇峰|2019-05-12|   82|
|李奇峰|2019-05-12|   86|
|李晓峰|2019-05-12|   88|
|李晓峰|2019-05-12|   81|
|李晓峰|2019-05-12|   82|
|李晓峰|2019-05-12|   86|
+------+----------+-----+

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/90598020