I am getting one issue like "sql.AnalysisException: cannot resolve column_name" when performing one join operation using dataframe API. Though the column name exists and same join operation is working fine when tried with SQL format of HiveContext. In the following code base,
DataFrame df= df1
.join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
.select(df2.col("MERCH_ID"), df1.col("MERCHANT")));
I have tried with "alias" function too, but got the same problem "Can't resolve column name." and throwing following exception.
resolved attribute(s) MERCH_ID#738 missing from MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project [MERCH_ID#738,MERCHANT#737];
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
Spark Version: 1.6
The problem has been faced in both Scala and Java Spark.
In Scala, the issue got resolved using 'alias', but in Java, I am still getting the error.
From my experience it is best to avoid DataFrame.col
and DataFrame.apply
unless necessary for disambiguation (aliasing is still better). Please try using independent Column
objects:
import org.apache.spark.sql.functions;
DataFrame df= df1.alias("df1").
.join(df2.alias("df2"), functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
.select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));