DataFrame fail to find the column name after join condition

abhijit nag :

I am getting one issue like "sql.AnalysisException: cannot resolve column_name" when performing one join operation using dataframe API. Though the column name exists and same join operation is working fine when tried with SQL format of HiveContext. In the following code base,

DataFrame df= df1
  .join(df2, df1.col("MERCHANT").equalTo(df2.col("MERCHANT")))
  .select(df2.col("MERCH_ID"), df1.col("MERCHANT")));

I have tried with "alias" function too, but got the same problem "Can't resolve column name." and throwing following exception.

resolved attribute(s) MERCH_ID#738 missing from MERCHANT#737,MERCHANT#928,MERCH_ID#929,MER_LOC#930 in operator !Project [MERCH_ID#738,MERCHANT#737];

at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)

Spark Version: 1.6

The problem has been faced in both Scala and Java Spark.

In Scala, the issue got resolved using 'alias', but in Java, I am still getting the error.

Alper t. Turker :

From my experience it is best to avoid DataFrame.col and DataFrame.apply unless necessary for disambiguation (aliasing is still better). Please try using independent Column objects:

import org.apache.spark.sql.functions;

DataFrame df= df1.alias("df1").
  .join(df2.alias("df2"), functions.col("df1.MERCHANT").equalTo(functions.col("df2.MERCHANT")))
  .select(functions.col("df2.MERCH_ID"), functions.col("df2.MERCHANT")));

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=434219&siteId=1