spark-sql子查询的字段名在父查询中无法解析

场景:多表join、union时,发生如下报错:

Error in query: Resolved attribute(s) complex_flag_code#6549,quantity#6551L,pay_time_date#6547,sales_price#6553,oms_code#6548,retail_price#6550,promotion_sku_code#6552 missing from retail_price#6178,source_platform_code#6384,promotion_policy_code#6402,pay_amount#6329,sku_code#6322,complex_flag_code#6177,order_id#6530,sales_price#6181,promotion_sku_code#6180,pay_time_date#6175,sku_type#6331,quantity#6179L,oms_code#6176,is_gift#6333L in operator !Project [order_id#6530, pay_time_date#6547, cast(oms_code#6548 as string) AS oms_code#6518, sku_code#6322, cast(complex_flag_code#6549 as string) AS complex_flag_code#6519, retail_price#6550, pay_amount#6329, cast(quantity#6551L as decimal(22,2)) AS quantity#6520, sku_type#6331, promotion_policy_code#6402, source_platform_code#6384, is_gift#6333L, promotion_sku_code#6552, sales_price#6553]. Attribute(s) with the same name appear in the operation: complex_flag_code,quantity,pay_time_date,sales_price,oms_code,retail_price,promotion_sku_code. Please check if the right attribute(s) are used.;;

通过分别注释各部分代码后再运行,将问题定位到以下代码段:

...
...
(
    SELECT
      comp_sku.order_id 
      ,comp_sku.quantity 
      ,comp_sku.sales_price 
      ,comp_sku.promotion_sku_code
      ,sales_tmp.order_id 
    FROM
    (
        SELECT
          order_id
          ,promotion_sku_code
          ,quantity
          ,sales_price
        FROM all_detail
        WHERE is_gift = 1 AND promotion_sku_code IS NOT NULL 
    ) comp_sku
    
    LEFT JOIN
    (
        SELECT
          order_id
          ,sku_code
        FROM sales
    ) sales_tmp
    ON comp_sku.order_id = sales_tmp.order_id AND comp_sku.promotion_sku_code = sales_tmp.sku_code 
)
...
...

根据报错,猜测:以上代码作为子查询,将结果供父查询时, 父查询没有解析到子查询结果中的字段。

联想到曾经在hive官网上看到,在join或者union时,必须指定字段别名,否则会丢失数据。

猜测代码中的 comp_sku.order_id等字段在结果中应该成为了column1之类默认的字段名,所以父查询中查找order_id就查找不到。

于是将代码修改为:

...
...
(
    SELECT
      comp_sku.order_id AS order_id
      ,comp_sku.quantity AS quantity
      ,comp_sku.sales_price AS sales_price
      ,comp_sku.promotion_sku_code AS promotion_sku_code
      ,sales_tmp.order_id AS s_order_id
    FROM
    (
        SELECT
          order_id
          ,promotion_sku_code
          ,quantity
          ,sales_price
        FROM all_detail
        WHERE is_gift = 1 AND promotion_sku_code IS NOT NULL 
    ) comp_sku
    
    LEFT JOIN
    (
        SELECT
          order_id
          ,sku_code
        FROM sales
    ) sales_tmp
    ON comp_sku.order_id = sales_tmp.order_id AND comp_sku.promotion_sku_code = sales_tmp.sku_code 
)
...
...

问题解决。

总结:

  1. 在join或者union时,要养成定义字段别名的习惯。
  2. 使用df编程时,也一样适用第一条总结。

猜你喜欢

转载自blog.csdn.net/x950913/article/details/106810376