Spark 读写Hive分桶表

原文链接: https://stackoverflow.com/questions/53476945/read-write-bucketed-tables-in-spark?r=SearchResults

I have a number of tables (with 100 million-ish rows) that are stored as external Hive tables using Parquet format. The Spark job needs to join several of them together, using a single column, with almost no filtering. The join column has unique values about 2/3X fewer than the number of rows.

I can see that there are shuffles happening by the join key; and I have been trying to utilize bucketing/partitioning to improve join performance. My thought is that if Spark can be made aware that each of these tables has been bucketed using the same column, it can load the dataframes and join them without shuffling. I have tried using Hive bucketing, but the shuffles don't go away. (From Spark's documentation it looks like Hive bucketing is not supported as of Spark 2.3.0 at least, which I found out later.) Can I use Spark's bucketing feature to do this? If yes, would I have to disable Hive support and just read the files directly? Or could I rewrite the tables once using Spark's bucketing scheme and still be able to read them as Hive tables?

EDIT: For writing out the Hive bucketed tables I was using something like:

customerDF
.write
.option("path", "/some/path")
.mode("overwrite")
.format("parquet")
.bucketBy(200, "customer_key")
.sortBy("customer_key")
.saveAsTable("table_name")
 giftDF.write
.partitionBy("plat","date")
.mode(SaveMode.Append)
.bucketBy(20,"room_id")
.parquet(path)

猜你喜欢

转载自blog.csdn.net/s294878304/article/details/102678658
今日推荐