Solve a small problem using Spark SQL's temporary table



Recently, when I used spark to process a business scenario, I encountered a small problem. In the scala code, I used spark sql to access the hive table, and then filtered out the required data according to a batch of ids. It was a very simple requirement to use it directly. The following pseudo-SQL will do:
````
select * from table where  id in (id1,id2,id3,id4,idn)
````



But the problem encountered now is that there are many id conditions, about tens of thousands, so an in of this magnitude will definitely go wrong, and an error will be reported if the in query of the online article hive exceeds 3,000.

How to solve?

There are two main solutions:


(1) Execute in batches, that is, query tens of thousands of IDs in groups of 3000 once, and finally combine all the query results.


(2) Use join to create a hive table with tens of thousands of IDs, and then associate the two tables to obtain the results at one time.




The second solution is preferred here, which is more flexible and easy to expand. Try not to scatter the data set. Once scattered, it means that the client needs to do more work to merge the result set, such as a random sum or dinstict. If it is the first The species needs to be sum or distinct again in the final result set.


Let's see how to use the second solution:


Since our id list is dynamic, the id list of each task may change, so to satisfy the second method, we have to turn them into a temporary table and store them in memory , when spark tasks are stopped, they are automatically destroyed because they do not need to be persisted to disk.

Using a temporary table in spark is very simple. We only need to put the data of the id list into the rdd, and then register the rdd as a table, and then we can do various join operations with the existing tables in the hive library. , a demo code is as follows:
````
import org.apache.spark.sql.SparkSession


object SparkSQLJoinDemo {

  def main(args: Array[String]): Unit = {

    val spark=SparkSession//
      .builder()
      .appName(" spark sql demo ")
      .enableHiveSupport().getOrCreate()

    import spark.implicits._
    import spark.sql

    sql(" use hivedb ")//Specify the db library of hive

    val ids="1,2,3,4,5"//List of simulated ids

    val data=ids.split(",").toSeq//Convert to Seq structure

    val school_table=spark.sparkContext.makeRDD(data).toDF("id")//指定列名

    school_table.createOrReplaceTempView("temp_table")//Create a temporary table in spark's memory

    //It is assumed here that hive_table is a table in hive
    val xr=sql("select * from hive_table where hive_id=id")

    println("Data amount: "+xr.count()) //Print data amount

    spark.close()//close

    
  }

}

````
The ids in the above code is the data that we need to convert into memory table, then we need to convert it into Seq, and generate RDD, and then convert it into DataFrame through RDD. Note that if you want to use DF, you need to import the following import spark.implicits._ package function, so that it can be implicitly converted directly into DF. While converting into DF, we assign the column name id to the data. If there are multiple columns, we can continue to separate commas and add multiple column names. Finally, we Register it as a memory temporary table, and then in the following statement, you can directly use the table existing in hive to join the memory table. Finally, we print the number of successful joins to verify whether the program is running normally.

If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326437453&siteId=291194637