spark sql on hive note 1

Spark sql on Hive is very convenient. By sharing and reading hive metadata, we can directly use spark sql to access hive libraries and tables for faster OLAP analysis.

If spark wants to integrate sql directly, it is best to compile the source code yourself:

switch the version of scala to the new version
dev/change-scala-version.sh 2.11

Compile to support hive
mvn -Pyarn -Phive  -Phive-thriftserver -Phadoop-2.7.3 -Dscala-2.11 -DskipTests clean package

Note that spark sql can be used directly on Linux, just like executing hive commands, enter the interactive terminal, perform ad hoc queries, enter the spark-sql interactive terminal command, and specify to run in yarn mode:

spark/bin/spark-sql --master yarn
uses spark2.0.2 this time. After entering the interactive terminal, you can perform any query analysis. However, the note example in this article is not based on the terminal-based spark sql analysis, but uses spark sql on hive in Scala. Using spark sql on hive in the programming language greatly provides flexibility, and can do more things, such as storing the analyzed results in MySQL, Hbase or Redis, or the analysis process, some data that needs to be stored externally, and so on.

The development program is written in IDEA. The project style is Java+scala mashup and maven management. Note that it is not a full scala project, and it is not managed by sbt. The domestic download of sbt is very slow. Students who can browse the Internet can try it.

Function: Use spark sql to read hive data, then group according to a certain field, collect the grouping results, and store them in redis.

def main(args: Array[String]): Unit = {
    
    val t0=System.nanoTime();//Start time
    val spark=SparkSession
       .builder()
        .appName("spark on sql hive  ")
       .enableHiveSupport().getOrCreate();//Activate hive support

    
    import spark.implicits._
    import spark.sql
    sql("use db")//Switch db
    //Note that collect_set can collect grouped results
    val ds=sql("select q_id, collect_set(kp_id) as ids from ods_q_quest_kp_rel where kp_id!=0  group by q_id");
    ds.cache();//cache it for subsequent use
    println("size:",ds.collect().length)//Print length
    ds.select("q_id","ids").collect().foreach (
      t =>
      {
        val key=t.getAs[String]("q_id");//Get the above column mapping
        val value=t.getAs[Seq[String]]("ids").mkString(",");//Get the above grouping collection
        //insert redis
      }
    )
    val t1 = System.nanoTime ();
    
    println("insert redis ok! Elapsed time: " + (t1 - t0)/1000/1000 + "ms")
    //stop
    spark.stop();

  }


 
If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation.
Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326802177&siteId=291194637