sparksql series (four) sparksql operation of the database

A: SparkSql operation mysql

The old rules: the first out method to the public:

import java.util.Arrays

import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.{DataFrame, Row, SparkSession, functions}
import org.apache.spark.sql.functions.{col, desc, length, row_number, trim, when}
import org.apache.spark.sql.functions.{countDistinct,sum,count,avg}
import org.apache.spark.sql.functions.concat
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
import java.util.ArrayList


object WordCount {

  def dataAndJdbcoption() = {  

    val sparkSession= SparkSession.builder().master("local").getOrCreate()
    val javasc = new JavaSparkContext(sparkSession.sparkContext)

    val nameRDD1 = javasc.parallelize(Arrays.asList("{'id':'7'}","{'id':'8'}","{'id':'9'}"));
    val nameRDD1df = sparkSession.read.json(nameRDD1)

    val prop = new java.util.Properties
    prop.setProperty("user","root")
    prop.setProperty("password","123456")
    prop.setProperty("driver","com.mysql.jdbc.Driver")
    prop.setProperty("dbtable","blog")
    prop.setProperty("url","jdbc:mysql://127.0.0.1:3306/test")

    (nameRDD1df,prop)

  }

}

Read mysql

    val df = dataAndJdbcoption()._1
    val prop = dataAndJdbcoption()._2

    val sparkSession= SparkSession.builder().master("local").getOrCreate()
    val data = sparkSession.read.format("jdbc").option("user","root").option("password","123456")
      .option("driver","com.mysql.jdbc.Driver")
.      option("url","jdbc:mysql://127.0.0.1:3306/test").option("dbtable", "blog")
      .load()
    data.show(100)

Write mysql

  val df = dataAndJdbcoption()._1
  val prop = dataAndJdbcoption()._2
  df.write.mode(SaveMode.Append).jdbc(prop.getProperty("url"), prop.getProperty("dbtable"), prop)

Two: SparkSql operating Hive

Company data read Hive

                     In fact, read Hive table location of the file to generate the final document.

Company write data Hive

                     Generate data files will load into the Hive

Sql data directly operating the Hive       

    conf = new new SparkConf Val (). setAppName ( "WordCount")
    // merge small files, sparksql default there are 200 task executable file, it will generate a lot of small files. In fact, there are many parameters can be optimized see sparkSession.sql ( "SET -v")

    conf.set("mapreduce.input.fileinputformat.split.minsize","1024000000")
    conf.set("mapreduce.input.fileinputformat.split.maxsize","1024000000")
    conf.set("mapreduce.input.fileinputformat.split.minsize.per.node","1024000000")
    conf.set("mapreduce.input.fileinputformat.split.maxsize.per.node","1024000000")
    conf.set("mapreduce.input.fileinputformat.split.minsize.per.rack","1024000000")
    conf.set("mapreduce.input.fileinputformat.split.maxsize.per.rack","1024000000")
    val sparkSession= SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()

      sparkSession.sql("insert into table table1 select aa from sparksqlTempTable")

 

    In addition to the above-described method can be combined file, there is a way to merge files:

    val dataFrame = sparkSession.sql("select aa from table ").coalesce(3);//日志看task数量3
    dataFrame.createOrReplaceTempView("sparksqlTempTable")
    sparkSession.sql("insert into table table1 select aa from sparksqlTempTable")

 

    However, this method is not practical because most operations Sql operation is the need to insert.

 

    Online said there is a third method:

    To insert a REPARTITION (4) in the Sql, but I during the experiment does not work, probably only for this syntax to HiveSql itself, and use SparkSql no effect.

    栗子:select /*+ REPARTITION(4) */ aa from table 

Guess you like

Origin www.cnblogs.com/wuxiaolong4/p/11707341.html