A: SparkSql operation mysql
The old rules: the first out method to the public:
import java.util.Arrays
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.{DataFrame, Row, SparkSession, functions}
import org.apache.spark.sql.functions.{col, desc, length, row_number, trim, when}
import org.apache.spark.sql.functions.{countDistinct,sum,count,avg}
import org.apache.spark.sql.functions.concat
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.SaveMode
import java.util.ArrayList
object WordCount {
def dataAndJdbcoption() = {
val sparkSession= SparkSession.builder().master("local").getOrCreate()
val javasc = new JavaSparkContext(sparkSession.sparkContext)
val nameRDD1 = javasc.parallelize(Arrays.asList("{'id':'7'}","{'id':'8'}","{'id':'9'}"));
val nameRDD1df = sparkSession.read.json(nameRDD1)
val prop = new java.util.Properties
prop.setProperty("user","root")
prop.setProperty("password","123456")
prop.setProperty("driver","com.mysql.jdbc.Driver")
prop.setProperty("dbtable","blog")
prop.setProperty("url","jdbc:mysql://127.0.0.1:3306/test")
(nameRDD1df,prop)
}
}
Read mysql
val df = dataAndJdbcoption()._1
val prop = dataAndJdbcoption()._2
val sparkSession= SparkSession.builder().master("local").getOrCreate()
val data = sparkSession.read.format("jdbc").option("user","root").option("password","123456")
.option("driver","com.mysql.jdbc.Driver")
. option("url","jdbc:mysql://127.0.0.1:3306/test").option("dbtable", "blog")
.load()
data.show(100)
Write mysql
val df = dataAndJdbcoption()._1
val prop = dataAndJdbcoption()._2
df.write.mode(SaveMode.Append).jdbc(prop.getProperty("url"), prop.getProperty("dbtable"), prop)
Two: SparkSql operating Hive
Company data read Hive
In fact, read Hive table location of the file to generate the final document.
Company write data Hive
Generate data files will load into the Hive
Sql data directly operating the Hive
conf = new new SparkConf Val (). setAppName ( "WordCount")
// merge small files, sparksql default there are 200 task executable file, it will generate a lot of small files. In fact, there are many parameters can be optimized see sparkSession.sql ( "SET -v")
conf.set("mapreduce.input.fileinputformat.split.minsize","1024000000")
conf.set("mapreduce.input.fileinputformat.split.maxsize","1024000000")
conf.set("mapreduce.input.fileinputformat.split.minsize.per.node","1024000000")
conf.set("mapreduce.input.fileinputformat.split.maxsize.per.node","1024000000")
conf.set("mapreduce.input.fileinputformat.split.minsize.per.rack","1024000000")
conf.set("mapreduce.input.fileinputformat.split.maxsize.per.rack","1024000000")
val sparkSession= SparkSession.builder().enableHiveSupport().config(conf).getOrCreate()
sparkSession.sql("insert into table table1 select aa from sparksqlTempTable")
In addition to the above-described method can be combined file, there is a way to merge files:
val dataFrame = sparkSession.sql("select aa from table ").coalesce(3);//日志看task数量3
dataFrame.createOrReplaceTempView("sparksqlTempTable")
sparkSession.sql("insert into table table1 select aa from sparksqlTempTable")
However, this method is not practical because most operations Sql operation is the need to insert.
Online said there is a third method:
To insert a REPARTITION (4) in the Sql, but I during the experiment does not work, probably only for this syntax to HiveSql itself, and use SparkSql no effect.
栗子:select /*+ REPARTITION(4) */ aa from table