Spark与MySQL连接

将数据保存到MySQL

方法一:各个字段都是提前定好的

val prop = new java.util.Properties
prop.setProperty("user", "root")
prop.setProperty("password", "123456")

df.write.mode(SaveMode.Append).jdbc("jdbc:mysql://localhost:3306/test", "mytab", prop)

方法二:字段可自由添减

df.foreachPartition(p => {
      @transient val conn = ConnectionPool.getConnection
      p.foreach(x => {
        val sql = "insert into app_id(id,date,appid,num) values (" +
          "'"+UUID.randomUUID+"'," +
          "'"+x.getInt(0)+"'," +
          "'"+x.getString(1)+"'," +
          "'"+x.getLong(2)+"'" +
          ")"
        val stmt = conn.createStatement
        stmt.executeUpdate(sql)
      })
      ConnectionPool.returnConnection(conn)
    })

数据库链接池:

package com.prince.spark.util;
import java.sql.Connection;
import java.sql.DriverManager;
import java.util.LinkedList;

public class ConnectionPool {
    private static LinkedList<Connection> connectionQueue;

    static {
        try {
            Class.forName("com.mysql.jdbc.Driver");
        }catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
    }

    public synchronized static Connection getConnection() {
        try {
            if (connectionQueue == null) {
                connectionQueue = new LinkedList<Connection>();
                for (int i = 0;i < 5;i ++) {
                    Connection conn = DriverManager.getConnection(
                            "jdbc:mysql://192.168.1.97:3306/xiang_log?characterEncoding=utf8",
                            "root",
                            "123456"
                    );
                    connectionQueue.push(conn);
                }
            }
        }catch (Exception e) {
            e.printStackTrace();
        }
        return connectionQueue.poll();
    }

    public static void returnConnection(Connection conn) {
        connectionQueue.push(conn);
    }
}

方法三:有时涉及到计算结果的写入,还要组装df

//组装结果RDD
val arrayRDD = sc.parallelize(List ((num,log_date)))
//将结果RDD映射到rowRDD
val resultRowRDD = arrayRDD.map(p =>Row(
  p._1.toInt,
  p._2.toString,
  new Timestamp(new java.util.Date().getTime)
))
//通过StructType直接指定每个字段的schema
val resultSchema = StructType(
  List(
    StructField("verify_num", IntegerType, true), 
    StructField("log_date", StringType, true), //是哪一天日志分析出来的结果
    StructField("create_time", TimestampType, true) //分析结果的创建时间
  )
)
//组装新的DataFrame
val DF = spark.createDataFrame(resultRowRDD,resultSchema)
//将结果写入到Mysql
DF.write.mode("append")
  .format("jdbc")
  .option("url","jdbc:mysql://192.168.1.97:3306/xiang_log")
  .option("dbtable","verify") //表名
  .option("user","root")
  .option("password","123456")
  .save()

从MySQL读取数据

def main(args: Array[String]) {
    val conf =newSparkConf().setAppName("JdbcRDDDemo").setMaster("local[2]")
    val sc = newSparkContext(conf)
    valconnection = () => {
     Class.forName("com.mysql.jdbc.Driver").newInstance()
     DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata","root", "123456")
}
//创建JdbcRDD对象
    val jdbcRDD= new JdbcRDD(
      sc,
     connection,
     "SELECT * FROM ta where id >= ? AND id <= ?",
      1, 4,
      2,
      r => {              //这个函数就是把MySQL中的数据select出来之后,把第一列的数据赋值给id, 第二列的数据给code
        val id =r.getInt(1)
        val code= r.getString(2)
        (id,code)
      }
    )
    val jrdd =jdbcRDD.collect()
   println(jdbcRDD.collect().toBuffer)
    sc.stop()
  }
}

JdbcRDD主构造器的参数如下:
在这里插入图片描述

误区

在Driver端创建对象
在Driver上创建连接对象(比如网络连接或数据库连接),如果在Driver上创建连接对象,然后在RDD的算子函数内使用连接对象,那么就意味着需要将连接对象序列化后从Driver传递到Worker上。而连接对象(比如Connection对象)通常来说是不支持序列化的,此时通常会报序列化的异常(serialization errors)。因此连接对象必须在Worker上创建,不要在Driver上创建

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // 在driver上执行
  rdd.foreach { record =>
    connection.send(record) // 在worker上执行
  }
}

为每一条记录都创建对象

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

正确做法:

  1. 为每个rdd分区创建一个连接对象
    连接对象的创建和销毁都是很消耗时间的。因此频繁地创建和销毁连接对象,可能会导致降低spark作业的整体性能和吞吐量。
dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}
  1. 为每个rdd分区使用一个连接池这种的连接对象
    比较正确的做法是:对DStream中的RDD,调用foreachPartition,对RDD中每个分区创建一个连接对象,使用一个连接对象将一个分区内的数据都写入底层MySQL中。这样可以大大减少创建的连接对象的数量。
dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // 静态连接池,同时连接是懒创建的
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // 用完以后将连接返回给连接池,进行复用
  }
}

下面给出不同连接数据库的优化过程:
foreachRDD=>foreachPartition=>foreach

package com.ruozedata.spark
import java.sql.DriverManager
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SocketWCApp {
  def main(args: Array[String]): Unit = {
    val sparkConf=new SparkConf().setMaster("local[2]").setAppName("SocketWCApp")
    val ssc=new StreamingContext(sparkConf,Seconds(10))
    //From server ==> DStream
    val lines=ssc.socketTextStream("vm01",8888)
    val result=lines.flatMap(_.split(",")).map((_,1)).reduceByKey(_+_)
//    result.print()

//第一段,foreach,一条数据连接一次MySQL,非常消耗资源
    result.foreachRDD(rdd=>{
//这里是在driver端执行,跨网络需要序列化,这里会有序列化的问题,要放到foreach里面
      val connection=getConnection()  
      rdd.foreach(kv=>{ //foreache是在excutor端执行
        val connection=getConnection()
        val sql=s"insert into wc(word,cnt) values ('${kv._1}', '${kv._2}')"
        connection.createStatement().execute(sql)
        connection.close()
      })
    })

//第二段,优化,foreachPartition,连接MySQL是一个高消耗的事情,一个分区连接一次
        result.foreachRDD(rdd => {
         rdd.foreachPartition(partionOfRecords => {
              val connection = getConnection()
              partionOfRecords.foreach(kv => {
                val sql = s"insert into wc(word,cnt) values ('${kv._1}', '${kv._2}')"
                connection.createStatement().execute(sql)
             })
             connection.close()
          })
        })

//第三段,优化,foreachPartition,增加连接池,执行后不关闭连接,返回到连接池中
    result.foreachRDD(rdd => {
      rdd.foreachPartition(partionOfRecords => {
        // if(partionOfRecords.size > 0) {
        val connection = ConnectionPool.getConnection().get
        partionOfRecords.foreach(kv => {
          val sql = s"insert into wc(word,cnt) values ('${kv._1}', '${kv._2}')"
          connection.createStatement().execute(sql)
        })
        //connection.close()
        ConnectionPool.returnConnection(connection)
        // }
      })
    })

//第四段,窗口window
    val windowedWordCounts=result.reduceByKeyAndWindow((a:Int,b:Int)=>(a+b),Seconds(20),Seconds(5))
    //窗口10秒,每隔5s滑动一次
    windowedWordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }

  def getConnection()={
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://192.168.137.130:3306/rzdb?useSSL=false","root","syncdb123!")
  }
}

定义连接池,需要先添加依赖

    <dependency>
      <groupId>com.jolbox</groupId>
      <artifactId>bonecp</artifactId>
      <version>0.8.0.RELEASE</version>
    </dependency>
package com.ruozedata.spark
import java.sql.{Connection, DriverManager}
import com.jolbox.bonecp.{BoneCP, BoneCPConfig}
import org.slf4j.LoggerFactory

object ConnectionPool {
  val logger=LoggerFactory.getLogger(this.getClass())
  private val pool={
    try{
      Class.forName("com.mysql.jdbc.Driver")
    //  DriverManager.getConnection("jdbc:mysql://192.168.137.130:3306/rzdb?useSSL=false","root","syncdb123!")
      val config = new BoneCPConfig()
      config.setUsername("root")
      config.setPassword("syncdb123!")
      config.setJdbcUrl("jdbc:mysql://192.168.137.130:3306/rzdb?useSSL=false")
      config.setMinConnectionsPerPartition(2) //最小连接数
      config.setMaxConnectionsPerPartition(5) //最大连接数
      config.setCloseConnectionWatch(true)  //关闭的时候要不要监控
      Some(new BoneCP(config))
    }catch {
      case e:Exception=>{
        e.printStackTrace()
        None
      }
    }
  }
  def getConnection():Option[Connection]={
    pool match {
        case  Some(pool)=> Some(pool.getConnection)
        case None=>None
    }
  }
  def  returnConnection(connection:Connection)={
    if(null != connection){
      connection.close() //这个地方不能关闭,应该要返回到池里面去吃才行
    }
  }
}
发布了107 篇原创文章 · 获赞 19 · 访问量 6万+

猜你喜欢

转载自blog.csdn.net/ThreeAspects/article/details/103813263