Spark Streaming saves data in msyql

Spark Streaming Persistence Design Patterns

DStreams output operations

print: print the first 10 batch elements in each Dstream on the driver node, often used for development and debugging
saveAsTextFiles(prefix, [suffix]): Save the current Dstream as a file. The file name naming rules for each interval batch are based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix]): Save the current Dstream content as a serialized file of Java serializable objects. The file naming rules for each interval batch are based on prefix and suffix:: "prefix-TIME_IN_MS[.suffix]" .
saveAsHadoopFiles(prefix, [suffix]): Save the Dstream as hadoop files. The file naming rules for each interval batch are based on prefix and suffix:: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func): The most general output operation, the function _fun_ can be applied to each RDD produced from the data stream. Usually _fun_ will save the data in each RDD to an external system, such as: save the RDD to a file, or save it to a database through a network connection. It is worth noting that _fun_ is executed in the driver process that runs the application, and usually contains RDD actions to cause the data stream RDD to start computing.

Design pattern using foreachRDD

dstream.foreachRDD provides a lot of flexibility for development, but also avoids many common pitfalls when using it. Our usual process for saving data to an external system is: establish a remote connection -> transfer data to the remote system over the connection -> close the connection. For this process, we directly thought of the following program code:

// Traverse each RDD in the data stream and match the existing data
    ds.foreachRDD(r => {
      println("Monitored" + r.count() + "Data")
      if (r.count() > 0) {
          //Get the connection to mysql
          val conn: Connection = MySqlUtil.getConnection
          // Traverse the RDD
          r.foreach(tuple => {
            insertIntoMySQL(conn, sql, tuple)
          })
          MySqlUtil.close(conn)
        }
    })

How to insert into database

def insertIntoMySQL(con: Connection, sql: String, data: Tuple8[String, String, String, String, String, String, String, String]): Unit = {
    try {
      val ps = con.prepareStatement(sql)
      ps.setString(1, data._1)
      ps.setString(2, data._2)
      ps.setString(3, data._3)
      ps.setString(4, data._4)
      ps.setString(5, data._5)
      ps.setString(6, data._6)
      ps.setString(7, data._7)
      ps.setString(8, data._8)
      ps.executeUpdate()
      ps.close()
    } catch {
      case exception: Exception =>
        exception.printStackTrace()
    }
  }

In the first test of spark stepping on the pit , the workers and drivers of spark were sorted out. We know that in cluster mode, the connection in the above code needs to be sent from the driver to the worker in the form of serialized objects, but the connection cannot be in the form of a serialized object. The connection between machines cannot be serialized, which may cause _serialization errors (connection object not serializable)_ error. In order to avoid this kind of error, we will establish the conenction in the worker, the code is as follows:

// Traverse each RDD in the data stream and match the existing data
    ds.foreachRDD(r => {
      println("Monitored" + r.count() + "Data")
      if (r.count() > 0) {
          r.foreach(tuple => {
            //Get the connection to mysql
            val conn: Connection = MySqlUtil.getConnection
            insertIntoMySQL(conn, sql, tuple)
            MySqlUtil.close(conn)
          })
        }
      
    })

It seems that the problem is solved? But think about it, we have established and closed the connection in each record of each rdd, which will cause unnecessary high load and reduce the throughput of the whole system. So a better way is to use _rdd.foreachPartition_ to establish a unique connection for each rdd partition (note: each partition is the rdd running on the same worker), the code is as follows:

ds.foreachRDD(r => {
      println("Monitored" + r.count() + "Data")
      if (r.count() > 0) {
          // Traverse the RDD
          r.foreachPartition(x => {
            //Get the connection to mysql
            while (x.hasNext) {
              val conn: Connection = MySqlUtil.getConnection
              insertIntoMySQL(conn, sql, x.next())
              MySqlUtil.close(conn)
            }
          })
        }
    })

In this way, we reduce the load of frequently establishing connections. Usually, we use connection pooling when connecting to the database. By holding a static connection pool object, we can reuse the connection and further optimize the connection establishment overhead, thereby reducing the load. It is also worth noting that, similar to the connection pool of the database, the connection pool we are talking about here should also be lazy to establish connections on demand, and to recover the timeout connections in time.

Also worth noting:

If multiple foreachRDDs are used in spark streaming, they are executed in program order.
Dstream对于输出操作的执行策略是lazy的，所以如果我们在foreachRDD中不添加任何RDD action，那么系统仅仅会接收数据然后将数据丢弃。

Spark访问Mysql

我们需要有一个可序列化的类来建立Mysql连接，这里我们利用了Mysql的C3P0连接池

MySQL通用连接类

import java.sql.Connection
import java.util.Properties

import com.mchange.v2.c3p0.ComboPooledDataSource

class MysqlPool extends Serializable {
  private val cpds: ComboPooledDataSource = new ComboPooledDataSource(true)
  private val conf = Conf.mysqlConfig
  try {
    cpds.setJdbcUrl(conf.get("url").getOrElse("jdbc:mysql://127.0.0.1:3306/test_bee?useUnicode=true&characterEncoding=UTF-8"));
    cpds.setDriverClass("com.mysql.jdbc.Driver");
    cpds.setUser(conf.get("username").getOrElse("root"));
    cpds.setPassword(conf.get("password").getOrElse(""))
    cpds.setMaxPoolSize(200)
    cpds.setMinPoolSize(20)
    cpds.setAcquireIncrement(5)
    cpds.setMaxStatements(180)
  } catch {
    case e: Exception => e.printStackTrace()
  }
  def getConnection: Connection = {
    try {
      return cpds.getConnection();
    } catch {
      case ex: Exception =>
        ex.printStackTrace()
        null
    }
  }
}
object MysqlManager {
  var mysqlManager: MysqlPool = _
  def getMysqlManager: MysqlPool = {
    synchronized {
      if (mysqlManager == null) {
        mysqlManager = new MysqlPool
      }
    }
    mysqlManager
  }
}

我们利用c3p0建立Mysql连接池，然后访问的时候每次从连接池中取出连接用于数据传输。

Mysql输出操作

同样利用之前的foreachRDD设计模式，将Dstream输出到mysql的代码如下：

dstream.foreachRDD(rdd => {
    if (!rdd.isEmpty) {
      rdd.foreachPartition(partitionRecords => {
        //从连接池中获取一个连接
        val conn = MysqlManager.getMysqlManager.getConnection
        val statement = conn.createStatement
        try {
          conn.setAutoCommit(false)
          partitionRecords.foreach(record => {
            val sql = "insert into table..." // 需要执行的sql操作
            statement.addBatch(sql)
          })
          statement.executeBatch
          conn.commit
        } catch {
          case e: Exception =>
            // do some log
        } finally {
          statement.close()
          conn.close()
        }
      })
    }
})

值得注意的是:

我们在提交Mysql的操作的时候，并不是每条记录提交一次，而是采用了批量提交的形式，所以需要将conn.setAutoCommit(false)，这样可以进一步提高mysql的效率。
如果我们更新Mysql中带索引的字段时，会导致更新速度较慢，这种情况应想办法避免，如果不可避免，那就没办法了