[Switch] Spark stepping on the pit: database (Hbase+Mysql)

https://cloud.tencent.com/developer/article/1004820

Spark Stepping on the Pit: Database (Hbase+Mysql)
Preface
In the process of using Spark Streaming to persist the calculation results, we often need to operate the database to count or change some values.

A recent real-time consumer processing task, when using spark streaming for real-time data stream processing, I need to update the calculated data to hbase and mysql, so this article summarizes the content of spark's operation of hbase and mysql, and for myself Some pits stepped on are recorded.

Spark Streaming persistence design pattern
DStreams output operation
print: print the first 10 batch elements in each Dstream on the driver node, often used for development and debugging
saveAsTextFiles(prefix, [suffix]): save the current Dstream as a file, each The file name naming rule of interval batch is based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix]): Save the current Dstream content as a serialized file of Java serializable objects, each time The file naming rules of an interval batch are based on prefix and suffix:: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix]): save the Dstream as a hadoop file, and the file naming rules for each interval batch Based on prefix and suffix:: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func): The most general output operation, the function fun can be applied to each RDD produced from the data stream. Usually fun will save the data in each RDD to an external system, such as: save the RDD to a file, or save it to a database through a network connection. It is worth noting that fun is executed in the driver process of the running application, and usually contains RDD actions to cause the data flow RDD to start computing.
Design pattern using foreachRDD
dstream.foreachRDD provides a lot of flexibility for development, but also avoids many common pitfalls when using it. Our usual process for saving data to an external system is: establish a remote connection -> transfer data to the remote system over the connection -> close the connection. For this process, we directly thought of the following program code:

dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
In the previous article "Spark Stepping on the Pit" In "First Test", the workers and drivers of spark are sorted out. We know that in cluster mode, the connection in the above code needs to be sent from the driver to the worker in the form of a serialized object, but the connection cannot be passed between machines. Yes, that is, the connection cannot be serialized, which may cause Cserialization errors (connection object not serializable). In order to avoid this kind of error, we will establish the conenction in the worker, the code is as follows:

dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val connection = createNewConnection()
connection.send(record)
connection.close()
}
}
It seems that the problem is solved? But think about it, we have established and closed the connection in each record of each rdd, which will cause unnecessary high load and reduce the throughput of the whole system.

So a better way is to use rdd.foreachPartition to establish a unique connection for each rdd partition (note: each partition is the inner rdd that runs on the same worker), the code is as follows:

dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
This way we reduce the load of frequent connection establishment , we usually use the connection pool when connecting to the database, and introduce the concept of connection pool. The code optimization is as follows:

dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool. returnConnection(connection) // return to the pool for future reuse
}
}
By holding a static connection pool object, we can reuse the connection to further optimize the connection establishment overhead, thereby reducing the load. It is also worth noting that, similar to the connection pool of the database, the connection pool we are talking about here should also be lazy to establish connections on demand, and to recover the timeout connections in time.

Also worth noting:

If multiple foreachRDDs are used in spark streaming, Dstream is executed downward in program order
. The execution strategy for output operations is lazy, so if we do not add any RDD actions to foreachRDD, the system will only receive data and then discard the data.
Spark access to Hbase
Above we have explained the basic design pattern of outputting spark streaming Dstream to an external system, here we describe how to output Dstream to Hbase cluster.

Hbase general connection class
Scala connects Hbase to obtain information through zookeeper, so you need to provide relevant information of zookeeper during configuration, as follows:

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Connection
import org.apache.hadoop.hbase.HConstants
import org.apache.hadoop.hbase.client.ConnectionFactory

object HbaseUtil extends Serializable {
private val conf = HBaseConfiguration.create()
private val para = Conf.hbaseConfig // Conf为配置类，获取hbase的配置
conf.set(HConstants.ZOOKEEPER_CLIENT_PORT, para.get("port").getOrElse("2181"))
conf.set(HConstants.ZOOKEEPER_QUORUM, para.get("quorum").getOrElse("127-0-0-1")) // hosts
private val connection = ConnectionFactory.createConnection(conf)

def getHbaseConn: Connection = connection
}
According to online information, we do not use connection pool due to the particularity of Hbase connection

Hbase output operation
Let's take the put operation as an example to demonstrate the application of the above design pattern to the Hbase output operation:

dstream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
val connection = HbaseUtil.getHbaseConn // 获取Hbase连接
partitionRecords.foreach(data => {
val tableName = TableName.valueOf("tableName")
val t = connection.getTable(tableName)
try {
val put = new Put(Bytes.toBytes(rowKey)) // row key
// column, qualifier, value
put.addColumn(column.getBytes, qualifier.getBytes, value.getBytes)
Try(t.put(put)).getOrElse(t.close())
// do some log（显示在worker上）
} catch {
case e: Exception =>
// log error
e.printStackTrace()
} finally {
t.close()
}
})
})
// do some log (displayed on the driver)
}
}) For
other operations on HBase, please refer to Operating HBase under Spark (1.0.0 new API)

Filling the pit record
Focus on the problem of configuring HConstants.ZOOKEEPER_QUORUM during the connection to Hbase:

Since the connection of Hbase cannot be accessed directly using the ip address, it is often necessary to configure the hosts. For example, in the above code segment, 127-0-0-1 (any), we need to configure
127-0-0-1 127.0 in the hosts. 0.1
In the case of a single machine, we only need to configure the hosts of Hbase where the zookeeper is located, but when switching to the Hbase cluster, a strange bug is encountered.
Problem description: when saving Dstream to Hbase in foreachRDD, it will get stuck, and No error messages pop up (yes! It's just stuck and unresponsive)

Problem analysis: Since the Hbase cluster has multiple machines, and we only configure the hosts of one Hbase machine, the Spark cluster keeps looking for it when accessing Hbase, but it gets stuck there.

Solution: configure all hbase node ip for the hosts on each worker, the problem is solved

Spark accessing Mysql
is similar to accessing Hbase, we also need a serializable class to establish Mysql connection, here we use Mysql's C3P0 connection pool

MySQL generic connection class
import java.sql.Connection
import java.util.Properties

import com.mchange.v2.c3p0.ComboPooledDataSource

class MysqlPool extends Serializable {
private val cpds: ComboPooledDataSource = new ComboPooledDataSource(true)
private val conf = Conf.mysqlConfig
try {
cpds.setJdbcUrl(conf.get("url").getOrElse("jdbc:mysql://127.0.0.1:3306/test_bee?useUnicode=true&characterEncoding=UTF-8"));
cpds.setDriverClass("com.mysql.jdbc.Driver");
cpds.setUser(conf.get("username").getOrElse("root"));
cpds.setPassword(conf.get("password").getOrElse(""))
cpds.setMaxPoolSize(200)
cpds.setMinPoolSize(20)
cpds.setAcquireIncrement(5)
cpds.setMaxStatements(180)
} catch {
case e: Exception => e.printStackTrace()
}
def getConnection: Connection = {
try {
return cpds.getConnection();
} catch {
case ex: Exception =>
ex.printStackTrace()
null
}
}
}
object MysqlManager {
var mysqlManager: MysqlPool = _
def getMysqlManager: MysqlPool = {
synchronized {
if (mysqlManager == null) {
mysqlManager = new MysqlPool
}
}
mysqlManager
}
}
We use c3p0 to establish a Mysql connection pool, and then take out a connection from the connection pool for data transmission every time when accessing.

Mysql output operation
also uses the previous foreachRDD design pattern to output Dstream to mysql. The code is as follows:

dstream.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//Get a connection from the connection pool
val conn = MysqlManager.getMysqlManager.getConnection
val statement = conn.createStatement
try {
conn .setAutoCommit(false)
partitionRecords.foreach(record => {
val sql = "insert into table..." // SQL operation to be executed
statement.addBatch(sql)
})
statement.executeBatch
conn.commit
} catch {
case e : Exception =>
// do some log
} finally {
statement.close()
conn.close()
}
})
}
})
It's worth noting that:

When we submit the operation of Mysql, we do not submit each record once, but in the form of batch submission, so we need to set conn.setAutoCommit(false), which can further improve the efficiency of MySQL.
If we update the fields with indexes in Mysql, the update speed will be slow. We should try to avoid this situation. If it is unavoidable, then go hard (T^T)
Deploy
and provide what Spark needs to connect Mysql and Hbase. The maven configuration of the jar package:

org.apache.hbase
hbase-client
1.0.0

org.apache.hbase
hbase-common
1.0.0

org.apache.hbase
hbase-server
1.0.0

mysql
mysql-connector-java
5.1.31

c3p0
c3p0
0.9.1.2

references:

[Switch] Spark stepping on the pit: database (Hbase+Mysql)

Guess you like