waterdrop1.x导入clickhouse分布式表-默认方式

先引用一段官方output clickhouse插件中，对分布式表的说明

官方文档地址：https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/configuration/output-plugins/Clickhouse

分布式表配置

ClickHouse {
    host = "localhost:8123"
    database = "nginx"
    table = "access_msg"
    cluster = "no_replica_cluster"
    fields = ["date", "datetime", "hostname", "http_code", "data_size", "ua", "request_time"]
}
根据提供的cluster名称，会从system.clusters表里面获取当前table实际分布在那些节点上。单spark partition的数据会根据随机策略选择某一个ClickHouse节点执行具体的写入操作

从文字说明上可以得知，waterdrop实际上是写的本地表，数据的分配策略是随机

下面来实际测试：

在测试之前，请确保你的clickhouse分布式配置以及完成。可以参考：clickhouse集群模式配置_cakecc2008的专栏-CSDN博客

1、创建表：

-- 创建本地表,在所有节点中都需要执行
DROP TABLE IF EXISTS dw_local.dist_test;
CREATE TABLE  dw_local.dist_test(
    id String    COMMENT 'id' ,
    user_name String    COMMENT '用户姓名' 
)
engine = MergeTree 
primary key (id)
order by  (id) 
;

truncate table dw_local.dist_test;
-- 创建分布式表
DROP TABLE IF EXISTS dw.dist_test;
CREATE TABLE  dw.dist_test(
    id String    COMMENT 'id' ,
    user_name String    COMMENT '用户姓名' 
 )
ENGINE = Distributed(dw_cluster, dw_local, dist_test);


select * from  dw_local.dist_test t  ;
select * from  dw.dist_test t  ;

2、准备数据：

vi /tmp/dist_test.csv
id,user_name
1,zhangsan
2,lisi
3,wangwu
4,lili
5,lucy
6,poli
7,lilei
8,hanmeimei
9,liudehua
10,wuyanzu

3、waterdrop配置：

vi /tmp/dist_test.conf

spark {
  spark.app.name = "Waterdrop"
  spark.executor.instances = 1
  spark.executor.cores = 1
  spark.executor.memory = "1g"
  spark.sql.catalogImplementation = "hive"
}
input {
    file {
        path = "file:///tmp/dist_test.csv"
        format = "csv"
        options.header = "true"
        result_table_name = "dist_test"
    }
}
filter {
    repartition {
    	"注释":"对数进进行重新分区。由于测试数据很少，如果不repartition，数据就会都进入同一个节点,后面源码分析的时候会提到"
        num_partitions = 5
    }
}
output {
    clickhouse {
        host = "10.1.99.191:8123"
    	"注释":"因为waterdrop是写本地表，所以这里database需要配置为本地表对于的库名"
        database = "dw_local"
        table = "dist_test"
        cluster = "dw_cluster"
        username = "user"
        password = "password"
    } 
}

4、执行导入:

bin/start-waterdrop.sh --master local[1] --deploy-mode client --config /tmp/dist_test.conf

5、查询数据：

-- 节点1
select * from  dw_local.dist_test t  ;

Query id: ff2dfdb8-1d58-413a-8fe6-f17992630d1a

┌─id─┬─user_name─┐
│ 8  │ hanmeimei │
│ 9  │ liudehua  │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 10 │ wuyanzu   │
│ 5  │ lucy      │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 4  │ lili      │
│ 7  │ lilei     │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 1  │ zhangsan  │
│ 2  │ lisi      │
└────┴───────────┘

-- 节点2
select * from  dw_local.dist_test t  ;

Query id: ed9ca714-d301-4691-b1bd-7fbf7f34be07

┌─id─┬─user_name─┐
│ 3  │ wangwu    │
│ 6  │ poli      │
└────┴───────────┘

很可能你的测试结果和我的不一样，因为它是随机策略。下面从源码层面分析下Clickhouse output插件的集群分配策略。

6、源码分析：

版本1.5.1

从官方文档中我们可以知道：

Output插件调用结构与Filter插件相似。在调用时会先执行checkConfig方法核对调用插件时传入的参数是否正确，然后调用prepare方法配置参数的缺省值以及初始化类的成员变量，最后调用process方法将 Dataset[Row] 格式数据输出到外部数据源。

所以我们重点就看2个方法prepare、process。相关的分析已经写在注释中

package io.github.interestinglab.waterdrop.output.batch

class Clickhouse extends BaseOutput { 

override def prepare(spark: SparkSession): Unit = {
    this.jdbcLink = String.format("jdbc:clickhouse://%s/%s", config.getString("host"), config.getString("database"))

    val balanced: BalancedClickhouseDataSource = new BalancedClickhouseDataSource(this.jdbcLink, properties)
    val conn = balanced.getConnection.asInstanceOf[ClickHouseConnectionImpl]

    this.table = config.getString("table")
    this.tableSchema = getClickHouseSchema(conn, table)

    if (this.config.hasPath("fields")) {
      this.fields = config.getStringList("fields")
      val (flag, msg) = acceptedClickHouseSchema()
      if (!flag) {
        throw new ConfigRuntimeException(msg)
      }
    }

    val defaultConfig = ConfigFactory.parseMap(
      Map(
        "bulk_size" -> 20000,
        // "retry_codes" -> util.Arrays.asList(ClickHouseErrorCode.NETWORK_ERROR.code),
        "retry_codes" -> util.Arrays.asList(),
        "retry" -> 1
      )
    )

    if (config.hasPath("cluster")) {              //检查配置文件中是否存在cluster参数
      this.cluster = config.getString("cluster")

      this.clusterInfo = getClickHouseClusterInfo(conn, cluster)  //从数据库中获取集群信息，后面在process方法中用到，clusterInfo其实是一个数组
      if (this.clusterInfo.size == 0) {
        val errorInfo = s"cloud not find cluster config in system.clusters, config cluster = $cluster"
        logError(errorInfo)
        throw new RuntimeException(errorInfo)
      }
      logInfo(s"get [$cluster] config from system.clusters, the replica info is [$clusterInfo].")
    }

    config = config.withFallback(defaultConfig)
    retryCodes = config.getIntList("retry_codes")
    super.prepare(spark)
  }

  override def process(df: Dataset[Row]): Unit = {
    val dfFields = df.schema.fieldNames
    val bulkSize = config.getInt("bulk_size")
    val retry = config.getInt("retry")

    if (!config.hasPath("fields")) {
      fields = dfFields.toList
    }

    this.initSQL = initPrepareSQL()
    logInfo(this.initSQL)

    df.foreachPartition { iter =>         //这里使用Dataset的foreachPartition变量分区，所以所谓的随机，是按分区随机。如果只有一个分区，那么数据就只会进入一个shard
      var jdbcUrl = this.jdbcLink
      if (this.clusterInfo != null && this.clusterInfo.size > 0) {					//如果clusterInfo中有数据，就是集群模式
        //using random policy to select shard when insert data
        val randomShard = (Math.random() * this.clusterInfo.size).asInstanceOf[Int]   //随机策略的核心代码，生成一个0~clusterInfo.size的随机数
        val shardInfo = this.clusterInfo.get(randomShard)                             //跟进上面的随机数，获取其中一个shard

        val host = shardInfo._4                                                       //从shard中获取host地址，其他信息使用的还是配置文件中的参数
        val port = getJDBCPort(this.jdbcLink)
        val database = config.getString("database")									//数据库名也是从配置文件中获取，所以配置文件中需要配置本地表对应的库名

        jdbcUrl = s"jdbc:clickhouse://$host:$port/$database"                          //重新对jdbcUrl赋值，其实主要就是host
        logInfo(s"cluster mode, select shard index [$randomShard] to insert data, the jdbc url is [$jdbcUrl].")
      } else {
        logInfo(s"single mode, the jdbc url is [$jdbcUrl].")
      }

      val executorBalanced = new BalancedClickhouseDataSource(jdbcUrl, this.properties)
      val executorConn = executorBalanced.getConnection.asInstanceOf[ClickHouseConnectionImpl]
      val statement = executorConn.createClickHousePreparedStatement(this.initSQL, ResultSet.TYPE_FORWARD_ONLY)
      var length = 0
      while (iter.hasNext) {                                                          //添加数据到缓冲区
        val row = iter.next()
        length += 1

        renderStatement(fields, row, dfFields, statement)
        statement.addBatch()

        if (length >= bulkSize) {                                                     //如果缓冲区大小大于等于阈值（默认20000）则执行入库
          execute(statement, retry)
          length = 0
        }
      }

      execute(statement, retry)
    }
  }
}

7、总结

clickhouse output插件写分布式表的时候，是直接写的本地表，性能上没有什么大大问题
shard的分配策略是随机，核心代码：(Math.random() * this.clusterInfo.size).asInstanceOf[Int]。具体来说是按分区随机，即如果有N个分区，每个分区都会随机获取一次shard，同一个分区必定进入同一个shard。
随机策略导致了两个缺陷：

1）数据分布不均，笔者测试5000万的数据，2个节点，偏差可达到16%；

2）无法指定或预知数据进入哪一个shard，导致后续如果需要join或group时，效率不高。

需要注意的是，在2.x版本没有分布式表写入功能，可能也是基于以上两点原因。

8、思考

由于随机策略在实际应用中并不好用，那么如何解决这个问题呢？

1、修改源代码，增加hash策略，可指定字段进行hash

2、在不修改源代码的情况下，如何实现分布式表的本地写入？