手把手带你玩转 iceberg - Flink Datastream写Iceberg表

之前的系列我们已经搭建了trino引擎 + iceberg数据湖 + hiveMetasotre元数据管理 + s3对象存储的数据湖系统,本文将在此基础上使用Flink分布式引擎来实时读取Kafka中的数据并入湖,我们会使用trino命令行sql来查看入湖的数据。flink任务使用搭建的flinkOperator进行提交。

整个项目的框架图如下:

image.png

站在巨人的肩膀上

在github找到如下代码,里面有多个demo,其中就有flink读取kafka写入数据湖的demo,我们只需进行适配修改到我们的数据湖系统中就行。 github.com/spancer/fli…

请查看文件FlinkWriteIcebergTest.java

适配

catalog适配

原来的demo工程的Catalog元数据本身是存储到Hadoop HDFS文件系统

    // iceberg catalog identification.
    Configuration conf = new Configuration();
    Catalog catalog = new HadoopCatalog(conf);
复制代码

而我们是要存储在Hive Metastore(HMS),所以这块需要适配修改。 参考文章: zhuanlan.zhihu.com/p/419636349 的 “Iceberg元数据操作”章节

    Map<String, String> properties = new HashMap<>();
    properties.put("type", "iceberg");
    properties.put("clients", "5");
    properties.put("property-version", "1");
    properties.put("warehouse", "s3a://datalake/");
    properties.put("catalog-type", "hive");
    properties.put("uri", "thrift://xxx:9083");

    Configuration conf = new Configuration();
    conf.set("fs.s3a.connection.ssl.enabled", "false");
    conf.set("fs.s3a.endpoint", "http://xxx:9000");
    conf.set("fs.s3a.access.key", "minioadmin");
    conf.set("fs.s3a.secret.key", "minioadmin");
    conf.set("fs.s3a.path.style.access", "true");
    conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
    conf.set("fs.s3a.fast.upload", "true");

    String HIVE_CATALOG = "iceberg";
    CatalogLoader catalogLoader = CatalogLoader.hive(HIVE_CATALOG, conf, properties);
    Catalog catalog = catalogLoader.loadCatalog();
复制代码

pom文件中还要添加依赖

<properties>
   <hive.version>3.1.2</hive.version>
   <hadoop.version>3.2.1</hadoop.version>
   <iceberg.version>0.12.0</iceberg.version>
</properties>

<dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-metastore</artifactId>
   <version>${hive.version}</version>
</dependency>

<dependency>
   <groupId>org.apache.iceberg</groupId>
   <artifactId>iceberg-hive-runtime</artifactId>
   <version>${iceberg.version}</version>
</dependency>

<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-aws</artifactId>
   <version>${hadoop.version}</version>
</dependency>

复制代码

kafka数据源适配

    String topic = "arkevent";
    String servers = "kafka:9092";
复制代码

namespace和table名适配

namespace对应的也就是database的概念

TableIdentifier.of的第一个参数是namespace, 第二个参数是table名。

// iceberg table identification.
TableIdentifier name =
        TableIdentifier.of(StringUtils.isEmpty(args[0]) ? "default" : args[0], StringUtils.isEmpty(args[1]) ? "table_test" : args[1]);
复制代码

我输入的时候args[0]设置database名,args[1]设置table名,这样就方便切换了。

适配schema

根据kafka中的数据适配schema

// iceberg table schema identification.
Schema schema = new Schema(Types.NestedField.required(1, "id", Types.StringType.get()),
        Types.NestedField.required(2, "start_time", Types.TimestampType.withoutZone()),
        Types.NestedField.required(3, "end_time", Types.TimestampType.withoutZone()),
        Types.NestedField.required(4, "float_data1", Types.FloatType.get()),
        Types.NestedField.required(5, "float_data2", Types.FloatType.get()));
        

SingleOutputStreamOperator<RowData> dataStream =
        env.addSource(consumer).map(new MapFunction<String, RowData>() {
            @Override
            public RowData map(String value) throws Exception {
                GenericRowData row = new GenericRowData(5);
                try {
                    log.info("input data {}", value);
                    JSONObject dataJson = JSON.parseObject(value);
                    row.setField(0, StringData.fromBytes(dataJson.getString("id").getBytes()));
                    row.setField(1, TimestampData.fromEpochMillis(dataJson.getLong("start_time")));
                    row.setField(2, TimestampData.fromEpochMillis(dataJson.getLong("end_time")));
                    row.setField(3, dataJson.getFloatValue("float_data1"));
                    row.setField(4, dataJson.getFloatValue("float_data2"));
                } catch (Exception e) {
                    log.error("flink data exception ", e);
                }
                return row;
            }
        });
       
复制代码

kafka消费offset配置

flink消费Kafka消息有五种指定offset的方式,参考: zhuanlan.zhihu.com/p/94592509

我们配置为从topic中指定的group上次消费的位置开始消费,这样重启入湖任务后就可以继续上次的消费结尾继续消费了。

consumer.setStartFromGroupOffsets();
复制代码

checkpoint配置

我在调试过程中,在minio的web见面查询入湖的数据,发现metadata可以写入,但是却没有data。查看了jobManager和taskManager发现没有任何异常发生。我怀疑是不是读取kafka数据有问题,我在代码如下位置添加日志,发现程序有从kafka中正确读取出数据。

SingleOutputStreamOperator<RowData> dataStream =
        env.addSource(consumer).map(new MapFunction<String, RowData>() {
            @Override
            public RowData map(String value) throws Exception {
                GenericRowData row = new GenericRowData(5);
                try {
                    log.info("input data {}", value);
复制代码

那就是写入数据流程有问题,但是为啥没有异常日志呢?陷入僵局两天之久。后来,发现没有地方配置checkpoint。在上篇文章手把手带你玩转iceberg - Flink Sql 读写Iceberg表说过,不管是Flink-Sql还是Flink-Datastream方式,我们都需要配置checkpoint的。 于是我就想着先把checkpoint配置上,配置好后上面的难题竟然就解决了。

flink有四种部署模式,请查看: www.cnblogs.com/tencent-clo…

image.png

由于我是通过flink-opertator提交FlinkCluster自定义资源,他属于NativePerJobCluster部署模式,我看了下FlinkCluster这个自定义资源的资源定义文件,发现没有配置checkpoint的配置项。github.com/GoogleCloud…

那只能在代码里配置了,找到如下文章进行了配置 cloud.tencent.com/developer/a…

// start a checkpoint every 1000 ms
            env.enableCheckpointing(1000);
// advanced options:
// set mode to exactly-once (this is the default)
            env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// checkpoints have to complete within one minute, or are discarded
            env.getCheckpointConfig().setCheckpointTimeout(60000);
// make sure 500 ms of progress happen between checkpoints
            env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// allow only one checkpoint to be in progress at the same time
            env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
            env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// This determines if a task will be failed if an error occurs in the execution of the task’s checkpoint procedure.
            env.getCheckpointConfig().setFailOnCheckpointingErrors(true);

复制代码

overwrite

我又遇到一个问题,在minio的web页面可以看到有data正确写入了,但是我用trino进行查询竟然一行记录都查询不到。

./trino --server xxx:8080 --catalog iceberg
 select * from default.table16;
复制代码

后来看到文章 上说,对于未分区的Iceberg 表,其数据将被 INSERT OVERWRITE 完全覆盖。我看了下代码,竟然配置了overwrite。

    // sink data to iceberg table
 FlinkSink.forRowData(dataStream).table(table).tableLoader(tableLoader).writeParallelism(1)
        .overwrite(true)
        .build();
    
复制代码

overwrite可是覆盖的意思啊!

流式入湖要配置overwrite干啥呀?我的场景是用不到的这个的。 官方文档也有介绍overwrite,请查看: iceberg.apache.org/#flink/#ove…

我把overwrite删除后,trino可以正常查询到数据了。 image.png

FlinkCluster脚本

参考官方文档配置FlinkCluster的yaml文件 github.com/GoogleCloud…

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: flink-iceberg-test
  namespace: flink
spec:
  entrypoint: flink-cluster
  serviceAccountName: default
  templates:
  - name: flink-cluster
    resource: 
      action: create 
      successCondition: 'status.components.job.state in (Succeeded)'
      failureCondition: 'status.components.job.state in (Failed, Submission_Failed, Unknow)'
      manifest: | 
        apiVersion: flinkoperator.k8s.io/v1beta1
        kind: FlinkCluster
        metadata:
          name: flinkwriteicebergtest
        spec:
          image:
            name: flink:1.11.3
          jobManager:
            ports:
              ui: 8081
            resources:
              limits:
                memory: "1024Mi"
                cpu: "200m"
            volumeMounts:
              - mountPath: /cache
                name: cache
              - mountPath: /opt/flink/log
                name: cache
            volumes:
              - name: cache
                persistentVolumeClaim:
                  claimName: flink-pv-claim
          taskManager:
            replicas: 2
            resources:
              limits:
                memory: "1024Mi"
                cpu: "200m"
            volumeMounts:
              - mountPath: /cache
                name: cache
              - mountPath: /opt/flink/log
                name: cache
            volumes:
              - name: cache
                persistentVolumeClaim:
                  claimName: flink-pv-claim
          job:
            jarFile: /cache/flink-app.jar
            className: ....iceberg.FlinkWriteIcebergTest
            args: ["default","table16","--input", "./NOTICE"]
            parallelism: 2
            autoSavepointSeconds: 30
            savepointsDir: /cache/savepoints
            restartPolicy: FromSavepointOnFailure
            initContainers:
              - name: downloader
                image: curlimages/curl
                env:
                  - name: JAR_URL
                    value: xxx.jar
                  - name: DEST_PATH
                    value: /cache/flink-app.jar
                command: ['sh', '-c', 'curl -o ${DEST_PATH} ${JAR_URL}']             
            volumeMounts:
              - mountPath: /cache
                name: cache
              - mountPath: /opt/flink/log
                name: cache
            volumes:
              - name: cache
                persistentVolumeClaim:
                  claimName: flink-pv-claim
          flinkProperties:
            taskmanager.numberOfTaskSlots: "1"
            taskmanager.memory.flink.size: "1024mb"
复制代码

我们可以看到上面的yaml定义了一个jobManager,两个taskManager,这构成了一个flink集群,然后还有一个job资源。 具体资源项定义请参考官方文档: github.com/GoogleCloud…

然后通过argo workflow的web界面进行job的提交。

拓展

我们也可以在代码中使用sql的方式进行入湖操作,请参考: github.com/zhangjun0x0…