[Real-time data warehouse] Commodity theme calculation and region theme table (FlinkSQL) of DWS layer

1 DWS layer - Commodity Topic Calculation

1 Convert the JSON string data stream into a data stream of unified data objects

(1) Convert order wide table flow data

// 4.6 转换订单宽表流数据
SingleOutputStreamOperator<ProductStats> orderWideStatsDS = orderWideStrDS.map(
        new MapFunction<String, ProductStats>() {
    
    
            @Override
            public ProductStats map(String jsonStr) throws Exception {
    
    
                OrderWide orderWide = JSON.parseObject(jsonStr, OrderWide.class);
                ProductStats productStats = ProductStats.builder()
                        .sku_id(orderWide.getSku_id())
                        .order_sku_num(orderWide.getSku_num())
                        .order_amount(orderWide.getSplit_total_amount())
                        .ts(DateTimeUtil.toTs(orderWide.getCreate_time()))
                        .orderIdSet(new HashSet(Collections.singleton(orderWide.getOrder_id())))
                        .build();
                return productStats;
            }
        }
);

(2) Convert payment wide table flow data

// 4.7 转换支付宽表流数据
SingleOutputStreamOperator<ProductStats> paymentWideStatsDS = paymentWideStrDS.map(
        new MapFunction<String, ProductStats>() {
    
    
            @Override
            public ProductStats map(String jsonStr) throws Exception {
    
    
                PaymentWide paymentWide = JSON.parseObject(jsonStr, PaymentWide.class);
                ProductStats productStats = ProductStats.builder()
                        .sku_id(paymentWide.getSku_id())
                        .payment_amount(paymentWide.getSplit_total_amount())
                        .paidOrderIdSet(new HashSet(Collections.singleton(paymentWide.getOrder_id())))
                        .ts(DateTimeUtil.toTs(paymentWide.getCallback_time()))
                        .build();
                return productStats;
            }
        }
);

2 Merge unified data structure streams into one stream

(1) Code

// TODO 5 将不同流的数据通过union合并到一起
DataStream<ProductStats> unionDS = clickAndDisplayStatsDS.union(
        favorStatsDS,
        cartStatsDS,
        refundStatsDS,
        commentStatsDS,
        orderWideStatsDS,
        paymentWideStatsDS
);

unionDS.print(">>>");

(2) test

  • Start ZK, Kafka, logger.sh, ClickHouse, Redis, HDFS, Hbase, Maxwell

    redis-server /home/hzy/redis2022.conf 
    sudo systemctl start clickhouse-server
    
  • runBaseLogApp

  • run BaseDBApp

  • Run OrderWideApp

  • Run PaymentWideApp

  • Run the ProductsStatsApp

  • Run the jar package in the rt_applog directory (you can get the exposure display_ct and click data click_ct)

  • Run the jar package in the rt_dblog directory (you can get sku_id, cart_ct, favor_ct, order quantity, order amount, orderIdSet, etc.)

  • View the console output (if the computer performance is not enough, you can test the log and business lines separately)

3 Set event time and water level

// TODO 6 指定watermark以及提取时间时间字段
SingleOutputStreamOperator<ProductStats> productStatsWithWatermarkDS = unionDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<ProductStats>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(
                        new SerializableTimestampAssigner<ProductStats>() {
    
    
                            @Override
                            public long extractTimestamp(ProductStats productStats, long recordTimestamp) {
    
    
                                return productStats.getTs();
                            }
                        }
                )
);

4 Grouping, windowing, aggregation

// TODO 7 分组 -- 按照商品id分组
KeyedStream<ProductStats, Long> keyedDS = productStatsWithWatermarkDS.keyBy(ProductStats::getSku_id);

// TODO 8 开窗
WindowedStream<ProductStats, Long, TimeWindow> windowDS = keyedDS.window(TumblingEventTimeWindows.of(Time.seconds(10)));

// TODO 9 聚合计算
SingleOutputStreamOperator<ProductStats> reduceDS = windowDS.reduce(
        new ReduceFunction<ProductStats>() {
    
    
            @Override
            public ProductStats reduce(ProductStats stats1, ProductStats stats2) throws Exception {
    
    
                stats1.setDisplay_ct(stats1.getDisplay_ct() + stats2.getDisplay_ct());
                stats1.setClick_ct(stats1.getClick_ct() + stats2.getClick_ct());
                stats1.setCart_ct(stats1.getCart_ct() + stats2.getCart_ct());
                stats1.setFavor_ct(stats1.getFavor_ct() + stats2.getFavor_ct());
                stats1.setOrder_amount(stats1.getOrder_amount().add(stats2.getOrder_amount()));

                stats1.getOrderIdSet().addAll(stats2.getOrderIdSet());
                stats1.setOrder_ct(stats1.getOrderIdSet().size() + 0L);

                stats1.setOrder_sku_num(stats1.getOrder_sku_num() + stats2.getOrder_sku_num());
                stats1.setPayment_amount(stats1.getPayment_amount().add(stats2.getPayment_amount()));

                stats1.getRefundOrderIdSet().addAll(stats2.getRefundOrderIdSet());
                stats1.setRefund_order_ct(stats1.getRefundOrderIdSet().size() + 0L);
                stats1.setRefund_amount(stats1.getRefund_amount().add(stats2.getRefund_amount()));

                stats1.getPaidOrderIdSet().addAll(stats2.getPaidOrderIdSet());
                stats1.setPaid_order_ct(stats1.getPaidOrderIdSet().size() + 0L);

                stats1.setComment_ct(stats1.getComment_ct() + stats2.getComment_ct());
                stats1.setGood_comment_ct(stats1.getGood_comment_ct() + stats2.getGood_comment_ct());

                return stats1;
            }
        },
        new ProcessWindowFunction<ProductStats, ProductStats, Long, TimeWindow>() {
    
    
            @Override
            public void process(Long aLong, Context context, Iterable<ProductStats> elements, Collector<ProductStats> out) throws Exception {
    
    
                for (ProductStats productStats : elements) {
    
    
                    productStats.setStt(DateTimeUtil.toYMDHMS(new Date(context.window().getStart())));
                    productStats.setEdt(DateTimeUtil.toYMDHMS(new Date(context.window().getEnd())));
                    productStats.setTs(new Date().getTime());
                    out.collect(productStats);
                }
            }
        }
);

5 Supplementary product dimension information

Because except for the ordering operation, other operations only obtain the id of the product, and there is no other dimension information.

(1) Associated commodity dimension

SingleOutputStreamOperator<ProductStats> productStatsWithSkuDS = AsyncDataStream.unorderedWait(
        reduceDS,
        new DimAsyncFunction<ProductStats>("DIM_SKU_INFO") {
    
    

            @Override
            public void join(ProductStats productStats, JSONObject dimJsonObj) throws Exception {
    
    
                productStats.setSku_name(dimJsonObj.getString("SKU_NAME"));
                productStats.setSku_price(dimJsonObj.getBigDecimal("PRICE"));
                productStats.setCategory3_id(dimJsonObj.getLong("CATEGORY3_ID"));
                productStats.setSpu_id(dimJsonObj.getLong("SPU_ID"));
                productStats.setTm_id(dimJsonObj.getLong("TM_ID"));
            }

            @Override
            public String getKey(ProductStats productStats) {
    
    
                return productStats.getSku_id().toString();
            }
        },
        60, TimeUnit.SECONDS
);

(2) Associated SPU dimensions

SingleOutputStreamOperator<ProductStats> productStatsWithSpuDS =
        AsyncDataStream.unorderedWait(productStatsWithSkuDS,
                new DimAsyncFunction<ProductStats>("DIM_SPU_INFO") {
    
    
                    @Override
                    public void join(ProductStats productStats, JSONObject jsonObject) throws Exception {
    
    
                        productStats.setSpu_name(jsonObject.getString("SPU_NAME"));
                    }
                    @Override
                    public String getKey(ProductStats productStats) {
    
    
                        return String.valueOf(productStats.getSpu_id());
                    }
                }, 60, TimeUnit.SECONDS
        );

(3) Related category dimension

SingleOutputStreamOperator<ProductStats> productStatsWithCategory3DS =
        AsyncDataStream.unorderedWait(productStatsWithSpuDS,
                new DimAsyncFunction<ProductStats>("DIM_BASE_CATEGORY3") {
    
    
                    @Override
                    public void join(ProductStats productStats, JSONObject jsonObject) throws Exception {
    
    
                        productStats.setCategory3_name(jsonObject.getString("NAME"));
                    }
                    @Override
                    public String getKey(ProductStats productStats) {
    
    
                        return String.valueOf(productStats.getCategory3_id());
                    }
                }, 60, TimeUnit.SECONDS
        );

(4) Associated brand dimension

SingleOutputStreamOperator<ProductStats> productStatsWithTmDS =
        AsyncDataStream.unorderedWait(productStatsWithCategory3DS,
                new DimAsyncFunction<ProductStats>("DIM_BASE_TRADEMARK") {
    
    
                    @Override
                    public void join(ProductStats productStats, JSONObject jsonObject) throws Exception {
    
    
                        productStats.setTm_name(jsonObject.getString("TM_NAME"));
                    }
                    @Override
                    public String getKey(ProductStats productStats) {
    
    
                        return String.valueOf(productStats.getTm_id());
                    }
                }, 60, TimeUnit.SECONDS
        );

productStatsWithTmDS.print(">>>");

(5) test

  • Run the jar package in the rt_applog directory.

insert image description here

  • Run the jar package under the rt_dblog directory, execute it twice to change the water level and trigger the window submission operation.

insert image description here

6 Write to ClickHouse

(1) Create a commodity theme wide table in ClickHouse

create table product_stats_2022 (
   stt DateTime,
   edt DateTime,
   sku_id  UInt64,
   sku_name String,
   sku_price Decimal64(2),
   spu_id UInt64,
   spu_name String ,
   tm_id UInt64,
   tm_name String,
   category3_id UInt64,
   category3_name String ,
   display_ct UInt64,
   click_ct UInt64,
   favor_ct UInt64,
   cart_ct UInt64,
   order_sku_num UInt64,
   order_amount Decimal64(2),
   order_ct UInt64 ,
   payment_amount Decimal64(2),
   paid_order_ct UInt64,
   refund_order_ct UInt64,
   refund_amount Decimal64(2),
   comment_ct UInt64,
   good_comment_ct UInt64 ,
   ts UInt64
)engine =ReplacingMergeTree( ts)
        partition by  toYYYYMMDD(stt)
        order by   (stt,edt,sku_id );

(2) Add a Sink written to ClickHouse for the main program

// TODO 11 将结果写入到ClickHouse
productStatsWithTmDS.addSink(
        ClickhouseUtil.getJdbcSink(
                "insert into product_stats_2022 values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)")
);

(3) Overall test

  • Start ZK, Kafka, logger.sh, ClickHouse, Redis, HDFS, Hbase, Maxwell
  • runBaseLogApp
  • run BaseDBApp
  • Run OrderWideApp
  • Run PaymentWideApp
  • Run the ProductsStatsApp
  • Run the jar package in the rt_applog directory
  • Run the jar package in the rt_dblog directory
  • View console output
  • Check the products_stats_2022 table data in ClickHouse

Note: Be sure to match the dates of the two data generation simulators, otherwise the windows will not match.

2nd DWS layer - region theme table (FlinkSQL)

statistics topic demand indicator output method calculation source source hierarchy
area pv multidimensional analysis page_log is directly available dwd
uv multidimensional analysis Need to use page_log to filter and deduplicate dwm
Place an order (odd number, amount) Visual large screen order wide table dwm

Regional themes mainly reflect sales in various regions. From the perspective of business logic, regional topics are simpler than commodities, and there is nothing special about the business logic, which is to do a light aggregation and then save it, so flinkSQL is used to complete the business.

1 Demand Analysis and Ideas

  • Define the Table flow environment
  • Define the data source as a dynamic table
  • Query the result table through SQL
  • Convert the result table into a data stream
  • Write the data stream to the target database

If it is a database officially supported by Flink, you can also directly define the target data table as a dynamic table and write it with insert into. Because ClickHouse currently does not officially support jdbc connectors (currently supports Mysql, PostgreSQL, Derby). You can also make custom sinks to implement connectors that are not officially supported. But more cumbersome.

2 Add FlinkSQL-related dependencies to the pom.xml file

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-api-java-bridge_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner-blink_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

3 Create ProvinceStatsSqlApp and define Table flow environment

package com.hzy.gmall.realtime.app.dws;
/**
 * 地区主题统计 -- SQL
 */
public class ProvinceStatsSqlApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // TODO 1 环境准备
        // 1.1 流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 1.2 表执行环境
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
        // 1.3 设置并行度
        env.setParallelism(4);

        // TODO 2 检查点相关设置(略)

        env.execute();
    }
}

4 MyKafkaUtil adds a DDL method

public static String getKafkaDDL(String topic,String groupId){
    
    
    String ddl = "'connector' = 'kafka'," +
            "  'topic' = '"+topic+"'," +
            "  'properties.bootstrap.servers' = '"+KAFKA_SERVER+"'," +
            "  'properties.group.id' = '"+groupId+"'," +
            "  'scan.startup.mode' = 'latest-offset'," +
            "  'format' = 'json'";
    return ddl;
}

5 Define the data source as a dynamic table and specify the water level

insert image description here

(1) DesignationWATERMARK

WATERMARKDefines the event-time attributes of the table, in the form WATERMARK FOR rowtime_column_name AS watermark_strategy_expression.

rowtime_column_nameDefine an existing column as an attribute that marks event times for the table. The column must be of type TIMESTAMP(3)and be a top-level column in the schema, it can also be a computed column. It is equivalent to the extraction time field operation in the API format.

watermark_strategy_expressionDefines the watermark generation strategy. It allows computing watermarks using arbitrary non-query expressions including computed columns; the return type of the expression must be TIMESTAMP(3), representing the elapsed time since the Epoch. The returned watermark will only be emitted if it is not empty and its value is greater than the previously emitted local watermark (to keep the watermark incremented). The calculation of the watermark generation expression for each record will be done by the framework. The framework will periodically emit the largest watermark generated. If the current watermark is still the same as the previous watermark, is empty, or the value of the returned watermark is smaller than the last watermark emitted, the new watermark will not be emitted. Watermarks are emitted at intervals configured pipeline.auto-watermark-intervalin . If the watermark interval is 0ms, then each record will generate a watermark, and the watermark will be emitted when it is not empty and greater than the last watermark emitted.

When using event-time semantics, the table must contain event-time attributes and a watermark policy.

Flink provides several commonly used watermark strategies.

  • Strictly incrementing timestamps: WATERMARK FOR rowtime_column AS rowtime_column.

    Emit the watermark of the largest timestamp observed so far, and rows with timestamps greater than the largest are considered not late.

  • Increment timestamp: WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '0.001' SECOND.

    Emit the watermark of the largest timestamp observed so far minus 1, and rows with timestamps greater than or equal to the largest timestamp are considered not late. It is equivalent to the monotonically increasing strategy in the API format.

  • Bounded out-of-order timestamps: WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL 'string' timeUnit.

    A watermark that emits the largest timestamp observed so far minus the specified delay, for example, WATERMARK FOR rowtime_column AS rowtime_column - INTERVAL '5' SECONDa watermark policy with a 5 second delay.

Among them, WATERMARK FOR rowtime AS rowtime is to set a certain field as EVENT_TIME.

Detailed description .

(2) System built-in functions

Convert a string to a timestamp.

TO_TIMESTAMP(string1[, string2]) Converts date time string string1 with format string2 (by default: ‘yyyy-MM-dd HH:mm:ss’) under the session time zone (specified by TableConfig) to a timestamp.Only supported in blink planner.

Detailed description .

(3) Alias ​​the calculated column

<computed_column_definition>:
  column_name AS computed_column_expression [COMMENT column_comment]

(4) Complete code

The field name should be exactly the same as the json attribute in kafka.

// TODO 3 从指定的数据源(kafka)读取数据,转换为动态表,并指定水位线
String orderWideTopic = "dwm_order_wide";
String groupId = "province_stats";
tableEnv.executeSql("CREATE TABLE order_wide (" +
        " province_id BIGINT," +
        " province_name STRING," +
        " province_area_code STRING," +
        " province_iso_code STRING," +
        " province_3166_2_code STRING," +
        " order_id STRING," +
        " split_total_amount DOUBLE," +
        " create_time STRING," +
        " rowtime as TO_TIMESTAMP(create_time)," +
        " WATERMARK FOR rowtime AS rowtime - INTERVAL '3' SECOND" +
        " ) WITH (" + MyKafkaUtil.getKafkaDDL(orderWideTopic,groupId) +")");

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128430412