Flink real-time data warehouse-03-DWS layer construction

DWS layer

Design Points:

(1) Design reference index system of DWS layer;

(2) The naming convention of the DWS layer table name is dws_data field_statistical granularity_business process_statistical cycle (window)

Note: window represents the time range corresponding to the window.

Traffic Domain Source Keyword Granularity Page Browsing Summary Table for Each Window (FlinkSQL, ※)

main mission

Read data from Kafka page browsing detailed topics, filter search behavior, and use custom UDTF (one-input-multiple-out) function to segment search content. Count the occurrence frequency of each keyword in each window and write it into ClickHouse.

Idea analysis

This program will be implemented using FlinkSQL. Word segmentation is a one-input-multiple-out process, which requires a UDTF function to implement. FlinkSQL does not provide related built-in functions, so you need to customize the UDTF function.

The logic of the custom function is implemented in the code. To complete the word segmentation function, you need to import related dependencies. Here, the IK tokenizer will be used to complete the word segmentation.

Finally, to write data into ClickHouse, it is necessary to supplement relevant dependencies and encapsulate ClickHouse tool classes and methods. The tasks in this section are divided into two parts: word segmentation processing and data writing.

1) word segmentation processing

Word segmentation processing is divided into eight steps, as follows:

(1) Create a word segmentation tool class

Define the word segmentation method, use the tools provided by the IK tokenizer to split the input keywords into multiple words, and return a List collection.

(2) Create a custom function class

Inherit Flink's TableFunction class, call the word segmentation method of the word segmentation tool class, and realize the word segmentation logic.

(3) Registration function

(4) Read data from the Kafka page browsing detail topic and set the water level

(5) Filter search behavior

The search behavior data meets the following three conditions:

  • The item field under the page field is not null;
  • The last_page_id under the page field is search;
  • The item_type under the page field is keyword.

(6) participle

(7) Grouping, windowing, aggregation calculation

Grouped according to the split keywords. Count the occurrence frequency of each word, supplement the window start time, end time and keyword source (source) fields. Call the unix_timestamp() function to obtain the current system timestamp in seconds, convert it to milliseconds (*1000), and use it as the version field of the ClickHouse table for data deduplication.

(8) Convert dynamic table to stream

2) Write data to ClickHouse

(1) Create a table

To write data into ClickHouse, first create a table. The first thing to do is to specify the table engine to use. In order to ensure that the data is not repeated, you can use ReplacingMergeTree (replacing the merged tree) or ReplicatedMergeTree (replicated merged tree), both of which can be deduplicated. The differences are as follows:

  • The copy achieves deduplication by comparing the inserted "data blocks" (data written in the same batch). If the similarity between the two batches of data inserted reaches the ClickHouse judgment standard, the inserted data will be discarded. The original purpose of the copy is to prevent data loss, not deduplication. If duplicate data is mixed in different data blocks, the deduplication effect cannot be achieved. Assuming that data is written to ClickHouse in batches of 5, the first batch of ABCDE and the second batch of FAGHI, as long as ClickHouse does not meet the criteria for judging data block duplication, duplicate A will still be written.

  • ReplacingMergeTree needs to define a version field when building a table. It will compare the sort field (the sort field can uniquely identify a row of data in ClickHouse) with the version field of the same data. If this field is set and the values ​​of this field are different for multiple pieces of data, then Keep the data with the largest value in the version field. If this field is not set or multiple pieces of data have the same value in this field, the last piece will be kept in the order of insertion. Data deduplication will only be performed during data consolidation. Merge operations are performed in the background at an indeterminate time and cannot be planned in advance. Therefore, there is no guarantee that the data will not be repeated every moment. You can execute optimize table xxx final to manually merge partitions.

    ReplacingMergeTree is chosen here, mainly considering that although there is a delay in deduplication, it can be deduplicated by optimize when necessary. But this command will cause a large number of read and write operations, which is very heavy for ClickHouse and greatly affects performance. In the production environment, it is impossible to perform a merge operation before each query, and do not rely too much on optimize to remove duplicates.

(2) Writing method

Call the JDBCSink.<T>sink(String sql,JdbcStatementBuilder<T> statementBuilder, JdbcExecutionOptions executionOptions, JdbcConnectionOptions connectionOptions) method provided by Flink to create a JDBC sink, return an object of type SinkFunction, and use it as a parameter of the flow call addSink() method, namely Data can be written to the database in JDBC. This method can only write to one table in the database. The parameters are interpreted as follows

  • sql: any DML statement.

  • statementBuilder: Constructor class JDBCStatementBuilder object, used to pass parameters for the placeholders in the database operation object (PreparedStatement object). The core method accept(PreparedStatement preparedStatement, T obj), the parameters are interpreted as follows.

    • preparedStatement: database operation object.

    • obj: the data object in the stream. To pass parameters to placeholders, it is necessary to match the placeholders in SQL with the data in the stream. However, the number of placeholders in different SQL statements may be different, and it is impossible to set a unified value to specify the number of placeholders, and then simply complete parameter passing through a fixed number of cycles. So, how to associate the placeholder with the data in the stream in the program? You can do this by using the data object (obj) in the stream passed into the method to get the Class object of the class, then get the Field objects of all attributes through reflection, and then call the setObject() method of the field object to pass the data in the stream to SQL The placeholder in to complete the parameter passing.

    • T: Generic, specifying the data type in the stream.

    • executionOptions: SQL DML statements are executed in batches. This parameter is used to set execution parameters. The API is as follows.

      • withBatchIntervalMs(long intervalMs) Set the batch processing interval in milliseconds. The default value is 0, which means that batching will not be controlled based on time.

      • withBatchSize(int size) Set the batch size (number of data), the default is 5000.

      • withMaxRetries(int maxRetries) Set the maximum number of retries, the default is 3 times.

      • Batch trigger conditions (just satisfy one of them):

        a. The time interval set by withBatchIntervalMs has passed since the last data insertion

        b. The amount of data reaches the batch size

        c. When Flink checkpoint starts

    • connectionOptions: set the database connection parameters

      • withUrl: database URL
      • withDriverName: database driver name
      • withUsername: the username to connect to the database
      • withPassword: the password to connect to the database

(3)TransientSink

Some fields in the entity class are set to assist in the calculation of indicators and will not be written to the database. So, how to tell the program which fields do not need to be written to the database? Java's reflection provides a solution. The attribute object Field of the class can call the getAnnotation(Class annotationClass) method to obtain the information written in the annotation above the attribute definition statement in the class. If the annotation exists, the return value is not null.

Define an annotation that can be written on the attribute. For attributes that do not need to be written to the database, add the annotation above the attribute definition statement in the entity class. When passing parameters for the database operation object, it is judged whether the annotation exists, and if it is, the attribute is skipped to realize the exclusion of the attribute.

diagram

ClickHouse table creation statement

drop table if exists dws_traffic_source_keyword_page_view_window;
create table if not exists dws_traffic_source_keyword_page_view_window
(
    stt           DateTime,
    edt           DateTime,
    source        String,
    keyword       String,
    keyword_count UInt64,
    ts            UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, source, keyword);

the code

1) IK tokenizer, ClickHouse dependency

<dependency>
    <groupId>com.janeluo</groupId>
    <artifactId>ikanalyzer</artifactId>
    <version>2012_u6</version>
</dependency>
<dependency>
    <groupId>ru.yandex.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.0</version>
    <exclusions>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </exclusion>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>

2) IK word segmentation tool class KeywordUtil

public class KeywordUtil {
    
    
    public static List<String> splitKeyword(String keyword) throws IOException {
    
    

        //创建集合用于存放切分后的数据
        ArrayList<String> list = new ArrayList<>();

        //创建IK分词对象  ik_smart  ik_max_word
        StringReader reader = new StringReader(keyword);
        IKSegmenter ikSegmenter = new IKSegmenter(reader, false);

        //循环取出切分好的词
        Lexeme next = ikSegmenter.next();

        while (next != null) {
    
    
            String word = next.getLexemeText();
            list.add(word);

            next = ikSegmenter.next();
        }

        //最终返回集合
        return list;
    }

    public static void main(String[] args) throws IOException {
    
    
        System.out.println(splitKeyword("Flink实时数仓"));
    }

}

**3) FlinkSQL user-defined function SplitFunction **

@FunctionHint(output = @DataTypeHint("ROW<word STRING>"))
public class SplitFunction extends TableFunction<Row> {
    
    

    public void eval(String str) {
    
    
        //        for (String s : str.split(" ")) {
    
    
        //            collect(Row.of(s, s.length()));
        //        }

        List<String> list = null;
        try {
    
    
            list = KeywordUtil.splitKeyword(str);
            for (String word : list) {
    
    
                collect(Row.of(word));
            }
        } catch (IOException e) {
    
    
            collect(Row.of(str));
        }
    }
}

4) Entity class KeywordBean

@Data
@AllArgsConstructor
@NoArgsConstructor
public class KeywordBean {
    
    
    // 窗口起始时间
    private String stt;
    // 窗口闭合时间
    private String edt;
    // 关键词来源   ---  辅助字段,不需要写入ClickHouse
    //@TransientSink
    private String source;
    // 关键词
    private String keyword;
    // 关键词出现频次
    private Long keyword_count;
    // 时间戳
    private Long ts;
}

5) Constant class GmallConstant

public class GmallConstant {
    
    
    // 10 单据状态
    public static final String ORDER_STATUS_UNPAID="1001";  //未支付
    public static final String ORDER_STATUS_PAID="1002"; //已支付
    public static final String ORDER_STATUS_CANCEL="1003";//已取消
    public static final String ORDER_STATUS_FINISH="1004";//已完成
    public static final String ORDER_STATUS_REFUND="1005";//退款中
    public static final String ORDER_STATUS_REFUND_DONE="1006";//退款完成


    // 11 支付状态
    public static final String PAYMENT_TYPE_ALIPAY="1101";//支付宝
    public static final String PAYMENT_TYPE_WECHAT="1102";//微信
    public static final String PAYMENT_TYPE_UNION="1103";//银联

    // 12 评价
    public static final String APPRAISE_GOOD="1201";// 好评
    public static final String APPRAISE_SOSO="1202";// 中评
    public static final String APPRAISE_BAD="1203";//  差评
    public static final String APPRAISE_AUTO="1204";// 自动

    // 13 退货原因
    public static final String REFUND_REASON_BAD_GOODS="1301";// 质量问题
    public static final String REFUND_REASON_WRONG_DESC="1302";// 商品描述与实际描述不一致
    public static final String REFUND_REASON_SALE_OUT="1303";//   缺货
    public static final String REFUND_REASON_SIZE_ISSUE="1304";//  号码不合适
    public static final String REFUND_REASON_MISTAKE="1305";//  拍错
    public static final String REFUND_REASON_NO_REASON="1306";//  不想买了
    public static final String REFUND_REASON_OTHER="1307";//    其他

    // 14 购物券状态
    public static final String COUPON_STATUS_UNUSED="1401";//    未使用
    public static final String COUPON_STATUS_USING="1402";//     使用中
    public static final String COUPON_STATUS_USED="1403";//       已使用

    // 15退款类型
    public static final String REFUND_TYPE_ONLY_MONEY="1501";//   仅退款
    public static final String REFUND_TYPE_WITH_GOODS="1502";//    退货退款

    // 24来源类型
    public static final String SOURCE_TYPE_QUREY="2401";//   用户查询
    public static final String SOURCE_TYPE_PROMOTION="2402";//   商品推广
    public static final String SOURCE_TYPE_AUTO_RECOMMEND="2403";//   智能推荐
    public static final String SOURCE_TYPE_ACTIVITY="2404";//   促销活动


    // 购物券范围
    public static final String COUPON_RANGE_TYPE_CATEGORY3="3301";//
    public static final String COUPON_RANGE_TYPE_TRADEMARK="3302";//
    public static final String COUPON_RANGE_TYPE_SPU="3303";//

    //购物券类型
    public static final String COUPON_TYPE_MJ="3201";//满减
    public static final String COUPON_TYPE_DZ="3202";// 满量打折
    public static final String COUPON_TYPE_DJ="3203";//  代金券

    public static final String ACTIVITY_RULE_TYPE_MJ="3101";
    public static final String ACTIVITY_RULE_TYPE_DZ ="3102";
    public static final String ACTIVITY_RULE_TYPE_ZK="3103";


    public static final String KEYWORD_SEARCH="SEARCH";
    public static final String KEYWORD_CLICK="CLICK";
    public static final String KEYWORD_CART="CART";
    public static final String KEYWORD_ORDER="ORDER";

}

6) Supplement constants in the GmallConfig constant class

public class GmallConfig {
    
    
    // Phoenix库名
    public static final String HBASE_SCHEMA = "GMALL211126_REALTIME";

    // Phoenix驱动
    public static final String PHOENIX_DRIVER = "org.apache.phoenix.jdbc.PhoenixDriver";

    // Phoenix连接参数
    public static final String PHOENIX_SERVER = "jdbc:phoenix:hadoop102,hadoop103,hadoop104:2181";

    // ClickHouse 驱动
    public static final String CLICKHOUSE_DRIVER = "ru.yandex.clickhouse.ClickHouseDriver";

    // ClickHouse 连接 URL
    public static final String CLICKHOUSE_URL = "jdbc:clickhouse://hadoop102:8123/gmall_211126";
}

7) ClickHouse tool class

public class MyClickHouseUtil {
    
    
    public static <T> SinkFunction<T> getSinkFunction(String sql) {
    
    
        return JdbcSink.<T>sink(sql,
                new JdbcStatementBuilder<T>() {
    
    
                    @SneakyThrows
                    @Override
                    public void accept(PreparedStatement preparedStatement, T t) throws SQLException {
    
    

                        //使用反射的方式获取t对象中的数据
                        Class<?> tClz = t.getClass();

//                        Method[] methods = tClz.getMethods();
//                        for (int i = 0; i < methods.length; i++) {
    
    
//                            Method method = methods[i];
//                            method.invoke(t);
//                        }

                        //获取并遍历属性
                        Field[] declaredFields = tClz.getDeclaredFields();
                        int offset = 0;
                        for (int i = 0; i < declaredFields.length; i++) {
    
    

                            //获取单个属性
                            Field field = declaredFields[i];
                            field.setAccessible(true);

                            //尝试获取字段上的自定义注解
                            TransientSink transientSink = field.getAnnotation(TransientSink.class);
                            if (transientSink != null) {
    
    
                                offset++;
                                continue;
                            }

                            //获取属性值
                            Object value = field.get(t);

                            //给占位符赋值
                            preparedStatement.setObject(i + 1 - offset, value);

                        }
                    }
                },
                new JdbcExecutionOptions.Builder()
                        .withBatchSize(5)
                        .withBatchIntervalMs(1000L)
                        .build(),
                new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
                        .withDriverName(GmallConfig.CLICKHOUSE_DRIVER)
                        .withUrl(GmallConfig.CLICKHOUSE_URL)
                        .build());
    }

}

8) Main program

//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK) -> DwsTrafficSourceKeywordPageViewWindow > ClickHouse(ZK)
public class DwsTrafficSourceKeywordPageViewWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.使用DDL方式读取Kafka page_log 主题的数据创建表并且提取时间戳生成Watermark
        String topic = "dwd_traffic_page_log";
        String groupId = "dws_traffic_source_keyword_page_view_window_211126";
        tableEnv.executeSql("" +
                "create table page_log( " +
                "    `page` map<string,string>, " +
                "    `ts` bigint, " +
                "    `rt` as TO_TIMESTAMP(FROM_UNIXTIME(ts/1000)), " +
                "    WATERMARK FOR rt AS rt - INTERVAL '2' SECOND " +
                " ) " + MyKafkaUtil.getKafkaDDL(topic, groupId));

        //TODO 3.过滤出搜索数据
        Table filterTable = tableEnv.sqlQuery("" +
                " select " +
                "    page['item'] item, " +
                "    rt " +
                " from page_log " +
                " where page['last_page_id'] = 'search' " +
                " and page['item_type'] = 'keyword' " +
                " and page['item'] is not null");
        tableEnv.createTemporaryView("filter_table", filterTable);

        //TODO 4.注册UDTF & 切词
        tableEnv.createTemporarySystemFunction("SplitFunction", SplitFunction.class);
        Table splitTable = tableEnv.sqlQuery("" +
                "SELECT " +
                "    word, " +
                "    rt " +
                "FROM filter_table,  " +
                "LATERAL TABLE(SplitFunction(item))");
        tableEnv.createTemporaryView("split_table", splitTable);
        tableEnv.toAppendStream(splitTable, Row.class).print("Split>>>>>>");

        //TODO 5.分组、开窗、聚合
        Table resultTable = tableEnv.sqlQuery("" +
                "select " +
                "    'search' source, " +
                "    DATE_FORMAT(TUMBLE_START(rt, INTERVAL '10' SECOND),'yyyy-MM-dd HH:mm:ss') stt, " +
                "    DATE_FORMAT(TUMBLE_END(rt, INTERVAL '10' SECOND),'yyyy-MM-dd HH:mm:ss') edt, " +
                "    word keyword, " +
                "    count(*) keyword_count, " +
                "    UNIX_TIMESTAMP()*1000 ts " +
                "from split_table " +
                "group by word,TUMBLE(rt, INTERVAL '10' SECOND)");

        //TODO 6.将动态表转换为流
        DataStream<KeywordBean> keywordBeanDataStream = tableEnv.toAppendStream(resultTable, KeywordBean.class);
        keywordBeanDataStream.print(">>>>>>>>>>>>");

        //TODO 7.将数据写出到ClickHouse
        keywordBeanDataStream.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_traffic_source_keyword_page_view_window values(?,?,?,?,?,?)"));

        //TODO 8.启动任务
        env.execute("DwsTrafficSourceKeywordPageViewWindow");

    }
}

Traffic Domain Version-Channel-Region-Visitor Category Granularity Page Browsing Each Window Summary Table (※)

main mission

The DWS layer serves the ADS layer. Through the analysis of the index system, the summary table in this section needs to have five measurement fields: session number, page view number, total browsing time, unique visitor number, and bounce session number. The task of this section is to count the five indicators and write the dimension and measure data into the ClickHouse summary table.

Idea analysis

The task can be divided into two parts: the calculation of statistical indicators and the writing of data. The writing of data has been introduced earlier and will not be repeated here. Only the calculation of statistical indicators is analyzed here.

  • The three indicators of sessions, page views, and total browsing time are all related to page views, and can be obtained from the page view list at the DWD layer.
  • The number of unique visitors can be obtained from the list of independent visitors in the DWD layer, and the number of bounced sessions can be obtained from the list of user bounces in the DWD layer.

The data read by the three topics will be encapsulated into three streams in the program. To write the processed data into the same table of ClickHouse, the data structures of the three streams must be exactly the same. This problem is easy to solve. Just define the entity class corresponding to the table structure, and then convert the data structure in the stream into the entity class. Can. In addition, there is another question to consider. Do the three streams need to be merged? The fields of the ClickHouse table will be ordered by all dimensions in the window + table, and the sort key is the unique key in ClickHouse. If the three streams write data out to ClickHouse respectively, for the data with the same unique key, there will be three pieces of data that need to be kept regardless of repeated writing (the measurement data exists in the three pieces of data respectively). We use ReplacingMergeTree, which will deduplicate according to the sort key when partitions are merged, and only keep one piece of data with the same sort field, which will cause data loss. Obviously, this solution is not feasible. Here the three streams are merged into one, and only one piece of data is generated for each sort key.

1) Knowledge reserve

Common multi-stream merge operators and their application scenarios are as follows.

  • union(): used for merging between two or more streams, there is no limit to the number of streams, but the data structures in all streams are required to be completely consistent.

  • connect(): It is used to combine two streams, and the CoProcessFunction that can be used in the immediately adjacent process operator is the lowest-level API for dual-stream processing. Join, broadcast join, and segment join can be realized through the use of keyed states and timers various associations. connect() can only associate two streams, and has no requirements for the data structure of the two streams.

  • intervalJoin: Segment join, each piece of data in two streams can be associated with data in another stream within a certain time range. The underlying implementation principle: Take A.intervalJoin(B) as an example. After the data in the A stream enters the operator, it will be saved in the keyed state, and a timer will be registered at the same time. When the timer triggers, the data in the A stream state will be cleared . Before the timer triggers, each piece of data in the B stream can be associated with the A stream data saved in the state. Similarly, the state timer is also maintained in the B flow. This realizes the segment join. Assume that the timer in stream A exists for 3s, the timer in stream B exists for 5s, and the arrival time of a piece of data in stream A is tA, which can be compared with stream B arriving within the time range of tA – 5s ~ tA + 3s Data association; the arrival time of a piece of data in stream B is tB, which can be associated with stream A data arriving within the time range of tB – 3s ~ tB + 5s.

  • join(): The function of this operator can be replaced by other operators, and it is basically not used at present.

connect(), intervalJoin(), and join() are dual-stream merge operators. Here, three streams need to be merged, and the data structures in the streams are consistent. It is more reasonable to choose union().

2) Execution steps

(1) Read the page theme data and encapsulate it as a stream

(2) Statistics of page browsing time, number of page views, number of sessions, conversion data structure

Create an entity class, set the number of unique visitors and the number of bounced sessions to 0, set the number of page views to 1 (as long as there is a page browsing log, the number of page views will be increased by one), obtain the page browsing time in the log, and assign it to the entity The same-name field of the class, and finally judge whether the last_page_id is null, if yes, it means that the page is the home page, a new session is opened, and the session number is set to 1, otherwise it is set to 0. Supplementary dimension field, the window start and end time are set to empty strings. The downstream needs to open windows according to the water level, so the event time field needs to be supplemented. Here, the log generation time ts can be used as the event time field. Finally, the entity class object is sent downstream.

(3) Read user jump-out detailed data

(4) Convert the data structure of user exit flow

Encapsulate the entity class, the dimension field and timestamp processing are the same as the page flow, the number of bounces is set to 1, and the rest of the measurement fields are set to 0. Send data downstream.

(5) Read the detailed data of independent visitors

(6) Convert the independent visitor flow data structure

The processing process is the same as that of the exit flow.

(7) union merges three streams

(8) Set the water level;

(9) Group by dimension field;

(10) open window

The timeout period for judging the jump out behavior is 10s. Suppose a certain log belongs to the jump out data. If the corresponding event time is 15s, it is necessary to judge whether to jump out or not when the water level reaches 25s. If the window size is 10s, this data It should enter the 10~20s window, but when this data is obtained, the water level has reached 25s, and the window to which it belongs has been destroyed. This causes the number of bounced sessions to always be 0, which is obviously problematic. To avoid this situation, you must set the window to close with a delay. The delayed close time is greater than or equal to the timeout of the jump out judgment to ensure that the jump out data will not be missed. You can set the forBoundedOutOfOrderness delay time of the watermark to 14 . But this will seriously affect the timeliness. If the enterprise requires the delay time to be set to half an hour, then the window will be closed half an hour later. To count indicators related to bounce behavior, you must accept its negative impact on timeliness.

(11) Aggregate calculation

The measurement fields are summed, and the window start time and end time fields are supplemented after each window data is aggregated.

In ClickHouse, ts will be used as a version field for deduplication, and ReplacingMergeTree will compare ts with the same data in the sort field when partitions are merged, and keep the data with the largest ts. Here, the timestamp field is set to the current system time, so as to ensure that the result of the last calculation is retained when the data is recalculated.

(12) Write data to ClickHouse.

diagram

ClickHouse table creation statement

drop table if exists dws_traffic_vc_ch_ar_is_new_page_view_window;
create table if not exists dws_traffic_vc_ch_ar_is_new_page_view_window
(
    stt     DateTime,
    edt     DateTime,
    vc      String,
    ch      String,
    ar      String,
    is_new  String,
    uv_ct   UInt64,
    sv_ct   UInt64,
    pv_ct   UInt64,
    dur_sum UInt64,
    uj_ct   UInt64,
    ts      UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, vc, ch, ar, is_new);

the code

1) Entity class TrafficPageViewBean

@Data
@AllArgsConstructor
public class TrafficPageViewBean {
    
    
    // 窗口起始时间
    String stt;
    // 窗口结束时间
    String edt;
    // app 版本号
    String vc;
    // 渠道
    String ch;
    // 地区
    String ar;
    // 新老访客状态标记
    String isNew;
    // 独立访客数
    Long uvCt;
    // 会话数
    Long svCt;
    // 页面浏览数
    Long pvCt;
    // 累计访问时长
    Long durSum;
    // 跳出会话数
    Long ujCt;
    // 时间戳
    Long ts;
}

2) Main program

//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD)
//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD)
//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD)
======> FlinkApp -> ClickHouse(DWS)

//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK)
//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK) -> DwdTrafficUserJumpDetail -> Kafka(ZK)
//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK) -> DwdTrafficUniqueVisitorDetail -> Kafka(ZK)
======> DwsTrafficVcChArIsNewPageViewWindow -> ClickHouse(ZK)
public class DwsTrafficVcChArIsNewPageViewWindow {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.读取三个主题的数据创建流
        String uvTopic = "dwd_traffic_unique_visitor_detail";
        String ujdTopic = "dwd_traffic_user_jump_detail";
        String topic = "dwd_traffic_page_log";
        String groupId = "vccharisnew_pageview_window_1126";
        DataStreamSource<String> uvDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(uvTopic, groupId));
        DataStreamSource<String> ujDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(ujdTopic, groupId));
        DataStreamSource<String> pageDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.统一数据格式
        SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithUvDS = uvDS.map(line -> {
    
    
            JSONObject jsonObject = JSON.parseObject(line);
            JSONObject common = jsonObject.getJSONObject("common");

            return new TrafficPageViewBean("", "",
                    common.getString("vc"),
                    common.getString("ch"),
                    common.getString("ar"),
                    common.getString("is_new"),
                    1L, 0L, 0L, 0L, 0L,
                    jsonObject.getLong("ts"));
        });

        SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithUjDS = ujDS.map(line -> {
    
    
            JSONObject jsonObject = JSON.parseObject(line);
            JSONObject common = jsonObject.getJSONObject("common");

            return new TrafficPageViewBean("", "",
                    common.getString("vc"),
                    common.getString("ch"),
                    common.getString("ar"),
                    common.getString("is_new"),
                    0L, 0L, 0L, 0L, 1L,
                    jsonObject.getLong("ts"));
        });

        SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithPageDS = pageDS.map(line -> {
    
    
            JSONObject jsonObject = JSON.parseObject(line);
            JSONObject common = jsonObject.getJSONObject("common");

            JSONObject page = jsonObject.getJSONObject("page");
            String lastPageId = page.getString("last_page_id");
            long sv = 0L;
            if (lastPageId == null) {
    
    
                sv = 1L;
            }

            return new TrafficPageViewBean("", "",
                    common.getString("vc"),
                    common.getString("ch"),
                    common.getString("ar"),
                    common.getString("is_new"),
                    0L, sv, 1L, page.getLong("during_time"), 0L,
                    jsonObject.getLong("ts"));
        });

        //TODO 4.将三个流进行Union
        DataStream<TrafficPageViewBean> unionDS = trafficPageViewWithUvDS.union(
                trafficPageViewWithUjDS,
                trafficPageViewWithPageDS);

        //TODO 5.提取事件时间生成WaterMark
        //这里注意一下:forBoundedOutOfOrderness延迟时间要设置为14,如果为2的话会出错,因为在DWD中用户跳出事务中,使用了CEP,会产生10+2s的延迟。
        SingleOutputStreamOperator<TrafficPageViewBean> trafficPageViewWithWmDS = unionDS.assignTimestampsAndWatermarks(WatermarkStrategy.<TrafficPageViewBean>forBoundedOutOfOrderness(Duration.ofSeconds(14)).withTimestampAssigner(new SerializableTimestampAssigner<TrafficPageViewBean>() {
    
    
            @Override
            public long extractTimestamp(TrafficPageViewBean element, long recordTimestamp) {
    
    
                return element.getTs();
            }
        }));

        //TODO 6.分组开窗聚合
        WindowedStream<TrafficPageViewBean, Tuple4<String, String, String, String>, TimeWindow> windowedStream = trafficPageViewWithWmDS.keyBy(new KeySelector<TrafficPageViewBean, Tuple4<String, String, String, String>>() {
    
    
            @Override
            public Tuple4<String, String, String, String> getKey(TrafficPageViewBean value) throws Exception {
    
    
                return new Tuple4<>(value.getAr(),
                        value.getCh(),
                        value.getIsNew(),
                        value.getVc());
            }
        }).window(TumblingEventTimeWindows.of(Time.seconds(10)));

//        //增量聚合
//        windowedStream.reduce(new ReduceFunction<TrafficPageViewBean>() {
    
    
//            @Override
//            public TrafficPageViewBean reduce(TrafficPageViewBean value1, TrafficPageViewBean value2) throws Exception {
    
    
//                return null;
//            }
//        });
//        //全量聚合
//        windowedStream.apply(new WindowFunction<TrafficPageViewBean, TrafficPageViewBean, Tuple4<String, String, String, String>, TimeWindow>() {
    
    
//            @Override
//            public void apply(Tuple4<String, String, String, String> key, TimeWindow window, Iterable<TrafficPageViewBean> input, Collector<TrafficPageViewBean> out) throws Exception {
    
    
//
//            }
//        });
        SingleOutputStreamOperator<TrafficPageViewBean> resultDS = windowedStream.reduce(new ReduceFunction<TrafficPageViewBean>() {
    
    
            @Override
            public TrafficPageViewBean reduce(TrafficPageViewBean value1, TrafficPageViewBean value2) throws Exception {
    
    
                value1.setSvCt(value1.getSvCt() + value2.getSvCt());
                value1.setUvCt(value1.getUvCt() + value2.getUvCt());
                value1.setUjCt(value1.getUjCt() + value2.getUjCt());
                value1.setPvCt(value1.getPvCt() + value2.getPvCt());
                value1.setDurSum(value1.getDurSum() + value2.getDurSum());
                return value1;
            }
        }, new WindowFunction<TrafficPageViewBean, TrafficPageViewBean, Tuple4<String, String, String, String>, TimeWindow>() {
    
    
            @Override
            public void apply(Tuple4<String, String, String, String> key, TimeWindow window, Iterable<TrafficPageViewBean> input, Collector<TrafficPageViewBean> out) throws Exception {
    
    
                //获取数据
                TrafficPageViewBean next = input.iterator().next();

                //补充信息
                next.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                next.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));

                //修改TS
                next.setTs(System.currentTimeMillis());

                //输出数据
                out.collect(next);
            }
        });

        //TODO 7.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_traffic_vc_ch_ar_is_new_page_view_window values(?,?,?,?,?,?,?,?,?,?,?,?)"));

        //TODO 8.启动任务
        env.execute("DwsTrafficVcChArIsNewPageViewWindow");
    }
}

Summary table of each window of traffic domain page browsing

main mission

Read data from the Kafka page log topic, and count the number of unique visitors to the homepage and product detail pages of the day.

Idea analysis

1) Read Kafka page topic data

2) Convert data structure

Convert the data in the stream from String to JSONObject.

3) Filter data

Only keep the data whose page_id is home or good_detail, because the statistics of this program are only related to these two pages, other data is useless.

4) Set the water level line

* 5) Group by mid

6) Count the number of independent visitors on the homepage and product details page, and convert the data structure* *

Use Flink state programming to maintain the last visit date of the home page and product details page for each mid. If page_id is home, when the date stored in the status is null or not the current day, set homeUvCt (the number of unique visitors to the homepage) to 1, and update the date in the status to the current day. Otherwise, it is set to 0 and no operation is performed. The same is true for the statistics of unique visitors to product detail pages. When at least one of homeUvCt and detailUvCt is not 0, encapsulate the statistical results and related dimension information into the defined entity class and send it downstream, otherwise discard the data.

7) Open the window

8) Aggregation

9) Write data out to ClickHouse

diagram

ClickHouse table creation statement

drop table if exists dws_traffic_page_view_window;
create table if not exists dws_traffic_page_view_window
(
    stt               DateTime,
    edt               DateTime,
    home_uv_ct        UInt64,
    good_detail_uv_ct UInt64,
    ts                UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class TrafficHomeDetailPageViewBean

@Data
@AllArgsConstructor
public class TrafficHomeDetailPageViewBean {
    
    
    // 窗口起始时间
    String stt;
    // 窗口结束时间
    String edt;
    // 首页独立访客数
    Long homeUvCt;
    // 商品详情页独立访客数
    Long goodDetailUvCt;
    // 时间戳
    Long ts;
}

2) Main program

//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK) -> DwsTrafficPageViewWindow -> ClickHouse(ZK)
public class DwsTrafficPageViewWindow {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.读取 Kafka 页面日志主题数据创建流
        String topic = "dwd_traffic_page_log";
        String groupId = "dws_traffic_page_view_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JSON对象并过滤(首页与商品详情页)
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                //转换为JSON对象
                JSONObject jsonObject = JSON.parseObject(value);
                //获取当前页面id
                String pageId = jsonObject.getJSONObject("page").getString("page_id");
                //过滤出首页与商品详情页的数据
                if ("home".equals(pageId) || "good_detail".equals(pageId)) {
    
    
                    out.collect(jsonObject);
                }
            }
        });

        //TODO 4.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = jsonObjDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
    
    
            @Override
            public long extractTimestamp(JSONObject element, long recordTimestamp) {
    
    
                return element.getLong("ts");
            }
        }));

        //TODO 5.按照Mid分组
        KeyedStream<JSONObject, String> keyedStream = jsonObjWithWmDS.keyBy(json -> json.getJSONObject("common").getString("mid"));

        //TODO 6.使用状态编程过滤出首页与商品详情页的独立访客
        SingleOutputStreamOperator<TrafficHomeDetailPageViewBean> trafficHomeDetailDS = keyedStream.flatMap(new RichFlatMapFunction<JSONObject, TrafficHomeDetailPageViewBean>() {
    
    

            private ValueState<String> homeLastState;
            private ValueState<String> detailLastState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    

                StateTtlConfig ttlConfig = new StateTtlConfig.Builder(Time.days(1))
                        .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                        .build();

                ValueStateDescriptor<String> homeStateDes = new ValueStateDescriptor<>("home-state", String.class);
                ValueStateDescriptor<String> detailStateDes = new ValueStateDescriptor<>("detail-state", String.class);

                //设置TTL
                homeStateDes.enableTimeToLive(ttlConfig);
                detailStateDes.enableTimeToLive(ttlConfig);

                homeLastState = getRuntimeContext().getState(homeStateDes);
                detailLastState = getRuntimeContext().getState(detailStateDes);
            }

            @Override
            public void flatMap(JSONObject value, Collector<TrafficHomeDetailPageViewBean> out) throws Exception {
    
    

                //获取状态数据以及当前数据中的日期
                Long ts = value.getLong("ts");
                String curDt = DateFormatUtil.toDate(ts);
                String homeLastDt = homeLastState.value();
                String detailLastDt = detailLastState.value();

                //定义访问首页或者详情页的数据
                long homeCt = 0L;
                long detailCt = 0L;

                //如果状态为空或者状态时间与当前时间不同,则为需要的数据
                if ("home".equals(value.getJSONObject("page").getString("page_id"))) {
    
    
                    if (homeLastDt == null || !homeLastDt.equals(curDt)) {
    
    
                        homeCt = 1L;
                        homeLastState.update(curDt);
                    }
                } else {
    
    
                    if (detailLastDt == null || !detailLastDt.equals(curDt)) {
    
    
                        detailCt = 1L;
                        detailLastState.update(curDt);
                    }
                }

                //满足任何一个数据不等于0,则可以写出
                if (homeCt == 1L || detailCt == 1L) {
    
    
                    out.collect(new TrafficHomeDetailPageViewBean("", "",
                            homeCt,
                            detailCt,
                            ts));
                }
            }
        });

        //TODO 7.开窗聚合
        SingleOutputStreamOperator<TrafficHomeDetailPageViewBean> resultDS = trafficHomeDetailDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10))).reduce(new ReduceFunction<TrafficHomeDetailPageViewBean>() {
    
    
            @Override
            public TrafficHomeDetailPageViewBean reduce(TrafficHomeDetailPageViewBean value1, TrafficHomeDetailPageViewBean value2) throws Exception {
    
    
                value1.setHomeUvCt(value1.getHomeUvCt() + value2.getHomeUvCt());
                value1.setGoodDetailUvCt(value1.getGoodDetailUvCt() + value2.getGoodDetailUvCt());
                return value1;
            }
        }, new AllWindowFunction<TrafficHomeDetailPageViewBean, TrafficHomeDetailPageViewBean, TimeWindow>() {
    
    
            @Override
            public void apply(TimeWindow window, Iterable<TrafficHomeDetailPageViewBean> values, Collector<TrafficHomeDetailPageViewBean> out) throws Exception {
    
    
                //获取数据
                TrafficHomeDetailPageViewBean pageViewBean = values.iterator().next();
                //补充字段
                pageViewBean.setTs(System.currentTimeMillis());
                pageViewBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                pageViewBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                //输出数据
                out.collect(pageViewBean);
            }
        });

        //TODO 8.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_traffic_page_view_window values(?,?,?,?,?)"));

        //TODO 9.启动任务
        env.execute("DwsTrafficPageViewWindow");
    }
}

User domain user login window summary table

main mission

Read data from the Kafka page log topic, and count the number of returning users in the past seven days and the number of independent users on the day .

Idea analysis

Previously active users who have not been active for a period of time (lost) and become active again today are called returning users. Here it is required to count the total number of returning users. It is stipulated that users who log in on the same day and have not logged in for at least 7 days since the last login are returning users.

1) Read Katka page topic data

2) Convert data structure

The data in the stream is converted from String to JSONObject.

3) Filter data

The statistical indicators are related to users, and the data whose uid is not null is useful. In addition, the login is divided into two situations:

  • Automatically log in after the user opens the app;
  • The user does not log in after opening the app. After browsing some pages, the user jumps to the login page and logs in halfway.

For case (1), the login operation occurs on the session home page, so keep the home page; for case (2), the login operation occurs on the login page, and after the login page, it will inevitably jump to other pages, just keep the page after login Record the login operation of case (2).

To sum up, we should keep the browsing records whose uid is not null and last_page_id is null or last_page_id is login.

4) Set the water level line

5) Group by uid

The login records of different users are irrelevant to each other and are handled separately.

6) Count the number of returning users and the number of independent users

Use Flink state programming to record the user's last login date.

  • If the last login date in the status is not null, make further judgments.

    • If the last login date is not equal to today's date, the number of independent users uuCt is recorded as 1, and the last login date in the status is updated to the current day for further judgment.

      a) If the difference between the current date and the last login date is greater than or equal to 8 days, the number of returning users backCt is set to 1.

      b) Otherwise, set backCt to 0.

    • If the last login date is the same day, then both uuCt and backCt are 0. At this time, this data will not affect the statistical results, so it will be discarded and not sent downstream.

  • If the last login date in the status is null, set uuCt to 1 and backCt to 0, and update the last login date in the status to the current day.

7) Fenestration, aggregation

The measurement field is summed, the start and end time of the supplementary window are added, and the timestamp field is set to the current system time, which is used for deduplication of ClickHouse data.

8) Write to ClickHouse

diagram

ClickHouse table creation statement

drop table if exists dws_user_user_login_window;
create table if not exists dws_user_user_login_window
(
    stt     DateTime,
    edt     DateTime,   
    back_ct UInt64,
    uu_ct   UInt64,
    ts      UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class UserLoginBean

@Data
@AllArgsConstructor
public class UserLoginBean {
    
    
    // 窗口起始时间
    String stt;

    // 窗口终止时间
    String edt;

    // 回流用户数
    Long backCt;

    // 独立用户数
    Long uuCt;

    // 时间戳
    Long ts;
}

2) Main program

//数据流:web/app -> Nginx -> 日志服务器(.log) -> Flume -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:     Mock(lg.sh) -> Flume(f1) -> Kafka(ZK) -> BaseLogApp -> Kafka(ZK) -> DwsUserUserLoginWindow -> ClickHouse(ZK)
public class DwsUserUserLoginWindow {
    
    
    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.读取Kafka 页面日志主题创建流
        String topic = "dwd_traffic_page_log";
        String groupId = "dws_user_login_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.转换数据为JSON对象并过滤数据
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                //转换为JSON对象
                JSONObject jsonObject = JSON.parseObject(value);
                //获取UID以及上一跳页面
                String uid = jsonObject.getJSONObject("common").getString("uid");
                String lastPageId = jsonObject.getJSONObject("page").getString("last_page_id");
                //当UID不等于空并且上一跳页面为null或者为"login"才是登录数据
                if (uid != null && (lastPageId == null || lastPageId.equals("login"))) {
    
    
                    out.collect(jsonObject);
                }
            }
        });

        //TODO 4.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = jsonObjDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
    
    
            @Override
            public long extractTimestamp(JSONObject element, long recordTimestamp) {
    
    
                return element.getLong("ts");
            }
        }));

        //TODO 5.按照uid分组
        KeyedStream<JSONObject, String> keyedStream = jsonObjWithWmDS.keyBy(json -> json.getJSONObject("common").getString("uid"));

        //TODO 6.使用状态编程获取独立用户以及七日回流用户
        SingleOutputStreamOperator<UserLoginBean> userLoginDS = keyedStream.flatMap(new RichFlatMapFunction<JSONObject, UserLoginBean>() {
    
    

            private ValueState<String> lastLoginState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    
                lastLoginState = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-login", String.class));
            }

            @Override
            public void flatMap(JSONObject value, Collector<UserLoginBean> out) throws Exception {
    
    

                //获取状态日期以及当前数据日期
                String lastLoginDt = lastLoginState.value();
                Long ts = value.getLong("ts");
                String curDt = DateFormatUtil.toDate(ts);

                //定义当日独立用户数&七日回流用户数
                long uv = 0L;
                long backUv = 0L;

                if (lastLoginDt == null) {
    
    
                    uv = 1L;
                    lastLoginState.update(curDt);
                } else if (!lastLoginDt.equals(curDt)) {
    
    

                    uv = 1L;
                    lastLoginState.update(curDt);

                    if ((DateFormatUtil.toTs(curDt) - DateFormatUtil.toTs(lastLoginDt)) / (24 * 60 * 60 * 1000L) >= 8) {
    
    
                        backUv = 1L;
                    }
                }

                if (uv != 0L) {
    
    
                    out.collect(new UserLoginBean("", "",
                            backUv, uv, ts));
                }
            }
        });

        //TODO 7.开窗聚合
        SingleOutputStreamOperator<UserLoginBean> resultDS = userLoginDS.windowAll(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<UserLoginBean>() {
    
    
                    @Override
                    public UserLoginBean reduce(UserLoginBean value1, UserLoginBean value2) throws Exception {
    
    
                        value1.setBackCt(value1.getBackCt() + value2.getBackCt());
                        value1.setUuCt(value1.getUuCt() + value2.getUuCt());
                        return value1;
                    }
                }, new AllWindowFunction<UserLoginBean, UserLoginBean, TimeWindow>() {
    
    
                    @Override
                    public void apply(TimeWindow window, Iterable<UserLoginBean> values, Collector<UserLoginBean> out) throws Exception {
    
    
                        UserLoginBean next = values.iterator().next();

                        next.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                        next.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        next.setTs(System.currentTimeMillis());

                        out.collect(next);
                    }
                });

        //TODO 8.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_user_user_login_window values(?,?,?,?,?)"));

        //TODO 9.启动任务
        env.execute("DwsUserUserLoginWindow");
    }
}

User domain user registration window summary table

main mission

Read data from the DWD layer user registry, count the number of registered users in each window , and write it into ClickHouse.

Idea analysis

1) Read Kafka user registration topic data

2) Convert data structure

String converted to JSONObject.

3) Set the water level line

4) Window opening and aggregation

5) Written by ClickHouse

diagram

ClickHouse table creation statement

drop table if exists dws_user_user_register_window;
create table if not exists dws_user_user_register_window
(
    stt         DateTime,
    edt         DateTime,
    register_ct UInt64,
    ts          UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class UserRegisterBean

@Data
@AllArgsConstructor
public class UserRegisterBean {
    
    
    // 窗口起始时间
    String stt;
    // 窗口终止时间
    String edt;
    // 注册用户数
    Long registerCt;
    // 时间戳
    Long ts;
}

2) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdUserRegister -> Kafka(ZK) -> DwsUserUserRegisterWindow -> ClickHouse(ZK)
public class DwsUserUserRegisterWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.读取Kafka DWD层用户注册主题数据创建流
        String topic = "dwd_user_register";
        String groupId = "dws_user_user_register_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JavaBean对象
        SingleOutputStreamOperator<UserRegisterBean> userRegisterDS = kafkaDS.map(line -> {
    
    
            JSONObject jsonObject = JSON.parseObject(line);

            //yyyy-MM-dd HH:mm:ss
            String createTime = jsonObject.getString("create_time");

            return new UserRegisterBean("",
                    "",
                    1L,
                    DateFormatUtil.toTs(createTime, true));
        });

        //TODO 4.提取时间戳生成Watermark
        SingleOutputStreamOperator<UserRegisterBean> userRegisterWithWmDS = userRegisterDS.assignTimestampsAndWatermarks(WatermarkStrategy.<UserRegisterBean>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<UserRegisterBean>() {
    
    
            @Override
            public long extractTimestamp(UserRegisterBean element, long recordTimestamp) {
    
    
                return element.getTs();
            }
        }));

        //TODO 5.开窗聚合
        SingleOutputStreamOperator<UserRegisterBean> resultDS = userRegisterWithWmDS.windowAll(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<UserRegisterBean>() {
    
    
                    @Override
                    public UserRegisterBean reduce(UserRegisterBean value1, UserRegisterBean value2) throws Exception {
    
    
                        value1.setRegisterCt(value1.getRegisterCt() + value2.getRegisterCt());
                        return value1;
                    }
                    
                    
                }, new AllWindowFunction<UserRegisterBean, UserRegisterBean, TimeWindow>() {
    
    
                    @Override
                    public void apply(TimeWindow window, Iterable<UserRegisterBean> values, Collector<UserRegisterBean> out) throws Exception {
    
    
                        UserRegisterBean userRegisterBean = values.iterator().next();

                        userRegisterBean.setTs(System.currentTimeMillis());
                        userRegisterBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        userRegisterBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));

                        out.collect(userRegisterBean);
                    }
                });

        //TODO 6.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_user_user_register_window values(?,?,?,?)"));

        //TODO 7.启动任务
        env.execute("DwsUserUserRegisterWindow");
    }
}

Transaction domain plus purchase window summary table

main mission

Read the user's purchase details from Kafka, count the number of independent users who purchase in each window every day, and write it into ClickHouse.

Idea analysis

1) Read data from Katka purchase details topic

2) Convert data structure

Convert the data in the stream from String to JSONObject.

3) Set the water level line

4) Group by user id

5) Filter independent user purchase records

Use Flink state programming to maintain the date of the last purchase in the state.

If the last login date is null or not equal to today's date, keep the data and update the status, otherwise discard it and do nothing.

6) Window opening and aggregation

The number of data items in the statistics window is the number of independent users who purchased additionally, supplemented with the start time and closing time of the window, set the timestamp field as the current system time, and send it downstream.

7) Write data to ClickHouse.

diagram

ClickHouse table creation statement

drop table if exists dws_trade_cart_add_uu_window;
create table if not exists dws_trade_cart_add_uu_window
(
    stt            DateTime,
    edt            DateTime,
    cart_add_uu_ct UInt64,
    ts             UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class CartAddUuBean

@Data
@AllArgsConstructor
public class CartAddUuBean {
    
    
    // 窗口起始时间
    String stt;

    // 窗口闭合时间
    String edt;

    // 加购独立用户数
    Long cartAddUuCt;

    // 时间戳
    Long ts;
}

2) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeCartAdd -> Kafka(ZK) -> DwdTradeCartAdd -> ClickHouse(ZK)
public class DwsTradeCartAddUuWindow {
    
    
    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 1.1 状态后端设置
//        env.enableCheckpointing(3000L, CheckpointingMode.EXACTLY_ONCE);
//        env.getCheckpointConfig().setCheckpointTimeout(60 * 1000L);
//        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(3000L);
//        env.getCheckpointConfig().enableExternalizedCheckpoints(
//                CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
//        );
//        env.setRestartStrategy(RestartStrategies.failureRateRestart(
//                3, Time.days(1), Time.minutes(1)
//        ));
//        env.setStateBackend(new HashMapStateBackend());
//        env.getCheckpointConfig().setCheckpointStorage(
//                "hdfs://hadoop102:8020/ck"
//        );
//        System.setProperty("HADOOP_USER_NAME", "atguigu");

        //TODO 2.读取 Kafka DWD层 加购事实表
        String topic = "dwd_trade_cart_add";
        String groupId = "dws_trade_cart_add_uu_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(JSON::parseObject);

        //TODO 4.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = jsonObjDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
    
    
            @Override
            public long extractTimestamp(JSONObject element, long recordTimestamp) {
    
    

                String operateTime = element.getString("operate_time");

                if (operateTime != null) {
    
    
                    return DateFormatUtil.toTs(operateTime, true);
                } else {
    
    
                    return DateFormatUtil.toTs(element.getString("create_time"), true);
                }
            }
        }));

        //TODO 5.按照user_id分组
        KeyedStream<JSONObject, String> keyedStream = jsonObjWithWmDS.keyBy(json -> json.getString("user_id"));

        //TODO 6.使用状态编程提取独立加购用户
        SingleOutputStreamOperator<CartAddUuBean> cartAddDS = keyedStream.flatMap(new RichFlatMapFunction<JSONObject, CartAddUuBean>() {
    
    

            private ValueState<String> lastCartAddState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    

                StateTtlConfig ttlConfig = new StateTtlConfig.Builder(Time.days(1))
                        .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                        .build();

                ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("last-cart", String.class);
                stateDescriptor.enableTimeToLive(ttlConfig);

                lastCartAddState = getRuntimeContext().getState(stateDescriptor);
            }

            @Override
            public void flatMap(JSONObject value, Collector<CartAddUuBean> out) throws Exception {
    
    

                //获取状态数据以及当前数据的日期
                String lastDt = lastCartAddState.value();
                String operateTime = value.getString("operate_time");
                String curDt = null;
                if (operateTime != null) {
    
    
                    curDt = operateTime.split(" ")[0];
                } else {
    
    
                    String createTime = value.getString("create_time");
                    curDt = createTime.split(" ")[0];
                }

                if (lastDt == null || !lastDt.equals(curDt)) {
    
    
                    lastCartAddState.update(curDt);
                    out.collect(new CartAddUuBean(
                            "",
                            "",
                            1L,
                            null));
                }
            }
        });

        //TODO 7.开窗、聚合
        SingleOutputStreamOperator<CartAddUuBean> resultDS = cartAddDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)))
                .reduce(new ReduceFunction<CartAddUuBean>() {
    
    
                    @Override
                    public CartAddUuBean reduce(CartAddUuBean value1, CartAddUuBean value2) throws Exception {
    
    
                        value1.setCartAddUuCt(value1.getCartAddUuCt() + value2.getCartAddUuCt());
                        return value1;
                    }
                }, new AllWindowFunction<CartAddUuBean, CartAddUuBean, TimeWindow>() {
    
    
                    @Override
                    public void apply(TimeWindow window, Iterable<CartAddUuBean> values, Collector<CartAddUuBean> out) throws Exception {
    
    
                        CartAddUuBean next = values.iterator().next();

                        next.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                        next.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        next.setTs(System.currentTimeMillis());

                        out.collect(next);
                    }
                });

        //TODO 8.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_cart_add_uu_window values (?,?,?,?)"));

        //TODO 9.启动任务
        env.execute("DwsTradeCartAddUuWindow");
    }
}

Summary table of transaction domain payment windows (※)

main mission

Read transaction domain payment success topic data from Kafka, and count the number of independent users who have successfully paid and the number of users who have successfully paid for the first time .

Idea analysis

We mentioned in the DWD layer that a retracement flow will be formed during the generation of order details table data. In the dataset generated by left join, there may be multiple pieces of data with the same unique key . It has already been explained above and will not be repeated. The recall data exists in the form of null value in Kafka, which can be filtered by simple judgment. What we need to consider is how to deduplicate the rest of the data.

Analyzing the data generation process of the retracement flow, it can be found that the generation of complete data in the field must be later than the generation of incomplete data. To ensure the correctness of the statistical results, we should retain the data with the most complete field content. Based on the above discussion, The data with the most complete content was generated the latest. To filter this part of data by time, you must first obtain the data generation time.

1) Knowledge reserve

FlinkSQL provides several functions that can get the current timestamp:

  • localtimestamp: returns the current timestamp in the local time zone, and the return type is TIMESTAMP(3). In stream processing mode, the time is calculated once for each record. Whereas in batch mode, the time is calculated only once at the start of the query, and the same time is used for all data.
  • current_timestamp: returns the current timestamp in the local time zone, and the return type is TIMESTAMP_LTZ(3). In stream processing mode, the time is calculated once for each record. Whereas in batch mode, the time is calculated only once at the start of the query, and the same time is used for all data.
  • now(): Same as current_timestamp.
  • current_row_timestamp(): Returns the current timestamp in the local time zone, and the return type is TIMESTAMP_LTZ(3). No matter in stream processing mode or batch processing mode, the time is calculated once for each row of data.

function test. The query statement is as follows:

tableEnv.sqlQuery("select localtimestamp," +
                "current_timestamp," +
                "now()," +
                "current_row_timestamp()")
                .execute()
                .print();

The query results are as follows:

+----+-------------------------+-------------------------+-------------------------+-------------------------+
| op |          localtimestamp |       current_timestamp |                  EXPR$2 |                  EXPR$3 |
+----+-------------------------+-------------------------+-------------------------+-------------------------+
| +I | 2022-04-13 20:42:28.529 | 2022-04-13 20:42:28.529 | 2022-04-13 20:42:28.529 | 2022-04-13 20:42:28.529Z |
+----+-------------------------+-------------------------+-------------------------+-------------------------+
1 row in set

The dynamic table belongs to the stream processing mode, so you can choose one of the four functions. Select current_row_timestamp() here.

2) Time comparison tools

The generation time of the data obtained in the dynamic table is accurate to milliseconds. The date formatting tool class provided above cannot realize the conversion of such date strings to timestamps, and it is not possible to compare the generation of two data by directly converting them into timestamps. time. Therefore, a separate wrapper utility class is used to compare times of type TIME_STAMP(3). The comparison logic is to split the time into two parts: before the decimal point and after the decimal point. The format of the date before the decimal point is yyyy-MM-dd HH:mm:ss, this part can be directly converted into a timestamp comparison, if this part of the time is the same, then compare the part after the decimal point, and convert the part after the decimal point into an integer comparison , so as to realize the comparison of TIME_STAMP(3) type time.

3) Deduplication ideas

After obtaining the data generation time, the next question to consider is how to obtain the data with the latest generation time. Here are three ideas:

  • Group by unique key, open a window , compare the time of all data in the window before the window is closed, send the data with the latest generation time downstream, and discard other data.
  • Group by unique key. For each unique key, maintain the state and timer . When the data in the state is null, register the timer and maintain the data in the state. After that, each piece of data is compared with the generation time of the data in the state, and only the latest data is kept in the state. If the generation time of the two pieces of data is the same (the system time precision is insufficient), the data that enters the operator later will be retained. Because the parallelism of our Flink program is the same as the number of Kafka partitions, it can ensure that the data is in order, and the subsequent data is the latest data. (Refer to the summary table of each payment window in the transaction domain)
  • If the subsequent requirements do not use the fields of the left join right table, then only the first piece of data can be reserved for output. (Refer to the summary table of each window for placing an order in the transaction domain)

Depending on the requirements, use the third option. The first two options are both feasible. Here, choose option two. (Note: Option 3 can actually be used for this requirement)

The data in this section comes from the topic of Kafka dwd_trade_pay_detail_suc. The data of the latter is obtained from the three tables payment_info, dwd_trade_order_detail, and base_dic through inner join association. This process will not generate duplicate data. Therefore, the duplicate data of this table is determined by the order details table . The data of the dwd_trade_order_detail table comes from dwd_trade_order_pre_process, which uses left join in the data generation process, so it contains null data and repeated data. The Kafka Connector used to read data from the order details table will filter out null data. Only filtering is done in the program without deduplication, so there is no null data in this table, but there is duplicate data for the same unique key order_detail_id. To sum up, there is data with the same unique key order_detail_id in the successful payment details table, but there is no null data, so it only needs to be deduplicated.

4) Implementation steps

(1) Read data from Kafka payment success detail topic

(2) Convert data structure

String converted to JSONObject.

(3) Group by unique key

(4) Deduplication

Same as above.

(5) Set the water level and group by user_id

(6) Count the number of independent payers and new payers

Use Flink state programming to maintain the user's last payment date in the state.

If the date of the last payment is null, set the number of users who paid for the first time and the number of independent users who paid for it to 1; otherwise, the number of users who paid for the first time is set to 0, and determine whether the date of the last payment is the current day, and if it is not the same day, the number of independent users who paid is set to 1, otherwise set to 0. Finally, update the payment date in the status to the current day.

(7) Window opening and aggregation

The measurement fields are summed, the window start time and end time fields are supplemented, and the ts field is set to the current system timestamp.

(8) Write out to ClickHouse

diagram

M

ClickHouse table creation statement

drop table if exists dws_trade_payment_suc_window;
create table if not exists dws_trade_payment_suc_window
(
    stt                           DateTime,
    edt                           DateTime,
    payment_suc_unique_user_count UInt64,
    payment_new_user_count        UInt64,
    ts                            UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class TradePaymentWindowBean

@Data
@AllArgsConstructor
public class TradePaymentWindowBean {
    
    
    // 窗口起始时间
    String stt;

    // 窗口终止时间
    String edt;

    // 支付成功独立用户数
    Long paymentSucUniqueUserCount;

    // 支付成功新用户数
    Long paymentSucNewUserCount;

    // 时间戳
    Long ts;
}

2) FlinkSQL time data type TimestampLtz3 comparison tool class TimestampLtz3CompareUtil

public class TimestampLtz3CompareUtil {
    
    

    // 数据格式 2022-04-01 10:20:47.302Z
    // 数据格式 2022-04-01 10:20:47.041Z
    // 数据格式 2022-04-01 10:20:47.410Z
    // 数据格式 2022-04-01 10:20:47.41Z
    public static int compare(String timestamp1, String timestamp2) {
    
    

        // 1. 去除末尾的时区标志,'Z' 表示 0 时区
        String cleanedTime1 = timestamp1.substring(0, timestamp1.length() - 1);
        String cleanedTime2 = timestamp2.substring(0, timestamp2.length() - 1);

        // 2. 提取小于 1秒的部分
        String[] timeArr1 = cleanedTime1.split("\\.");
        String[] timeArr2 = cleanedTime2.split("\\.");
        String microseconds1 = new StringBuilder(timeArr1[timeArr1.length - 1])
                .append("000").toString().substring(0, 3);
        String microseconds2 = new StringBuilder(timeArr2[timeArr2.length - 1])
                .append("000").toString().substring(0, 3);

        int micro1 = Integer.parseInt(microseconds1);
        int micro2 = Integer.parseInt(microseconds2);

        // 3. 提取 yyyy-MM-dd HH:mm:ss 的部分
        String date1 = timeArr1[0];
        String date2 = timeArr2[0];
        Long ts1 = DateFormatUtil.toTs(date1, true);
        Long ts2 = DateFormatUtil.toTs(date2, true);

        // 4. 获得精确到毫秒的时间戳
        long microTs1 = ts1 + micro1;
        long microTs2 = ts2 + micro2;

        long divTs = microTs1 - microTs2;

        return divTs < 0 ? -1 : divTs == 0 ? 0 : 1;
    }

    public static void main(String[] args) {
    
    
        System.out.println(compare("2022-04-01 11:10:55.042Z","2022-04-01 11:10:55.041Z"));
        //System.out.println(Integer.parseInt("095"));
    }
}

3) Main program


//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeOrderPreProcess -> Kafka(ZK) -> DwdTradeOrderDetail -> Kafka(ZK) -> DwdTradePayDetailSuc -> Kafka(ZK) -> DwsTradePaymentSucWindow -> ClickHouse(ZK)
public class DwsTradePaymentSucWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); //生产环境中设置为Kafka主题的分区数

        //1.1 开启CheckPoint
        //env.enableCheckpointing(5 * 60000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(10 * 60000L);
        //env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));

        //1.2 设置状态后端
        //env.setStateBackend(new HashMapStateBackend());
        //env.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop102:8020/211126/ck");
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //1.3 设置状态的TTL  生产环境设置为最大乱序程度
        //tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));

        //TODO 2.读取DWD层成功支付主题数据创建流
        String topic = "dwd_trade_pay_detail_suc";
        String groupId = "dws_trade_payment_suc_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                try {
    
    
                    JSONObject jsonObject = JSON.parseObject(value);
                    out.collect(jsonObject);
                } catch (Exception e) {
    
    
                    System.out.println(">>>>>>>" + value);
                }
            }
        });

        //TODO 4.按照订单明细id分组
        KeyedStream<JSONObject, String> jsonObjKeyedByDetailIdDS = jsonObjDS.keyBy(json -> json.getString("order_detail_id"));

        //TODO 5.使用状态编程保留最新的数据输出
        SingleOutputStreamOperator<JSONObject> filterDS = jsonObjKeyedByDetailIdDS.process(new KeyedProcessFunction<String, JSONObject, JSONObject>() {
    
    

            private ValueState<JSONObject> valueState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    
                valueState = getRuntimeContext().getState(new ValueStateDescriptor<JSONObject>("value-state", JSONObject.class));
            }

            @Override
            public void processElement(JSONObject value, Context ctx, Collector<JSONObject> out) throws Exception {
    
    

                //获取状态中的数据
                JSONObject state = valueState.value();

                //判断状态是否为null
                if (state == null) {
    
    
                    valueState.update(value);
                    ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 5000L);
                } else {
    
    
                    String stateRt = state.getString("row_op_ts");
                    String curRt = value.getString("row_op_ts");

                    int compare = TimestampLtz3CompareUtil.compare(stateRt, curRt);

                    if (compare != 1) {
    
    
                        valueState.update(value);
                    }
                }
            }

            @Override
            public void onTimer(long timestamp, OnTimerContext ctx, Collector<JSONObject> out) throws Exception {
    
    
                super.onTimer(timestamp, ctx, out);
                //输出并清空状态数据
                JSONObject value = valueState.value();
                out.collect(value);

                valueState.clear();
            }
        });

        //TODO 6.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = filterDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
    
    
            @Override
            public long extractTimestamp(JSONObject element, long recordTimestamp) {
    
    
                String callbackTime = element.getString("callback_time");
                return DateFormatUtil.toTs(callbackTime, true);
            }
        }));

        //TODO 7.按照user_id分组
        KeyedStream<JSONObject, String> keyedByUidDS = jsonObjWithWmDS.keyBy(json -> json.getString("user_id"));

        //TODO 8.提取独立支付成功用户数
        SingleOutputStreamOperator<TradePaymentWindowBean> tradePaymentDS = keyedByUidDS.flatMap(new RichFlatMapFunction<JSONObject, TradePaymentWindowBean>() {
    
    

            private ValueState<String> lastDtState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    
                lastDtState = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-dt", String.class));
            }

            @Override
            public void flatMap(JSONObject value, Collector<TradePaymentWindowBean> out) throws Exception {
    
    

                //取出状态中以及当前数据的日期
                String lastDt = lastDtState.value();
                String curDt = value.getString("callback_time").split(" ")[0];

                //定义当日支付人数以及新增付费人数
                long paymentSucUniqueUserCount = 0L;
                long paymentSucNewUserCount = 0L;

                //判断状态日期是否为null
                if (lastDt == null) {
    
    
                    paymentSucUniqueUserCount = 1L;
                    paymentSucNewUserCount = 1L;
                    lastDtState.update(curDt);
                } else if (!lastDt.equals(curDt)) {
    
    
                    paymentSucUniqueUserCount = 1L;
                    lastDtState.update(curDt);
                }

                //返回数据
                if (paymentSucUniqueUserCount == 1L) {
    
    
                    out.collect(new TradePaymentWindowBean("",
                                                           "",
                                                           paymentSucUniqueUserCount,
                                                           paymentSucNewUserCount,
                                                           null));
                }
            }
        });

        //TODO 9.开窗、聚合
        SingleOutputStreamOperator<TradePaymentWindowBean> resultDS = tradePaymentDS.windowAll(TumblingEventTimeWindows.of(Time.seconds(10)))
            .reduce(new ReduceFunction<TradePaymentWindowBean>() {
    
    
                @Override
                public TradePaymentWindowBean reduce(TradePaymentWindowBean value1, TradePaymentWindowBean value2) throws Exception {
    
    
                    value1.setPaymentSucUniqueUserCount(value1.getPaymentSucUniqueUserCount() + value2.getPaymentSucUniqueUserCount());
                    value1.setPaymentSucNewUserCount(value1.getPaymentSucNewUserCount() + value2.getPaymentSucNewUserCount());
                    return value1;
                }
            }, new AllWindowFunction<TradePaymentWindowBean, TradePaymentWindowBean, TimeWindow>() {
    
    
                @Override
                public void apply(TimeWindow window, Iterable<TradePaymentWindowBean> values, Collector<TradePaymentWindowBean> out) throws Exception {
    
    

                    TradePaymentWindowBean next = values.iterator().next();

                    next.setTs(System.currentTimeMillis());
                    next.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                    next.setStt(DateFormatUtil.toYmdHms(window.getStart()));

                    out.collect(next);
                }
            });

        //TODO 10.将数据写出到ClickHouse
        resultDS.print(">>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_payment_suc_window values(?,?,?,?,?)"));

        //TODO 11.启动任务
        env.execute("DwsTradePaymentSucWindow");

    }
}

Summary table of each window for placing an order in the transaction domain (※)

main mission

Read data from the topic of Kafka order details, de-duplicate the data, count the number of independent users who placed orders and the number of new order users on the day, encapsulate it as an entity class, and write it into ClickHouse.

Idea analysis

1) Read data from Kafka order details topic

2) Convert data structure

The data of the Kafka order details topic is obtained by filtering after reading from the order preprocessing topic through Kafka-Connector. Kafka-Connector will filter out the null data in the topic, so there is no null data in the order details topic, and the data is directly converted Just the structure.

3) Group by order_detail_id

order_detail_id is the unique key of the data.

4) Deduplicate the data with the same order_detailid

Deduplicate the data according to the scheme mentioned above.

5) Set the water level line

6) Group by user id

7) Calculate the value of the measure field

(1) The number of unique users who placed orders and the number of new order users on the same day

Use Flink state programming to maintain the date of the user's last order in the state.

If the date of the last order is null, set the number of users who placed the first order and the number of independent users who placed the order to 1; otherwise, set the number of users who placed the first order to 0, and determine whether the date of the last order is the same day, and if not, place the order The number of independent users is set to 1, otherwise it is set to 0. Finally, update the order date in the status to the current day.

(2) The rest of the measurement fields can directly take the corresponding values ​​of the data in the stream.

8) Window opening and aggregation

The measurement fields are summed, the window start time and end time fields are supplemented, and the ts field is set to the current system timestamp.

9) Write out to ClickHouse

diagram

M

Create table statement

drop table if exists dws_trade_order_window;
create table if not exists dws_trade_order_window
(
    stt                          DateTime,
    edt                          DateTime,
    order_unique_user_count      UInt64,
    order_new_user_count         UInt64,
    order_activity_reduce_amount Decimal(38, 20),
    order_coupon_reduce_amount   Decimal(38, 20),
    order_origin_total_amount    Decimal(38, 20),
    ts                           UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt);

the code

1) Entity class TradeOrderBean

@Data
@AllArgsConstructor
@Builder
public class TradeOrderBean {
    
    
    // 窗口起始时间
    String stt;

    // 窗口关闭时间
    String edt;

    // 下单独立用户数
    Long orderUniqueUserCount;

    // 下单新用户数
    Long orderNewUserCount;

    // 下单活动减免金额
    Double orderActivityReduceAmount;

    // 下单优惠券减免金额
    Double orderCouponReduceAmount;

    // 下单原始金额
    Double orderOriginalTotalAmount;

    // 时间戳
    Long ts;
}

2) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeOrderPreProcess -> Kafka(ZK) -> DwdTradeOrderDetail -> Kafka(ZK) -> DwsTradeOrderWindow -> ClickHouse(ZK)
public class DwsTradeOrderWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); //生产环境中设置为Kafka主题的分区数

        //1.1 开启CheckPoint
        //env.enableCheckpointing(5 * 60000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(10 * 60000L);
        //env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));

        //1.2 设置状态后端
        //env.setStateBackend(new HashMapStateBackend());
        //env.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop102:8020/211126/ck");
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //1.3 设置状态的TTL  生产环境设置为最大乱序程度
        //tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));

        //TODO 2.读取Kafka DWD层下单主题数据创建流
        String topic = "dwd_trade_order_detail";
        String groupId = "dws_trade_order_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                try {
    
    
                    JSONObject jsonObject = JSON.parseObject(value);
                    out.collect(jsonObject);
                } catch (Exception e) {
    
    
                    System.out.println("Value>>>>>>>>" + value);
                }
            }
        });

        //TODO 4.按照 order_detail_id 分组
        KeyedStream<JSONObject, String> keyedByDetailIdDS = jsonObjDS.keyBy(json -> json.getString("id"));

        //TODO 5.针对 order_detail_id 进行去重(保留第一条数据即可)
        SingleOutputStreamOperator<JSONObject> filterDS = keyedByDetailIdDS.filter(new RichFilterFunction<JSONObject>() {
    
    

            private ValueState<String> valueState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    

                StateTtlConfig ttlConfig = new StateTtlConfig.Builder(Time.seconds(5))
                    .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
                    .build();
                ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("is-exists", String.class);
                stateDescriptor.enableTimeToLive(ttlConfig);

                valueState = getRuntimeContext().getState(stateDescriptor);
            }

            @Override
            public boolean filter(JSONObject value) throws Exception {
    
    

                //获取状态数据
                String state = valueState.value();

                //判断状态是否为null
                if (state == null) {
    
    
                    valueState.update("1");
                    return true;
                } else {
    
    
                    return false;
                }
            }
        });

        //TODO 6.提取事件时间生成Watermark
        SingleOutputStreamOperator<JSONObject> jsonObjWithWmDS = filterDS.assignTimestampsAndWatermarks(WatermarkStrategy.<JSONObject>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<JSONObject>() {
    
    
            @Override
            public long extractTimestamp(JSONObject element, long recordTimestamp) {
    
    
                return DateFormatUtil.toTs(element.getString("create_time"), true);
            }
        }));

        //TODO 7.按照 user_id 分组
        KeyedStream<JSONObject, String> keyedByUidDS = jsonObjWithWmDS.keyBy(json -> json.getString("user_id"));

        //TODO 8.提取独立下单用户
        SingleOutputStreamOperator<TradeOrderBean> tradeOrderDS = keyedByUidDS.map(new RichMapFunction<JSONObject, TradeOrderBean>() {
    
    

            private ValueState<String> lastOrderDtState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    
                lastOrderDtState = getRuntimeContext().getState(new ValueStateDescriptor<String>("last-order", String.class));
            }

            @Override
            public TradeOrderBean map(JSONObject value) throws Exception {
    
    

                //获取状态中以及当前数据的日期
                String lastOrderDt = lastOrderDtState.value();
                String curDt = value.getString("create_time").split(" ")[0];

                //定义当天下单人数以及新增下单人数
                long orderUniqueUserCount = 0L;
                long orderNewUserCount = 0L;

                //判断状态是否为null
                if (lastOrderDt == null) {
    
    
                    orderUniqueUserCount = 1L;
                    orderNewUserCount = 1L;

                    lastOrderDtState.update(curDt);
                } else if (!lastOrderDt.equals(curDt)) {
    
    
                    orderUniqueUserCount = 1L;
                    lastOrderDtState.update(curDt);
                }

                //取出下单件数以及单价
                Integer skuNum = value.getInteger("sku_num");
                Double orderPrice = value.getDouble("order_price");

                Double splitActivityAmount = value.getDouble("split_activity_amount");
                if (splitActivityAmount == null) {
    
    
                    splitActivityAmount = 0.0D;
                }
                Double splitCouponAmount = value.getDouble("split_coupon_amount");
                if (splitCouponAmount == null) {
    
    
                    splitCouponAmount = 0.0D;
                }

                return new TradeOrderBean("", "",
                                          orderUniqueUserCount,
                                          orderNewUserCount,
                                          splitActivityAmount,
                                          splitCouponAmount,
                                          skuNum * orderPrice,
                                          null);
            }
        });

        //TODO 9.开窗、聚合
        SingleOutputStreamOperator<TradeOrderBean> resultDS = tradeOrderDS.windowAll(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)))
            .reduce(new ReduceFunction<TradeOrderBean>() {
    
    
                @Override
                public TradeOrderBean reduce(TradeOrderBean value1, TradeOrderBean value2) throws Exception {
    
    
                    value1.setOrderUniqueUserCount(value1.getOrderUniqueUserCount() + value2.getOrderUniqueUserCount());
                    value1.setOrderNewUserCount(value1.getOrderNewUserCount() + value2.getOrderNewUserCount());
                    value1.setOrderOriginalTotalAmount(value1.getOrderOriginalTotalAmount() + value2.getOrderOriginalTotalAmount());
                    value1.setOrderActivityReduceAmount(value1.getOrderActivityReduceAmount() + value2.getOrderActivityReduceAmount());
                    value1.setOrderCouponReduceAmount(value1.getOrderCouponReduceAmount() + value2.getOrderCouponReduceAmount());
                    return value1;
                }
            }, new AllWindowFunction<TradeOrderBean, TradeOrderBean, TimeWindow>() {
    
    
                @Override
                public void apply(TimeWindow window, Iterable<TradeOrderBean> values, Collector<TradeOrderBean> out) throws Exception {
    
    
                    TradeOrderBean tradeOrderBean = values.iterator().next();

                    tradeOrderBean.setTs(System.currentTimeMillis());
                    tradeOrderBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                    tradeOrderBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));

                    out.collect(tradeOrderBean);
                }
            });

        //TODO 10.将数据写出到ClickHouse
        resultDS.print(">>>>>>>>>>>");
        resultDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_order_window values(?,?,?,?,?,?,?,?)"));

        //TODO 11.启动任务
        env.execute("DwsTradeOrderWindow");

    }
}

Transaction domain user-SPU granularity order order window summary table (※)

main mission

Read data from Kafka order details topic, filter null data and deduplicate data according to unique key, correlate dimension information, group by dimension, count the number of orders and order amount in each dimension and window, and write the data into ClickHouse transaction domain brand- Category-User-SPU granular order summary table for each window.

Idea analysis

Compared with the DWS layer wide table mentioned above, this program adds dimension association operations.

Dimension tables are saved in HBase, and the query method must first be supplemented in the PhoenixUtil tool class.

PhoenixUtil query method ideas

This program needs to obtain the value of the related dimension field from HBase through the known primary key and table name. According to the known information, we can concatenate query statements, pass parameters to the query method, and execute the six steps of registering the driver, obtaining the connection object, precompiling (obtaining the database operation object), executing, parsing the result set, and closing the resource within the method. Data can be retrieved.

The query result must pass the data to the caller by returning the value. So, what type should the return value be? There may be multiple query results, so the return value should be a collection. After confirming this, what type of collection elements should be considered next? The query result may have multiple fields, here are two options: tuple or entity class. The implementation of the two schemes will be analyzed below.

(1) tuple

If the query results of each row are encapsulated in tuples, there are two strategies:

  • Pass the number of elements in the tuple to the method, and then use switch ... case ... to call the corresponding tuple API for different numbers of elements to encapsulate the query results;
  • Pass the Class object of the tuple to the method, and assign the query result to the tuple object through reflection.

The problem with the first solution is that you need to write a lot of repetitive code, and you have to write the same processing logic for each branch; the problem with the second method is that the type information of the tuple elements is lost. The implementation example of Scheme 2 is as follows:

public class TupleTest {
    
    
    public static void main(String[] args) throws InstantiationException, IllegalAccessException {
    
    
        Class<Tuple3> tuple3Class = Tuple3.class;
        Tuple3 tuple3 = tuple3Class.newInstance();
        Field[] declaredFields = tuple3Class.getDeclaredFields();
        for (int i = 1; i < declaredFields.length; i++) {
    
    
            Field declaredField = declaredFields[i];
            declaredField.setAccessible(true);
            declaredField.set(tuple3, (char)('a' + i));
        }
        System.out.println(tuple3);
    }
}

The result is as follows:

(b,c,d)

Since there is no type information of the tuple elements, the set method of the Field object can only be called to assign values, resulting in the tuple element types being Object, which may cause inconvenience to downstream data processing.

In addition, the maximum number of elements in the tuple provided by Flink is 25, which will cause problems when there are too many query result fields.

M

(2) Entity class

Pass the Class object of the entity class into the method through parameters, and assign the query result to the entity class object through reflection.

Based on the above analysis, the custom entity class is selected as the collection element here, and each row of the query result corresponds to an entity class object, and all objects are encapsulated into the List collection and returned to the method caller.

Phoenix dimension query diagram :

Side cache optimization

Queries of external data sources are often the performance bottleneck of streaming computing. Taking this program as an example, every query needs to connect to Hbase, data transmission needs to be serialized, deserialized, and network transmission, which seriously affects the timeliness. Queries can be optimized with bypass caching.

The bypass cache mode is a very common on-demand cache mode. All requests access the cache first. If the cache hits, the data is directly obtained and returned to the requester. If there is a miss, the database is queried, and after the result is obtained, it is returned and written to the cache for subsequent requests.

(1) The bypass cache strategy should pay attention to two points

  • The cache should set an expiration time, otherwise cold data will stay in the cache, wasting resources.
  • It is necessary to consider whether the dimension data will change, and if there is a change, the cache must be actively cleared.

(2) Selection of cache

Generally two types: heap cache or independent cache service (memcache, redis)

  • Heap cache, better performance and higher efficiency, because the data access path is shorter. But it is difficult to manage, and other processes cannot maintain the data in the cache.
  • Independent cache services (redis, memcache) will consume consumption such as connection creation and network IO, which are slightly worse than heap caches, but the performance is acceptable. The independent cache service is easy to maintain and expand, and is more suitable for scenarios where the data will change and the amount of data is large. Here, the independent cache service is selected and redis is used as the cache medium.

(3) Implementation steps

  • Get data from cache:

    Returns the result if the query result is not null.

    If the result obtained in the cache is null, query the data from the Phoenix table.

  • If the result is not empty, write the data into the cache and return the result.

    Otherwise, the user is prompted: there is no corresponding dimension data

Note: The data in the cache needs to set a timeout period, which is set to 1 day in this program. In addition, if the original table data changes, the corresponding cache should be deleted. In order to realize this function, the dimension shunting program needs to be modified as follows:

  • Add the operation type field to the JSON object inside the processElement method of MyBroadcastFunction.
  • Add the deleteCached method in the DimUtil tool class to delete the cache information of the changed data.
  • In the invoke method of MyPhoenixSink, the judgment of the operation type is supplemented, and if the operation type is update, the cache is cleared.

Side cache diagram

M

Asynchronous I/O

In the process of Flink stream processing, it is often necessary to interact with external systems, such as completing the dimension fields in the fact table through the dimension table.

By default, in a Flink operator, a single parallel subtask can only interact with the external system in a synchronous manner: send the request to the external storage, block on IO, wait for the request to return, and then continue to send the next request. This method spends a lot of time waiting for the result.

In order to improve the processing efficiency, there are two ways of thinking.

  • Increase the parallelism of operators, but more resources are required.
  • Asynchronous I/O.

Flink introduced Async I/O in 1.2 to make IO operations asynchronous. In asynchronous mode, a single parallel subtask can send multiple requests continuously, and process the requests in the order they are returned. After sending the requests, there is no need for blocking waiting, which saves a lot of waiting time and greatly improves the efficiency of stream processing.

Async I/O is a feature contributed by Alibaba to the community, and it has been widely requested. It can be used to solve the problem that network delay becomes a system bottleneck when interacting with external systems.

**Asynchronous query actually entrusts the query operation of the dimension table to a separate thread pool to complete, so that it will not be blocked by a certain query, so a single parallel subtask can send multiple requests continuously, thereby improving concurrency efficiency. ** For operations involving network IO, the performance loss caused by request waiting can be significantly reduced.

For example, suppose you write an HTTP server with windows iocp. Then came an http request to access index.html. This request is assigned to a thread of the server. Then it sees that index.html is not in the memory, so it initiates a synchronous io request to read from the hard disk. At the same time the thread is suspended so that it does not count towards the number of active threads. Then came the second http request, the same. Then came 1000 more, all like this, all hung. Then the hard disk read the data to us! So these 1000 threads return to the RUNNABLE state at the same time, so the CPU is stupid. The so-called unified management of IO and thread pool by the kernel is to solve such problems.

But the disadvantage is that if the thread pool is in the kernel mode, the overhead of each scheduling is very high, making it not suitable for lightweight computing tasks (such as matrix multiplication).

Asynchronous IO Diagram

M

Template Method Design Pattern

(1) Definition

Define the core algorithm skeleton to complete a certain function in the parent class, and the specific implementation can be delayed to the subclass. The template method class must be an abstract class, which has a set of specific implementation processes (it can be an abstract method or a common method). These methods may be inherited by upper-level templates.

(2) Advantages

Under the premise of not changing the core algorithm skeleton of the parent class, each subclass can have different implementations. We only need to focus on the implementation logic of the specific method without being distracted from the implementation process.

This program defines the template class DimAsyncFunction, which defines the specific process of dimension association:

  • Get the dimension primary key according to the objects in the stream.
  • Get the dimension object according to the dimension primary key.
  • Use the query results from the previous step to complete the dimension information of the objects in the stream.
Execution steps and diagrams

(1) Read data from Kafka order details topic

(2) Convert data structure

(3) Deduplication according to the unique key

(4) Convert data structure

JSONObject is converted to the entity class TradeTrademarkCategoryUserSpuOrderBean.

(5) Supplementary dimension information related to grouping

Associate the sku_info table to obtain tm_id, category3_id, spu_id.

(6) Set the water level line

(7) Grouping, windowing, aggregation

Group by dimension information, sum the measurement fields, and supplement the window start time and end time after the window is closed. Set the timestamp to the current system time.

(8) Dimension association, supplementing dimension fields not related to grouping

  • Associate spu_info table to get spu_name.
  • Associate the base_trademark table to obtain tm_name.
  • Associate the base_category3 table, get the name (the name of the third-level category), and get the category2_id.
  • Associate the base_categroy2 table to obtain name (secondary category name) and category1_id.
  • Associate the base_category1 table to obtain the name (first-level category name).

(9) Write out to ClickHouse.

diagram

ClickHouse table creation statement

drop table if exists dws_trade_user_spu_order_window;
create table if not exists dws_trade_user_spu_order_window
(
    stt            DateTime,
    edt            DateTime,
    trademark_id   String,
    trademark_name String,
    category1_id   String,
    category1_name String,
    category2_id   String,
    category2_name String,
    category3_id   String,
    category3_name String,
    user_id        String,
    spu_id         String,
    spu_name       String,
    order_count    UInt64,
    order_amount   Decimal(38, 20),
    ts             UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, spu_id, spu_name, user_id);

the code

(1) Supplement Jedis-related dependencies

<dependency>
    <groupId>redis.clients</groupId>
    <artifactId>jedis</artifactId>
    <version>3.3.0</version>
</dependency>

(2) Phoenix query data method queryList()

public class JdbcUtil {
    
    
    public static <T> List<T> queryList(Connection connection, String sql, Class<T> clz, boolean underScoreToCamel) throws SQLException, IllegalAccessException, InstantiationException, InvocationTargetException {
    
    

        //创建集合用于存放结果数据
        ArrayList<T> result = new ArrayList<>();

        //编译SQL语句
        PreparedStatement preparedStatement = connection.prepareStatement(sql);

        //执行查询
        ResultSet resultSet = preparedStatement.executeQuery();

        //获取查询的元数据信息
        ResultSetMetaData metaData = resultSet.getMetaData();
        int columnCount = metaData.getColumnCount();

        //遍历结果集,将每行数据转换为T对象并加入集合   行遍历
        while (resultSet.next()) {
    
    

            //创建T对象
            T t = clz.newInstance();

            //列遍历,并给T对象赋值
            for (int i = 0; i < columnCount; i++) {
    
    

                //获取列名与列值
                String columnName = metaData.getColumnName(i + 1);
                Object value = resultSet.getObject(columnName);

                //判断是否需要进行下划线与驼峰命名转换
                if (underScoreToCamel) {
    
    
                    columnName = CaseFormat.LOWER_UNDERSCORE.to(CaseFormat.LOWER_CAMEL, columnName.toLowerCase());
                }

                //赋值
                BeanUtils.setProperty(t, columnName, value);
            }

            //将T对象放入集合
            result.add(t);
        }

        resultSet.close();
        preparedStatement.close();

        //返回集合
        return result;
    }

    public static void main(String[] args) throws Exception {
    
    

        DruidDataSource dataSource = DruidDSUtil.createDataSource();
        DruidPooledConnection connection = dataSource.getConnection();

        List<JSONObject> jsonObjects = queryList(connection,
                "select * ct from GMALL211027_REALTIME.DIM_BASE_TRADEMARK where id='1'",
                JSONObject.class,
                true);

        for (JSONObject jsonObject : jsonObjects) {
    
    
            System.out.println(jsonObject);
        }

        connection.close();

    }
}

(3) Jedis tool class JedisUtil

public class JedisUtil {
    
    

    private static JedisPool jedisPool;

    private static void initJedisPool() {
    
    
        JedisPoolConfig poolConfig = new JedisPoolConfig();
        poolConfig.setMaxTotal(100);
        poolConfig.setMaxIdle(5);
        poolConfig.setMinIdle(5);
        poolConfig.setBlockWhenExhausted(true);
        poolConfig.setMaxWaitMillis(2000);
        poolConfig.setTestOnBorrow(true);
        jedisPool = new JedisPool(poolConfig, "hadoop102", 6379, 10000);
    }

    public static Jedis getJedis() {
    
    
        if (jedisPool == null) {
    
    
            initJedisPool();
        }
        // 获取Jedis客户端
        return jedisPool.getResource();
    }

    public static void main(String[] args) {
    
    
        Jedis jedis = getJedis();
        String pong = jedis.ping();
        System.out.println(pong);
    }

}

(4) Dimension query tool class DimUtil

public class DimUtil {
    
    

    public static JSONObject getDimInfo(Connection connection, String tableName, String key) throws InvocationTargetException, SQLException, InstantiationException, IllegalAccessException {
    
    

        //先查询Redis
        Jedis jedis = JedisUtil.getJedis();
        String redisKey = "DIM:" + tableName + ":" + key;
        String dimJsonStr = jedis.get(redisKey);
        if (dimJsonStr != null) {
    
    
            //重置过期时间
            jedis.expire(redisKey, 24 * 60 * 60);
            //归还连接
            jedis.close();
            //返回维度数据
            return JSON.parseObject(dimJsonStr);
        }

        //拼接SQL语句
        String querySql = "select * from " + GmallConfig.HBASE_SCHEMA + "." + tableName + " where id='" + key + "'";
        System.out.println("querySql>>>" + querySql);

        //查询数据
        List<JSONObject> queryList = JdbcUtil.queryList(connection, querySql, JSONObject.class, false);

        //将从Phoenix查询到的数据写入Redis
        JSONObject dimInfo = queryList.get(0);
        jedis.set(redisKey, dimInfo.toJSONString());
        //设置过期时间
        jedis.expire(redisKey, 24 * 60 * 60);
        //归还连接
        jedis.close();

        //返回结果
        return dimInfo;
    }

    public static void delDimInfo(String tableName, String key) {
    
    
        //获取连接
        Jedis jedis = JedisUtil.getJedis();
        //删除数据
        jedis.del("DIM:" + tableName + ":" + key);
        //归还连接
        jedis.close();
    }

    public static void main(String[] args) throws Exception {
    
    

        DruidDataSource dataSource = DruidDSUtil.createDataSource();
        DruidPooledConnection connection = dataSource.getConnection();

        long start = System.currentTimeMillis();
        JSONObject dimInfo = getDimInfo(connection, "DIM_BASE_TRADEMARK", "18");
        long end = System.currentTimeMillis();
        JSONObject dimInfo2 = getDimInfo(connection, "DIM_BASE_TRADEMARK", "18");
        long end2 = System.currentTimeMillis();

        System.out.println(dimInfo);
        System.out.println(dimInfo2);

        System.out.println(end - start);  //159  127  120  127  121  122  119
        System.out.println(end2 - end);   //8  8  8  1  1  1  1  0  0.5

        connection.close();

    }

}

(5) Modify the processElement method in MyBroadcastFunction

Supplementary operation type field, used to clear expired cache, when the operation type is update, clear cache.

@Override
public void processElement(JSONObject jsonObj, ReadOnlyContext readOnlyContext, Collector<JSONObject> out) throws Exception {
    
    
    ReadOnlyBroadcastState<String, TableProcess> tableConfigState = readOnlyContext.getBroadcastState(tableConfigDescriptor);
    // 获取配置信息
    String sourceTable = jsonObj.getString("table");
    TableProcess tableConfig = tableConfigState.get(sourceTable);
    if (tableConfig != null) {
    
    
        JSONObject data = jsonObj.getJSONObject("data");
        // 获取操作类型
        String type = jsonObj.getString("type");
        String sinkTable = tableConfig.getSinkTable();

        // 根据 sinkColumns 过滤数据
        String sinkColumns = tableConfig.getSinkColumns();
        filterColumns(data, sinkColumns);

        // 将目标表名加入到主流数据中
        data.put("sinkTable", sinkTable);

        // 将操作类型加入到 JSONObject 中
        data.put("type", type);

        out.collect(data);
    }
}

(6) Modify the invoke method in the MyPhoenixSink class, supplement the judgment of the operation type, clear the cache when the operation type is update (update), and supplement the clear operation of the type field in the JSON object before writing to HBase.

public class DimSinkFunction extends RichSinkFunction<JSONObject> {
    
    
    private DruidDataSource druidDataSource = null;

    @Override
    public void open(Configuration parameters) throws Exception {
    
    
        druidDataSource = DruidDSUtil.createDataSource();
    }

    //value:{"database":"gmall-211126-flink","table":"base_trademark","type":"update","ts":1652499176,"xid":188,"commit":true,"data":{"id":13,"tm_name":"atguigu"},"old":{"logo_url":"/aaa/aaa"},"sinkTable":"dim_xxx"}
    //value:{"database":"gmall-211126-flink","table":"order_info","type":"update","ts":1652499176,"xid":188,"commit":true,"data":{"id":13,...},"old":{"xxx":"/aaa/aaa"},"sinkTable":"dim_xxx"}
    @Override
    public void invoke(JSONObject value, Context context) throws Exception {
    
    
        //获取连接
        DruidPooledConnection connection = druidDataSource.getConnection();

        String sinkTable = value.getString("sinkTable");
        JSONObject data = value.getJSONObject("data");

        //获取数据类型
        String type = value.getString("type");
        //如果为更新数据,则需要删除Redis中的数据
        if ("update".equals(type)) {
    
    
            DimUtil.delDimInfo(sinkTable.toUpperCase(), data.getString("id"));
        }

        //写出数据
        PhoenixUtil.upsertValues(connection, sinkTable, data);

        //归还连接
        connection.close();
    }

}

(7) Template method design pattern template interface DimJoinFunction

public interface DimJoinFunction<T> {
    
    

    String getKey(T input);

    void join(T input, JSONObject dimInfo);
}

(8) Thread pool tool class ThreadPoolUtil

public class ThreadPoolUtil {
    
    

    private static ThreadPoolExecutor threadPoolExecutor;

    private ThreadPoolUtil() {
    
    
    }

    public static ThreadPoolExecutor getThreadPoolExecutor() {
    
    

        if (threadPoolExecutor == null) {
    
    
            synchronized (ThreadPoolUtil.class) {
    
    
                if (threadPoolExecutor == null) {
    
    
                    threadPoolExecutor = new ThreadPoolExecutor(4,
                            20,
                            100,
                            TimeUnit.SECONDS,
                            new LinkedBlockingDeque<>());
                }
            }
        }

        return threadPoolExecutor;
    }
}

(9) Asynchronous IO function DimAsyncFunction

public abstract class DimAsyncFunction<T> extends RichAsyncFunction<T, T> implements DimJoinFunction<T> {
    
    

    private DruidDataSource dataSource;
    private ThreadPoolExecutor threadPoolExecutor;

    private String tableName;

    public DimAsyncFunction() {
    
    
    }

    public DimAsyncFunction(String tableName) {
    
    
        this.tableName = tableName;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
    
    
        dataSource = DruidDSUtil.createDataSource();
        threadPoolExecutor = ThreadPoolUtil.getThreadPoolExecutor();
    }

    @Override
    public void asyncInvoke(T input, ResultFuture<T> resultFuture) throws Exception {
    
    

        threadPoolExecutor.execute(new Runnable() {
    
    
            @Override
            public void run() {
    
    

                try {
    
    
                    //获取连接
                    DruidPooledConnection connection = dataSource.getConnection();

                    //查询维表获取维度信息
                    String key = getKey(input);
                    JSONObject dimInfo = DimUtil.getDimInfo(connection, tableName, key);

                    //将维度信息补充至当前数据
                    if (dimInfo != null) {
    
    
                        join(input, dimInfo);
                    }

                    //归还连接
                    connection.close();

                    //将结果写出
                    resultFuture.complete(Collections.singletonList(input));

                } catch (Exception e) {
    
    
                    e.printStackTrace();
                    System.out.println("关联维表失败:" + input + ",Table:" + tableName);
                    //resultFuture.complete(Collections.singletonList(input));
                }
            }
        });
    }


    @Override
    public void timeout(T input, ResultFuture<T> resultFuture) throws Exception {
    
    
        System.out.println("TimeOut:" + input);
    }
}

(10) Entity class TradeUserSpuOrderBean

@Data
@AllArgsConstructor
@NoArgsConstructor
@Builder
public class TradeUserSpuOrderBean {
    
    
    // 窗口起始时间
    String stt;
    // 窗口结束时间
    String edt;
    // 品牌 ID
    String trademarkId;
    // 品牌名称
    String trademarkName;
    // 一级品类 ID
    String category1Id;
    // 一级品类名称
    String category1Name;
    // 二级品类 ID
    String category2Id;
    // 二级品类名称
    String category2Name;
    // 三级品类 ID
    String category3Id;
    // 三级品类名称
    String category3Name;

    // 订单 ID
    @TransientSink
    Set<String> orderIdSet;

    // sku_id
    @TransientSink
    String skuId;

    // 用户 ID
    String userId;
    // spu_id
    String spuId;
    // spu 名称
    String spuName;
    // 下单次数
    Long orderCount;
    // 下单金额
    Double orderAmount;
    // 时间戳
    Long ts;
}

(11) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeOrderPreProcess -> Kafka(ZK) -> DwdTradeOrderDetail -> Kafka(ZK) -> DwsTradeUserSpuOrderWindow(Phoenix-(HBase-HDFS、ZK)、Redis) -> ClickHouse(ZK)
public class DwsTradeUserSpuOrderWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); //生产环境中设置为Kafka主题的分区数

        //1.1 开启CheckPoint
        //env.enableCheckpointing(5 * 60000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(10 * 60000L);
        //env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));

        //1.2 设置状态后端
        //env.setStateBackend(new HashMapStateBackend());
        //env.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop102:8020/211126/ck");
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //1.3 设置状态的TTL  生产环境设置为最大乱序程度
        //tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));

        //TODO 2.读取Kafka DWD层下单主题数据创建流
        String topic = "dwd_trade_order_detail";
        String groupId = "dws_trade_user_spu_order_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                try {
    
    
                    JSONObject jsonObject = JSON.parseObject(value);
                    out.collect(jsonObject);
                } catch (Exception e) {
    
    
                    System.out.println("Value>>>>>>>>" + value);
                }
            }
        });

        //TODO 4.按照 order_detail_id 分组
        KeyedStream<JSONObject, String> keyedByDetailIdDS = jsonObjDS.keyBy(json -> json.getString("id"));

        //TODO 5.针对 order_detail_id 进行去重(保留第一条数据即可)
        SingleOutputStreamOperator<JSONObject> filterDS = keyedByDetailIdDS.filter(new RichFilterFunction<JSONObject>() {
    
    

            private ValueState<String> valueState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    

                StateTtlConfig ttlConfig = new StateTtlConfig.Builder(Time.seconds(5))
                        .setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
                        .build();
                ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("is-exists", String.class);
                stateDescriptor.enableTimeToLive(ttlConfig);

                valueState = getRuntimeContext().getState(stateDescriptor);
            }

            @Override
            public boolean filter(JSONObject value) throws Exception {
    
    

                //获取状态数据
                String state = valueState.value();

                //判断状态是否为null
                if (state == null) {
    
    
                    valueState.update("1");
                    return true;
                } else {
    
    
                    return false;
                }
            }
        });

        //TODO 6.将数据转换为JavaBean对象
        SingleOutputStreamOperator<TradeUserSpuOrderBean> tradeUserSpuDS = filterDS.map(jsonObject -> {
    
    
            HashSet<String> orderIds = new HashSet<>();
            //使用orderIds这个set,是因为是按照SPU的粒度来算订单数的,
            // 而如果存在{orderId:1,skuId:1,spuId:1},{orderId:1,skuId:2,spuId:1}这样的数据应该算一条
            orderIds.add(jsonObject.getString("order_id"));

            return TradeUserSpuOrderBean.builder()
                    .skuId(jsonObject.getString("sku_id"))
                    .userId(jsonObject.getString("user_id"))
                    .orderAmount(jsonObject.getDouble("split_total_amount"))
                    .orderIdSet(orderIds)
                    .ts(DateFormatUtil.toTs(jsonObject.getString("create_time"), true))
                    .build();
        });
        tradeUserSpuDS.print("tradeUserSpuDS>>>>>>>>>>>>>>");

        //TODO 7.关联sku_info维表 补充 spu_id,tm_id,category3_id
//        tradeUserSpuDS.map(new RichMapFunction<TradeUserSpuOrderBean, TradeUserSpuOrderBean>() {
    
    
//            @Override
//            public void open(Configuration parameters) throws Exception {
    
    
//                //创建Phoenix连接池
//            }
//            @Override
//            public TradeUserSpuOrderBean map(TradeUserSpuOrderBean value) throws Exception {
    
    
//                //查询维表,将查到的信息补充至JavaBean中
//                return null;
//            }
//        });
        SingleOutputStreamOperator<TradeUserSpuOrderBean> tradeUserSpuWithSkuDS = AsyncDataStream.unorderedWait(
                tradeUserSpuDS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_SKU_INFO") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getSkuId();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setSpuId(dimInfo.getString("SPU_ID"));
                        input.setTrademarkId(dimInfo.getString("TM_ID"));
                        input.setCategory3Id(dimInfo.getString("CATEGORY3_ID"));
                    }
                },
                100, TimeUnit.SECONDS);

        tradeUserSpuWithSkuDS.print("tradeUserSpuWithSkuDS>>>>>>>>>>>");

        //TODO 8.提取事件时间生成Watermark
        SingleOutputStreamOperator<TradeUserSpuOrderBean> tradeUserSpuWithWmDS = tradeUserSpuWithSkuDS.assignTimestampsAndWatermarks(WatermarkStrategy.<TradeUserSpuOrderBean>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<TradeUserSpuOrderBean>() {
    
    
            @Override
            public long extractTimestamp(TradeUserSpuOrderBean element, long recordTimestamp) {
    
    
                return element.getTs();
            }
        }));

        //TODO 9.分组、开窗、聚合
        KeyedStream<TradeUserSpuOrderBean, Tuple4<String, String, String, String>> keyedStream = tradeUserSpuWithWmDS.keyBy(new KeySelector<TradeUserSpuOrderBean, Tuple4<String, String, String, String>>() {
    
    
            @Override
            public Tuple4<String, String, String, String> getKey(TradeUserSpuOrderBean value) throws Exception {
    
    
                return new Tuple4<>(value.getUserId(),
                        value.getSpuId(),
                        value.getTrademarkId(),
                        value.getCategory3Id());
            }
        });
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceDS = keyedStream.window(TumblingEventTimeWindows.of(org.apache.flink.streaming.api.windowing.time.Time.seconds(10)))
                .reduce(new ReduceFunction<TradeUserSpuOrderBean>() {
    
    
                    @Override
                    public TradeUserSpuOrderBean reduce(TradeUserSpuOrderBean value1, TradeUserSpuOrderBean value2) throws Exception {
    
    
                        value1.getOrderIdSet().addAll(value2.getOrderIdSet());
                        value1.setOrderAmount(value1.getOrderAmount() + value2.getOrderAmount());
                        return value1;
                    }
                }, new WindowFunction<TradeUserSpuOrderBean, TradeUserSpuOrderBean, Tuple4<String, String, String, String>, TimeWindow>() {
    
    
                    @Override
                    public void apply(Tuple4<String, String, String, String> key, TimeWindow window, Iterable<TradeUserSpuOrderBean> input, Collector<TradeUserSpuOrderBean> out) throws Exception {
    
    

                        TradeUserSpuOrderBean userSpuOrderBean = input.iterator().next();

                        userSpuOrderBean.setTs(System.currentTimeMillis());
                        userSpuOrderBean.setOrderCount((long) userSpuOrderBean.getOrderIdSet().size());
                        userSpuOrderBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        userSpuOrderBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));

                        out.collect(userSpuOrderBean);
                    }
                });

        //TODO 10.关联spu,tm,category维表补充相应的信息
        //10.1 关联SPU表
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceWithSpuDS = AsyncDataStream.unorderedWait(reduceDS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_SPU_INFO") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getSpuId();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setSpuName(dimInfo.getString("SPU_NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        //10.2 关联Tm表
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceWithTmDS = AsyncDataStream.unorderedWait(reduceWithSpuDS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_BASE_TRADEMARK") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getTrademarkId();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setTrademarkName(dimInfo.getString("TM_NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        //10.3 关联Category3
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceWithCategory3DS = AsyncDataStream.unorderedWait(reduceWithTmDS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_BASE_CATEGORY3") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getCategory3Id();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setCategory3Name(dimInfo.getString("NAME"));
                        input.setCategory2Id(dimInfo.getString("CATEGORY2_ID"));
                    }
                }, 100, TimeUnit.SECONDS);

        //10.4 关联Category2
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceWithCategory2DS = AsyncDataStream.unorderedWait(reduceWithCategory3DS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_BASE_CATEGORY2") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getCategory2Id();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setCategory2Name(dimInfo.getString("NAME"));
                        input.setCategory1Id(dimInfo.getString("CATEGORY1_ID"));
                    }
                }, 100, TimeUnit.SECONDS);

        //10.5 关联Category1
        SingleOutputStreamOperator<TradeUserSpuOrderBean> reduceWithCategory1DS = AsyncDataStream.unorderedWait(reduceWithCategory2DS,
                new DimAsyncFunction<TradeUserSpuOrderBean>("DIM_BASE_CATEGORY1") {
    
    
                    @Override
                    public String getKey(TradeUserSpuOrderBean input) {
    
    
                        return input.getCategory1Id();
                    }

                    @Override
                    public void join(TradeUserSpuOrderBean input, JSONObject dimInfo) {
    
    
                        input.setCategory1Name(dimInfo.getString("NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        //TODO 11.将数据写出到ClickHouse
        reduceWithCategory1DS.print(">>>>>>>>>>>>>>>>>");
        reduceWithCategory1DS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_user_spu_order_window values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"));

        //TODO 12.启动
        env.execute("DwsTradeUserSpuOrderWindow");

    }

}

Summary table of each window for placing orders at the granularity of the transaction domain and province

main mission

Read the order details data from Kafka, filter null data and deduplicate the data according to the unique key, count the number of orders and order amounts in each province and window, and write the data into the ClickHouse transaction domain province-level order window summary table.

Idea analysis

1) Read data from Kafka order details topic

2) Convert data structure

3) Deduplication according to the unique key

4) Convert data structure

JSONObject is converted to the entity class TradeProvinceOrderWindow.

5) Set the water level line

6) Group by province ID

provinced can uniquely identify data.

7) Open the window

8) Aggregate calculation

The measure fields are summed and the window start and end times are supplemented after the window is closed. Set the timestamp to the current system time.

9) Associated province information

Complete the province name field.

10) Write out to ClickHouse.

diagram

M

ClickHouse table creation statement

drop table if exists dws_trade_province_order_window;
create table if not exists dws_trade_province_order_window
(
    stt           DateTime,
    edt           DateTime,
    province_id   String,
    province_name String,
    order_count   UInt64,
    order_amount  Decimal(38, 20),
    ts            UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, province_id);

the code

(1) Entity class

@Data
@AllArgsConstructor
@Builder
public class TradeProvinceOrderWindow {
    
    
    // 窗口起始时间
    String stt;

    // 窗口结束时间
    String edt;

    // 省份 ID
    String provinceId;

    // 省份名称
    @Builder.Default
    String provinceName = "";

    // 累计下单次数
    Long orderCount;

    // 订单 ID 集合,用于统计下单次数
    @TransientSink
    Set<String> orderIdSet;

    // 累计下单金额
    Double orderAmount;

    // 时间戳
    Long ts;
}

(2) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeOrderPreProcess -> Kafka(ZK) -> DwdTradeOrderDetail -> Kafka(ZK) -> DwsTradeProvinceOrderWindow(Phoenix-(HBase-HDFS、ZK)、Redis) -> ClickHouse(ZK)

public class DwsTradeProvinceOrderWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); //生产环境中设置为Kafka主题的分区数

        //1.1 开启CheckPoint
        //env.enableCheckpointing(5 * 60000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(10 * 60000L);
        //env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));

        //1.2 设置状态后端
        //env.setStateBackend(new HashMapStateBackend());
        //env.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop102:8020/211126/ck");
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //1.3 设置状态的TTL  生产环境设置为最大乱序程度
        //tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));

        //TODO 2.读取Kafka DWD层下单主题数据创建流
        String topic = "dwd_trade_order_detail";
        String groupId = "dws_trade_province_order_window_211126";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JSON对象
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.flatMap(new FlatMapFunction<String, JSONObject>() {
    
    
            @Override
            public void flatMap(String value, Collector<JSONObject> out) throws Exception {
    
    
                try {
    
    
                    JSONObject jsonObject = JSON.parseObject(value);
                    out.collect(jsonObject);
                } catch (Exception e) {
    
    
                    System.out.println("Value>>>>>>>>" + value);
                }
            }
        });

        //TODO 4.按照 order_detail_id 分组、去重(取最后一条数据)
        KeyedStream<JSONObject, String> keyedByDetailIdDS = jsonObjDS.keyBy(json -> json.getString("id"));
        SingleOutputStreamOperator<JSONObject> filterDS = keyedByDetailIdDS.process(new KeyedProcessFunction<String, JSONObject, JSONObject>() {
    
    

            private ValueState<JSONObject> valueState;

            @Override
            public void open(Configuration parameters) throws Exception {
    
    
                valueState = getRuntimeContext().getState(new ValueStateDescriptor<JSONObject>("value-state", JSONObject.class));
            }

            @Override
            public void processElement(JSONObject value, Context ctx, Collector<JSONObject> out) throws Exception {
    
    
                //取出状态中的数据
                JSONObject lastValue = valueState.value();

                //判断状态数据是否为null
                if (lastValue == null) {
    
    
                    valueState.update(value);
                    long processingTime = ctx.timerService().currentProcessingTime();
                    ctx.timerService().registerProcessingTimeTimer(processingTime + 5000L);
                } else {
    
    

                    //取出状态数据以及当前数据中的时间字段
                    String lastTs = lastValue.getString("row_op_ts");
                    String curTs = value.getString("row_op_ts");

                    if (TimestampLtz3CompareUtil.compare(lastTs, curTs) != 1) {
    
    
                        valueState.update(value);
                    }
                }
            }

            @Override
            public void onTimer(long timestamp, OnTimerContext ctx, Collector<JSONObject> out) throws Exception {
    
    
                //输出数据并清空状态
                out.collect(valueState.value());
                valueState.clear();
            }
        });

        //TODO 5.将每行数据转换为JavaBean
        SingleOutputStreamOperator<TradeProvinceOrderWindow> provinceOrderDS = filterDS.map(line -> {
    
    

            HashSet<String> orderIdSet = new HashSet<>();
            orderIdSet.add(line.getString("order_id"));

            return new TradeProvinceOrderWindow("", "",
                    line.getString("province_id"),
                    "",
                    0L,
                    orderIdSet,
                    line.getDouble("split_total_amount"),
                    DateFormatUtil.toTs(line.getString("create_time"), true));
        });

        //TODO 6.提取时间戳生成Watermark
        SingleOutputStreamOperator<TradeProvinceOrderWindow> tradeProvinceWithWmDS = provinceOrderDS.assignTimestampsAndWatermarks(WatermarkStrategy.<TradeProvinceOrderWindow>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<TradeProvinceOrderWindow>() {
    
    
            @Override
            public long extractTimestamp(TradeProvinceOrderWindow element, long recordTimestamp) {
    
    
                return element.getTs();
            }
        }));

        //TODO 7.分组开窗聚合
        SingleOutputStreamOperator<TradeProvinceOrderWindow> reduceDS = tradeProvinceWithWmDS.keyBy(TradeProvinceOrderWindow::getProvinceId)
                .window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<TradeProvinceOrderWindow>() {
    
    
                    @Override
                    public TradeProvinceOrderWindow reduce(TradeProvinceOrderWindow value1, TradeProvinceOrderWindow value2) throws Exception {
    
    
                        value1.getOrderIdSet().addAll(value2.getOrderIdSet());
                        value1.setOrderAmount(value1.getOrderAmount() + value2.getOrderAmount());
                        return value1;
                    }
                }, new WindowFunction<TradeProvinceOrderWindow, TradeProvinceOrderWindow, String, TimeWindow>() {
    
    
                    @Override
                    public void apply(String s, TimeWindow window, Iterable<TradeProvinceOrderWindow> input, Collector<TradeProvinceOrderWindow> out) throws Exception {
    
    

                        TradeProvinceOrderWindow provinceOrderWindow = input.iterator().next();

                        provinceOrderWindow.setTs(System.currentTimeMillis());
                        provinceOrderWindow.setOrderCount((long) provinceOrderWindow.getOrderIdSet().size());
                        provinceOrderWindow.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        provinceOrderWindow.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));

                        out.collect(provinceOrderWindow);
                    }
                });
        reduceDS.print("reduceDS>>>>>>>>>>>>");

        //TODO 8.关联省份维表补充省份名称字段
        SingleOutputStreamOperator<TradeProvinceOrderWindow> reduceWithProvinceDS = AsyncDataStream.unorderedWait(reduceDS,
                new DimAsyncFunction<TradeProvinceOrderWindow>("DIM_BASE_PROVINCE") {
    
    
                    @Override
                    public String getKey(TradeProvinceOrderWindow input) {
    
    
                        return input.getProvinceId();
                    }

                    @Override
                    public void join(TradeProvinceOrderWindow input, JSONObject dimInfo) {
    
    
                        input.setProvinceName(dimInfo.getString("NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        //TODO 9.将数据写出到ClickHouse
        reduceWithProvinceDS.print("reduceWithProvinceDS>>>>>>>>>>>");
        reduceWithProvinceDS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_province_order_window values(?,?,?,?,?,?,?)"));

        //TODO 10.启动任务
        env.execute("DwsTradeProvinceOrderWindow");

    }

}

Transaction Domain Brand-Category-User Granularity Refund Each Window Summary Table

main mission

Read the chargeback details data from Kafka, filter null data and deduplicate the data according to the unique key, correlate dimension information, group by dimension, count the number of orders and order amount in each dimension and window, and write the data into the ClickHouse transaction domain brand- Category-user granular chargeback summary table for each window.

Idea analysis

1) Read data from Kafka chargeback details topic

2) Convert data structure

JSONObject is converted to the entity class TradeTrademarkCategoryUserRefundBean.

3) Supplementary dimension information related to grouping

Associate the sku info table to obtain tm id, category3 id.

4) Set the water level line

5) Grouping, windowing, aggregation

Group by dimension information, sum the measurement fields, and supplement the window start time and end time after the window is closed. Set the timestamp to the current system time.

6) Supplementary dimension information not related to grouping

7) Write out to ClickHouse

diagram

ClickHouse table creation statement

drop table if exists dws_trade_trademark_category_user_refund_window;
create table if not exists dws_trade_trademark_category_user_refund_window
(
    stt            DateTime,
    edt            DateTime,
    trademark_id   String,
    trademark_name String,
    category1_id   String,
    category1_name String,
    category2_id   String,
    category2_name String,
    category3_id   String,
    category3_name String,
    user_id        String,
    refund_count   UInt64,
    ts             UInt64
) engine = ReplacingMergeTree(ts)
      partition by toYYYYMMDD(stt)
      order by (stt, edt, trademark_id, trademark_name, category1_id,
                category1_name, category2_id, category2_name, category3_id, category3_name, user_id);

the code

(1) Entity class TradeTrademarkCategoryUserRefundBean

@Data
@AllArgsConstructor
@Builder
public class TradeTrademarkCategoryUserRefundBean {
    
    
    // 窗口起始时间
    String stt;
    // 窗口结束时间
    String edt;
    // 品牌 ID
    String trademarkId;
    // 品牌名称
    String trademarkName;
    // 一级品类 ID
    String category1Id;
    // 一级品类名称
    String category1Name;
    // 二级品类 ID
    String category2Id;
    // 二级品类名称
    String category2Name;
    // 三级品类 ID
    String category3Id;
    // 三级品类名称
    String category3Name;

    // 订单 ID
    @TransientSink
    Set<String> orderIdSet;

    // sku_id
    @TransientSink
    String skuId;

    // 用户 ID
    String userId;
    // 退单次数
    Long refundCount;
    // 时间戳
    Long ts;

    public static void main(String[] args) {
    
    
        TradeTrademarkCategoryUserRefundBean build = builder().build();
        System.out.println(build);
    }
}

(2) Main program

//数据流:Web/app -> nginx -> 业务服务器(Mysql) -> Maxwell -> Kafka(ODS) -> FlinkApp -> Kafka(DWD) -> FlinkApp -> ClickHouse(DWS)
//程  序:Mock  ->  Mysql  ->  Maxwell -> Kafka(ZK)  ->  DwdTradeOrderRefund -> Kafka(ZK) -> DwsTradeTrademarkCategoryUserRefundWindow(Phoenix(HBase-HDFS、ZK)、Redis) -> ClickHouse(ZK)
public class DwsTradeTrademarkCategoryUserRefundWindow {
    
    

    public static void main(String[] args) throws Exception {
    
    

        //TODO 1.获取执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1); //生产环境中设置为Kafka主题的分区数

        //1.1 开启CheckPoint
        //env.enableCheckpointing(5 * 60000L, CheckpointingMode.EXACTLY_ONCE);
        //env.getCheckpointConfig().setCheckpointTimeout(10 * 60000L);
        //env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
        //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000L));

        //1.2 设置状态后端
        //env.setStateBackend(new HashMapStateBackend());
        //env.getCheckpointConfig().setCheckpointStorage("hdfs://hadoop102:8020/211126/ck");
        //System.setProperty("HADOOP_USER_NAME", "atguigu");

        //1.3 设置状态的TTL  生产环境设置为最大乱序程度
        //tableEnv.getConfig().setIdleStateRetention(Duration.ofSeconds(5));

        //TODO 2.读取 Kafka DWD层 退单主题数据
        String topic = "dwd_trade_order_refund";
        String groupId = "dws_trade_trademark_category_user_refund_window";
        DataStreamSource<String> kafkaDS = env.addSource(MyKafkaUtil.getFlinkKafkaConsumer(topic, groupId));

        //TODO 3.将每行数据转换为JavaBean
        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> tradeTmCategoryUserDS = kafkaDS.map(line -> {
    
    

            JSONObject jsonObject = JSON.parseObject(line);

            HashSet<String> orderIds = new HashSet<>();
            orderIds.add(jsonObject.getString("order_id"));

            return TradeTrademarkCategoryUserRefundBean.builder()
                    .skuId(jsonObject.getString("sku_id"))
                    .userId(jsonObject.getString("user_id"))
                    .orderIdSet(orderIds)
                    .ts(DateFormatUtil.toTs(jsonObject.getString("create_time"), true))
                    .build();
        });

        //TODO 4.关联sku_info维表补充 tm_id以及category3_id
        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> tradeWithSkuDS = AsyncDataStream.unorderedWait(
                tradeTmCategoryUserDS,
                new DimAsyncFunction<TradeTrademarkCategoryUserRefundBean>("DIM_SKU_INFO") {
    
    
                    @Override
                    public String getKey(TradeTrademarkCategoryUserRefundBean input) {
    
    
                        return input.getSkuId();
                    }

                    @Override
                    public void join(TradeTrademarkCategoryUserRefundBean input, JSONObject dimInfo) {
    
    
                        input.setTrademarkId(dimInfo.getString("TM_ID"));
                        input.setCategory3Id(dimInfo.getString("CATEGORY3_ID"));
                    }
                }, 100, TimeUnit.SECONDS);

        //TODO 5.分组开窗聚合
        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> reduceDS = tradeWithSkuDS.assignTimestampsAndWatermarks(WatermarkStrategy.<TradeTrademarkCategoryUserRefundBean>forBoundedOutOfOrderness(Duration.ofSeconds(2)).withTimestampAssigner(new SerializableTimestampAssigner<TradeTrademarkCategoryUserRefundBean>() {
    
    
            @Override
            public long extractTimestamp(TradeTrademarkCategoryUserRefundBean element, long recordTimestamp) {
    
    
                return element.getTs();
            }
        })).keyBy(new KeySelector<TradeTrademarkCategoryUserRefundBean, Tuple3<String, String, String>>() {
    
    
            @Override
            public Tuple3<String, String, String> getKey(TradeTrademarkCategoryUserRefundBean value) throws Exception {
    
    
                return new Tuple3<>(value.getUserId(),
                        value.getTrademarkId(),
                        value.getCategory3Id()
                );
            }
        }).window(TumblingEventTimeWindows.of(Time.seconds(10)))
                .reduce(new ReduceFunction<TradeTrademarkCategoryUserRefundBean>() {
    
    
                    @Override
                    public TradeTrademarkCategoryUserRefundBean reduce(TradeTrademarkCategoryUserRefundBean value1, TradeTrademarkCategoryUserRefundBean value2) throws Exception {
    
    
                        value1.getOrderIdSet().addAll(value2.getOrderIdSet());
                        return value1;
                    }
                }, new WindowFunction<TradeTrademarkCategoryUserRefundBean, TradeTrademarkCategoryUserRefundBean, Tuple3<String, String, String>, TimeWindow>() {
    
    
                    @Override
                    public void apply(Tuple3<String, String, String> stringStringStringTuple3, TimeWindow window, Iterable<TradeTrademarkCategoryUserRefundBean> input, Collector<TradeTrademarkCategoryUserRefundBean> out) throws Exception {
    
    

                        TradeTrademarkCategoryUserRefundBean refundBean = input.iterator().next();

                        refundBean.setTs(System.currentTimeMillis());
                        refundBean.setEdt(DateFormatUtil.toYmdHms(window.getEnd()));
                        refundBean.setStt(DateFormatUtil.toYmdHms(window.getStart()));
                        refundBean.setRefundCount((long) refundBean.getOrderIdSet().size());

                        out.collect(refundBean);
                    }
                });

        //TODO 6.关联维表补充其他字段
        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> reduceWithTmDS = AsyncDataStream.unorderedWait(
                reduceDS,
                new DimAsyncFunction<TradeTrademarkCategoryUserRefundBean>("DIM_BASE_TRADEMARK") {
    
    
                    @Override
                    public String getKey(TradeTrademarkCategoryUserRefundBean input) {
    
    
                        return input.getTrademarkId();
                    }

                    @Override
                    public void join(TradeTrademarkCategoryUserRefundBean input, JSONObject dimInfo) {
    
    
                        input.setTrademarkName(dimInfo.getString("TM_NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> reduceWith3DS = AsyncDataStream.unorderedWait(
                reduceWithTmDS,
                new DimAsyncFunction<TradeTrademarkCategoryUserRefundBean>("DIM_BASE_CATEGORY3") {
    
    
                    @Override
                    public String getKey(TradeTrademarkCategoryUserRefundBean input) {
    
    
                        return input.getCategory3Id();
                    }

                    @Override
                    public void join(TradeTrademarkCategoryUserRefundBean input, JSONObject dimInfo) {
    
    
                        input.setCategory3Name(dimInfo.getString("NAME"));
                        input.setCategory2Id(dimInfo.getString("CATEGORY2_ID"));
                    }
                }, 100, TimeUnit.SECONDS);

        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> reduceWith2DS = AsyncDataStream.unorderedWait(
                reduceWith3DS,
                new DimAsyncFunction<TradeTrademarkCategoryUserRefundBean>("DIM_BASE_CATEGORY2") {
    
    
                    @Override
                    public String getKey(TradeTrademarkCategoryUserRefundBean input) {
    
    
                        return input.getCategory2Id();
                    }

                    @Override
                    public void join(TradeTrademarkCategoryUserRefundBean input, JSONObject dimInfo) {
    
    
                        input.setCategory2Name(dimInfo.getString("NAME"));
                        input.setCategory1Id(dimInfo.getString("CATEGORY1_ID"));
                    }
                }, 100, TimeUnit.SECONDS);

        SingleOutputStreamOperator<TradeTrademarkCategoryUserRefundBean> reduceWith1DS = AsyncDataStream.unorderedWait(
                reduceWith2DS,
                new DimAsyncFunction<TradeTrademarkCategoryUserRefundBean>("DIM_BASE_CATEGORY1") {
    
    
                    @Override
                    public String getKey(TradeTrademarkCategoryUserRefundBean input) {
    
    
                        return input.getCategory1Id();
                    }

                    @Override
                    public void join(TradeTrademarkCategoryUserRefundBean input, JSONObject dimInfo) {
    
    
                        input.setCategory1Name(dimInfo.getString("NAME"));
                    }
                }, 100, TimeUnit.SECONDS);

        //TODO 7.将数据写出到ClickHouse
        reduceWith1DS.print(">>>>>>>>>>>>");
        reduceWith1DS.addSink(MyClickHouseUtil.getSinkFunction("insert into dws_trade_trademark_category_user_refund_window values(?,?,?,?,?,?,?,?,?,?,?,?,?)"));

        //TODO 8.启动任务
        env.execute("DwsTradeTrademarkCategoryUserRefundWindow");

    }

}

Guess you like

Origin blog.csdn.net/qq_44766883/article/details/131001708