[Real-time data warehouse] DWM layer design mode, independent visitor (UV) calculation

Design of a DWS layer and DWM layer

1 Design ideas

Previously, the data was split into independent kafka topics through means such as shunting. Then how to process the data next, we have to think about which indicators to calculate in real time.

Because real-time computing is different from offline computing, the development and operation and maintenance costs of real-time computing are very high. It is necessary to consider whether it is necessary to build a large and comprehensive middle layer like an offline data warehouse based on the actual situation.

If it is not necessary to be large and comprehensive, then it is necessary to make a general plan for the indicator requirements to be calculated in real time. Outputting these indicators in the form of subject-wide tables is the DWS layer.

2 DWS layer requirement analysis

statistics topic demand indicator output method calculation source source hierarchy
visitor pv Visual large screen page_log is directly available dwd
uv Visual large screen Need to use page_log to filter and deduplicate dwm
Jump out of details Visual large screen It needs to be judged by page_log behavior dwm
Enter the number of pages Visual large screen Requires identification to start accessing the logo dwd
Continuous visit time Visual large screen page_log is directly available dwd
merchandise click multidimensional analysis page_log is directly available dwd
collect multidimensional analysis favorite table dwd
add to the cart multidimensional analysis shopping cart table dwd
place an order Visual large screen order wide table dwm
to pay multidimensional analysis pay wide table dwm
Refund multidimensional analysis refund form dwd
Comment multidimensional analysis comment form dwd
area pv multidimensional analysis page_log is directly available dwd
uv multidimensional analysis Need to use page_log to filter and deduplicate dwm
place an order Visual large screen order wide table dwm
Key words search keyword Visual large screen Page access logs are directly available dwd
Click on the product keyword Visual large screen Commodity topic order re-aggregation dws
Order product keywords Visual large screen Commodity topic order re-aggregation dws

Of course, there will be more actual needs. Here, the real-time calculation processing is mainly for the purpose of visualizing the large screen.

The positioning of the DWM layer is mainly to serve the DWS, because some requirements directly from the DWD layer to the DWS layer will have a certain amount of calculation, and the results of this part of the calculation are likely to be reused by multiple DWS layer topics, so some DWD layers A layer of DWM will be formed, and the business involved here mainly includes: access UV calculation, jump out of detailed calculation, order wide table, and payment wide table.

Two DWM layers - UV calculation

1 Demand Analysis and Ideas

UV, the full name is Unique Visitor, that is, independent visitors. For real-time computing, it can also be called DAU (Daily Active User), that is, daily active users, because uv in real-time computing usually refers to the number of visitors on that day.

So how to identify the visitors of the day from the user behavior log, there are the following two points:

  • One is to identify the first page opened by the visitor, indicating that the visitor starts to enter the application.
  • Second, since visitors can enter the application multiple times in a day, it is necessary to deduplicate within a day.

insert image description here

2 read data from kafka

The workflow is as follows:

insert image description here

(1) Code implementation

public class UnionVistorApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        //TODO 1 基本环境准备
        //1.1 流处理环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //1.2 设置并行度
        env.setParallelism(4);

        //TODO 2 检查点设置
//        //2.1 开启检查点
//        env.enableCheckpointing(5000L, CheckpointingMode.EXACTLY_ONCE);
//        //2.2 设置检查点超时时间
//        env.getCheckpointConfig().setCheckpointTimeout(60000L);
//        //2.3 设置重启策略
//        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,3000L));
//        //2.4 设置job取消后,检查点是否保留
//        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//        //2.5 设置状态后端 -- 基于内存 or 文件系统 or RocksDB
//        env.setStateBackend(new FsStateBackend("hdfs://hadoop101:8020/ck/gmall"));
//        //2.6 指定操作HDFS的用户
//        System.setProperty("HADOOP_USER_NAME","hzy");

        //TODO 3 从kafka中读取数据
        //3.1 声明消费主题以及消费者组
        String topic = "dwd_page_log";
        String groupId = "union_visitor_app_group";
        //3.2 获取kafka消费者对象
        FlinkKafkaConsumer<String> kafkaSource = MyKafkaUtil.getKafkaSource(topic, groupId);
        //3.3 读取数据封装流
        DataStreamSource<String> kafkaDS = env.addSource(kafkaSource);

        //TODO 4 对读取的数据进行类型转换 String -> JSONObject
        SingleOutputStreamOperator<JSONObject> jsonObjDS = kafkaDS.map(JSON::parseObject);

        jsonObjDS.print(">>>");
        env.execute();
    }
}

(2) test

Processes that need to be started: zookeeper, kafka, simulation generated log jar package, logger.sh, UnionVistorApp, BaseLogApp.

  • Start logger.sh, zk, kafka
  • Run the BaseLogApp in Idea
  • Run UniqueVisitApp in Idea
  • View console output
  • Implementation process

Simulate generated data -> log processing server -> write to Kafka's ODS layer (ods_base_log) -> BaseLogApp split -> dwd_page_log -> UniqueVisitApp read output

The output information is as follows:

BaseLogApp

启动流::3> {
    
    "common":{
    
    "ar":"110000","uid":"45","os":"Android 11.0","ch":"360","is_new":"1","md":"Xiaomi Mix2 ","mid":"mid_13","vc":"v2.1.134","ba":"Xiaomi"},"start":{
    
    "entry":"install","open_ad_skip_ms":0,"open_ad_ms":5918,"loading_time":1480,"open_ad_id":11},"ts":1670158358000}
曝光流::1> {
    
    "display_type":"query","page_id":"good_detail","item":"10","item_type":"sku_id","pos_id":4,"order":4,"ts":1670158358000}
曝光流::3> {
    
    "display_type":"query","page_id":"good_detail","item":"1","item_type":"sku_id","pos_id":5,"order":6,"ts":1670158358000}
曝光流::1> {
    
    "display_type":"query","page_id":"good_detail","item":"5","item_type":"sku_id","pos_id":1,"order":5,"ts":1670158358000}
主流::3> {
    
    "common":{
    
    "ar":"110000","uid":"45","os":"Android 11.0","ch":"360","is_new":"1","md":"Xiaomi Mix2 ","mid":"mid_13","vc":"v2.1.134","ba":"Xiaomi"},"page":{
    
    "page_id":"cart","during_time":15330,"last_page_id":"good_detail"},"ts":1670158358000}

UnionVistorApp

>>>:2> {
    
    "common":{
    
    "ar":"110000","uid":"45","os":"Android 11.0","ch":"360","is_new":"1","md":"Xiaomi Mix2 ","mid":"mid_13","vc":"v2.1.134","ba":"Xiaomi"},"page":{
    
    "page_id":"good_detail","item":"3","during_time":9775,"item_type":"sku_id","last_page_id":"good_list","source_type":"query"},"displays":[{
    
    "display_type":"recommend","item":"10","item_type":"sku_id","pos_id":4,"order":1},{
    
    "display_type":"recommend","item":"3","item_type":"sku_id","pos_id":1,"order":2},{
    
    "display_type":"promotion","item":"2","item_type":"sku_id","pos_id":4,"order":3},{
    
    "display_type":"query","item":"8","item_type":"sku_id","pos_id":1,"order":4},{
    
    "display_type":"query","item":"10","item_type":"sku_id","pos_id":5,"order":5},{
    
    "display_type":"query","item":"1","item_type":"sku_id","pos_id":5,"order":6}],"actions":[{
    
    "item":"3","action_id":"favor_add","item_type":"sku_id","ts":1670158362887}],"ts":1670158358000}
>>>:4> {
    
    "common":{
    
    "ar":"110000","uid":"45","os":"Android 11.0","ch":"360","is_new":"1","md":"Xiaomi Mix2 ","mid":"mid_13","vc":"v2.1.134","ba":"Xiaomi"},"page":{
    
    "page_id":"trade","item":"4,6,10","during_time":5294,"item_type":"sku_ids","last_page_id":"cart"},"ts":1670158358000}
>>>:3> {
    
    "common":{
    
    "ar":"110000","uid":"45","os":"Android 11.0","ch":"360","is_new":"1","md":"Xiaomi Mix2 ","mid":"mid_13","vc":"v2.1.134","ba":"Xiaomi"},"page":{
    
    "page_id":"cart","during_time":15330,"last_page_id":"good_detail"},"ts":1670158358000}

(3) Summary

Implementation process:

  • Simulation to generate log jar package
  • Send the log data generated by the simulation to Nginx for load balancing
  • Nginx forwards the request to three log collection services
  • The three log collection services receive the log data and send the log data to the ods_base_log topic of kafka
  • The BaseLogApp application reads data from ods_base_log for shunting
    • Startup log: dwd_start_log
    • Exposure log: dwd_display_log
    • Page log: dwd_page_log
  • UnionVistorApp reads data from dwd_page_log topic

3 UV Filtering – Independent Visitor Calculations

(1) Implementation ideas

  • First use keyby to group according to mid, each group represents the access status of the current device
  • After grouping, use the keystate state to record the user's entry time, and implement RichFilterFunction to complete the filtering
  • Override the open method to initialize the state
  • Rewrite the filter method to filter
    • You can directly filter out the fields whose last_page_id is not empty, because as long as there is a previous page, it means that this is not the first page entered by this user.
    • The status is used to record the user's entry time. As long as the lastVisitDate is today, it means that the user has already visited today, so it is filtered out. If it is empty or not today, it means that there is no visit today, and it will be reserved.
    • Because the status value is mainly used to filter whether you have been here today, this record is basically useless after today. Here enableTimeToLive sets an expiration time of 1 day to avoid excessive status.

(2) Code implementation

		//TODO 5 按照设备id对数据进行分组
        KeyedStream<JSONObject, String> keyedDS = jsonObjDS.keyBy(jsonObj -> jsonObj.getJSONObject("common").getString("mid"));

        //TODO 6 实现过滤
        //实现目的:如有一个用户在6月访问一次,11月访问一次,6-11月共访问两次,
        // 如果一直保留其6月的访问状态,直到11月才去更新,会消耗很多资源,
        // 所以需要将其访问时间放入状态中,定时进行更新。
        SingleOutputStreamOperator<JSONObject> filterDS = keyedDS.filter(
                new RichFilterFunction<JSONObject>() {
    
    
                    // 声明状态变量,用于存放上次访问日期
                    private ValueState<String> lastVistDateState;
                    // 声明日期格式工具类
                    private SimpleDateFormat sdf;

                    @Override
                    public void open(Configuration parameters) throws Exception {
    
    
                        sdf = new SimpleDateFormat("yyyyMMdd");
                        ValueStateDescriptor<String> valueStateDescriptor = new ValueStateDescriptor<>("lastVistDateState", String.class);
                        // 注意:UV可以延伸为日活统计,其状态值主要用于筛选当天是否访问过
                        // 那么状态超过今天就没有存在的意义
                        // 所以设置状态的失效时间为1天
                        // 粒度为天,不记录时分秒
                        StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.days(1))
                                // 默认值,当状态创建或者写入的时候会更新状态失效时间
//                                .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                                // 默认值,状态过期后,如果还没有被清理,是否返回给状态调用者
//                                .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                                .build();
                        valueStateDescriptor.enableTimeToLive(ttlConfig);
                        lastVistDateState = getRuntimeContext().getState(valueStateDescriptor);
                    }

                    @Override
                    public boolean filter(JSONObject jsonObj) throws Exception {
    
    
                        // 如果从其他页面跳转过来,直接过滤掉
                        String lastPageId = jsonObj.getJSONObject("page").getString("last_page_id");
                        if (lastPageId != null && lastPageId.length() > 0) {
    
    
                            return false;
                        }
                        // 获取状态中的上次访问日期
                        String lastVisitDate = lastVistDateState.value();
                        String curVisitDate = sdf.format(jsonObj.getLong("ts"));
                        if (lastVisitDate != null && lastVisitDate.length() > 1 && lastVisitDate.equals(curVisitDate)) {
    
    
                            // 今天已经访问过
                            return false;
                        } else {
    
    
                            // 今天还没访问过
                            lastVistDateState.update(curVisitDate);
                            return true;
                        }
                    }
                }
        );


        filterDS.print(">>>");
        env.execute();

4 write to kafka

Write the filtered UV to Kafka's dwm_unique_visitor.

//TODO 7 将过滤后的uv数据,写回到kafka的dwm层
filterDS.map(jsonObj -> jsonObj.toJSONString()).addSink(
        MyKafkaUtil.getKafkaSink("dwm_unique_visitor")
);

5 tests

# 启动logger.sh、zk、kafka
# 运行Idea中的BaseLogApp
# 运行Idea中的UniqueVisitApp
# 查看控制台输出以及kafka的dwm_unique_visit主题
# 执行流程 
模拟生成数据->日志处理服务器->写到kafka的ODS层(ods_base_log)->BaseLogApp分流->dwd_page_log->UniqueVisitApp读取并处理->写回到kafka的dwm层

The overall process of program operation is as follows:

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128321938