[Real-time data warehouse] DWS layer positioning, DWS layer visitor theme calculation (PV, UV, number of bounces, counted pages, continuous visit time)

Design of a DWS layer and DWM layer

1 Design ideas

In the past, the data was split into independent kafka topics through means such as shunting. Then how to process the data next, we have to think about which indicators to calculate in real time.

Because real-time computing is different from offline computing, the development and operation and maintenance costs of real-time computing are very high. It is necessary to consider whether it is necessary to build a large and comprehensive middle layer like an offline data warehouse based on the actual situation.

If it is not necessary to be large and comprehensive, then it is necessary to make a general plan for the indicator requirements to be calculated in real time. Outputting these indicators in the form of subject-wide tables forms the DWS layer.

2 Demand sorting

Of course, in addition to the following requirements, there will be more actual requirements. Here, the real-time calculation processing is mainly for the purpose of visualizing the large screen.

statistics topic demand indicator output method calculation source source hierarchy
visitor pv Visual large screen page_log is directly available dwd
uv Visual large screen Need to use page_log to filter and deduplicate dwm
Bounces Visual large screen It needs to be judged by page_log behavior dwm
Enter the number of pages Visual large screen Requires identification to start accessing the logo dwd
Continuous visit time Visual large screen page_log is directly available dwd
merchandise click multidimensional analysis page_log is directly available dwd
exposure multidimensional analysis page_log is directly available dwd
collect multidimensional analysis favorite table dwd
add to the cart multidimensional analysis shopping cart table dwd
place an order Visual large screen order wide table dwm
to pay multidimensional analysis pay wide table dwm
Refund multidimensional analysis refund form dwd
Comment multidimensional analysis comment form dwd
area pv multidimensional analysis page_log is directly available dwd
uv multidimensional analysis Need to use page_log to filter and deduplicate dwm
place an order Visual large screen order wide table dwm
Key words search keyword Visual large screen Page access logs are directly available dwd
Click on the product keyword Visual large screen Commodity topic order re-aggregation dws
Order product keywords Visual large screen Commodity topic order re-aggregation dws

3 DWS layer positioning

Light aggregation, because the DWS layer has to deal with many real-time queries. If it is completely detailed, the query pressure is very high.

Combining more real-time data in the form of topics is easy to manage, and can also reduce the number of dimension queries.

2nd DWS Layer - Visitor Topic Calculation

statistics topic demand indicator output method calculation source source hierarchy
visitor pv Visual large screen page_log is directly available dwd
uv Visual large screen Need to use page_log to filter and deduplicate dwm
Bounces Visual large screen It needs to be judged by page_log behavior dwm
Enter the number of pages Visual large screen Requires identification to start accessing the logo dwd
Continuous visit time Visual large screen page_log is directly available dwd

Designing a DWS layer table is actually two things: dimension and measure (fact data)

  • Metrics include PV, UV, number of bounces, number of pages entered (session_count), and continuous access time
  • Dimensions include several fields that are more important in analysis: channel, region, version, new and old users for aggregation

1 Demand Analysis and Ideas

  • Receive each detailed data and turn it into a data stream.
  • Merge data streams together to form a data stream of objects of the same format.
  • The aggregated streams are aggregated, and the aggregated time window determines the timeliness of the data.
  • Write the aggregation result to the database.

In order to merge the three streams together, it is necessary to define an entity class VisitorStats [channel, region, version, new and old users, PV, UV, number of bounces, number of pages entered (session_count), continuous visit duration]

dwd_page_log
	new VisitorStats(渠道、地区、版本、新老用户、1L、0L、0L、1L、XXXXL)

dwm_unique_visitor
	new VisitorStats(渠道、地区、版本、新老用户、0L、1L、0L、0L、0L)

dwm_user_jump_detail
	new VisitorStats(渠道、地区、版本、新老用户、0L、0L、1L、0L、0L)

The overall process is as follows:

insert image description here

2 function realization

(1) Encapsulate VisitorStatsApp and read each flow data of Kafka

a code

package com.hzy.gmall.realtime.app.dws;
/**
 * 访客主题统计dws
 */
public class VisitorStatsApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // TODO 1 基本环境准备
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);

        // TODO 2 从kafka中读取数据
        // 2.1 声明读取的主题和消费者组
        String pageViewSourceTopic = "dwd_page_log";
        String uniqueVisitSourceTopic = "dwm_unique_visitor";
        String userJumpDetailSourceTopic = "dwm_user_jump_detail";
        String groupId = "visitor_stats_app";

        // 2.2 获取kafka消费者
        FlinkKafkaConsumer<String> pvSource = MyKafkaUtil.getKafkaSource(pageViewSourceTopic, groupId);
        FlinkKafkaConsumer<String> uvSource = MyKafkaUtil.getKafkaSource(uniqueVisitSourceTopic, groupId);
        FlinkKafkaConsumer<String> ujdSource = MyKafkaUtil.getKafkaSource(userJumpDetailSourceTopic, groupId);

        // 2.3 读取数据,封装成流
        DataStreamSource<String> pvStrDS = env.addSource(pvSource);
        DataStreamSource<String> uvStrDS = env.addSource(uvSource);
        DataStreamSource<String> ujdStrDS = env.addSource(ujdSource);

        pvStrDS.print("1111");
        uvStrDS.print("2222");
        ujdStrDS.print("3333");

        env.execute();
    }
}

b test

Start zookeeper, kafka, logger.sh, BaseLogApp, UnionVistorApp, UserJumpDetailAPP, VisitorStatsApp, execute the simulated user behavior log generation script, and check whether the above three types of information are included.

(2) Combine data streams

Merge data streams together to form a data stream of objects of the same format

The core operator for merging data streams is union. However, the union operator requires that all data flow structures must be consistent. Therefore, the data structure must be adjusted before union.

a Encapsulation theme wide table entity class VisitorStats

package com.hzy.gmall.realtime.beans;

import lombok.AllArgsConstructor;
import lombok.Data;

/**
 * Desc: 访客统计实体类  包括各个维度和度量
 */
@Data
@AllArgsConstructor
public class VisitorStats {
    
    
    //统计开始时间
    private String stt;
    //统计结束时间
    private String edt;
    //维度:版本
    private String vc;
    //维度:渠道
    private String ch;
    //维度:地区
    private String ar;
    //维度:新老用户标识
    private String is_new;
    //度量:独立访客数
    private Long uv_ct=0L;
    //度量:页面访问数
    private Long pv_ct=0L;
    //度量: 进入次数
    private Long sv_ct=0L;
    //度量: 跳出次数
    private Long uj_ct=0L;
    //度量: 持续访问时间
    private Long dur_sum=0L;
    //统计时间
    private Long ts;
}

b Convert the structure of each data stream read

The format of the read jsonStr is as follows
{
    "common": {
        "ar": "530000",
        "uid": "9",
        "os": "Android 11.0",
        "ch": "vivo",
        "is_new": "1",
        "md": "Xiaomi Mix2 ",
        "mid": "mid_6",
        "vc": "v2.1.134",
        "ba": "Xiaomi"
    },
    "page": {
        "page_id": "home",
        "item":"9",
        "during_time":15839,
        "item_type":"sku_id",
        "last_page_id":"home",
        "source_type":"query"
    },
    "displays": [
        {
            "display_type": "activity",
            "item": "1",
            "item_type": "activity_id",
            "pos_id": 5,
            "order": 1
        }
    ],
    "ts": 1670913783000
}
code show as below
// TODO 3 对流中的数据进行类型转换 jsonStr -> VisitorStats
// 3.1 dwd_page_loge流中数据的转化
SingleOutputStreamOperator<VisitorStats> pvStatsDS = pvStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                JSONObject pageJsonObj = jsonObj.getJSONObject("page");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        0L,
                        1L,
                        0L,
                        0L,
                        pageJsonObj.getLong("during_time"),
                        jsonObj.getLong("ts")
                );
                // 判断是否为新的会话,是则sessionViewCount + 1
                String lastPageId = pageJsonObj.getString("last_page_id");
                if (lastPageId == null || lastPageId.length() == 0) {
    
    
                    visitorStats.setSv_ct(1L);
                }
                return visitorStats;
            }
        }
);
// 3.2 dwm_unique_visitor流中数据的转化
SingleOutputStreamOperator<VisitorStats> uvStatsDS = uvStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        1L,
                        0L,
                        0L,
                        0L,
                        0L,
                        jsonObj.getLong("ts")
                );
                return visitorStats;
            }
        }
);
// 3.3 dwm_user_jump_detail流中数据的转化
SingleOutputStreamOperator<VisitorStats> ujdStatsDS = ujdStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        0L,
                        0L,
                        0L,
                        1L,
                        0L,
                        jsonObj.getLong("ts")
                );
                return visitorStats;
            }
        }
);
// TODO 4 将三条流转换后的数据进行合并
DataStream<VisitorStats> unionDS = pvStatsDS.union(uvStatsDS, ujdStatsDS);

unionDS.print(">>>");
output result
VisitorStats(stt=, edt=, vc=v2.1.134, ch=xiaomi, ar=110000, is_new=0, uv_ct=0, pv_ct=1, sv_ct=0, uj_ct=0, dur_sum=8283, ts=1670918057000)

c Aggregate by dimension

Idea analysis

Because it involves windowing aggregation, it is necessary to set the event time and water level.

Do you need to count multiple details of the same dimension together:

  • Because the operation data of mid per unit time is very limited, the amount of data cannot be compressed obviously (if the amount of data is large enough, or the unit time is long enough).
  • Therefore, the four dimensions of commonly used statistics are used for aggregation: channels, new and old users, app version, provinces, cities and regions.
  • Metrics include: startup, daily activity (the first startup of the day), number of pages visited, number of new users, number of bounces, average page stay time, and total visit time.
  • Aggregation window: 10 seconds.
Code
// TODO 5 指定Watermark以及提取事件时间字段
SingleOutputStreamOperator<VisitorStats> visitorStatsWithWatermarkDS = unionDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<VisitorStats>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(
                        new SerializableTimestampAssigner<VisitorStats>() {
    
    
                            @Override
                            public long extractTimestamp(VisitorStats visitorStats, long recordTimestamp) {
    
    
                                return visitorStats.getTs();
                            }
                        }
                )
);

// TODO 6 按照维度对流中的数据进行分组
//维度有:版本,渠道,地区,新老访客 定义分组的key为Tuple4类型
KeyedStream<VisitorStats, Tuple4<String, String, String, String>> keyedDS = visitorStatsWithWatermarkDS.keyBy(
        new KeySelector<VisitorStats, Tuple4<String, String, String, String>>() {
    
    
            @Override
            public Tuple4<String, String, String, String> getKey(VisitorStats visitorStats) throws Exception {
    
    
                return Tuple4.of(
                        visitorStats.getVc(),
                        visitorStats.getCh(),
                        visitorStats.getAr(),
                        visitorStats.getIs_new()
                );
            }
        }
);

// TODO 7 对分组之后的数据,进行开窗处理
// 每个分组是独立的窗口,分组之间互不影响
WindowedStream<VisitorStats, Tuple4<String, String, String, String>, TimeWindow> windowDS = keyedDS.window(TumblingEventTimeWindows.of(Time.seconds(10)));

// TODO 8 聚合计算
// 对窗口中的数据进行两两聚合计算
SingleOutputStreamOperator<VisitorStats> reduceDS = windowDS.reduce(
        new ReduceFunction<VisitorStats>() {
    
    
            @Override
            public VisitorStats reduce(VisitorStats stats1, VisitorStats stats2) throws Exception {
    
    
                // 度量值进行两两相加
                stats1.setPv_ct(stats1.getPv_ct() + stats2.getPv_ct());
                stats1.setUv_ct(stats1.getUv_ct() + stats2.getUv_ct());
                stats1.setSv_ct(stats1.getSv_ct() + stats2.getSv_ct());
                stats1.setDur_sum(stats1.getDur_sum() + stats2.getDur_sum());
                stats1.setUj_ct(stats1.getUj_ct() + stats2.getUj_ct());
                return stats1;
            }
        },
        new ProcessWindowFunction<VisitorStats, VisitorStats, Tuple4<String, String, String, String>, TimeWindow>() {
    
    
            @Override
            public void process(Tuple4<String, String, String, String> Tuple4, Context context, Iterable<VisitorStats> elements, Collector<VisitorStats> out) throws Exception {
    
    
                // 补全时间字段的值
                for (VisitorStats visitorStats : elements) {
    
    
                    visitorStats.setStt(DateTimeUtil.toYMDHMS(new Date(context.window().getStart())));
                    visitorStats.setEdt(DateTimeUtil.toYMDHMS(new Date(context.window().getEnd())));
                    // 操作时间为当前系统时间
                    visitorStats.setTs(System.currentTimeMillis());
                    // 将处理之后的数据向下游发送
                    out.collect(visitorStats);
                }
            }
        }
);

reduceDS.print(">>>");
output result
>>>:1> VisitorStats(stt=2022-12-13 16:48:40, edt=2022-12-13 16:48:50, vc=v2.1.134, ch=Appstore, ar=500000, is_new=0, uv_ct=0, pv_ct=8, sv_ct=1, uj_ct=0, dur_sum=77934, ts=1670921334767)
>>>:2> VisitorStats(stt=2022-12-13 16:48:40, edt=2022-12-13 16:48:50, vc=v2.1.134, ch=xiaomi, ar=110000, is_new=0, uv_ct=1, pv_ct=7, sv_ct=1, uj_ct=0, dur_sum=78835, ts=1670921334767)

(3) Write to OLAP database

ClickHouse official website .

# 启动ClickHouse
sudo systemctl start clickhouse-server
# 启动客户端
clickhouse-client -m

Why write to the ClickHouse database? The ClickHouse database is a database dedicated to statistical analysis of a large amount of data. It not only guarantees the ability to store massive data, but also takes into account the response speed. It also supports standard SQL, which is flexible and easy to use.

For detailed installation and introduction of the ClickHouse database, please refer to .

a ClickHouse data sheet preparation

create table  
visitor_stats_2022 (
        stt DateTime,
        edt DateTime,
        vc  String,
        ch  String ,
        ar  String ,
        is_new String ,
        uv_ct UInt64,
        pv_ct UInt64,
        sv_ct UInt64,
        uj_ct UInt64,
        dur_sum  UInt64,
        ts UInt64
        ) engine = ReplacingMergeTree(ts)
        partition by  toYYYYMMDD(stt)
        order by  (stt,edt,is_new,vc,ch,ar);

The reason for choosing the ReplacingMergeTree engine is mainly to rely on it to ensure the idempotency of the data table.

  • paritition by turns the date into a number type (eg: 20201126) for partitioning. So try to ensure that the query conditions include the stt field as much as possible.
  • The field data behind order by is in the same partition, if there is any duplication, it will be deduplicated, and the duplicate data will keep the data with the largest ts.
  • The order and names of the fields in the table creation statement must be consistent with the order and names of the attributes of VisitorStats.

b Add ClickHouse dependency package

<dependency>
    <groupId>ru.yandex.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.0</version>
    <exclusions>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </exclusion>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-jdbc_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

Among them, flink-connector-jdbc is the official and general jdbcSink package. As long as the corresponding jdbc driver is introduced, flink can use it to deal with various databases that support jdbc, such as phoenix can also use it. But this jdbc-sink only supports data flow corresponding to a data table. If it is a first-class-to-many table, it must be implemented in a custom way, such as the previous dimension data.

Although this kind of jdbc-sink can only be one-to-one, but because of the internal use of the pre-compiler, it can implement batch submission to optimize the writing speed.

The completed functions are as follows:

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128322370