Article directory

Design of a DWS layer and DWM layer
2nd DWS Layer - Visitor Topic Calculation

Design of a DWS layer and DWM layer

1 Design ideas

In the past, the data was split into independent kafka topics through means such as shunting. Then how to process the data next, we have to think about which indicators to calculate in real time.

Because real-time computing is different from offline computing, the development and operation and maintenance costs of real-time computing are very high. It is necessary to consider whether it is necessary to build a large and comprehensive middle layer like an offline data warehouse based on the actual situation.

If it is not necessary to be large and comprehensive, then it is necessary to make a general plan for the indicator requirements to be calculated in real time. Outputting these indicators in the form of subject-wide tables forms the DWS layer.

2 Demand sorting

Of course, in addition to the following requirements, there will be more actual requirements. Here, the real-time calculation processing is mainly for the purpose of visualizing the large screen.

statistics topic	demand indicator	output method	calculation source	source hierarchy
visitor	pv	Visual large screen	page_log is directly available	dwd
	uv	Visual large screen	Need to use page_log to filter and deduplicate	dwm
	Bounces	Visual large screen	It needs to be judged by page_log behavior	dwm
	Enter the number of pages	Visual large screen	Requires identification to start accessing the logo	dwd
	Continuous visit time	Visual large screen	page_log is directly available	dwd
merchandise	click	multidimensional analysis	page_log is directly available	dwd
	exposure	multidimensional analysis	page_log is directly available	dwd
	collect	multidimensional analysis	favorite table	dwd
	add to the cart	multidimensional analysis	shopping cart table	dwd
	place an order	Visual large screen	order wide table	dwm
	to pay	multidimensional analysis	pay wide table	dwm
	Refund	multidimensional analysis	refund form	dwd
	Comment	multidimensional analysis	comment form	dwd
area	pv	multidimensional analysis	page_log is directly available	dwd
	uv	multidimensional analysis	Need to use page_log to filter and deduplicate	dwm
	place an order	Visual large screen	order wide table	dwm
Key words	search keyword	Visual large screen	Page access logs are directly available	dwd
	Click on the product keyword	Visual large screen	Commodity topic order re-aggregation	dws
	Order product keywords	Visual large screen	Commodity topic order re-aggregation	dws

3 DWS layer positioning

Light aggregation, because the DWS layer has to deal with many real-time queries. If it is completely detailed, the query pressure is very high.

Combining more real-time data in the form of topics is easy to manage, and can also reduce the number of dimension queries.

2nd DWS Layer - Visitor Topic Calculation

statistics topic	demand indicator	output method	calculation source	source hierarchy
visitor	pv	Visual large screen	page_log is directly available	dwd
	uv	Visual large screen	Need to use page_log to filter and deduplicate	dwm
	Bounces	Visual large screen	It needs to be judged by page_log behavior	dwm
	Enter the number of pages	Visual large screen	Requires identification to start accessing the logo	dwd
	Continuous visit time	Visual large screen	page_log is directly available	dwd

Designing a DWS layer table is actually two things: dimension and measure (fact data)

Metrics include PV, UV, number of bounces, number of pages entered (session_count), and continuous access time
Dimensions include several fields that are more important in analysis: channel, region, version, new and old users for aggregation

1 Demand Analysis and Ideas

Receive each detailed data and turn it into a data stream.
Merge data streams together to form a data stream of objects of the same format.
The aggregated streams are aggregated, and the aggregated time window determines the timeliness of the data.
Write the aggregation result to the database.

In order to merge the three streams together, it is necessary to define an entity class VisitorStats [channel, region, version, new and old users, PV, UV, number of bounces, number of pages entered (session_count), continuous visit duration]

dwd_page_log
	new VisitorStats(渠道、地区、版本、新老用户、1L、0L、0L、1L、XXXXL)

dwm_unique_visitor
	new VisitorStats(渠道、地区、版本、新老用户、0L、1L、0L、0L、0L)

dwm_user_jump_detail
	new VisitorStats(渠道、地区、版本、新老用户、0L、0L、1L、0L、0L)

The overall process is as follows:

insert image description here

2 function realization

(1) Encapsulate VisitorStatsApp and read each flow data of Kafka

a code

package com.hzy.gmall.realtime.app.dws;
/**
 * 访客主题统计dws
 */
public class VisitorStatsApp {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // TODO 1 基本环境准备
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(4);

        // TODO 2 从kafka中读取数据
        // 2.1 声明读取的主题和消费者组
        String pageViewSourceTopic = "dwd_page_log";
        String uniqueVisitSourceTopic = "dwm_unique_visitor";
        String userJumpDetailSourceTopic = "dwm_user_jump_detail";
        String groupId = "visitor_stats_app";

        // 2.2 获取kafka消费者
        FlinkKafkaConsumer<String> pvSource = MyKafkaUtil.getKafkaSource(pageViewSourceTopic, groupId);
        FlinkKafkaConsumer<String> uvSource = MyKafkaUtil.getKafkaSource(uniqueVisitSourceTopic, groupId);
        FlinkKafkaConsumer<String> ujdSource = MyKafkaUtil.getKafkaSource(userJumpDetailSourceTopic, groupId);

        // 2.3 读取数据，封装成流
        DataStreamSource<String> pvStrDS = env.addSource(pvSource);
        DataStreamSource<String> uvStrDS = env.addSource(uvSource);
        DataStreamSource<String> ujdStrDS = env.addSource(ujdSource);

        pvStrDS.print("1111");
        uvStrDS.print("2222");
        ujdStrDS.print("3333");

        env.execute();
    }
}

b test

Start zookeeper, kafka, logger.sh, BaseLogApp, UnionVistorApp, UserJumpDetailAPP, VisitorStatsApp, execute the simulated user behavior log generation script, and check whether the above three types of information are included.

(2) Combine data streams

Merge data streams together to form a data stream of objects of the same format

The core operator for merging data streams is union. However, the union operator requires that all data flow structures must be consistent. Therefore, the data structure must be adjusted before union.

a Encapsulation theme wide table entity class VisitorStats

package com.hzy.gmall.realtime.beans;

import lombok.AllArgsConstructor;
import lombok.Data;

/**
 * Desc: 访客统计实体类  包括各个维度和度量
 */
@Data
@AllArgsConstructor
public class VisitorStats {
    
    
    //统计开始时间
    private String stt;
    //统计结束时间
    private String edt;
    //维度：版本
    private String vc;
    //维度：渠道
    private String ch;
    //维度：地区
    private String ar;
    //维度：新老用户标识
    private String is_new;
    //度量：独立访客数
    private Long uv_ct=0L;
    //度量：页面访问数
    private Long pv_ct=0L;
    //度量： 进入次数
    private Long sv_ct=0L;
    //度量： 跳出次数
    private Long uj_ct=0L;
    //度量： 持续访问时间
    private Long dur_sum=0L;
    //统计时间
    private Long ts;
}

b Convert the structure of each data stream read

The format of the read jsonStr is as follows

{
    "common": {
        "ar": "530000",
        "uid": "9",
        "os": "Android 11.0",
        "ch": "vivo",
        "is_new": "1",
        "md": "Xiaomi Mix2 ",
        "mid": "mid_6",
        "vc": "v2.1.134",
        "ba": "Xiaomi"
    },
    "page": {
        "page_id": "home",
        "item":"9",
        "during_time":15839,
        "item_type":"sku_id",
        "last_page_id":"home",
        "source_type":"query"
    },
    "displays": [
        {
            "display_type": "activity",
            "item": "1",
            "item_type": "activity_id",
            "pos_id": 5,
            "order": 1
        }
    ],
    "ts": 1670913783000
}

code show as below

// TODO 3 对流中的数据进行类型转换 jsonStr -> VisitorStats
// 3.1 dwd_page_loge流中数据的转化
SingleOutputStreamOperator<VisitorStats> pvStatsDS = pvStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                JSONObject pageJsonObj = jsonObj.getJSONObject("page");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        0L,
                        1L,
                        0L,
                        0L,
                        pageJsonObj.getLong("during_time"),
                        jsonObj.getLong("ts")
                );
                // 判断是否为新的会话，是则sessionViewCount + 1
                String lastPageId = pageJsonObj.getString("last_page_id");
                if (lastPageId == null || lastPageId.length() == 0) {
    
    
                    visitorStats.setSv_ct(1L);
                }
                return visitorStats;
            }
        }
);
// 3.2 dwm_unique_visitor流中数据的转化
SingleOutputStreamOperator<VisitorStats> uvStatsDS = uvStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        1L,
                        0L,
                        0L,
                        0L,
                        0L,
                        jsonObj.getLong("ts")
                );
                return visitorStats;
            }
        }
);
// 3.3 dwm_user_jump_detail流中数据的转化
SingleOutputStreamOperator<VisitorStats> ujdStatsDS = ujdStrDS.map(
        new MapFunction<String, VisitorStats>() {
    
    
            @Override
            public VisitorStats map(String jsonStr) throws Exception {
    
    
                JSONObject jsonObj = JSON.parseObject(jsonStr);
                JSONObject commonJsonObj = jsonObj.getJSONObject("common");
                VisitorStats visitorStats = new VisitorStats(
                        "",
                        "",
                        commonJsonObj.getString("vc"),
                        commonJsonObj.getString("ch"),
                        commonJsonObj.getString("ar"),
                        commonJsonObj.getString("is_new"),
                        0L,
                        0L,
                        0L,
                        1L,
                        0L,
                        jsonObj.getLong("ts")
                );
                return visitorStats;
            }
        }
);
// TODO 4 将三条流转换后的数据进行合并
DataStream<VisitorStats> unionDS = pvStatsDS.union(uvStatsDS, ujdStatsDS);

unionDS.print(">>>");

output result

VisitorStats(stt=, edt=, vc=v2.1.134, ch=xiaomi, ar=110000, is_new=0, uv_ct=0, pv_ct=1, sv_ct=0, uj_ct=0, dur_sum=8283, ts=1670918057000)

c Aggregate by dimension

Idea analysis

Because it involves windowing aggregation, it is necessary to set the event time and water level.

Do you need to count multiple details of the same dimension together:

Because the operation data of mid per unit time is very limited, the amount of data cannot be compressed obviously (if the amount of data is large enough, or the unit time is long enough).
Therefore, the four dimensions of commonly used statistics are used for aggregation: channels, new and old users, app version, provinces, cities and regions.
Metrics include: startup, daily activity (the first startup of the day), number of pages visited, number of new users, number of bounces, average page stay time, and total visit time.
Aggregation window: 10 seconds.

Code

// TODO 5 指定Watermark以及提取事件时间字段
SingleOutputStreamOperator<VisitorStats> visitorStatsWithWatermarkDS = unionDS.assignTimestampsAndWatermarks(
        WatermarkStrategy.<VisitorStats>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                .withTimestampAssigner(
                        new SerializableTimestampAssigner<VisitorStats>() {
    
    
                            @Override
                            public long extractTimestamp(VisitorStats visitorStats, long recordTimestamp) {
    
    
                                return visitorStats.getTs();
                            }
                        }
                )
);

// TODO 6 按照维度对流中的数据进行分组
//维度有：版本，渠道，地区，新老访客 定义分组的key为Tuple4类型
KeyedStream<VisitorStats, Tuple4<String, String, String, String>> keyedDS = visitorStatsWithWatermarkDS.keyBy(
        new KeySelector<VisitorStats, Tuple4<String, String, String, String>>() {
    
    
            @Override
            public Tuple4<String, String, String, String> getKey(VisitorStats visitorStats) throws Exception {
    
    
                return Tuple4.of(
                        visitorStats.getVc(),
                        visitorStats.getCh(),
                        visitorStats.getAr(),
                        visitorStats.getIs_new()
                );
            }
        }
);

// TODO 7 对分组之后的数据，进行开窗处理
// 每个分组是独立的窗口，分组之间互不影响
WindowedStream<VisitorStats, Tuple4<String, String, String, String>, TimeWindow> windowDS = keyedDS.window(TumblingEventTimeWindows.of(Time.seconds(10)));

// TODO 8 聚合计算
// 对窗口中的数据进行两两聚合计算
SingleOutputStreamOperator<VisitorStats> reduceDS = windowDS.reduce(
        new ReduceFunction<VisitorStats>() {
    
    
            @Override
            public VisitorStats reduce(VisitorStats stats1, VisitorStats stats2) throws Exception {
    
    
                // 度量值进行两两相加
                stats1.setPv_ct(stats1.getPv_ct() + stats2.getPv_ct());
                stats1.setUv_ct(stats1.getUv_ct() + stats2.getUv_ct());
                stats1.setSv_ct(stats1.getSv_ct() + stats2.getSv_ct());
                stats1.setDur_sum(stats1.getDur_sum() + stats2.getDur_sum());
                stats1.setUj_ct(stats1.getUj_ct() + stats2.getUj_ct());
                return stats1;
            }
        },
        new ProcessWindowFunction<VisitorStats, VisitorStats, Tuple4<String, String, String, String>, TimeWindow>() {
    
    
            @Override
            public void process(Tuple4<String, String, String, String> Tuple4, Context context, Iterable<VisitorStats> elements, Collector<VisitorStats> out) throws Exception {
    
    
                // 补全时间字段的值
                for (VisitorStats visitorStats : elements) {
    
    
                    visitorStats.setStt(DateTimeUtil.toYMDHMS(new Date(context.window().getStart())));
                    visitorStats.setEdt(DateTimeUtil.toYMDHMS(new Date(context.window().getEnd())));
                    // 操作时间为当前系统时间
                    visitorStats.setTs(System.currentTimeMillis());
                    // 将处理之后的数据向下游发送
                    out.collect(visitorStats);
                }
            }
        }
);

reduceDS.print(">>>");

output result

>>>:1> VisitorStats(stt=2022-12-13 16:48:40, edt=2022-12-13 16:48:50, vc=v2.1.134, ch=Appstore, ar=500000, is_new=0, uv_ct=0, pv_ct=8, sv_ct=1, uj_ct=0, dur_sum=77934, ts=1670921334767)
>>>:2> VisitorStats(stt=2022-12-13 16:48:40, edt=2022-12-13 16:48:50, vc=v2.1.134, ch=xiaomi, ar=110000, is_new=0, uv_ct=1, pv_ct=7, sv_ct=1, uj_ct=0, dur_sum=78835, ts=1670921334767)

(3) Write to OLAP database

ClickHouse official website .

# 启动ClickHouse
sudo systemctl start clickhouse-server
# 启动客户端
clickhouse-client -m

Why write to the ClickHouse database? The ClickHouse database is a database dedicated to statistical analysis of a large amount of data. It not only guarantees the ability to store massive data, but also takes into account the response speed. It also supports standard SQL, which is flexible and easy to use.

For detailed installation and introduction of the ClickHouse database, please refer to .

a ClickHouse data sheet preparation

create table  
visitor_stats_2022 (
        stt DateTime,
        edt DateTime,
        vc  String,
        ch  String ,
        ar  String ,
        is_new String ,
        uv_ct UInt64,
        pv_ct UInt64,
        sv_ct UInt64,
        uj_ct UInt64,
        dur_sum  UInt64,
        ts UInt64
        ) engine = ReplacingMergeTree(ts)
        partition by  toYYYYMMDD(stt)
        order by  (stt,edt,is_new,vc,ch,ar);

The reason for choosing the ReplacingMergeTree engine is mainly to rely on it to ensure the idempotency of the data table.

paritition by turns the date into a number type (eg: 20201126) for partitioning. So try to ensure that the query conditions include the stt field as much as possible.
The field data behind order by is in the same partition, if there is any duplication, it will be deduplicated, and the duplicate data will keep the data with the largest ts.
The order and names of the fields in the table creation statement must be consistent with the order and names of the attributes of VisitorStats.

b Add ClickHouse dependency package

<dependency>
    <groupId>ru.yandex.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.0</version>
    <exclusions>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </exclusion>
        <exclusion>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-core</artifactId>
        </exclusion>
    </exclusions>
</dependency>

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-jdbc_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>

Among them, flink-connector-jdbc is the official and general jdbcSink package. As long as the corresponding jdbc driver is introduced, flink can use it to deal with various databases that support jdbc, such as phoenix can also use it. But this jdbc-sink only supports data flow corresponding to a data table. If it is a first-class-to-many table, it must be implemented in a custom way, such as the previous dimension data.

Although this kind of jdbc-sink can only be one-to-one, but because of the internal use of the pre-compiler, it can implement batch submission to optimize the writing speed.

The completed functions are as follows:

insert image description here

[Real-time data warehouse] DWS layer positioning, DWS layer visitor theme calculation (PV, UV, number of bounces, counted pages, continuous visit time)

Article directory

Design of a DWS layer and DWM layer

1 Design ideas

2 Demand sorting

3 DWS layer positioning

2nd DWS Layer - Visitor Topic Calculation

1 Demand Analysis and Ideas

2 function realization

(1) Encapsulate VisitorStatsApp and read each flow data of Kafka

a code

b test

(2) Combine data streams

a Encapsulation theme wide table entity class VisitorStats

b Convert the structure of each data stream read

The format of the read jsonStr is as follows

code show as below

output result

c Aggregate by dimension

Idea analysis

Code

output result

(3) Write to OLAP database

a ClickHouse data sheet preparation

b Add ClickHouse dependency package

Guess you like