Flink from entry to abandonment (12) - stepping on pits in event-driven scenarios in actual enterprise combat (1)

demand background

One day, Xiao Ming clocked in at the company at 10 o'clock in the morning. He first had a cup of hot water to moisten his throat, turned on the music player and put on his beloved noise-canceling headphones to watch the news, and waited quietly for lunch at 11:30. Suddenly the message box lit up. At this time, Xiao Ming thought that either there was a demand, or there was a problem with the data. At this time, the students who operated Department A sent a message saying that they wanted to analyze the real-time traffic situation of each channel on that day, and provided it in the form of Kanban. Seeing this demand, Xiao Ming thought it was not easy, so he immediately agreed and promised to complete it before leaving get off work.

Design

Based on the company's existing Flink1.12.0 SQL, Xiaoming accesses Kafka to read data to realize statistics, correlate through channel dimension data, and write the final results into Mysql , and display them through Superset . The entire data flow is also very simple.picture

engineering practice

The dimension association here adopts the cdc mode , because the operation students want to see and analyze all channel traffic conditions, and if Temporal Join is used, some channel data may be lost (for example, a new channel c is connected, and channel C has no traffic conversion, then the pv=0,uv=0 of the C channel cannot be counted)

-- Kafka Source
create table real_dwd_flow_info_from_kafka(
    visit_time timestamp,
    channel_code string,
    user_id string,
    url string,
    device_id string
    primary key(unique_id) not enforced
)with( 
  'connector' = 'upsert-kafka',
  'topic' = 'real_dwd_flow_info',
  'properties.bootstrap.servers' = 'bootstrap:9092',
  'key.format' = 'json',
  'value.format' = 'json'
 );
 
--Mysql Sink
create table real_dw_flow_index_info(
   channel string,
   pv int,
   uv int,
   primary key(channel) not enforced
)with(
  'connector' = 'jdbc',
   'url' = 'jdbc:mysql://localhost:3306/test?serverTimezone=Asia/Shanghai',
 'table-name' = 'flow_index_info',
 'username' = 'user_name',
 'password' = 'password'
)

--Dim Data
 create table real_dim_channel_code_from_mysql(
  channel_code string,
  channel_name string,
  primary key(channel_code) not enforced
) WITH(
  'connector' = 'mysql-cdc',
  'hostname' = 'localhost',
  'port' = '3306',
  'username' = 'user_name',
  'password' = 'password',
  'database-name' = 'test',
  'table-name' = 'dim_channel',
  'debezium.event.processing.failure.handling.mode' = 'warn',
  'debezium.snapshot.locking.mode' = 'none'
);

--统计指标,这里采用Regular Join
insert into real_dw_flow_index_info
select 
  t1.channel_code,
  sum(case when substr(cast(visit_time as string),1,10) = substr(cast(LOCALTIMESTAMP as string,1,10)) then 1 else 0 end) as pv,
  count(distinct case when substr(cast(visit_time as string),1,10) = substr(cast(LOCALTIMESTAMP as string,1,10)) then user_id end) as uv
from real_dim_channel_code_from_mysql t1
left join real_dwd_flow_info_from_kafka t2
on t1.channel_code  = t2.channel_code 
group by t1.channel_code

According to the logic processing above, the real-time traffic situation of the day can be calculated in real time. pictureAs shown in the figure above, when an operation student accesses a new channel G through configuration, it can be immediately reflected in the Kanban.

Step on the pit and fill the pit

Entering the pit:
   After Xiao Ming delivered the demand to the operation department before leaving work, he left work happily. Unfortunately, when I went to work the next morning, the operation students came to feedback that the data was wrong, and the channel F data had not changed, and it was still at the statistical value of yesterday. . As a data person, what I am most afraid of is that others will say that the data is wrong. After positioning and investigation, it is found that channel F has failed due to various factors, so there has been no traffic entering.
   Combining with Flink’s event-driven feature, it can be easily understood. Since channel F does not transmit any events, Flink itself will not calculate and initialize channel F, so the result value remains at the time when the last event occurred. Statistical status
filling:
    Now that the cause is located, it is necessary to manually or regularly drive events to generate trigger calculations. Therefore, the entire data flow direction is adjusted as shown in the figure below: picturethat is, the offline part is integrated into the original flow direction, and the dimension table data is regularly extracted in the early morning and updated to the dimension table again, so that a calculation can be triggered through the CDC mode.

Guess you like

Origin blog.csdn.net/qq_28680977/article/details/122149515