Flink从入门到放弃(十二)-企业实战之事件驱动型场景踩坑(一)

需求背景

某日，小明早上10点打卡到公司，先来一杯热水润润嗓子，打开音乐播放器带上心爱的降噪耳机看看新闻，静静等待11点半吃午饭。突然消息框亮了起来，这个时候小明心想要么来需求了，要么数据就有问题了。这个时候运营A部的同学发消息过来说想要分析下每个渠道当日的实时流量情况，以看板的方式提供就行。 小明看到这种需求，心想这还不简单，立马答应了下来，并许诺下班前完成。

方案设计

小明基于公司现有的Flink1.12.0 SQL接入Kafka来读取数据实现统计,通过渠道维度数据来关联,并将最终结果写入Mysql中，通过Superset来进行展示。整个数据流向也非常简单。

工程实践

这里的维度关联采用的是cdc模式，因为运营同学想看分析所有的渠道流量情况，而如果使用Temporal Join 就有可能会丢失部分渠道的数据(比如新接入一个渠道c,而且C渠道没有流量转换，那么就无法统计到C渠道的pv=0,uv=0)

-- Kafka Source
create table real_dwd_flow_info_from_kafka(
    visit_time timestamp,
    channel_code string,
    user_id string,
    url string,
    device_id string
    primary key(unique_id) not enforced
)with( 
  'connector' = 'upsert-kafka',
  'topic' = 'real_dwd_flow_info',
  'properties.bootstrap.servers' = 'bootstrap:9092',
  'key.format' = 'json',
  'value.format' = 'json'
 );
 
--Mysql Sink
create table real_dw_flow_index_info(
   channel string,
   pv int,
   uv int,
   primary key(channel) not enforced
)with(
  'connector' = 'jdbc',
   'url' = 'jdbc:mysql://localhost:3306/test?serverTimezone=Asia/Shanghai',
 'table-name' = 'flow_index_info',
 'username' = 'user_name',
 'password' = 'password'
)

--Dim Data
 create table real_dim_channel_code_from_mysql(
  channel_code string,
  channel_name string,
  primary key(channel_code) not enforced
) WITH(
  'connector' = 'mysql-cdc',
  'hostname' = 'localhost',
  'port' = '3306',
  'username' = 'user_name',
  'password' = 'password',
  'database-name' = 'test',
  'table-name' = 'dim_channel',
  'debezium.event.processing.failure.handling.mode' = 'warn',
  'debezium.snapshot.locking.mode' = 'none'
);

--统计指标，这里采用Regular Join
insert into real_dw_flow_index_info
select 
  t1.channel_code,
  sum(case when substr(cast(visit_time as string),1,10) = substr(cast(LOCALTIMESTAMP as string,1,10)) then 1 else 0 end) as pv,
  count(distinct case when substr(cast(visit_time as string),1,10) = substr(cast(LOCALTIMESTAMP as string,1,10)) then user_id end) as uv
from real_dim_channel_code_from_mysql t1
left join real_dwd_flow_info_from_kafka t2
on t1.channel_code  = t2.channel_code 
group by t1.channel_code

如上面的逻辑处理，可以实时统计出当日的实时流量情况。如上图，当运营同学通过配置接入一个新渠道G时，可以立刻在看板中反应出来。

踩坑填坑

入坑：
小明下班前将该需求交付给运营部门后，开心下班了。不幸的是第二天一早上班，运营同学就来反馈说数据不对，渠道F数据并未发生变化，仍然停留在昨日的统计值上。。作为一名数据人，最害怕的就是别人说数据不对，经过定位排查发现渠道F由于各种因素出现故障，所以一直没有流量进入。
这里结合Flink的事件驱动特性可以很容易理解，由于渠道F并没有任何事件传输过来，所以Flink本身不会对渠道F进行计算并做初始化的动作，因此结果值仍停留在上次事件发生时的统计状态
填坑：
既然定位到原因，那么就需要人为或定时驱动事件产生触发计算。因此整个数据流向调整为下图：即在原来的流向中融入离线部分，凌晨定时抽取维度表数据并再次更新到维度表中，这样可以通过CDC模式触发一次计算。