Ctrip's real-time intelligent detection platform practice

image


Sharing guest : Pan Guoqing@克莱程 

Article finishing: Wang Zhao

Content source: Flink Forward

Production platform: DataFun

Warm reminder: During the epidemic, wash your hands frequently and wear a mask.


Guide: Combining real-time computing and deep learning can solve a specific business scenario. This time I will share Ctrip's real-time intelligent detection platform based on tensorflow and flink. The main content shared today is divided into four parts: 1. Background 2. What is Prophet 3. AI and Real Time 4. Challenges and Future imageEach company will have a monitoring platform, and most of the monitoring platforms are monitored according to the rules and alarms. Early warning of indicators and rule alarms are generally based on statistical methods. For example, the year-on-year or month-on-month increase or decrease of an indicator requires setting a threshold or a percentage of measurement. Therefore, there will be many problems:image

  • Complicated configuration of rule alarms
  • The effect of regular warnings is relatively poor
  • Rule maintenance costs are relatively high

In addition to the above problems, Ctrip has some other problems. There are three company-level monitoring systems. According to different business scenarios, corresponding monitoring platforms will be built. There will be more than a dozen large and small monitoring platforms within the company. The construction and configuration of the monitoring platform is very cumbersome for users. According to the above problems, Ctrip has built a real-time platform prophet. imageProphet is a one-stop anomaly detection solution. The main source of inspiration is Prophet from Facebook, but there are many differences in what it does. 1. Prophet's one-stop anomaly detection solution includes:
image

  • Data based on time series type
  • Take the monitoring platform as the access object, not the user as the target, all the rule alarms are offline, and intelligent alarms are used instead
  • Use deep learning algorithms to achieve abnormal intelligent alarms
  • Real-time warning of abnormalities based on real-time computing engine


2. Prophet's system architecture: theimage bottom layer is the bottom layer of Hadoop. As a resource scheduling engine, YARN mainly runs Flink job tasks, and HDFS mainly stores models trained by Tensorflow. The middle layer is the engine layer. The data must be stored in the message queue in real time. Kafka is used. To achieve real-time abnormal warnings, the calculation engine of flink is used. The training engine of deep learning uses tensorflow. The data is also stored in a time series database. . The upper layer is the platform layer that provides services to the outside world. The role of Clog is to collect job logs, Muise is a real-time computing platform, Qconfig is to provide configuration items needed in the job, and Hickwall is a simple monitoring and alarm platform.

3. Considerably

There are many real-time computing engines. Flink is chosen as the computing engine for the following reasons:image

  • Efficient state management, a lot of state information needs to be stored in anomaly detection, Flink's own state backends can store intermediate states well
  • Provides a wealth of windows, such as scrolling windows, sliding windows and other windows. Ctrip uses sliding windows, which will be further explained in the follow-up
  • Support a variety of time semantics, generally use Event Time
  • Different levels of fault tolerance semantics.


4. What needs to be done for users in the operation process of Prophetimage ? It is imperceptible to users, and there is no need to configure monitoring indicators on the Ctrip monitoring platform. Users only need to configure monitoring alarms on commonly used monitoring platforms and select smart alarms. All subsequent work is the interaction between the intelligent monitoring platform and the intelligent alarm platform.

用户配置监控平台的指标,监控平台会把用户的配置指标同步到Prophet平台,接收到新的指标就会进行模型训练,使用tensorflow训练模型,实时数据导入到kafka中,对于历史数据,如果用户能够提供接口就会使用,没有就会使用消息队列中积累的数据集进行训练,训练完成就会上传至HDFS,更新配置,在配置中心会传到Flink,需要对应的加载模型,推送的实时数据会保存到时序数据库中,因为在后面的异常检测中会需要用到。中间是模型训练的过程,当模型训练完成,Flink的作业监听到配置发生更新,尝试加载新的模型,实时的消费kafka中的数据,最终产生一个预测结果,异常的告警结果都会写回到Kafka,各个监控平台都会消费消息,获取各自监控平台的告警消息。整个过程对用户都无感知的。image1.智能化挑战image

  • 负样本少,异常发生频率低
  • 业务指标类型多,订单、支付等
  • 业务指标形态多,周期波动、稳定、非周期

image针对以上问题尝试使用了很多种深度学习的算法,如下:

RNN和LSTM需要给每个指标训练一个模型,基于这个模型预测当前数据集的走向,拿预测数据集和当前的数据集进行比对,进行异常检测。每个指标都需要训练一个模型,需要消耗比较大的资源,好处就是准确率比较高。

DNN模型,一个模型可以搞定所有业务场景,问题是特征的提取会比较复杂,需要提取特征不同频率的指标,对于这个特征需要用户对大量数据进行标注,判定那种情况归属为异常,这种情况比较复杂。

2. 模型训练的流程

携程的业务基本两个星期更新一个版本,每个业务指标每两周都会尝试训练一次,模型的训练数据也是两周一次。


image数据预处理。比如空值或null值,在数据中会有很多的异常区间,因此需要根据之前的预测值把这些异常区间的异常值进行替换;还有需要把节假日的数据进行替换,节假日的情况会比较复杂,会有相对用的应对方式,这个模型主要是平日的数据的训练周期。

提取特征。提取不同时序的特征,或者是频率特征,然后训练一个分类模型,判断这个特征是一个什么类型的指标,比如说周期或者非周期,针对不同的指标会使用不同的模型。

3. 模型的动态加载image模型训练完成上传,通知到配置中心,Flink作业收到信息,会从HDFS中拉取模型,为了将每一个模型均匀的分布在每个Task Manager中,所有的监控指标会根据id均匀的分布在Task Manager。

4. 数据实时消费与预测image要做一个实时的异常检测,从kafka消息队列中消费一个当前的实时数据,Flink Event Time+滑动窗口,监控的时间粒度很多种,例如选取分钟的力度,选取十分钟,Flink作业中会开一个窗口,长度为10个时间点,当数据积累到十分钟就可以进行数据的实时预测,会使用前面的五个数据来预测下一个数据,采用平滑的方式依次向后移动,从而获得五个实际值和预测值的对比。image然而在实际情况下并非这样简单。现实情况下会出现很多的数据缺失,这些数据有可能再也不能消费,比如说由于网络抖动的原因再也找回这些数据。需要对这些确实的数据进行插补,使用均值或者标准差替换缺失数据。如果在一个区间内的数据是异常值,需要使用上一批次训练出来的预测值,将异常数据进行替换,作为模型的输入,得到一个新的预测值。

5. 实时异常检测image① 基于异常类型与敏感度判断。不同指标会有不同的异常类型,有的是下降的异常,有的是上升的异常。其次会有一个敏感度,分为中高低,对于高敏感度异常,发生简单抖动就会认为会有一个异常,对于中敏感度连续出现这样的抖动才会认为是异常。

② 基于预测集与实际集的偏差判断。判断为某个区间为异常区间,需要同上周期的同一时间做对比,如果偏差较大,则认为这是一个异常区间。

③ 基于历史同期数据均值与标准差判断。潜在异常还需与历史周期数据比较来最终确认是否存在异常。image上面所说的技术都能够应用于这样的场景:

常见问题:对于用户来说,监控指标太多,监控的维度也比较多。比如一个指标可能有 max、min 等不同的统计方式,监控指标的数量就会比较多。其次,用户能力有限,很难每日查看监控告警。

异常原因:发生异常的原因一般会是技术性问题。如发布新版本上线时可能存在的 bug 导致业务出现下跌。少数的情况是由于外部因素的影响,比如调用外部链接或者服务,外部服务宕掉导致自己的服务出现问题。

解决方案:用户为 Prophet 提供的检测结果进行标注,选择检测结果的正确性。用户的标注数据会用到 Prophet 以后的模型训练中用于优化数据集。


6. 节假日场景image节假日场景的问题如下:

① 不同业务间上涨或下跌的趋势不同。比如携程的机票或者火车票基本在节前会上升到一定量,到节假期的期间会逐渐下降;对于酒店,节假期间会上升很多。因此不同业务的趋势是不一样的。

② 上涨幅度大,容易产生漏报。针对图中上升较大的部分可能会产生漏报,例如上周最高的订单量为1000单,但是本周作为节假日最高订单量为2000单,下降50%也会和上周持平,这样模型可能会检测不到。

③ 下跌幅度大,容易产生误报。上周为1000单,这周跌到500单,这是个正常值,但是继续下跌就会产生误报。

④ 小业务活动多,波动剧烈image针对节假日场景出现的问题,携程也做了很多的应对准备。 维护每年的节假日信息表。程序会自动判断距离下个节假日还有一周的时候,自动提取某个指标过去两年内的不同节假日的数据,然后统计跟当前时段的数值的相似度,使用当前数据拟合过去的数据。基于当前和历史的数据训练一个新的模型。image当前基本覆盖了携程的所有的业务线。覆盖了大部分重要的业务指标,把公司级别的系统监控平台都已经接入,可以覆盖95%的异常,报警的准备率达到75%。每个数据过来都会触发数据的实时消费和预测,告警的延迟是毫秒级别的,告警的数量较以前下降十倍左右。image上面的效果对比基于2019年4月-5月的数据。左边的Prophet的命中达到90%,规则告警只达到74%。image上图是告警数量的对比。Prophet的告警数量比规则降低了5倍到10倍左右。image
1. 遭遇挑战image

  • 资源消耗大,单指标单模型,模型数量等同于指标数量
  • 节假日影响大,业务指标节假日趋势不同告警准确性受影响
  • 无法适用于全部场景,波动剧烈的非周期性指标hold不住,比如遇到大促、活动等

For the above challenges, we have made improvements one after another. 2. Future prospects ① In the general model, there is no need to focus on analyzing the application of the DNN model. All the previous processes and processing logic are for LSTM. The DNN model can be a model that can be used for all monitoring indicators. The accuracy rate is lower than that of LSTM, but it can cover a relatively large number of scenarios. For important indicators, such as orders, payment and other important business indicators, use LSTM, and for others, you can use DNN models. ② The holiday algorithm is online, and the holiday alignment method is weighted according to the data of the previous holiday as the training data. The current holiday algorithm has been running for more than half a year. ③ Cover all monitoring platforms and access more monitoring platforms and indicators. Currently, 70%-80% of the monitoring platforms have been covered . ④ Flink operations will have some performance indicators. In the future, we plan to use smart alarms as a self-monitoring platform, self-warning, so as to bring better results.
imageimage

Guest introduction:

Pan Guoqing, Ctrip Big Data R&D Manager. Joined the Ctrip big data platform team in 2016, leading the Muise platform's architecture upgrade and technology evolution from Storm to Spark Streaming to Flink, and is currently responsible for the architecture design and research and development of Ctrip's real-time intelligent anomaly detection. He has 5 years of research and development experience in the field of big data, and has the research and promotion of real-time computing.





Guess you like

Origin blog.51cto.com/15060460/2675347