Flink implements real-time features in risk control scenarios

background introduction

Introduction to Risk Control

In the 21st century, with the advent of the information age, the Internet industry is developing much faster than other industries. Once the business model works well and is profitable, capital immediately flocks in, boosting more companies to continuously enter the market for rapid replication and iteration, in an attempt to become the next "industry leader".

Players who enter the market with capital will only pay more attention to business development because they will not have financial pressure, but ignore business risks. As powerful as Pinduoduo, it was patronized by the army of "wool wool" and lost tens of millions.

Risk control, namely risk management (risk management), is a management process, including the definition, measurement, evaluation and strategy of dealing with risks. The purpose is to minimize the avoidable risks, costs and losses [1].

Feature Platform Introduction

Internet companies face all kinds of black and gray attacks all the time. The business security team needs to assess the risky places in the business process in advance, and then set up checkpoints to collect relevant business information and identify whether the current request is risky. Expert experience (prevention and control strategies) is produced in the long-term confrontation.

The deployment of strategies needs to be supported by each feature, so what is a feature?
Features are divided into basic features, derived features, statistical features, etc. Examples are as follows:

  • Basic features: can be obtained directly from the business, such as the amount of the order, the buyer's mobile phone number, the buyer's address, the seller's address, etc.
  • Derivative features: secondary calculation is required, such as the distance from buyer to buyer, the first 3 digits of the mobile phone number, etc.
  • Statistical features: Real-time statistics are required, such as the number of purchase orders placed by a certain mobile phone number within 5 minutes, the number of orders with a purchase amount greater than 20,000 yuan within 10 minutes, etc.

With the rapid development of business, pure expert experience can no longer meet the needs of risk identification, and the addition of the algorithm team makes the interception effect more accurate. By unifying the algorithm engineering framework, the staff of the algorithm department solved the systematic problem of model and feature iteration, which greatly improved the iteration efficiency.

According to different functions, the algorithm platform can be divided into three parts: model service, model training and feature platform. Among them, model service is used to provide online model estimation, model training is used to provide model training output, and feature platform provides data support for features and samples. This article will focus on the challenges and optimization ideas encountered in the real-time computing of the feature platform during the construction process.

Challenges and Solutions

challenges

In the early stage of business development, we can meet the feature requirements proposed by the strategists through hard coding, and the collaboration is better. However, as the business develops faster and faster, there are more and more business lines, and the marketing method becomes more and more complex, the number of users and requests increases exponentially. Applicable to the early hard-coding method, there are many problems such as decentralized strategy and unmanageable strategy, strong coupling between logic and business, limited development of strategy update iteration rate, and high docking cost. At this time, we urgently need a feature management platform that is configurable online, hot-updable, and quick to try and error.

The shortcomings of the old framework

Real-time framework 1.0: built on the basis of Flink DataStream API

If you are familiar with the Flink DataStream API, you will definitely find that Flink's design naturally meets the real-time feature calculation scenario for risk control. We only need a few simple steps to count indicators, as shown in the following figure: Flink DataStream flow diagram
flink-dataStream-api.png

The real-time feature statistics sample code is as follows:

// 数据流,如topic
DataStream<ObjectNode> dataStream = ...

SingleOutputStreamOperator<AllDecisionAnalyze> windowOperator = dataStream
				// 过滤
                .filter(this::filterStrategy)
    			// 数据转换
                .flatMap(this::convertData)
    			// 配置watermark
                .assignTimestampsAndWatermarks(timestampAndWatermarkAssigner(config))
    			// 分组
                .keyBy(this::keyByStrategy)
    			// 5分钟滚动窗口
                .window(TumblingEventTimeWindows.of(Time.seconds(300)))
    			// 自定义聚合函数,内部逻辑自定义
                .aggregate(AllDecisionAnalyzeCountAgg.create(), AllDecisionAnalyzeWindowFunction.create());

The 1.0 framework is insufficient:

  • Features are strongly dependent on developer coding, simple statistical features can be abstracted, and slightly more complex features need to be customized
  • The iteration efficiency is low, and the strategy needs to be raised, product scheduling, R&D intervention, testing guarantee, and delivery is at least two weeks after a set of processes is completed.
  • Features are strongly coupled, task splitting is difficult, a job contains too much logic, and new feature logic may affect previously stable indicators

In general, 1.0 is very suitable in the early stage of business, but with the development of business, the speed of research and development gradually becomes a bottleneck, which does not meet the sustainable and manageable real-time feature cleaning architecture.

Real-time framework 2.0: built on Flink SQL

The disadvantage of the 1.0 architecture lies in the use of different language systems from requirements to R&D. How to efficiently transform requirements, and even directly let the strategic personnel configuration feature cleaning logic go online directly? If it follows the iteration speed of two weeks and one week, it may be "unrecognizable" by black and gray production on the line.

At this time, our R&D team noticed Flink SQL. SQL is the most common language for data analysis, and the basic necessary skills for scoring, strategy, and operation. It can be said that SQL is one of the implementation methods with the least cost for conversion requirements.

Look at an example of Flink SQL implementation:

-- error 日志监控
-- kafka source
CREATE TABLE rcp_server_log (
    thread varchar,
    level varchar,
    loggerName varchar,
    message varchar,
    endOfBatch varchar,
    loggerFqcn varchar,
    instant varchar,
    threadId varchar,
    threadPriority varchar,
    appName varchar,
    triggerTime as LOCALTIMESTAMP,
    proctime as PROCTIME(),

    WATERMARK FOR triggerTime AS triggerTime - INTERVAL '5' SECOND
) WITH (
    'connector.type' = 'kafka',
    'connector.version' = '0.11',
    'connector.topic' = '${sinkTopic}',
    'connector.startup-mode' = 'latest-offset',
    'connector.properties.group.id' = 'streaming-metric',
    'connector.properties.bootstrap.servers' = '${sinkBootstrapServers}',
    'connector.properties.zookeeper.connect' = '${sinkZookeeperConnect}}',
    'update-mode' = 'append',
    'format.type' = 'json'
);

-- 此处省略 sink_feature_indicator 创建,参考 source table
-- 按天 按城市 各业务线决策分布
INSERT INTO sink_feature_indicator
SELECT
    level,
    loggerName,
    COUNT(*)
FROM rcp_server_log
WHERE
    (level <> 'INFO' AND `appName` <> 'AppTestService')
    OR loggerName <> 'com.test'
GROUP BY
    TUMBLE(triggerTime, INTERVAL '5' SECOND),
    level,
    loggerName;

During the development of the Flink SQL support platform, we encountered the following problems:

  • If a SQL cleans an indicator, the data source will be greatly wasted
  • SQL merge, that is, if a detection has the same source SQL, it will be merged. At this time, the complexity of the job will be greatly increased, and the boundary cannot be defined
  • SQL needs to be shut down and restarted to go online. At this time, if the task contains a large number of stable indicators, will it be a critical point?

Technical realization

Pain point summary

flink real-time cleaning-pain points.png
Business & R&D Pain Point Map

real-time computing architecture

Strategy/algorithm personnel need to observe real-time and offline data every day to analyze whether there are risks online. For risky scenarios, they will design prevention and control strategies. Transparent transmission to the R&D side is actually the development of real-time features. Therefore, the online speed, quality delivery, and ease of use of real-time features completely determine the key to timely plugging of online risk scenarios.

Before the construction of a unified real-time feature computing platform, the output of real-time features mainly has the following problems:

  • Slow delivery and iterative development: the strategy is put forward to the product, then to R&D, to test, and to observe whether it is stable on the line, and the speed is extremely slow
  • Strong coupling, mobilize the whole body: monster tasks, including many business features, all businesses are mixed together, there is no priority guarantee
  • Repetitive development: Since there is no unified real-time feature management platform, many features already exist, but the names are different, causing great waste

The most important thing in the construction of platform language is " the abstraction of the whole process" . The goal of platform language should be usable, easy to use, and easy to use. Based on the above ideas, we try to extract the pain points of real-time feature development: templating + configuration , that is, the platform provides a template for creating real-time features, and based on this template, users can generate the real-time features they need through simple configuration.

flink-dataStream-api.png
Flink real-time computing architecture diagram

computing layer

**Data source cleaning: **Different data sources abstract Flink Connector, standard output for downstream use
**Data splitting: **1 Split N, a real-time message may contain multiple messages, data fission is required at this time**
Dynamic configuration:**Dynamic update or new cleaning logic is allowed without stopping the job, and cleaning logic involving features is sent**Script loading:**Groovy support, hot update**RTC: **Real-Time Calculate, real-time feature calculation, highly abstract encapsulation module **Task perception: **Based on feature business domain , priority, stability, isolation tasks
,
business
decoupling

service layer

**Unified query SDK: **Real-time feature unified query SDK, shielding the underlying implementation logic

Based on the unified Flink real-time computing architecture, we redesigned the real-time feature cleaning architecture
flink real-time cleaning-data flow diagram-2.png
Flink real-time computing data flow graph

Feature Configuration & Storage/Reading

The underlying storage of features should be "atomic", that is, the smallest indivisible unit. Why is it designed this way? Real-time statistical features are linked to the size of the window. Different strategies for personnel prevention and control have different requirements for the size of the feature window. Examples are as follows:

  • Trusted device determination scenario: The current mobile phone number login time window should be moderate, not too short, and anti-disturbance
  • Cash withdrawal fraud determination scenario: The current mobile phone number login time window should be as short as possible, and short-distance quick cash withdrawals can be combined with other dimensions to quickly locate risks

Based on the above, there is an urgent need for a general-purpose real-time feature reading module to meet the needs of policy personnel for any window, and at the same time meet the needs of R&D personnel for rapid configuration and cleaning. Our refactored feature configuration module is as follows:
image.png
feature configuration abstraction
module

Real-time feature module:

  • characteristic unique identifier
  • feature name
  • Whether to support windows: sliding, scrolling, fixed-size windows
  • Event slice units: minutes, hours, days, weeks
  • Main attribute: that is, the grouping column, which can be multiple
  • Dependent attributes: Aggregation functions are used, such as the input basic features required for deduplication

There is not much time left for business risk control. Most scenarios are within 100 ms, and real-time feature acquisition is even shorter. From past research and development experience, RT needs to be controlled within 10 ms to ensure that policy execution will not time out. So our storage uses Redis to ensure that performance is not a bottleneck.
image.png

Hot deployment of cleaning scripts

As mentioned above, the real-time feature calculation module is strongly dependent on the "main attribute" and "subordinate attribute" transmitted in the upstream message. This stage is also where R&D needs to intervene. If the main attribute field in the message does not exist, it needs to be completed by R&D. At this time, it has to be added to the release version of the code, which will return to the problem faced in the original stage: Flink Job needs to be restarted continuously, which is obviously unacceptable.
At this point we thought of Groovy, can Flink + Groovy be used to hot-deploy code directly? The answer is yes!

Since we abstracted the calculation flow graph of the entire Flink Job, the operator itself does not need to be changed, that is, the DAG is fixed and becomes the cleaning logic of the associated events inside the operator. Therefore, as long as the associated cleaning logic and the cleaning code itself are changed, there is no need to restart the Flink Job to complete hot deployment.

The core logic of Groovy hot deployment is shown in the figure:
flink real-time cleaning-script configuration.png
cleaning script configuration and loading diagram

R&D or strategy personnel add cleaning scripts in the management background (Operating System) and store them in the database. The Flink Job script caching module will sense the addition or modification of the script at this time (how to perceive it, see the detailed explanation of the overall process below)

  • warm up: It takes time to run the script for the first time. It should be warmed up and executed in advance when it is started for the first time or when the cache is updated, so as to ensure that real traffic enters the script and executes quickly.
  • cache: Cache existing Groovy scripts
  • Push/Poll: The cache update adopts two modes of push and pull to ensure that information will not be lost
  • router: script routing, to ensure that the message can find the corresponding script and execute it

The script loads the core code:

	// 缓存,否则无限加载下去会 metaspace outOfMemory
	private final static Map<String, GroovyObject> groovyObjectCache = new ConcurrentHashMap<>();

    /**
     * 加载脚本
     * @param script
     * @return
     */
    public static GroovyObject buildScript(String script) {
    
    
        if (StringUtils.isEmpty(script)) {
    
    
            throw new RuntimeException("script is empty");
        }

        String cacheKey = DigestUtils.md5DigestAsHex(script.getBytes());
        if (groovyObjectCache.containsKey(cacheKey)) {
    
    
            log.debug("groovyObjectCache hit");
            return groovyObjectCache.get(cacheKey);
        }

        GroovyClassLoader classLoader = new GroovyClassLoader();
        try {
    
    
            Class<?> groovyClass = classLoader.parseClass(script);
            GroovyObject groovyObject = (GroovyObject) groovyClass.newInstance();
            classLoader.clearCache();

            groovyObjectCache.put(cacheKey, groovyObject);
            log.info("groovy buildScript success: {}", groovyObject);
            return groovyObject;
        } catch (Exception e) {
    
    
            throw new RuntimeException("buildScript error", e);
        } finally {
    
    
            try {
    
    
                classLoader.close();
            } catch (IOException e) {
    
    
                log.error("close GroovyClassLoader error", e);
            }
        }
    }
Standard Messages & Cleaning Procedures

The message dimensions that need to be counted by the strategy are very complex, involving multiple businesses, and R&D itself also has real-time feature requirements for monitoring. Therefore, the data sources corresponding to real-time features are diverse. Fortunately, Flink supports access to multiple data sources. For some specific data sources, we only need to inherit and implement the Flink Connector to meet the requirements. I will take Kafka as an example to show how the overall process cleans real-time statistical features.

First, the overall data flow of wind control is introduced. Multiple business scenarios are connected to the middle platform of wind control. The internal core links of wind control are: decision engine, rule engine, and feature service.
We will record a business request decision asynchronously and send a Kafka message for real-time feature calculation & offline point burying.
image.png
Risk control core data flow diagram

Standardized message templates

After the Flink real-time calculation job receives the MQ message, it first needs to analyze the standardized message template. Different topics correspond to inconsistent message formats, such as JSON, CSV, and heterogeneity (such as error log messages, separated by spaces, and objects containing JSON objects), etc.

To facilitate unified processing by downstream operators, the standardized message structure is as follows: JSON structure:

public class RcpPreProcessData {
    
    

    /**
     * 渠道,可以直接写topic即可
     */
    private String channel;

    /**
     * 消息分类 channel + eventCode 应唯一确定一类消息
     */
    private String eventCode;

    /**
     * 所有主从属性
     */
    private Map<String, Object> featureParamMap;

    /**
     * 原始消息
     */
    private ObjectNode node;
    
}

news fission

A "rich message" may contain a large amount of business information, and some real-time features may need to be counted separately. For example, a business request risk control context message includes whether the message is rejected, that is, how many policy rules are hit, and the hit rules are an array, which may contain multiple hit rules. At this time, if you want to correlate other attribute statistics based on a hit rule, you need to use message fission, changing from 1 to N.

The logic of message fission is written by the operation background through Groovy scripts. The logic of positioning and cleaning scripts is channel (parent) + eventCode (child). Here, the logic is divided into "father and son". The logic of "parent" is applicable to all the logic under the current channel, avoiding the cumbersome configuration of N eventCode separately, and the logic of "child" is applicable to a specific eventCode.

Message Cleaning & Pruning

Cleaning of messages means that we need to know which master-slave attributes are required for features, and cleaning with purpose is clearer. The script for positioning and cleaning is the same as above, and it is still implemented based on channel + eventCode. The cleaned master-slave attributes are stored in featureParamMap for downstream real-time computing.

It should be noted here that we have always passed down the original message, but if the master-slave property of cleaning has been confirmed, then the original message is no longer necessary. At this time, we need to "prune" to save the consumption of I/O traffic in the RPC call process.

So far, an original message has been processed to only include channel (channel), eventCode (event type), featureParamMap (all master-slave attributes), and downstream operators only need and only need these information to calculate.

real-time computing

Still the same as the above two operators, the real-time calculation operator relies on channel + eventCode to find the corresponding real-time feature metadata. There may be multiple real-time feature configurations for an event. After the operation platform fills in the real-time feature configuration, it will be quickly distributed to tasks according to the cache update mechanism, and the corresponding Key will be generated according to the Key constructor, and the downstream will be directly Sinked to Redis.

Task Troubleshooting & Tuning Ideas

Task troubleshooting is based on comprehensive monitoring. Flink provides many useful metrics for us to troubleshoot problems. The following is a list of common task exceptions that I hope will be helpful to you.

Troubleshooting TaskManager Full GC

The possible reasons for the above exception are:

  • Large windows: 90% of TM memory explosions are caused by large windows
  • Memory leak: If it is a custom node and it involves caching, it will easily lead to memory expansion

Solution:

  • Reasonably formulate window wires, reasonably allocate TM memory (1.10 default is 1G), aggregated data should be managed by Back State, and it is not recommended to write object storage by yourself
  • You can attach heap snapshots to troubleshoot exceptions. Analysis tools such as MAT require certain tuning experience and can quickly locate problems.

Flink Job Backpressure

The possible reasons for the above exception are:

  • Data skew: 90% back pressure must be caused by data skew
  • The degree of parallelism is not set properly, and the data flow or computing performance of a single operator is wrongly estimated

Solution:

  • Data cleaning refer to the following
  • For the degree of parallelism, you can bury points in the message passing process to see the cost of each node

data skew

Core ideas:

  • Add a random number to the key, and then execute keyby to partition according to the new key. At this time, the distribution of the key will be scattered, and the problem of data skew will not be caused.
  • Secondary keyby for result statistics

Break up the logic core code:

public class KeyByRouter {
    
    

    private final static String SPLIT_CHAR = "#";

    /**
     * 不能太散,否则二次聚合还是会有数据倾斜
     *
     * @param sourceKey
     * @return
     */
    public static String randomKey(String sourceKey) {
    
    
        int endExclusive = (int) Math.pow(2, 7);
        return sourceKey + SPLIT_CHAR + (RandomUtils.nextInt(0, endExclusive) + 1);
    }

    public static String restoreKey(String randomKey) {
    
    
        if (StringUtils.isEmpty(randomKey)) {
    
    
            return null;
        }

        return randomKey.split(SPLIT_CHAR)[0];
    }
}

Job suspend and preserve state failed

The possible reasons for the above exception are:

  • The job itself is under back pressure, and doing Checkpoint may fail, so doing Savepoint when the state is suspended and reserved will definitely fail
  • The status of the job is very large, and the savepoint timed out
  • The checkpoint timeout period set by the job is short, which leads to the fact that the savepoint has not been completed yet, and the job discards the status of the savepoint

Solution:

  • The code sets the checkpoint timeout as long as possible, such as 10 minutes. For jobs with a large status, you can set a larger timeout
  • If the job does not need to keep the state, just suspend the job and restart it

Summary and Outlook

This article introduces the current stable real-time computing feasible architecture from the aspects of real-time feature cleaning framework evolution, feature configurability, and hot deployment of feature cleaning logic. After nearly two years of iterations, the current architecture has the best performance in terms of stability, resource utilization, and performance overhead, providing strong support for business strategy personnel and business algorithm personnel.

In the future, we expect the configuration of features to return to SQL. Although the current configuration is simple enough, it belongs to our own "domain design language" after all. There is a certain learning cost for new strategy personnel and product personnel. What we expect is to be able to configure it through a global language like SQL, similar to Hive offline query, which shields the underlying complex calculation logic and helps the business to develop better.

References:
[1] Risk Control ( https://zh.wikipedia.org/wiki/%E9%A3%8E%E9%99%A9%E7%AE%A1%E7%90%86 )

Guess you like

Origin blog.csdn.net/weixin_43975482/article/details/123339259
Recommended