Thinking about business risk control: How to establish identification, defense and decision-making systems?

Introduction: During the three years when the epidemic has disrupted the rhythm of life, we often see the topic of "cost reduction and efficiency increase" by enterprises. Such as: let the employees feel the cold, remove the green plants in the office, lower the food standard in the cafeteria, etc. As far as the operating costs of enterprises are concerned, reducing the limited resources stolen by black and gray producers (wool parties, coding platforms, etc.) is undoubtedly one of the most effective ways to reduce costs.

According to incomplete statistics: my country currently has more than 30,000 illegal production gangs, and the annual profit of the gangs exceeds 3 million. The annual corporate losses caused by illegal production can exceed 100 billion. 61.5% of online traffic comes from black and gray production. . During the nearly 10 years of continuous confrontation between the extreme experience and the black and gray production, it was concluded that the black and gray production has the characteristics of "high efficiency, fast speed, and large scale". Black and gray products may cheat in behavior, using automated trajectory scripts to simulate the operation path of real users; they may cheat on devices, using simulators, cloud control group control and interface cracking programs, and participate in large-scale marketing activities initiated by enterprises; Cheating on identities, keeping tens of thousands of account trumpets in captivity, and camouflage in business links.

In the face of the black production technology of behavior, equipment, and identity cheating in multiple dimensions, to comprehensively locate the credibility of a traffic, it is necessary to establish an identification, defense, and decision-making system from the three dimensions of behavior, equipment, and identity.

behavioral trajectory model

In 2012, GeeExpert, known as the global leader in interactive security, proposed for the first time the recognition of human-computer interaction through biological traces. The trajectory model of this method has only been iterated for 10 years, and it still occupies the main defensive position of the major verification code forms. Identify and stop black and gray attacks from behavioral data, and play a key defense role during the activities of major Top customers. As the customer base expands, the trajectory model is also fed with more and more data, and the accuracy and effect data are getting better and better. So how is the model based on the user's biological trajectory established?

1. Collection of samples

Students who have an understanding of AI or big data should know that sample data is very important in the early stage of model establishment. Also during the cold start period, sample data is often one of the biggest difficulties encountered in modeling. At the beginning of the launch of behavior verification, the novel sliding interaction style and the innovative trajectory recognition concept attracted many customers in a short time. Major websites began to deploy and use behavior verification, and a benign viral spread effect was formed for a while. Real users and machine attack scripts on the website "swiped" one after another to try to pass the interactive verification tool with the best experience at that time. The initial sample data is thus slowly accumulated.

2. Building a model

With the trajectory sample data, it is necessary to establish a trajectory recognition model. Whenever a sliding behavior is completed, the model discrimination result is output in real time. For simple understanding, the researchers simplified the trajectory of multi-dimensional features into a vector a(x, y) consisting of two-dimensional features. At this time, a trajectory model identification function F (ignore how to obtain this function) is obtained at this time. In the two-dimensional coordinate system The trajectory of the function F is drawn in , the trajectory falling on the upper left of the trajectory is the trajectory of the real person, and the trajectory falling on the lower right of the trajectory is the trajectory of the machine.

 

With the application of the model, you will find that one day a red dot falls on the green human track area, and the corresponding green dot also falls on the red machine track area. If we have two actions of blocking and letting go of the model identification results online, At this time, omissions and misjudgments will occur. At this time, it is necessary to continuously optimize the identification function F until the human-machine trajectory can be distinguished as completely as possible.

3. Optimizing the model

When the trajectory points cross in the coordinate system, it is necessary to optimize the model so that the identification function can correct errors more timely and accurately. At this time, CNN will be used to allow the model to evolve autonomously, adapt to and learn different trajectory features, in order to achieve accurate distinction. Some students may have doubts. If the green and red points are dense enough, there must be a possibility: the identification function F in the two-dimensional coordinates cannot clearly distinguish the set of man-machine trajectories into two independent areas, so the man-machine trajectories cannot Are you different? The answer is yes, this possibility exists, and the situation of restoring two-dimensional features to high-dimensional features also exists. At this time, relying solely on CNN is a bit stretched.

So the clustering model comes in handy. Compared with CNN, the biggest difference between the clustering model can be simply understood as: if CNN relies on the identification function F to divide the trajectory into two regions, then clustering is to divide the trajectory into multiple cluster. Because the distribution of machine trajectories is usually relatively clustered, the core idea is to establish a division unit for which location is dense, and then use it for banning. Model optimization is a process of continuous exploration and experimentation. The average daily data volume of 1.4 billion+ provides prerequisites for optimization and iterative models; and the combined use of multiple models will make up for the shortcomings of a single model, which can be more accurate defense against every machine attack.

Device portrait model

With the development of the Internet, mobile App applications have penetrated into almost every scene in life. All you need is a mobile phone to participate in marketing activities, complete game tasks, and collect rewards. The enterprise’s desired method of receiving rewards is obtained through the participation of real target users, but black and gray production operates mobile devices in batches to obtain rewards, which not only destroys fairness, but also fails to achieve the real marketing purpose of the enterprise. Particularly important. There are two main types of illegal equipment cheating methods: installing risk tools and modifying equipment parameters.

1. Device fingerprint

The device fingerprint is the unique identification generated for the terminal device used by Internet users. The extremely stable and difficult-to-tamper device fingerprint adopts the device weak feature attribution technology, does not rely on highly sensitive information such as IMEI and IDFA, and complies with the privacy policy specification. From more than 100 Establish multiple complementary algorithm models in the data characteristics of the item, and finally generate a unique device identifier. It remains unique in scenarios such as restarting, uninstalling, reinstalling and modifying hardware parameters. In scenarios such as attracting new users and boosting votes, identify abnormal behaviors such as one machine with multiple numbers, small accounts committing crimes, and swiping traffic.

2. Equipment environment detection

Relying solely on device fingerprints cannot fully identify all cheating behaviors. If a risk status label can be attached to each device, the risk level of the device can be perceived at any time. The latest device portraits no longer output fixed risk scores through a single "comparison device black database" like traditional products. It adopts "real-time detection, real-time confrontation, and real-time update". Establish a risk detection model from historical behavior, real-time risk, and device attribution, and accurately give the current risk status and risk label of the device. Compared with risk scores, direct risk indicators such as 0 and 1 can provide a more direct signal of corporate disposal, and no longer have the hesitation caused by hierarchical boundaries.

 

It is worth mentioning that with the tightening of regulatory policies, risk control at the device level is facing data compliance risks, and the risk control system that relies on highly sensitive data such as IMEI, IDFA, and Mac address is destined to be eliminated. More and more black and gray products have also begun to use customized phones to cheat. For example, there were characters that did not belong to the brand in a mobile phone of a certain brand. At this time, it is necessary to establish a new scheme to identify cheating methods of customized machines, which requires a large amount of black and white sample data and a device information database that is sufficient to cover mainstream devices in the market.

Account portrait model

Under the network real-name system policy environment, mobile phone numbers have almost become the online identity accounts of real network users. 90% of the Geetest account portrait model is developed around the mobile phone number. The account portrait model mainly provides two capabilities: account risk level (low, medium, High), account risk label.

 Assume that in the registration and login scenario, the following situation occurs: a user with a mobile phone number of 187xxxx1234 registers an App, and the device fingerprint corresponding to the device used is AAA; one day later, this account logs into the App again, but the device fingerprint corresponding to the device used this time is For the BBB; during the 618 event a week later, we found that this account had entered the app again, and the device logged in this time was an emulator. Three devices were used in three different scenarios, and during the activity period, the emulator was used to log in in a high-risk virtual environment, indicating that this account is most likely a trumpet registered by Heihui. The main purpose of entering the app is Collect 618's active assets. For ease of understanding, the first registration, second login, and last login are drawn as shown in the following figure:

 There is a one-to-many relationship between accounts and devices similar to the one above. We can formulate all policies and rules that can be used for analysis, and then analyze the reputation of all accounts to mark the risk level of the account. First, give all accounts an initial score, and then formulate a set of account-related rules and features, and deduct and add points to the account according to whether it is touched or not. As a result, the risk level of the account is given.

 When our business data is sufficient and the industry coverage is wide enough, we will establish a cross-industry and cross-device cross-relationship network through device fingerprints, mobile phone numbers, IP, etc., to form a unique relationship map.

 As the rules are put into use, we will continuously obtain the risk level and touch rules of each mobile phone number, establish a labeling system after desensitizing the rules, and return the risk level of the account, and also return the risk label of the account to assist Enterprises further clarify decision-making.

 

epilogue

Behavior, device, and identity constitute the three elements of traffic governance. Geeexperience relies on the three-element model to defend against every abnormal attack. Combine the security models of the three dimensions, and cooperate with the dynamic scheduling engine to escort the enterprise. When the wool party is detected by any model, it will not only call up the security tool in real time for secondary verification, but also send the relevant tags back to the business server, and the business side will make further decisions, greatly reducing the profit of the wool party Efficiency and probability, until you can't make ends meet, and finally give up and return empty-handed.

Guess you like

Origin blog.csdn.net/geek_wh2016/article/details/127011636