User portrait series - Tencent anti-brush person in charge: E-commerce anti-brush architecture based on user portrait big data

Tencent Big Data Processing Platform - Rubik's Cube

Our team has developed a big data processing and analysis platform called Rubik's Cube. We integrate MySQL, MongoDB, Spark, Hadoop and other technologies at the bottom layer. At the user level, we only need to write some simple SQL statements and complete some configurations to achieve this. Routine analysis.

Here we collect data from social, e-commerce, payment, game and other scenarios, and build some models for these data to find out which ones are malicious, and deposit the data.

The precipitated data that is meaningful to security, on the one hand, is stored on the Rubik's Cube platform for offline auditing as a model; on the other hand, it will be made into a real-time service and provided for online system query use.

1. Tencent user portrait precipitation method

Portraits are essentially tagging accounts, devices, etc.

User portrait = tagging

Here we mainly tag from the perspective of security, such as IP portrait, we will mark whether the IP is a proxy IP, these are helpful for our strategy.

Take the portrait of QQ as an example, for example, a QQ only logs in to IM, does not log in to other Tencent businesses, does not chat, frequently adds friends, is deleted by friends, QQ space is either not opened, or QQ space is opened but there are many comments but reply No, this kind of number will generally be marked with QQ support number (color, love, marketing), and we will also label QQ with other labels.

The category and details of the tags need to be set by the risk control person, such as: geographic location, marked by province. Gender, Ann male and female markers. Other detailed rules are set according to this law.

Let's take a look at Tencent's IP portrait. The logic of precipitation is as follows:

big data architecture

General business has a strategy for limiting the frequency and number of IPs, so in order to fight against it, the black industry will inevitably use a large number of proxy IPs to bypass the restrictions.

Since the identification of proxy IP is so important, let's take the proxy IP as an example to talk about the process of Tencent's identification of proxy IP.

To identify whether an IP is a proxy IP, the technology is nothing more than the following four:

  1. Reverse detection technology: Scan if the IP has opened ports that are often opened by proxy servers such as 80 and 8080. Obviously, an ordinary user IP is unlikely to open the above ports.
  2. X_Forwarded_For in the HTTP header: The IP that has opened the HTTP proxy can use this method to identify whether it is a proxy IP; if it has XFF information, the IP is undoubtedly a proxy IP.
  3. Keep-alive message: If there is a Keep-alive message with Proxy-Connection, the IP is undoubtedly the proxy IP.
  4. Check the port on the IP: If an IP has a port greater than 10000, then most of the IP has problems. It is almost impossible for a common home IP to open such a large port.

The above proxy IP detection methods are almost all public, but blindly scanning the IP of the entire network, not to mention being blocked, efficiency is also a big problem.

Therefore, in addition to using web crawlers to crawl proxy IPs, we also use the following methods to speed up the collection of proxy IPs: through business modeling, collect malicious IPs (black products are more likely to use proxy IPs), and then scan through protocols. way to determine whether these IPs are proxy IPs. Every day Tencent finds millions of malicious IPs, most of which are proxy IPs.

2. Overview of Tencent User Portrait Categories

big data architecture

3. Defense logic

big data architecture

The real-time system is developed and implemented using C/C++, and all data is stored in shared memory. Compared with other systems, the security system has its own special circumstances, so here we can use the "lossy" idea to achieve, Greatly reduces the development cost and difficulty.

Data consistency, multiple machines, using shared memory, how to ensure data consistency?

In fact, the security policy does not need to achieve strong data consistency.

From the perspective of security itself, the risk itself is a probability value, which is uncertain, so there is a little data inconsistency, which does not affect the overall situation.

However, the security system also has its own characteristics. Generally, the security system has relatively large burst traffic. We need to set up various emergency switches here, and we need WeChat, SMS and other methods to facilitate and quickly switch, so as to avoid spreading the impact to the back-end system.

 

http://www.36dsj.com/archives/35887

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326778878&siteId=291194637