Internet financial data capture and risk control

 

Internet finance should make full use of Internet technology to efficiently obtain user data, analyze massive data, achieve accurate credit granting to customers, eliminate malicious high-risk users, reduce default rates, and achieve risk control.

Risk control is mainly divided into two stages, data capture and data mining. Here are a few aspects of what I know about data scraping on the Internet:

1. Device identification :

Device identification is the precise identification of the devices used by users. The definition of devices includes computers, mobile phones, tablets, and devices used to surf the Internet. Through device identification, anti-fraud and account association can be achieved; for example, there are different account logins on the same device. operation, it can be considered that these two accounts are related. If one of them is overdue, the other accounts are also high-risk users. For another example, if a large number of account logins appear on the same device, it can be inferred that these accounts are high-risk fraudulent users.

The device identification method is divided into two forms: the client and the webpage. The client such as the mobile app can obtain the unique code of the mobile phone (Apple mobile phone does not seem to be), and the form of the webpage is mainly through the page js script and the back-end tcp data packet parsing method. The domestic equipment identification and anti-fraud service providers I know are " Tongdun " and " Tongfudun " .

Normal user behaviors, such as the number of logins per day, commonly used login locations, weekly transaction amount, and habitual shopping time, have a basically fixed range, while abnormal user behaviors will be significantly different from normal user behaviors. . By calculating these behavior indicators of each user and comparing them with normal indicator values, suspicious situations can be found.

2. Targeted data capture:

   Targeted data capture is the targeted extraction of valuable data that is open to the Internet, such as delinquent persons announced by the court.

The blacklisted users announced by the P2P website can be targeted and crawled by analyzing the structure of the web page, and then crawling and analyzing the obtained data, which is used as a fraud evidence base .

   Java can use httpclient to get the page jsoup.xpath to parse the page, python scrapy crawler framework. For some pages that need to perform js asynchronous loading, you can try Java 's htmlunit , which can simulate browser execution of scripts.

I also found a crawler framework recently.

pyspider ( http://docs.pyspider.org/en/latest/ ), it is a form of service, which can configure timed tasks to capture web page address information.

3. Simulate login:

   Simulated login is to set up a proxy website between the user and the real website. The account password filled in by the user is submitted to the proxy website. The proxy website backstage then simulates logging in to the real website. The proxy website can obtain the information that needs to be logged in to view. This It is more complicated. Many websites, especially large e-commerce websites, have anti-fraud and anti-robot simulated login strategies, and many pages also have encrypted script execution. Keyboard data) and the trajectory of the mouse movement, as well as the user's login device and ip are not normal for the user, will increase the anti-machine authentication method. The success rate of login is not high and unstable.

The technical way to simulate login I understand is Selenium2 . Selenium supports multiple languages ​​for web automation testing, and can use language scripts to drive browsers to automatically operate web pages

(Reference: http://www.cnblogs.com/dingmy/p/3438084.html )

Finally, I would like to mention some of my own understanding. The important thing about Internet finance is risk control. The Internet is aimed at a large number of Internet users. Therefore, it is not possible to obtain user information through traditional offline methods, which is efficient and low-cost. Now Internet companies such as Alibaba and Tencent have accumulated many years of experience. With a large amount of user behavior data, they can accurately and efficiently grant credit and risk control based on their own user groups. For example, they can understand the user's spending power and hobbies through the user's previous consumption behavior. This is a credit extension to acquaintances, and Internet finance is on the one hand. To accumulate your own user data, on the one hand, you need to grant credit to strangers, which requires obtaining a large amount of user data. For example, by obtaining the time when the user logs in to the website, and by analyzing the data to cluster which time period the user online has a high repayment rate. Which time periods are low; at the same time, the authenticity of user information is verified by data obtained from multiple dimensions. For example, if the user provides his own address, we can obtain the user's latitude and longitude through the app . If it basically matches the address, the information is considered to be true and the data There are many ways to obtain, how to achieve flexible, configurable and efficient data frame acquisition and realize automatic wind control engine through data mining machine learning , and the wind control engine is efficient and automated. I think the risk control engine ( crawler framework + rule engine ) will bring a huge leap to Internet finance just like the Internet search engine.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326683079&siteId=291194637