Data mining # summary financial risk control

As a program of a 211-yuan came from the Finance undergraduate internship during two universities are in financial technology company to do data analysis and mining of financial risk control direction. Today, after more than a year, by the opportunity to sort of knowledge, a little summary of financial exposure to wind down memory control knowledge.

A first internship was in before the end of the junior year abroad exchange, in particular the country's only license held by small and medium sized financial-bank financial services company's big data risk control department intern, the company provides collection core business systems for small and medium banks, internet banking system, the outreach service platform, big data services and operations, risk control support a full range of information technology solutions and training, consultation, cooperation and innovation and other diversified services.

The second practice before the end of the exchange B is a senior graduate intern at a financial technology company in Chengdu, anti-fraud department. The company for banks and other financial institutions to solve encountered in the retail transformation "How to get customers," "how to manage the risk," "how to ensure that operations", "how the system support" and other issues, before covering the loan, the loan, the mortgage after lifecycle management.

Traditional financial lending business loans before the process:

Users come to apply for loans, will go through fraud detection, personal and subjective fraud fraud gang refused to fall, and then make an assessment of credit by the person, according to the amount of the final model to calculate the loan amount while maximizing profits.

Coincidentally, by B, A two companies internship, just covering fraud identification and credit evaluation before the loan process.

B fraud detection

Knowledge maps directions

There are many ways related profiling, mainly divided into two categories:

One is the direct extraction of network features extracted once the center of the second degree or related features for the upper system of rules or risk assessment model to use.

For real-time decision-making fraud in high demand, these indicators require real-time extraction. Some of these indicators, such as two degrees of relevance (second order degree), is calculated in the general case the complexity is very high. In the case of a dynamic figure, usually it takes some approximation algorithms and pre-calculation. Here to explain the second time association. For example, in net loan application, is a first degree of contact with the target number of applications share the same application that was associated with the contact number, if these applications and some other applications are also share the same address, and certain of these applications on the application form the second time association.

Some simple indicators, such as whether the node was associated association nodes or second degree black touch, in actual practice, the effect of fraud is very significant.

The second category is the depth of excavation network information . The depth of excavation usually begins with the calculation of connected subgraphs, for the social attributes weak financial applications, larger connected subgraph may have a role in revealing fraud network. On this basis, we can also find further expand the community (Community Detection). Found inside this community is not the same as connected subgraph, is a convergence of more stringent targets. In addition, the ratio spread through fraud, or staining, a known fraudulent labeling spread, in order to get more fraudulent labeling is also an important application associated map.

Label propagation algorithm

· Label propagation algorithm

Label propagation algorithm is based on semi-supervised learning method in FIG, the basic idea is to use the tag information to predict the marked nodes unmarked node label information.

FIG complete model relationship using the relation between samples, in FIG complete, the node comprising the labeled and unlabeled data, which edges represent the similarity of two nodes, the node's label to other nodes passed by the similarity. Like a tag data source, it can be labeled to unlabeled data, the greater the similarity node labels more easily spread. The algorithm is simple and easy to implement, algorithm execution time is short, low complexity and good classification results, and has good explanatory.

Algorithm in fraud in the spread-label

Marked by fraud on customer surveys identified by the label propagation algorithm, a "bad" node information tagged to predict the extent of the risk of fraud nodes unmarked, with the side to indicate the level of risk the similarity between two nodes, the nodes in accordance with the degree of similarity is transmitted to the other nodes, the degree of risk for the color visual display of FIG.

For example, there are three people: Xiao Ming, red, Wang, they are good friends, now known Xiao Ming is not also owe people, red is also not also owe people, then as their friend Wang, not also owe the relative probability of the average person, larger. This is the old adage says, "knows nothing doubts nothing" is the same reason.

PageRank algorithm

· PageRank algorithm

PageRank, referred to as PR, was developed by Google's main Web site used to assess the reliability and importance of an algorithm ranks pages, and is one of the indicators to be considered a page rank.

PageRank algorithm is mainly based on two assumptions: First, the number of inbound links hypothesis (the more the number of inbound links a web page, so it's important the higher the degree); the second is the quality of the chain hypothesis (it will be high-quality pages Links page to bring more weight). Based on these two assumptions, PageRank algorithm set an initial weight value for each page, based on the relationship between the links page, after several iterations, the right of each page weight values ​​to stabilize. Generally considered high weight value of the node is more reliable website.

· PageRank algorithm in the application of anti-fraud

PageRank PageRank algorithm is a method to identify the network node value the importance of complex relationships. In the initial stage, the relationship between all nodes in the network set the same PageRank value, the network node metastasis in probability matrix corresponding to continually update PageRank value of the node, the node until the PageRank value has stabilized, to get the final PageRank value of each node. Background data based on fraud, based on machine learning experience modeling complex network of relationships in the final PageRank value of each node of high, medium and low segment classification, looking for the high segment of the population suspected fraud.

This is the greater the weight, the greater the risk, what weight is? That's how many people know, how much contact with others, significant or insignificant. Popular speak is the active population than the bad guys inactive population bad guys, good guys might generally more low-key bar.

Community discovery algorithm

· Community discovery algorithm

Community discovery algorithm can make use of various statistical indicators network to tap a network of close relations within the community. Community found mainly based on GN, SLPA, Newman and other community discovery algorithm for complex network of relationships in the gang suspected fraud cluster mining.

Algorithm in the anti-fraud community-discovery

GN algorithm as an example to calculate the initial number of edges mediator in complex networks (shortest path between all nodes in the number of paths through the shortest side) and Q value (modularity: dividing a measure of the quality of conventional online communities method), remove the highest number of mediator side edge, the Q value is recalculated network, if the Q value is larger than the original split case, then the current network and the Q value is updated, otherwise, once the network is divided for repeated until network segmentation is completed. Each community divided node in a high similarity in communities divided by finding fraud in the distribution of nodes, mining high similarity to other people suspected fraud.

For example, the district where you find that you have two group of people, a group of people to jump Square Dance day, another group of people secretive daily classes. Maybe the gang go to the bank to borrow money. This group of people jumping in the square dance, are also borrow money, then there is also the group of people not to borrow money, borrow money once, chances repay big; a group of people in the class, people who borrow money did also, then, there is no person to borrow money, borrow money once, the chances do not pay back the money on the big.

 

In the course of the internship, I used a little bit of knowledge modeling machine learning algorithm combines graph algorithms. Using the knowledge graph algorithms built most of the features of the project, with the financial risk control in the LR XGBoost relatively easy to detect fraud (0,1 Classification)

Here are some trick involved in the modeling process

1.networks build network diagram

networkx in May 2002 produced a graph theory is developed in Python language and complex network modeling tools, built-in common chart with complex network analysis algorithms, the network can easily perform complex data analysis, simulation modeling work.

networkx supports the creation of a simple undirected graph, directed graph and multigraph; built many standard graph algorithms, nodes can be any data; boundary dimensions of any support, feature-rich, easy to use.

Networkx can use standardized and non-standardized format for storing network data, and generates a plurality of random network classical networks, the network structure analysis, establish a network model, the design of the new network algorithm, a network drawing and the like.

Graph is a mathematical model between each pair of things in some way linked to a discrete set of things with dots and lines to portray.

FIG networks as an important area comprising more of the concepts and definitions, if a network to FIG. (Directed Graphs and Networks), without the conceptual diagram of a network (undirected) and the like.

Graph everywhere in the real world, such as transportation map, tourist map, flowcharts, and so on. Here we consider only the map of points and lines thereof.

FIG can describe many things in real life, can be expressed as a point of connection between the intersection point represents the path, so that you can easily depict a transportation network.

Graph definition of a set of nodes and a Graph contains a set of edges.

In NetworkX, a node may be any hash objects (objects except None), an edge may be associated with any object, like a string of text, an image, an XML object, or even any other customized FIG node object.

NOTE: None Python objects is not a type node.

Nodes and edges capable of storing attributes and any other type of data for any type of rich dictionary.

  • Graph : means undirected graph (undirected Graph), ignoring the direction of the edge between the two nodes.
  • Digraph : refers to FIG. (Directed Graph), i.e., the edges are considered directional.
  • Multigraph : undirected graph refers to multiple, i.e. the number of edges between two nodes is more than one, and by allowing the vertices and their associated with an edge.
  • MultiDiGraph : There are multiple versions of the map.

https://www.cnblogs.com/minglex/p/9205160.html

2. FIG various characteristics of the network (the same can continue to build NetworkX)

pagerank

https://www.cnblogs.com/jpcflyer/p/11180263.html

Authority&Hub

Authority refers to the page related to a certain topic or field of high-quality pages, Hub page that contains many points to the high quality links page Authority page, for instance, hao123 Home Hub is a typical high-quality pages.

Degree centrality

The greater the degree of a node to node of the center means that the higher, the more important node in the network.

Coding feature 3.deepwalk each point in the network in FIG.

Using machine learning algorithms to solve the problem requires a lot of information, but in the real world of network information is often relatively small , which leads to the traditional machine learning algorithms can not be widely used in the network . ( Ps of:  Traditional machine learning classification learning assumption, the mapping of the sample properties to the class label of the sample , but the attribute information of the node in real networks are often relatively small, the conventional machine learning method is not applicable to the network.)

deepWalk a network characterization study comparing the basic algorithms for learning network vertices vector representation (i.e., the structure of FIG learning feature attributes, i.e., the number of attributes and the number of dimensions of a vector), so that the traditional machine learning algorithms can be applied to solve The problem.

  • innovation:

    A language model by modeling the word2vec, skip-gram node learning vector representation. The analog network node is a word in the language model, while the junction sequence (obtained by random walk) in the simulation language sentence input as skip-gram.

  • feasibility:

    以上假设的可行性证明,当图中结点的度遵循幂律分布通俗讲即度数大的节点比较少,度数小的节点比较多)时,短随机游走中顶点出现的频率也将遵循幂律分布(即出现频率低的结点多),又因为自然语言中单词出现的频率遵循类似的分布,因此以上假设可行。(Ps: 为证明有效性,作者针对YouTube的社交网络与Wikipedia的文章进行了研究,比较了在短的随机游走中节点出现的频度与文章中单词的频度进行了比较,可以得出二者基本上类似。(幂率分布))

  • process:

    随机游走+skip-gram 语言模型

    通过随机游走得到短的结点序列,通过skip-gram更新结点向量表示。

  • Random Walk

    Random Walk从截断的随机游走序列中得到网络的局部信息,并以此来学习结点的向量表示。

    deepwalk中的实现是完全随机的,根据Random Walk的不同,后面又衍生出了node2vec算法,解决了deepwalk定义的结点相似度不能很好反映原网络结构的问题。

  • skip-gram 语言模型

    skip-gram 是使用单词来预测上下文的一个模型,通过最大化窗口内单词之间的共现概率来学习向量表示,在这里扩展之后便是使用结点来预测上下文,并且不考虑句子中结点出现的顺序,具有相同上下文的结点的表示相似。(Ps:两个node同时出现在一个序列中的频率越高,两个node的相似度越高。)

    结点相似性度量: 上下文的相似程度(LINE中的二阶相似度)

    共现概率根据独立性假设可以转化为各条件概率之积即

    对序列中的每个顶点,计算条件概率,即该结点出现的情况下序列中其他结点出现的概率的log值并借助随机梯度下降算法更新该结点的向量表示。

    Φ(vj)为当前结点的向量表示。Hierarchical Softmax用于分解并加快计算第三行的条件概率。

4.sklearn——CountVectorizer 文本特征提取

CountVectorizer是属于常见的特征数值计算类,是一个文本特征提取方法。对于每一个训练文本,它只考虑每种词汇在该训练文本中出现的频率。

CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语出现的次数。

https://blog.csdn.net/pit3369/article/details/95643392

5.样本不平衡

在现实收集的样本中,正负类别不均衡是现实数据中很常见的问题。一个分类器往往 Accuracy 将近90%,但是对少数样本的判别的 Recall 却只有10%左右。这对于我们正确找出少数类样本非常不利。

举例来说:在一波新手推荐的活动中,预测用户是否会注册的背景下,不注册的用户往往是居多的,这个正负比例通常回事1:99甚至更大。一般而言,正负样本比例超过1:3,分类器就已经会倾向于负样本的判断(表现在负样本Recall过高,而正样本 Recall 低,而整体的 Accuracy依然会有很好的表现)。在这种情况下,我们可以说这个分类器是失败的,因为它没法实现我们对正类人群的定位。

https://zhuanlan.zhihu.com/p/28850865

 

A信用评估

业界常说的有A卡、B卡、C卡,A卡就是申请评分卡。在你申请的时候就会站出来,决定放不放款,B卡,也就是贷中行为评分卡,监控你的信用状况,决定给不给你提额度,或者中不中断你的贷款,C卡就是贷后评分卡,一般有三种:账龄迁移模型、还款率模型和失联预警模型。 
账龄迁移:就是预测你的逾期状况会不会从M1迁移到M2 
失联预警模型:对于银行和贷款公司来讲,有时不怕你不还钱,如果逾期了还能对你进行罚息等方式再赚一笔,更怕的反而客户失联,彻底消失,所以就需要建立失联预警,看一看你未来有没有可能失联。 
还款率模型:注意这个模型不是为了预测你还不还钱,而是预测未来经过催收动作后,还款的概率。毕竟贷后催收,是需要人力、物力成本的,如果一个人简单催一催,例如发个短信,打个电话就还了,何必动用戴金链子的老铁们上门,出于平衡成本的需要,制定不同的催收套路,这个模型就可以起作用了。

数据来源主要是 运营商数据、埋点数据、线上数据、征信数据、线下提供数据等等。

传统金融的风控主要利用了信用属性强大的金融数据,一般采用20个纬度左右的数据,利用评分来识别客户的还款能力和还款意愿。信用相关程度强的数据纬度为十个左右,包含年龄、职业、收入、学历、工作单位、借贷情况、房产,汽车、单位、还贷记录等,金融企业参考用户提交的数据进行打分,最后得到申请人的信用评分,依据评分来决定是否贷款以及贷款额度。其他同信用相关的数据还有区域、产品、理财方式、行业、缴款方式、缴款记录、金额、时间、频率等。

互联网金融的大数据风控并不是完全改变传统风控,实际是丰富传统风控的数据纬度。互联网风控中,首先还是利用信用属性强的金融数据,判断借款人的还款能力和还款意愿,然后在利用信用属性较弱的行为数据进行补充,一般是利用数据的关联分析来判断借款人的信用情况,借助数据模型来揭示某些行为特征和信用风险之间的关系。

互联网金融公司利用大数据进行风控时,都是利用多维度数据来识别借款人风险。同信用相关的数据越多地被用于借款人风险评估,借款人的信用风险就被揭示的更充分,信用评分就会更加客观,接近借款人实际风险。常用的互联网金融大数据风控方式有以下几种:

1验证借款人身份

验证借款人身份的五因素认证是姓名、手机号、身份证号、银行卡号、家庭地址。企业可以借助国政通的数据来验证姓名、身份证号,借助银联数据来验证银行卡号和姓名,利用运营商数据来验证手机号、姓名、身份证号、家庭住址。

如果借款人是欺诈用户,这五个信息都可以买到。这个时候就需要进行人脸识别了,人脸识别等原理是调用国政通/公安局API接口,将申请人实时拍摄的照片/视频同客户预留在公安的身份证进行识别,通过人脸识别技术验证申请人是否是借款人本人。

其他的验证客户的方式包括让客户出示其他银行的信用卡及刷卡记录,或者验证客户的学历证书和身份认证。

2分析提交的信息来识别欺诈

大部分的贷款申请都从线下移到了线上,特别是在互联网金融领域,消费贷和学生贷都是以线上申请为主的。

线上申请时,申请人会按照贷款公司的要求填写多维度信息例如户籍地址,居住地址,工作单位,单位电话,单位名称等。如果是欺诈用户,其填写的信息往往会出现一些规律,企业可根据异常填写记录来识别欺诈。例如填写不同城市居住小区名字相同、填写的不同城市,不同单位的电话相同、不同单位的地址街道相同、单位名称相同、甚至居住的楼层和号码都相同。还有一些填写假的小区、地址和单位名称以及电话等。

如果企业发现一些重复的信息和电话号码,申请人欺诈的可能性就会很高。

3分析客户线上申请行为来识别欺诈

欺诈用户往往事先准备好用户基本信息,在申请过程中,快速进行填写,批量作业,在多家网站进行申请,通过提高申请量来获得更多的贷款。

企业可以借助于SDK或JS来采集申请人在各个环节的行为,计算客户阅读条款的时间,填写信息的时间,申请贷款的时间等,如果这些申请时间大大小于正常

客户申请时间,例如填写地址信息小于2秒,阅读条款少于3秒钟,申请贷款低于20秒等。用户申请的时间也很关键,一般晚上11点以后申请贷款的申请人,欺诈比例和违约比例较高。

这些异常申请行为可能揭示申请人具有欺诈倾向,企业可以结合其他的信息来判断客户是否为欺诈用户。

4利用黑名单和灰名单识别风险

互联网金融公司面临的主要风险为恶意欺诈,70%左右的信贷损失来源于申请人的恶意欺诈。客户逾期或者违约贷款中至少有30%左右可以收回,另外的一些可以通过催收公司进行催收,M2逾期的回收率在20%左右。

市场上有近百家的公司从事个人征信相关工作,其主要的商业模式是反欺诈识别,灰名单识别,以及客户征信评分。反欺诈识别中,重要的一个参考就是黑名单,市场上领先的大数据风控公司拥有将近1000万左右的黑名单,大部分黑名单是过去十多年积累下来的老赖名单,真正有价值的黑名单在两百万左右。

黑名单来源于民间借贷、线上P2P、信用卡公司、小额借贷等公司的历史违约用户,其中很大一部分不再有借贷行为,参考价值有限。另外一个主要来源是催收公司,催收的成功率一般小于于30%(M3以上的),会产生很多黑名单。

灰名单是逾期但是还没有达到违约的客户(逾期少于3个月的客户),灰名单也还意味着多头借贷,申请人在多个贷款平台进行借贷。总借款数目远远超过其还款能力。

黑名单和灰名单是很好的风控方式,但是各个征信公司所拥有的名单仅仅是市场总量的一部分,很多互联网金融公司不得不接入多个风控公司,来获得更多的黑名单来提高查得率。央行和上海经信委正在联合多家互联网金融公司建立统一的黑名单平台,但是很多互联网金融公司都不太愿意贡献自家的黑名单,这些黑名单是用真金白银换来的教训。另外如果让外界知道了自家平台黑名单的数量,会影响其公司声誉,降低公司估值,并令投资者质疑其平台的风控水平。

5利用移动设备数据识别欺诈

行为数据中一个比较特殊的就是移动设备数据反欺诈,公司可以利用移动设备的位置信息来验证客户提交的工作地和生活地是否真实,另外来可以根据设备安装的应用活跃来识别多头借贷风险。

欺诈用户一般会使用模拟器进行贷款申请,移动大数据可以识别出贷款人是否使用模拟器。欺诈用户也有一些典型特征,例如很多设备聚集在一个区域,一起申请贷款。欺诈设备不安装生活和工具用App,仅仅安装和贷款有关的App,可能还安装了一些密码破译软件或者其他的恶意软件

欺诈用户还有可能不停更换SIM卡和手机,利用SI;6利用消费记录来进行评分;大数据风控除了可以识别出坏人,还可以评估贷款人的;按照传统金融的做法,在家不工作照顾家庭的主妇可能;常用的消费记录由银行卡消费、电商购物、公共事业费;互联网金融的主要客户是屌丝,其电商消费记录、旅游;据分析,只要客户授权其登陆电商网站,其可以借助于;7参考社会关系来评估信用情况;物以类聚,人与群分。一般情况下,信用好的人,他的朋友信用也很好。信用不好的人,他的朋友的信用分也很低,

欺诈用户还有可能不停更换SIM卡和手机,利用SIM卡和手机绑定时间和频次可以识别出部分欺诈用户。另外欺诈用户也会购买一些已经淘汰的手机,其机器上面的操作系统已经过时很久,所安装的App版本都很旧。这些特征可以识别出一些欺诈用户。

6利用消费记录来进行评分

大数据风控除了可以识别出坏人,还可以评估贷款人的还款能力。过去传统金融依据借款人的收入来判断其还款能力,但是有些客户拥有工资以外的收入,例如投资收入、顾问咨询收入等。另外一些客户可能从父母、伴侣、朋友那里获得其他的财政支持,拥有较高的支付能力。

按照传统金融的做法,在家不工作照顾家庭的主妇可能还款能力较弱。无法给其提供贷款,但是其丈夫收入很高,家庭日常支出由其太太做主。这种情况,就需要消费数据来证明其还款能力了。

常用的消费记录由银行卡消费、电商购物、公共事业费记录、大宗商品消费等。还可以参考航空记录、手机话费、特殊会员消费等方式。例如头等舱乘坐次数,物业费高低、高尔夫球俱乐部消费,游艇俱乐部会员费用,奢侈品会员,豪车4S店消费记录等消费数据可以作为其信用评分重要参考。

互联网金融的主要客户是屌丝,其电商消费记录、旅游消费记录、以及加油消费记录都可以作为评估其信用的依据。有的互联金融公司专门从事个人电商消费数据分析,只要客户授权其登陆电商网站,其可以借助于工具将客户历史消费数据全部抓取并进行汇总和评分。

7参考社会关系来评估信用情况

物以类聚,人与群分。一般情况下,信用好的人,他的朋友信用也很好。信用不好的人,他的朋友的信用分也很低,

参考借款人常联系的朋友信用评分可以评价借款人的信用情况,一般会采用经常打电话的朋友作为样本,评估经常联系的几个人(不超过6六个人)的信用评分,去掉一个最高分,去掉一个最低分,取其中的平均值来判断借款人的信用。这种方式挑战很大,只是依靠手机号码来判断个人信用可信度不高。一般仅仅用于反欺诈识别,利用其经常通话的手机号在黑名单库里面进行匹配,如果命中,则此申请人的风险较高,需要进一步进行调查。

8参考借款人社会属性和行为来评估信用

参考过去互联网金融风控的经验发现,拥有伴侣和子女的借款人,其贷款违约率较低;

年龄大的人比年龄低的人贷款违约率要高,其中50岁左右的贷款人违约率最高,

30岁左右的人违约率最低。贷款用于家庭消费和教育的贷款人,其贷款违约率低;

声明月收入超过3万的人比声明月收入低于1万5千的人贷款违约率高;

贷款次数多的人,其贷款违约率低于第一次贷款的人。 

经常不交公共事业费和物业费的人,其贷款违约率较高。

经常换工作,收入不稳定的人贷款违约率较高。

经常参加社会公益活动的人,成为各种组织会员的人,其贷款违约率低。

经常更换手机号码的人贷款违约率比一直使用一个电话号码的人高很多。

午夜经常上网,很晚发微博,生活不规律,经常在各个城市跑的申请人,其带贷款违约率比其他人高30%。

刻意隐瞒自己过去经历和联系方式,填写简单信息的人,比信息填写丰富的人违约概率高20%。

借款时间长的人比借款时间短短人,逾期和违约概率高20%左右。拥有汽车的贷款人比没有汽车的贷款人,贷款违约率低10%左右。

9利用司法信息评估风险

涉毒涉赌以及涉嫌治安处罚的人,其信用情况不是太好,特别是涉赌和涉毒人员,这些人是高风险人群,一旦获得贷款,其贷款用途不可控,贷款有可能不会得到偿还。

寻找这些涉毒涉赌的嫌疑人,可以利用当地的公安数据,但是难度较大。也可以采用移动设备的位置信息来进行一定程度的识别。如果设备经常在半夜出现在赌博场所或赌博区域例如澳门,其申请人涉赌的风险就较高。另外中国有些特定的地区,当地的有一部分人群从事涉赌或涉赌行业,一旦申请人填写的居住地址或者移动设备位置信息涉及这些区域,也要引起重视。涉赌和涉毒的人员工作一般也不太稳定或者没有固定工作收入,如果申请人经常换工作或者经常在某一个阶段没有收入,这种情况需要引起重视。涉赌和涉毒的人活动规律比较特殊,经常半夜在外面活动,另外也经常住本地宾馆,这些信息都可以参考移动大数据进行识别。 

总之,互联网金融的大数据风控采用了用户社会行为和社会属性数据,在一定程度上补充了传统风控数据维度不足的缺点,能够更加全面识别出欺诈客户,评价客户的风险水平。互联网金融企业通过分析申请人的社会行为数据来控制信用风险,将资金借给合格贷款人,保证资金的安全。

 

 

发布了10 篇原创文章 · 获赞 2 · 访问量 1790

Guess you like

Origin blog.csdn.net/weixin_41814051/article/details/104333603