python case: campaign finance fraud user behavior analysis

Afternoon learning the python application case --- financial data analysis of marketing campaigns fraudulent user behavior analysis. Data from the DC contest data: https://www.dcjingsai.com/common/cmpt/2018%E5%B9%B4%E7%94%9C%E6%A9%99%E9%87%91%E8%9E % 8D% E6% 9D% AF % E5% A4% A7% E6% 95% B0% E6% 8D% AE% E5% BB% BA% E6% A8% A1% E5% A4% A7% E8% B5% 9B_ AB E7% 9E%%%% E8% B5. 9B% A1% E4% BF% E6% 81% AF.html , detailed data information can go to the website to view, tag table 1 "wool party", 0 table "normal user"

First, data import

1 import numpy as np
2 import pandas as pd
3 from collections import Counter
4 import matplotlib.pyplot as plt
5 from pymining import itemmining,assocrules,perftesting,seqmining
6 import pyecharts as pe
7 rt=pd.read_csv(r"E:\transaction_train_new.csv",sep=",")
8 ro=pd.read_csv(r"E:\operation_train_new.csv",sep=",")
9 rtt=pd.read_csv(r"E:\tag_train_new.csv",sep=",")

Then, with simple statistical data preprocessing

1  # data processing and simple statistical 
2 RT = pd.merge (RT, RTT)
 . 3 RO = pd.merge (RO, RTT) # labels from the label Behavior Behavioral combined to facilitate binding 
. 4 Z1 = ro.day.astype (NP .str)
 . 5 Z2 = rt.day.astype (np.str) # the day into a string 
. 6 ro.time = pd.to_datetime ( " 2018-01- " + Z1 + "  " + ro.time)
 . 7 RT. = pd.to_datetime time ( " 2018-01- " + Z2 + "  " + rt.time) # the time was changed to the following simple standard mode hour time 
8  # statistics the total number of users, wool party number, transaction number, operating behavior number 
9  Print(S (rtt.UID.values), means (rtt [rtt.Tag == 1] .UID.values), means (rt), means (ro))

Second, analysis of the general characteristics of fraudulent users.

1 from two aspects, transactional operations wool party usually little more trading operations. 2, wool party, usually a plurality of users share bank account information, and other various devices

1, # behavior timing analysis
plt.plot (RO [ro.Tag == 1] .groupby ( "Day"). Size ())
plt.plot (RO [ro.Tag == 0] .groupby ( "Day" ) .size ()) # View two types of users of transactional data manipulation

Figure, blue "wool Party," yellow "ordinary users seen, less wool party affairs operations, and more trading operation, that is to get as much benefit to the least possible cost

plt.plot (RT [rt.Tag ==. 1] .groupby ( "Day"). size ())
plt.plot (RT [rt.Tag == 0] .groupby ( "Day"). size ()) # users to view two types of trading operations

2, multi-party accounts wool Behavior Analysis

1  # multiple accounts wool, multiple accounts common bank account, number of equipment, a mobile phone 
2  DEF Cl (X):
 . 3      return SET (x.UID.values)
 . 4 Z2 = RT [rt.acc_id2.notnull ()]. GroupBy ( " acc_id2 " ) .apply (Cl) # statistics of the turn at the user's account acc_id2 
. 5 P2 = counter (z2.apply (len) .values) # counts the number of categories acc_id2 user 
. 6 plt.loglog (p2.keys (), p2.values (), " O " ) # approximated power function curve, data clearly problematic

1 #记录各acc_id2下用户数大于3的用户ID,疑似为羊毛党ID
2 z4=set([])
3 for i in z2.values:
4     if len(i)>3:
5         z4=z4|i
6 z5=set(rt[rt.Tag==1].UID.values)
7 print(len(z4),len(z5),len(z4&z5))#查看预测的羊毛党数量,实际的羊毛党数量,预测对的羊毛党数量
845 3993 725
 1 #同理分析某一标签x0下的用户,>x1的为疑似羊毛党
 2 def u1(x0,x1):
 3     def cl(x):
 4         return set(x.UID.values)
 5     z2=rt.groupby(x0).apply(cl)
 6     p2=Counter(z2.apply(len).values)
 7     plt.loglog(p2.keys(),p2.values(),"o")
 8     z4=set([])
 9     for i in z2.values:
10         if len(i)>x1:
11             z4=z4|i
12     return [z4,len(z4),len(z5),len(z4&z5)]

定义函数u1(x0,x1)分析x0标签下,疑似为用户共用情况,共用用户数>x1则认为是羊毛党

在"acc_id1"”acc_id2“"acc_id3""device_code1""device_code2"下分析羊毛党行为

 1 y1=u1("acc_id1",3)#以ip作为分析指标
 2 y1[1:4]
 3 [845, 3993, 725]
 4 y2=u1("acc_id2",2)
 5 y2[1:4]
 6 [333, 3993, 322]
 7 y3=u1("acc_id3",3)#以ip作为分析指标
 8 y3[1:4]
 9 [298, 3993, 287]
10 de1=u1("device_code1",4)#以同-设备号上有>4个用户,疑似为羊毛党
11 de1[1:4]
12 [1338, 3993, 809]
13 de2=u1("device_code2",4)#手机品牌标签下的羊毛党分析
14 de2[1:4]
15 [1023, 3993, 805]

最后,汇总分析这5个指标的筛选结果

1 w=y1[0]|y2[0]|y3[0]|de1[0]|de2[0]
2 print(len(w),len(z5),len(w&z5))
3 f0=len(w&z5)/len(w)
4 f1=len(w&z5)/len(z5)
5 f2=f0*f1*2/(f0+f1)
6 print(f0,f1,f2)#仅仅用简单的条件就能达到0.4以上

1967 3993 1282
0.6517539400101677 0.3210618582519409 0.4302013422818792

 

Guess you like

Origin www.cnblogs.com/dahongbao/p/11073697.html