Analysis of Credit Card Fraud

1. Projects

The dataset contains 284,807 transactions a year in September, European users place in two days, including 492 fraud. Exploration project through descriptive analysis of fraud-related features and patterns, and then by machine learning algorithms to create predictive models, parameter adjustment, and select a model by confusion matrix and other methods.

2. Data Cleansing

2.1 Importing Data

Library (data.table) # fread () 

# Loading data 
Cards <-fread ( " D: /R/practise/....../creditcard.csv " ) # raw data set has 143MB, operating data packet data.table, function fread reading speed is nearly 10 times the native function.

2.2 Overview of Data

View the data overall, variable type, missing values, outliers, distribution type, etc.

  library(VIM) # aggr()

# View the data overall situation, establish a data dictionary 
str (Cards) the Summary (Cards)
aggr (Cards) # visualization of missing values # other ways to determine missing values: the Table ( IS .Na (Cards)) # View sample data distribution prop.table ( table (cards $ Class)) # positive cases only of 0. The 001 727 486 qplot (X = Cards $ Class, Data = Cards, GEOM = " bar " , Fill = Cards $ Class) # visualization

Data dictionary

Missing Values: None

Sample Distribution: 0.001727486 only positive class, the proportion is very low, whereby trained model, insensitive to the positive class (fraud) case, would prefer to identify normal trading. (See later) we need to make adjustments to the training set time in the training model

2.3 Data Processing

Shaping second time conversion data # - hours 
Cards $ Time_new <-Round (Cards $ Time / 3600 , 0 ) # time in seconds is converted to units of hours 
Cards $ Time <- NULL original time variable data set # longer valid, it can be deleted 
Cards <-cards%>% SELECT (class, Everything ()) # class variable is adjusted to the first column
 class (class Cards $) 
Cards $ class <- AS .factor (Cards $ class) variable # class conversion factor of type 

# resolve imbalances data

cards_0 <-cards [cards $ Class == '0'] # normal non-fraudulent transactions
cards_1 <-cards [cards $ Class == '1'] # Fraud Case

index1<-sample(1:nrow(cards_0),size = nrow(cards_1))
# 合并数据集
cards_new<-rbind(cards_1,cards_0[index1])
# 归一化处理

library(caret)

standard <- preProcess(cards_new, method = 'range') # caret包中函数,range表示区间为[-1:1]
card_new2 <- predict(standard, cards_new)
traindata_2 <- card_new2[index2, ] # index2 见52行
testdata_2 <- card_new2[-index2, ]

2.4 数据划分

# 对新数据集进行做训练集、测试集的划分
index2<-createDataPartition(cards_new$Class,times = 1,p=.7,list = F)
train_cards<-cards_new[index2,] # 训练集
test_cards<-cards_new[-index2,] # 测试集

3.数据探索——描述性统计分析

数据集中第2列到第28列都是PCA数据,无法做有效描述性统计分析,故这里主要基于时间、金额等进行分析。

# 诈骗案发生时间、金额分布
p1<-ggplot(cards_1,aes(x=factor(Time_new),fill=factor(Time_new)))+
  geom_bar(stat = 'count')+
  labs(title = 'Time Distribution of\nFraud Cases',x='Time',y='Count')+
  theme_classic()+
  theme(legend.position='none')

# 计算每小时诈骗案涉及金额
# cards_1$Time_new<-as.factor(as.character(cards_1$Time_new))
Amount_money<-rowsum(cards_1$Amount,group=cards_1$Time_new) # 数据准备
Amount_c<-Amount_money[,1]
Time_c<-0:47
DF_money<-data.frame(Time_c,Amount_c)
p2<-ggplot(DF_money,aes(x=factor(Time_c),y=Amount_c,fill=Time_c))+
  geom_bar(stat = 'identity')+
  labs(title = 'Distribution bar chart of \nfraud amount',x='Time')+
  theme(legend.position='none')
library(Rmisc)
multiplot(p1,p2,cols = 1)

描述性分析结论:

从总趋势说,第一天多余第二天(但因为没有更多的数据,更长的时间轴,这一点无法得出有效结论);

整体看每小时平均诈骗数量在10次以下,涉案金额在500欧元左右;

从时间上说,诈骗发生次数在一天中的后半夜居多;

从涉案金额来说,而平均诈骗金额基本与次数相对应。其中第二天凌晨两点发生次数最多,30余起,对应的金额也是最大,超过3000欧元。

4.数据建模

4.1 随机森林模型

# 使用五折交叉验证方法建立随机森林模型
set.feed(1234) model_rf
<- train(Class ~., data = train_cards, method = 'rf', trControl = trainControl(method = 'cv', number = 5, selectionFunction = 'oneSE')) # 利用训练集建立随机森林模型 pred_model<-predict(model_rf,test_cards[,-1]) # 测试模型 confusionMatrix(data=pred_model,reference=test_cards$Class) # 准确率为94.9% plot(varImp(model_rf)) # 查看模型中变量的重要程度

查看模型:

查看模型中变量的重要程度:

 

 4.2 K近邻值算法(KNN算法)

results = c()      # 创建一个空向量
for(i in 1:10) {
  set.seed(1234)
  pred_knn <- knn(train=traindata_2[,2:30], test=testdata_2[,2:30], cl=traindata_2$Class, i)
  Table <- table(pred_knn, testdata_2$Class)
  accuracy <- sum(diag(Table))/sum(Table) # diag()提取对角线上的值
  results <- c(results, accuracy)
}

plot(x = 1:10, y = results, type = 'b', col = 'blue', xlab = 'k', ylab = 'accuracy')

# 根据图像k=4时,准确率最高。
pred_knn <- knn(train=traindata_2[,2:30], test=testdata_2[,2:30], cl=traindata_2$Class, 4)
confusionMatrix(pred_knn,testdata_2$Class) # 准确率为95.24%

 

 4.3 模型评估

 通过随机森林、k近邻算法分别得到model_rf、pred_knn两个模型,下面根据查全率、查准率和kappa三个维度评估模型。

 

5.结论

 模型pred_knn在查全率、查准率以及kappa三个评估维度上都高于model_rf,其中查准率达到95.24%,在134个诈骗案中,只有1个判断错误,可用性比较高。

 

 补充

为了方便描述,添加包没有放在文前

# 加载添加包
library(data.table) # fread()
library(VIM)        # aggr()
library(caret)      # 机器学习相关
library(dplyr)      # select()
library(ggplot2)    # ggplot/qplot
library(Rmisc)      # multiplot()
library(randomForest) # 随机森林
library(class)       # knn模型

 

Guess you like

Origin www.cnblogs.com/Grayling/p/11297576.html