先看数据:
Number of seconds elapsed between each transaction (over two days)
abc
Amount of money for this transaction
Fraud or Not-Fraud
Introduction
from:https://www.kaggle.com/nikitaivanov/getting-high-sensitivity-for-imbalanced-data 主要使用了smote和聚类两种思路!
In this notebook we will try to predict fraud transactions from a given data set. Given that the data is imbalanced, standard metrics for evaluating classification algorithm (such as accuracy) are invalid. We will focus on the following metrics: Sensitivity (true positive rate) and Specificity (true negative rate). Of course, they are dependent on each other, so we want to find optimal trade-off between them. Such trade-off usually depends on the application of the algorithm, and in case of fraud detection I would prefer to see high sensitivity (e.g. given that a transaction is fraud, I want to be able to detect it with high probability).
For dealing with skewed data I am going to use SMOTE algorithm. In two words, the idea is to create synthetic samples (in opposite to oversampling with replacement) through finding nearest examples (KNN), calculating difference between them, multiplying this difference by a random number between 0 and 1 and adding the result to the initial sample. For this purpose we are going to use SMOTE
function from DMwR
package.
Algorithms I am going to implement are Support Vector Machine (SVM), Logistic regression and Random Forest. Models will be trained on the original and SMOTEd data and their performance will be measured on the entire data set.
As a bonus, we are going to have some fun and use K-means centroids of the negative examples together with the original positive examples as a new dataset and train our algorithm on it. We then compare results.
##Loading required packeges
library(ggplot2) #visualization
library(caret) #train model library(dplyr) #data manipulation library(kernlab) #svm library(nnet) #models (logit, neural nets) library(DMwR) #SMOTE data ##Load data d = read.csv("../input/creditcard.csv") n = ncol(d) str(d) d$Class = ifelse(d$Class == 0, 'No', 'Yes') %>% as.factor()
It is always a good idea first to plot a response variable to check for skewness in data:
qplot(x = d$Class, geom = 'bar') + xlab('Fraud (Yes/No)') + ylab('Number of transactions')
Classification on the original data
Keeping in mind that the data is highly skewed we proceed. First split the data into training and test sets.
idx = createDataPartition(d$Class, p = 0.7, list = F) d[, -n] = scale(d[, -n]) #perform scaling train = d[idx, ] test = d[-idx, ]
Calculate baseline accuracy for future reference
blacc = nrow(d[d$Class == 'No', ])/nrow(d)*100 cat('Baseline accuracy:', blacc)
To begin with, let's train our models on the original dataset to see what we get if use unbalanced data. Due to computational limitations of my laptop, I will only run logistic regression for this purpose.
m1 = multinom(data = train, Class ~ .) p1 = predict(m1, test[, -n], type = 'class') cat(' Accuracy of the model', mean(p1 == test[, n])*100, '\n', 'Baseline accuracy', blacc)
Though accuracy (99.92%) of the model might look impressive at a first glance, in fact it isn't. Simply predicting 'not a fraud' for all transactions will give 99.83% accuracy. To really evaluate model's perfomance we need to check confusion matrix.
confusionMatrix(p1, test[, n], positive = 'Yes')
From the confusion matrix we see that while model has high accuracy (99.92%) and high specificity (99.98%), it has low sensitivity of 64%. In other words, only 64% of all fraudulent transactions were detected.
Classification on the SMOTEd data
Now let's preprocess our data using SMOTE algorithm:
table(d$Class) #check initial distribution
newData <- SMOTE(Class ~ ., d, perc.over = 500,perc.under=100) table(newData$Class) #check SMOTed distribution
To train SVM (with RBF kernel) we are going to use train
function from caret
package. It allows to choose optimal parameters of the model (cost and sigma in this case). Cost refers to penalty for misclassifying examples and sigma is a parameter of RBF which measures similarity between examples. To choose best model we use 5-fold cross-validation. We then evaluate our model on the entire data set.
gr = expand.grid(C = c(1, 50, 150), sigma = c(0.01, 0.05, 1)) tr = trainControl(method = 'cv', number = 5) m2 = train(data = newData, Class ~ ., method = 'svmRadial', trControl = tr, tuneGrid = gr) m2
As wee see, best tuning parameters are C = 50 and sigma = 0.05
Let's look at a confusion matrix
p2 = predict(m2, d[, -n]) confusionMatrix(p2, d[, n], positive = 'Yes')
(Numbers may differ due to randomness of k-fold cv)
As expected we were able to achieve sensitivity of 99.59%. In other words, out of all fraudulent transactions we correctly detected 99.59% of them. This came in price of slightly lower accuracy (in comparison to the first model) - 97.95% vs. 99.92% and lower specificity 97.94% vs. 99.98%. The main disadvantage is low level of positive predicted value (i.e. given that prediction is positive, what is probability that the true state is positive) which this case is 7.74% vs. 85% for initial (unbalanced dataset) model. As was mentioned in the beginning, one should choose a model that matches certain goals. If the goal is to correctly identify fraudulent transactions even in price of low positive predicted value (which I believe the case), then the latter model (based on SMOTed data) should be used. Looking at confusion matrix we see that almost all fraudulent transactions were correctly identified and only 2.5% were mislabeled as fraudulent.
I'm planning to try couple more models and also use more sophisticated algorithm that uses K-means centroids of the majority class as samples for non fraudulent transactions.
m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')
library(randomForest)
m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')
Random forest performs really well. Sensitivity 100% and high specificity (more than 99%). All fraudulent transactions were detected and less than 1% of all transactions were falsely classified as fraud. Hence, Random Forest + SMOTE algorithm shloud be considered as final model.
K-means centroids as a new sample
For curiosity, let's take another approach in dealing with imbalanced data. We are going to separate the examples for positive and negative and from the latter one extract centroids (generated using K-means clustering). Number of clusters will be equal to the number of positive examples. We then use these centroids together with positive examples as a new sample.(思路就是聚类,将major class聚类为k个点,其中k为欺诈信用卡的样本数!)
neg = d[d$Class == 'No', ] #negative examples pos = d[d$Class == 'Yes', ] #positive examples n_pos = sum(d$Class == 'Yes') #calculate number of positive examples clus = kmeans(neg[, -n], centers = n_pos, iter.max = 100) #perform K-means neg = as.data.frame(clus$centers) #extract centroids as new sample neg$Class = 'No' newData = rbind(neg, pos) #merge positive and negative examples newData$Class = factor(newData$Class)
We run random forest on the new dataset, newData
, and check confusion matrix.
m4 = randomForest(data = newData, Class ~ .) p4 = predict(m4, d[, -n]) confusionMatrix(p4, d[, n], positive = 'Yes')
Well, while sensitivity is still 100%, specificity dropped to 72% leading to a big fraction of false positive predictions. Learning on the data that was transformed using SMOTE algorithm gave much better results.
from:https://www.kaggle.com/themlguy/undersample-and-oversample-approach-explored
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory import os print(os.listdir("../input")) # Any results you write to the current directory are saved as output.
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler import seaborn as sns from sklearn.metrics import confusion_matrix,recall_score,precision_recall_curve,auc,roc_curve,roc_auc_score,classification_report
creditcard_data=pd.read_csv("../input/creditcard.csv")
creditcard_data['Amount']=StandardScaler().fit_transform(creditcard_data['Amount'].values.reshape(-1, 1)) creditcard_data.drop(['Time'], axis=1, inplace=True)
def generatePerformanceReport(clf,X_train,y_train,X_test,y_test,bool_): if bool_==True: clf.fit(X_train,y_train.values.ravel()) pred=clf.predict(X_test) cnf_matrix=confusion_matrix(y_test,pred) tn, fp, fn, tp=cnf_matrix.ravel() print('---------------------------------') print('Length of training data:',len(X_train)) print('Length of test data:', len(X_test)) print('---------------------------------') print('True positives:',tp) print('True negatives:',tn) print('False positives:',fp) print('False negatives:',fn) #sns.heatmap(cnf_matrix,cmap="coolwarm_r",annot=True,linewidths=0.5) print('----------------------Classification report--------------------------') print(classification_report(y_test,pred))
#generate 50%, 66%, 75% proportions of normal indices to be combined with fraud indices 也就是说采样后的黑白样本比例是:0.5,0.66,0.75
#undersampled data
normal_indices=creditcard_data[creditcard_data['Class']==0].index fraud_indices=creditcard_data[creditcard_data['Class']==1].index for i in range(1,4): normal_sampled_data=np.array(np.random.choice(normal_indices, i*len(fraud_indices),replace=False)) #a random sample is generated from normal_indices 主要是随机欠采样
undersampled_data=np.concatenate([fraud_indices, normal_sampled_data]) undersampled_data=creditcard_data.iloc[undersampled_data] print('length of undersampled data ', len(undersampled_data)) print('% of fraud transactions in undersampled data ',len(undersampled_data.loc[undersampled_data['Class']==1])/len(undersampled_data)) #get feature and label data feature_data=undersampled_data.loc[:,undersampled_data.columns!='Class'] label_data=undersampled_data.loc[:,undersampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),SVC(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in undersampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)
#oversampled_data data
normal_sampled_indices=creditcard_data.loc[creditcard_data['Class']==0].index oversampled_data=creditcard_data.iloc[normal_sampled_indices] fraud_data=creditcard_data.loc[creditcard_data['Class']==1] oversampled_data=oversampled_data.append([fraud_data]*300, ignore_index=True) #此处过采样处理是直接将欺诈样本复制300份!!! print('length of oversampled_data data ', len(oversampled_data)) print('% of fraud transactions in oversampled_data data ',len(oversampled_data.loc[oversampled_data['Class']==1])/len(oversampled_data)) #get feature and label data feature_data=oversampled_data.loc[:,oversampled_data.columns!='Class'] label_data=oversampled_data.loc[:,oversampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in oversampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)
Random forest classifier with oversampled approach performs better compared to undersampled approach!!!
综合结论就是:随机森林+过采样(直接复制或者smote后,黑白比例1:3)效果比较好!