先看数据：

特征如下：

Time

Number of seconds elapsed between each transaction (over two days)

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

No description provided

numeric

V10

No description provided

numeric

V11

No description provided

numeric

V12

No description provided

numeric

V13

No description provided

numeric

V14

No description provided

numeric

V15

No description provided

numeric

V16

No description provided

numeric

V17

No description provided

numeric

V18

No description provided

numeric

V19

No description provided

numeric

V20

No description provided

numeric

V21

No description provided

numeric

V22

No description provided

numeric

V23

No description provided

numeric

V24

No description provided

numeric

V25

No description provided

numeric

V26

No description provided

numeric

V27

No description provided

numeric

V28

abc

numeric

Amount

Amount of money for this transaction

numeric

Class

Fraud or Not-Fraud

boolean

只有Amount没有做标准化处理（mean不为0！！！）：见：https://www.kaggle.com/mlg-ulb/creditcardfraud/data

Introduction

from：https://www.kaggle.com/nikitaivanov/getting-high-sensitivity-for-imbalanced-data 主要使用了smote和聚类两种思路！

In this notebook we will try to predict fraud transactions from a given data set. Given that the data is imbalanced, standard metrics for evaluating classification algorithm (such as accuracy) are invalid. We will focus on the following metrics: Sensitivity (true positive rate) and Specificity (true negative rate). Of course, they are dependent on each other, so we want to find optimal trade-off between them. Such trade-off usually depends on the application of the algorithm, and in case of fraud detection I would prefer to see high sensitivity (e.g. given that a transaction is fraud, I want to be able to detect it with high probability).

For dealing with skewed data I am going to use SMOTE algorithm. In two words, the idea is to create synthetic samples (in opposite to oversampling with replacement) through finding nearest examples (KNN), calculating difference between them, multiplying this difference by a random number between 0 and 1 and adding the result to the initial sample. For this purpose we are going to use SMOTE function from DMwR package.

Algorithms I am going to implement are Support Vector Machine (SVM), Logistic regression and Random Forest. Models will be trained on the original and SMOTEd data and their performance will be measured on the entire data set.

As a bonus, we are going to have some fun and use K-means centroids of the negative examples together with the original positive examples as a new dataset and train our algorithm on it. We then compare results.

##Loading required packeges 
library(ggplot2) #visualization  
library(caret) #train model library(dplyr) #data manipulation library(kernlab) #svm library(nnet) #models (logit, neural nets) library(DMwR) #SMOTE data ##Load data d = read.csv("../input/creditcard.csv") n = ncol(d) str(d) d$Class = ifelse(d$Class == 0, 'No', 'Yes') %>% as.factor()

Loading required package: lattice

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha

Loading required package: grid

'data.frame':	284807 obs. of  31 variables:
 $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
 $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
 $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
 $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
 $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
 $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
 $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
 $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
 $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
 $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
 $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
 $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
 $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
 $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
 $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
 $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
 $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
 $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
 $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
 $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
 $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
 $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
 $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
 $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
 $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
 $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
 $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
 $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
 $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
 $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
 $ Class : int  0 0 0 0 0 0 0 0 0 0 ...

It is always a good idea first to plot a response variable to check for skewness in data:

qplot(x = d$Class, geom = 'bar') + xlab('Fraud (Yes/No)') + ylab('Number of transactions')

Classification on the original data

Keeping in mind that the data is highly skewed we proceed. First split the data into training and test sets.

idx = createDataPartition(d$Class, p = 0.7, list = F) d[, -n] = scale(d[, -n]) #perform scaling train = d[idx, ] test = d[-idx, ]

Calculate baseline accuracy for future reference

blacc = nrow(d[d$Class == 'No', ])/nrow(d)*100 cat('Baseline accuracy:', blacc)

Baseline accuracy: 99.82725

To begin with, let's train our models on the original dataset to see what we get if use unbalanced data. Due to computational limitations of my laptop, I will only run logistic regression for this purpose.

m1 = multinom(data = train, Class ~ .) p1 = predict(m1, test[, -n], type = 'class') cat(' Accuracy of the model', mean(p1 == test[, n])*100, '\n', 'Baseline accuracy', blacc)

# weights:  32 (31 variable)
initial  value 138189.980799 
final  value 31315.159746 
converged
 Accuracy of the model 99.92744 
 Baseline accuracy 99.82725

Though accuracy (99.92%) of the model might look impressive at a first glance, in fact it isn't. Simply predicting 'not a fraud' for all transactions will give 99.83% accuracy. To really evaluate model's perfomance we need to check confusion matrix.

confusionMatrix(p1, test[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction    No   Yes
       No  85287    55
       Yes     7    92
                                          
               Accuracy : 0.9993          
                 95% CI : (0.9991, 0.9994)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 1.779e-15       
                                          
                  Kappa : 0.7476          
 Mcnemar's Test P-Value : 2.387e-09       
                                          
            Sensitivity : 0.625850        
            Specificity : 0.999918        
         Pos Pred Value : 0.929293        
         Neg Pred Value : 0.999356        
             Prevalence : 0.001720        
         Detection Rate : 0.001077        
   Detection Prevalence : 0.001159        
      Balanced Accuracy : 0.812884        
                                          
       'Positive' Class : Yes

From the confusion matrix we see that while model has high accuracy (99.92%) and high specificity (99.98%), it has low sensitivity of 64%. In other words, only 64% of all fraudulent transactions were detected.

Classification on the SMOTEd data

Now let's preprocess our data using SMOTE algorithm:

table(d$Class) #check initial distribution
newData <- SMOTE(Class ~ ., d, perc.over = 500,perc.under=100) table(newData$Class) #check SMOTed distribution

    No    Yes 
284315    492

  No  Yes 
2460 2952

To train SVM (with RBF kernel) we are going to use train function from caret package. It allows to choose optimal parameters of the model (cost and sigma in this case). Cost refers to penalty for misclassifying examples and sigma is a parameter of RBF which measures similarity between examples. To choose best model we use 5-fold cross-validation. We then evaluate our model on the entire data set.

gr = expand.grid(C = c(1, 50, 150), sigma = c(0.01, 0.05, 1)) tr = trainControl(method = 'cv', number = 5) m2 = train(data = newData, Class ~ ., method = 'svmRadial', trControl = tr, tuneGrid = gr) m2

Support Vector Machines with Radial Basis Function Kernel 

5412 samples
  30 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4330, 4329, 4329, 4330, 4330 
Resampling results across tuning parameters:

  C    sigma  Accuracy   Kappa    
    1  0.01   0.9445668  0.8891865
    1  0.05   0.9626774  0.9250408
    1  1.00   0.9672934  0.9344234
   50  0.01   0.9717300  0.9430408
   50  0.05   0.9863262  0.9723782
   50  1.00   0.9695108  0.9388440
  150  0.01   0.9789351  0.9574955
  150  0.05   0.9850335  0.9697552
  150  1.00   0.9695108  0.9388440

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.05 and C = 50.

As wee see, best tuning parameters are C = 50 and sigma = 0.05

Let's look at a confusion matrix

p2 = predict(m2, d[, -n]) confusionMatrix(p2, d[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  278470      2
       Yes   5845    490
                                        
               Accuracy : 0.9795        
                 95% CI : (0.9789, 0.98)
    No Information Rate : 0.9983        
    P-Value [Acc > NIR] : 1             
                                        
                  Kappa : 0.1408        
 Mcnemar's Test P-Value : <2e-16        
                                        
            Sensitivity : 0.995935      
            Specificity : 0.979442      
         Pos Pred Value : 0.077348      
         Neg Pred Value : 0.999993      
             Prevalence : 0.001727      
         Detection Rate : 0.001720      
   Detection Prevalence : 0.022243      
      Balanced Accuracy : 0.987688      
                                        
       'Positive' Class : Yes

(Numbers may differ due to randomness of k-fold cv)

As expected we were able to achieve sensitivity of 99.59%. In other words, out of all fraudulent transactions we correctly detected 99.59% of them. This came in price of slightly lower accuracy (in comparison to the first model) - 97.95% vs. 99.92% and lower specificity 97.94% vs. 99.98%. The main disadvantage is low level of positive predicted value (i.e. given that prediction is positive, what is probability that the true state is positive) which this case is 7.74% vs. 85% for initial (unbalanced dataset) model. As was mentioned in the beginning, one should choose a model that matches certain goals. If the goal is to correctly identify fraudulent transactions even in price of low positive predicted value (which I believe the case), then the latter model (based on SMOTed data) should be used. Looking at confusion matrix we see that almost all fraudulent transactions were correctly identified and only 2.5% were mislabeled as fraudulent.

I'm planning to try couple more models and also use more sophisticated algorithm that uses K-means centroids of the majority class as samples for non fraudulent transactions.

m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')

Error in eval(expr, envir, enclos): could not find function "randomForest"
Traceback:

library(randomForest)
m3 = randomForest(data = newData, Class ~ .) p3 = predict(m3, d[, -n]) confusionMatrix(p3, d[, n], positive = 'Yes')

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  282105      0
       Yes   2210    492
                                          
               Accuracy : 0.9922          
                 95% CI : (0.9919, 0.9926)
    No Information Rate : 0.9983          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.306           
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 1.000000        
            Specificity : 0.992227        
         Pos Pred Value : 0.182087        
         Neg Pred Value : 1.000000        
             Prevalence : 0.001727        
         Detection Rate : 0.001727        
   Detection Prevalence : 0.009487        
      Balanced Accuracy : 0.996113        
                                          
       'Positive' Class : Yes

Random forest performs really well. Sensitivity 100% and high specificity (more than 99%). All fraudulent transactions were detected and less than 1% of all transactions were falsely classified as fraud. Hence, Random Forest + SMOTE algorithm shloud be considered as final model.

K-means centroids as a new sample

For curiosity, let's take another approach in dealing with imbalanced data. We are going to separate the examples for positive and negative and from the latter one extract centroids (generated using K-means clustering). Number of clusters will be equal to the number of positive examples. We then use these centroids together with positive examples as a new sample.（思路就是聚类，将major class聚类为k个点，其中k为欺诈信用卡的样本数!）

neg = d[d$Class == 'No', ] #negative examples pos = d[d$Class == 'Yes', ] #positive examples n_pos = sum(d$Class == 'Yes') #calculate number of positive examples clus = kmeans(neg[, -n], centers = n_pos, iter.max = 100) #perform K-means neg = as.data.frame(clus$centers) #extract centroids as new sample neg$Class = 'No' newData = rbind(neg, pos) #merge positive and negative examples newData$Class = factor(newData$Class)

We run random forest on the new dataset, newData, and check confusion matrix.

m4 = randomForest(data = newData, Class ~ .) p4 = predict(m4, d[, -n]) confusionMatrix(p4, d[, n], positive = 'Yes')

Confusion Matrix and Statistics

          Reference
Prediction     No    Yes
       No  210086      0
       Yes  74229    492
                                         
               Accuracy : 0.7394         
                 95% CI : (0.7378, 0.741)
    No Information Rate : 0.9983         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0097         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 1.000000       
            Specificity : 0.738920       
         Pos Pred Value : 0.006584       
         Neg Pred Value : 1.000000       
             Prevalence : 0.001727       
         Detection Rate : 0.001727       
   Detection Prevalence : 0.262357       
      Balanced Accuracy : 0.869460       
                                         
       'Positive' Class : Yes

Well, while sensitivity is still 100%, specificity dropped to 72% leading to a big fraction of false positive predictions. Learning on the data that was transformed using SMOTE algorithm gave much better results.

from：https://www.kaggle.com/themlguy/undersample-and-oversample-approach-explored

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the "../input/" directory. # For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory import os print(os.listdir("../input")) # Any results you write to the current directory are saved as output.

['creditcard.csv']

import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler import seaborn as sns from sklearn.metrics import confusion_matrix,recall_score,precision_recall_curve,auc,roc_curve,roc_auc_score,classification_report

/opt/conda/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

creditcard_data=pd.read_csv("../input/creditcard.csv")

creditcard_data['Amount']=StandardScaler().fit_transform(creditcard_data['Amount'].values.reshape(-1, 1)) creditcard_data.drop(['Time'], axis=1, inplace=True)

def generatePerformanceReport(clf,X_train,y_train,X_test,y_test,bool_): if bool_==True: clf.fit(X_train,y_train.values.ravel()) pred=clf.predict(X_test) cnf_matrix=confusion_matrix(y_test,pred) tn, fp, fn, tp=cnf_matrix.ravel() print('---------------------------------') print('Length of training data:',len(X_train)) print('Length of test data:', len(X_test)) print('---------------------------------') print('True positives:',tp) print('True negatives:',tn) print('False positives:',fp) print('False negatives:',fn) #sns.heatmap(cnf_matrix,cmap="coolwarm_r",annot=True,linewidths=0.5) print('----------------------Classification report--------------------------') print(classification_report(y_test,pred))

#generate 50%, 66%, 75% proportions of normal indices to be combined with fraud indices 也就是说采样后的黑白样本比例是：0.5,0.66,0.75
#undersampled data
normal_indices=creditcard_data[creditcard_data['Class']==0].index fraud_indices=creditcard_data[creditcard_data['Class']==1].index for i in range(1,4): normal_sampled_data=np.array(np.random.choice(normal_indices, i*len(fraud_indices),replace=False)) #a random sample is generated from normal_indices 主要是随机欠采样 undersampled_data=np.concatenate([fraud_indices, normal_sampled_data]) undersampled_data=creditcard_data.iloc[undersampled_data] print('length of undersampled data ', len(undersampled_data)) print('% of fraud transactions in undersampled data ',len(undersampled_data.loc[undersampled_data['Class']==1])/len(undersampled_data)) #get feature and label data feature_data=undersampled_data.loc[:,undersampled_data.columns!='Class'] label_data=undersampled_data.loc[:,undersampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),SVC(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in undersampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)

length of undersampled data  984
% of fraud transactions in undersampled data  0.5
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 134
False positives: 11
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.92      0.94       145
          1       0.93      0.95      0.94       151

avg / total       0.94      0.94      0.94       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 461
True negatives: 270879
False positives: 13436
False negatives: 31
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.95      0.98    284315
          1       0.03      0.94      0.06       492 #可以看到LR在测试数据集上表现并不好

avg / total       1.00      0.95      0.97    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 140
False positives: 5
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       145
          1       0.97      0.95      0.96       151

avg / total       0.96      0.96      0.96       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 463
True negatives: 267084
False positives: 17231
False negatives: 29
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.94      0.97    284315
          1       0.03      0.94      0.05       492 #看来svm在测试数据集上也不行啊

avg / total       1.00      0.94      0.97    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 688
Length of test data: 296
---------------------------------
True positives: 144
True negatives: 142
False positives: 3
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.98      0.97       145
          1       0.98      0.95      0.97       151

avg / total       0.97      0.97      0.97       296

---------------------------------
Length of training data: 688
Length of test data: 284807
---------------------------------
True positives: 485
True negatives: 275060
False positives: 9255
False negatives: 7
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.98    284315
          1       0.05      0.99      0.09       492 #Rf也不行？？？？

avg / total       1.00      0.97      0.98    284807

length of undersampled data  1476
% of fraud transactions in undersampled data  0.3333333333333333
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 130
True negatives: 291
False positives: 5
False negatives: 17
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.94      0.98      0.96       296
          1       0.96      0.88      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 442
True negatives: 278887
False positives: 5428
False negatives: 50
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.98      0.99    284315
          1       0.08      0.90      0.14       492

avg / total       1.00      0.98      0.99    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 133
True negatives: 286
False positives: 10
False negatives: 14
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.97      0.96       296
          1       0.93      0.90      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 453
True negatives: 274909
False positives: 9406
False negatives: 39
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.97      0.98    284315
          1       0.05      0.92      0.09       492

avg / total       1.00      0.97      0.98    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 1033
Length of test data: 443
---------------------------------
True positives: 128
True negatives: 293
False positives: 3
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.94      0.99      0.96       296
          1       0.98      0.87      0.92       147

avg / total       0.95      0.95      0.95       443

---------------------------------
Length of training data: 1033
Length of test data: 284807
---------------------------------
True positives: 473
True negatives: 281560
False positives: 2755
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00    284315
          1       0.15      0.96      0.25       492

avg / total       1.00      0.99      0.99    284807

length of undersampled data  1968
% of fraud transactions in undersampled data  0.25
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 116
True negatives: 451
False positives: 5
False negatives: 19
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.96      0.99      0.97       456
          1       0.96      0.86      0.91       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 433
True negatives: 282245
False positives: 2070
False negatives: 59
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      1.00    284315
          1       0.17      0.88      0.29       492

avg / total       1.00      0.99      1.00    284807

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 118
True negatives: 447
False positives: 9
False negatives: 17
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.96      0.98      0.97       456
          1       0.93      0.87      0.90       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 445
True negatives: 279369
False positives: 4946
False negatives: 47
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.98      0.99    284315
          1       0.08      0.90      0.15       492

avg / total       1.00      0.98      0.99    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 1377
Length of test data: 591
---------------------------------
True positives: 112
True negatives: 455
False positives: 1
False negatives: 23
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      1.00      0.97       456
          1       0.99      0.83      0.90       135

avg / total       0.96      0.96      0.96       591

---------------------------------
Length of training data: 1377
Length of test data: 284807
---------------------------------
True positives: 469
True negatives: 283466
False positives: 849
False negatives: 23
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    284315
          1       0.36      0.95      0.52       492

avg / total       1.00      1.00      1.00    284807

整体来看，因为欠采样只是用了一个模型，因此预测效果很差！！！因为没有用到全量数据特征，所以在全部数据集上表现并不好！

#oversampled_data data
normal_sampled_indices=creditcard_data.loc[creditcard_data['Class']==0].index oversampled_data=creditcard_data.iloc[normal_sampled_indices] fraud_data=creditcard_data.loc[creditcard_data['Class']==1] oversampled_data=oversampled_data.append([fraud_data]*300, ignore_index=True) #此处过采样处理是直接将欺诈样本复制300份！！！ print('length of oversampled_data data ', len(oversampled_data)) print('% of fraud transactions in oversampled_data data ',len(oversampled_data.loc[oversampled_data['Class']==1])/len(oversampled_data)) #get feature and label data feature_data=oversampled_data.loc[:,oversampled_data.columns!='Class'] label_data=oversampled_data.loc[:,oversampled_data.columns=='Class'] X_train, X_test, y_train, y_test=train_test_split(feature_data,label_data,test_size=0.30) for j in [LogisticRegression(),RandomForestClassifier(n_estimators=100)]: clf=j print(j) generatePerformanceReport(clf,X_train,y_train,X_test,y_test,True) #the above code classifies X_test which is part of undersampled data #now, let us consider the remaining rows of dataset and use that as test set remaining_indices=[i for i in creditcard_data.index if i not in oversampled_data.index] testdf=creditcard_data.iloc[remaining_indices] testdf_label=creditcard_data.loc[:,testdf.columns=='Class'] testdf_feature=creditcard_data.loc[:,testdf.columns!='Class'] generatePerformanceReport(clf,X_train,y_train,testdf_feature,testdf_label,False)

length of oversampled_data data  431915
% of fraud transactions in oversampled_data data  0.3417339059768704 最后复制后的欺诈样本比例为白样本的33%
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
---------------------------------
Length of training data: 302340
Length of test data: 129575
---------------------------------
True positives: 39803
True negatives: 84311
False positives: 1027
False negatives: 4434
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       0.95      0.99      0.97     85338
          1       0.97      0.90      0.94     44237

avg / total       0.96      0.96      0.96    129575

---------------------------------
Length of training data: 302340
Length of test data: 284807
---------------------------------
True positives: 444
True negatives: 281055
False positives: 3260
False negatives: 48
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      0.99      0.99    284315
          1       0.12      0.90      0.21       492 #效果也不咋的啊！

avg / total       1.00      0.99      0.99    284807

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
---------------------------------
Length of training data: 302340
Length of test data: 129575
---------------------------------
True positives: 44237
True negatives: 85327
False positives: 11
False negatives: 0
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     85338
          1       1.00      1.00      1.00     44237

avg / total       1.00      1.00      1.00    129575

---------------------------------
Length of training data: 302340
Length of test data: 284807
---------------------------------
True positives: 492
True negatives: 284304
False positives: 11
False negatives: 0
----------------------Classification report--------------------------
             precision    recall  f1-score   support

          0       1.00      1.00      1.00    284315
          1       0.98      1.00      0.99       492 #随机森林还是不错的！！！

avg / total       1.00      1.00      1.00    284807

Random forest classifier with oversampled approach performs better compared to undersampled approach！！！

综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3）效果比较好！

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3）效果比较好！记得在smote前一定要先做标准化！！！

先看数据：

Introduction

Classification on the original data

Classification on the SMOTEd data

K-means centroids as a new sample

猜你喜欢

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法 综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3）效果比较好！记得在smote前一定要先做标准化！！！

先看数据：

Introduction

Classification on the original data

Classification on the SMOTEd data

K-means centroids as a new sample

猜你喜欢

kaggle 欺诈信用卡预测——不平衡训练样本的处理方法综合结论就是：随机森林+过采样（直接复制或者smote后，黑白比例1:3）效果比较好！记得在smote前一定要先做标准化！！！