kaggle - actual combat of credit card fraud detection project

Step 1: Understand the topic and determine the task

Credit card fraud refers to the intentional use of forged or invalid credit cards, fraudulent use of other people's credit cards to defraud property, or malicious overdrafts with one's own credit card.

Common credit card fraud uses include:

1. Lost card for fraudulent use: There are generally three cases of lost card. First, the issuing bank lost the card when sending it to the cardholder, that is, the card was not received; second, the cardholder did not keep the card properly and lost it; third, it was stolen by criminals.
2. Fake application: apply for a credit card by using other people's information, or deliberately fill in false information to apply, forge an ID card, fill in a false unit or home address, etc.
3. Forged credit card: According to statistics, more than 60% of international credit card fraud cases are The fraud of counterfeit cards is characterized by the nature of gangs, from stealing card information, manufacturing counterfeit cards, selling counterfeit cards, to committing crimes with counterfeit cards. Forgers often use some of the latest technological means to steal real credit card information. Some use miniature recorders to steal credit card information, and some use the terminal function of the authorization machine to steal credit card information. After stealing real credit card information, They manufacture counterfeit cards in batches, and then commit crimes by selling counterfeit cards to make huge profits.

This project builds a simple credit card anti-fraud prediction model by using historical credit card transaction data , data preprocessing, variable selection, modeling analysis and prediction, etc. , to detect the incident of customer credit card theft in advance .

This project uses the historical transaction data of credit cards to conduct machine learning and build a credit card anti-fraud prediction model to detect the incidents of customers' credit card theft in advance.

Determining the task: that is, to establish an anti-fraud prediction model based on the user's historical information. When there is new user card information, the model can accurately predict whether it is a normal card swiping behavior or a fraudulent swiping behavior, that is, predict the cardholder's behavior. Whether the credit card has been stolen.

Step 2: Scenario Analysis

insert image description here
**Analysis scene:** Determine whether to supervise, binary classification or multi-classification. - Different scenarios require different algorithms.
First of all, the data we got is labeled, and each sample indicates whether it is a fraudulent behavior. No fraudulent behavior is 0, and fraudulent behavior is 1 - so it is determined that it is a supervised learning scenario.
Second, whether or not a cardholder sends fraudulent swipes is a binary problem—meaning that specific solutions can be found through binary-related algorithms. (For example, using logistic regression, logistic regression is a binary classification problem)
Analyze the data: The data given to us is a total of 31 columns. Among them, features v1 to v28 are processed by PCA, and we do not need to move. Then feature time and consumption amount, the values ​​of the two data are very different from other feature data values, so we need to do feature scaling to scale the features to the same specification. In terms of data quality, there is no data with garbled characters or empty characters. The last column of feature class is our target column, and the first 30 columns are our feature columns.

How to process this data for our final verification : The data is all labeled data, and the model generated from the training set can be evaluated by cross-validation. 70% of the data is used for training and 30% of the data is used for prediction and evaluation.

The business scenario is summarized as follows:
1. Learning from historical record data and predicting whether the credit card holder will be stolen or not, choose the Logistic Regression algorithm for the two-category supervised learning scenario.
2. The data is structured data, and no feature abstraction is required. Whether feature scaling is required remains to be seen later.

Step 3: Data Preprocessing

The original data is a personal transaction record, but considering the privacy of the data itself, the original data has been processed similar to PCA, and now the feature data has been extracted, the next purpose is how to build a model to achieve the best detection effect. , although we do not need to perform feature extraction operations on the data, the challenges we face are still quite big.
insert image description here
It contains only numeric input variables that are the result of PCA transformations. Unfortunately, due to confidentiality issues, we are unable to provide original functionality and further background information on the data. Features V1, V2, ... V28 are the main components obtained using PCA, the only features not transformed with PCA are "time" and "quantity". The feature 'time' contains the number of seconds that elapsed between each transaction in the dataset and the first transaction. The feature "amount" is the transaction amount, which can be used for instance-dependent cost-aware learning. The feature 'Class' is the response variable, which takes the value 1 if it is stolen, and 0 otherwise. There are 31 columns in total. A total of 31 features.

Check for missing values

insert image description here
insert image description here
No missing values. If there are missing values, the median or mean can be used instead.

View sample class distribution

Look at the distribution of the samples, whether it is an unbalanced sample, that is, the number of different classes is very different.
The number of positive and negative samples varies greatly.
insert image description here
Unbalanced data: Unbalanced data is likely to cause our model to be accurate in predicting a result of '0' but inaccurate in predicting an outcome of '1'.

Note that in the data analysis stage, the number of data samples is not adjusted first, otherwise it will interfere with subsequent feature engineering. In the model training phase, when the model is trained with the training samples, the sample imbalance adjustment is performed.

Data imbalance solution

To deal with this imbalance problem, we can start from the two levels of data and algorithm.
1. Data
downsampling (comparatively simple implementation): downsampling - give me down. Isn't the two types of data unbalanced, then I will let you have the same number (that is, reduce the number of 0s to as many as 1), so it will not be balanced. That is, delete too many samples.

A very simple implementation method. In the data belonging to 0, make a random selection, just select as many samples as the class of 1. Let the more be the same as the less

Oversampling: making less equal more
insert image description here

1. Data level: undersampling, oversampling, combination of undersampling and oversampling
2. Algorithm level: ensemble learning, cost sensitivity, feature selection

Change the evaluation criteria, the following criteria can provide deeper insight into the accuracy of the model

Confusion Matrix: Groups the data to be predicted into tables to show the correct predictions (diagonal), and understand the type of its incorrect predictions (which classes are assigned incorrect predictions);
Accuracy: a classification accuracy The processing method of ;
Recall: a processing method of classification completeness;
F1-score (or F-score): the weighted average of precision and recall.

use a different algorithm

insert image description here

insert image description here

Data value normalization/normalization

The value of the amount sequence is relatively large and will be standardized or normalized later, because the computer will mistake the weight of the feature for a larger value, and the size of the data should be balanced as much as possible.
insert image description here

Step 4: Feature Engineering

1) Check the feature distribution, and delete the features with similar feature distributions under different categories

1) First check the credit card amount distribution map of fraudulent credit card and normal credit card credit card,
that is, to view the characteristic value distribution map of different categories. The
insert image description herehorizontal axis is the amount, and the vertical axis is the number of transactions.
The two data distributions are quite different. This feature cannot be deleted. The amount of the stolen credit card is relatively small compared with the amount of normal credit card users. This shows that credit card fraudsters prefer to spend small amounts in order not to attract the attention of credit card owners.

2) Time distribution of normal card swiping and fraudulent swiping

insert image description here

Generally speaking, there is not much difference between the time distribution of normal credit card swiping and fraudulent swiping , so it is speculated that thieves put the fraudulent swiping time in the concentrated area of ​​normal swiping time in order to reduce the risk of being identified. Therefore, when building model predictions, this feature can be filtered.

3) View other feature distributions
insert image description here

We will select feature variables that have distinct distributions across different credit card states . Therefore, the variables V8, V13, V15, V20, V21, V22, V23, V24, V25, V26, V27 and V28 were eliminated. This is also consistent with the conclusion we began to use the correlation map to observe, and the variable Time was also eliminated.

insert image description here

2) Feature scaling

It can be clearly seen that the difference between the value of the amount feature and the value of other features increases, so feature scaling is performed.
In order to eliminate the influence between different dimensions, the 'Amount' feature is standardized to form a new feature 'normAmount',

3) Feature Importance Analysis

Use the feature importance of random forest to sort the importance of features.
When
insert image description here
insert image description here
using random forest to sort feature importance, it will increase the importance of continuous features. For example, if the index number is a continuous value from 0 to 100000, it will increase its importance. sex, but it has nothing to do with our ultimate goal. So it needs to be analyzed in detail.

Step 5: Model Training

The first: first divide the training set and test machine, and then downsample/oversample the training set, and the test set is a normal sample.
The second: down-sampling first, and then divide the training set and test machine. Such a test set is sampled and cannot accurately represent the ability of the model.

Handling Imbalanced Samples

1) Simple way - downsampling
insert image description here
insert image description here

Above, we can see that the balanced data of the samples we made is relatively small, and the test set is not enough to represent the integrity of the sample in the test, so the test set of the original data set is used for the actual test, which is more in line with the distribution of the original data.

2) Using the SMOTE algorithm for data oversampling
The basic idea of ​​the SMOTE algorithm is to analyze the minority class samples and artificially synthesize new samples according to the minority class samples and add them to the data set
, only oversampling the training set.
insert image description here
insert image description here
Note: Oversampling significantly reduces the number of manslaughter, so in the case of data imbalance, it is more often used to generate data rather than reduce data, but once the data increases, the running time also becomes longer.

K-fold cross-validation or grid search tuning parameters (grid search) to find the best model parameters

The best value is c=10
insert image description here

Step 6: Model Evaluation

Different target tasks correspond to different model indicators.
Here we want to detect all users who have fraudulent behaviors as much as possible, so the recall rate is used.

What is our purpose? Do you want to detect those abnormal samples?
For another example, suppose the hospital now gives us a task to detect those who have cancer out of 1000 patients. So suppose 990 of the 1000 people in the dataset are cancer-free and only 10 have cancer, we need to detect these 10 people.
Assuming that we measure it by accuracy, even if these 10 people are not detected, there is 990/1000 or 99% accuracy, but this model has no value! This is very important,

Because different evaluation methods will yield different answers, we must choose the most appropriate evaluation method according to the nature of the problem.

Train and test with downsampled data

insert image description here
insert image description here
insert image description here

Use downsampled data to train and test (the effect of different thresholds on the results)

For the threshold, it is not good to set too large, and it is not good to set too small, so the more appropriate the threshold is set, the better the model fitting effect can be.
Recall Metric
in at The Testing the DataSet: 0.9727891156462585
Recall Metric in at The Testing the DataSet: .9523809523809523 Recall Metric in at The Testing the DataSet: 0.9319727891156463
Recall Metric in at The Testing the DataSet: 0.9319727891156463
Recall Metric in at The Testing the DataSet: 0.9319727891156463 Recall Metric in at The Testing the DataSet: .9251700680272109 Recall
Metric
in the testing dataset: 0.8979591836734694
Recall metric in the testing dataset: 0.8775510204081632
Recall metric in the testing dataset: 0.8639455782312925

insert image description here

Use downsampled data for training, raw data for testing

insert image description here

insert image description here
It can be seen that although the recall rate is relatively high, 8794 correct samples are classified incorrectly. This is also the downside of downsampling. Divide the right into the wrong.
Note: For the data set obtained by downsampling, although the recall rate is relatively high, there are still many manslaughter. This phenomenon is actually a defect of the downsampling strategy itself.

K-fold cross-validation on raw data

insert image description here
insert image description here

Use raw data for training and testing

insert image description here

View ROC curve

Roc curve: the receiver operating characteristic curve, which reflects the changing relationship between the true positive rate (sensitivity) and the false positive rate (1-specificity). The closer the Roc curve is to the upper left corner, the more accurate the prediction result.

Logistic regression threshold selection

For the logistic regression algorithm, we can also specify such a threshold, that is, the probability of the final result is greater than how much we regard it as a positive or negative sample. Unused thresholds can have a big impact on the results.
insert image description here

The small threshold means that our model is very strict and would rather kill by mistake than let it go, which will make most samples regarded as abnormal samples, with high recall and slightly lower accuracy

When the threshold is larger, our model is slightly looser, which will result in very low recall and slightly higher accuracy.

In summary, when we use the logistic regression algorithm, we also need to choose the most appropriate threshold according to the actual application scenario!

There is no fixed standard for how to set the threshold. It is more judged based on the business (because different thresholds have an impact on the recall rate and precision rate), and it depends on which indicator our business wants to improve as a reference. For example: credit card fraud is a business that prefers a higher recall rate (meaning to block all possible fraudulent transactions)

If you want a high recall rate, set the threshold to a low point, and you would rather kill 100 by mistake than let go of 1.
insert image description here

oversampling

Summarize

Determining tasks - analysis scenarios (supervised/unsupervised, binary/multi-classification, what algorithm to use) - data preprocessing (see if there are missing values, the amount of data in different categories is not much different, normalized and standardized) - - Feature engineering (delete the feature data distribution of different classes with little difference, construct a new feature class) - Training model (divide the training set and test set first, deal with the data distribution here, if you encounter data imbalance, consider putting Under the training set, downsampling will divide a lot of correct ones into the wrong ones. Or oversampling, and then the training set is trained, and k-fold cross-validation is used to find the optimal parameters of the model) - evaluation model (different tasks correspond to different evaluation indicators , for example, the project considers the recall rate, how many users who have stolen credit cards have been detected by us) - evaluation results (if the evaluation index is good, it means that it is OK, and the model can be used. If the evaluation index is not good, think of other solutions , for the logistic regression algorithm, different thresholds also have an impact on our final results. Find the best threshold. If the evaluation index for finding the best threshold is not very high, then try other algorithms and change other methods)
insert image description here

This project can also be seen as anomaly detection, using autoencoder for anomaly detection.

Guess you like

Origin blog.csdn.net/weixin_45942265/article/details/119296543