XGBoost Article
The data here is taken form the Data Hackathon3.x - http://datahack.analyticsvidhya.com/contest/data-hackathon-3x
Import Libraries:
In [1]:
import pandas as pd
import numpy as np import xgboost as xgb from xgboost.sklearn import XGBClassifier from sklearn import cross_validation, metrics from sklearn.grid_search import GridSearchCV import matplotlib.pylab as plt %matplotlib inline from matplotlib.pylab import rcParams rcParams['figure.figsize'] = 12, 4
Load Data:
The data has gone through following pre-processing:
- City variable dropped because of too many categories
- DOB converted to Age | DOB dropped
- EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | EMI_Loan_Submitted dropped
- EmployerName dropped because of too many categories
- Existing_EMI imputed with 0 (median) - 111 values were missing
- Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Interest_Rate dropped
- Lead_Creation_Date dropped because made little intuitive impact on outcome
- Loan_Amount_Applied, Loan_Tenure_Applied imputed with missing
- Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Loan_Amount_Submitted dropped
- Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Loan_Tenure_Submitted dropped
- LoggedIn, Salary_Account removed
- Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Processing_Fee dropped
- Source - top 2 kept as is and all others combined into different category
- Numerical and One-Hot-Coding performed
In [2]:
train = pd.read_csv('train_modified.csv') test = pd.read_csv('test_modified.csv')
In [3]:
train.shape, test.shape
Out[3]:
In [4]:
target='Disbursed'
IDcol = 'ID'
In [5]:
train['Disbursed'].value_counts()
Out[5]:
Define a function for modeling and cross-validation
This function will do the following:
- fit the model
- determine training accuracy
- determine training AUC
- determine testing AUC
- update n_estimators with cv function of xgboost package
- plot Feature Importance
In [6]:
test_results = pd.read_csv('test_results.csv') def modelfit(alg, dtrain, dtest, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50): if useTrainCV: xgb_param = alg.get_xgb_params() xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values) xgtest = xgb.DMatrix(dtest[predictors].values) cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False) alg.set_params(n_estimators=cvresult.shape[0]) #Fit the algorithm on the data alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc') #Predict training set: dtrain_predictions = alg.predict(dtrain[predictors]) dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1] #Print model report: print "\nModel Report" print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions) print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob) # Predict on testing data: dtest['predprob'] = alg.predict_proba(dtest[predictors])[:,1] results = test_results.merge(dtest[['ID','predprob']], on='ID') print 'AUC Score (Test): %f' % metrics.roc_auc_score(results['Disbursed'