89 Logistic Regression User Portrait User Responsiveness Prediction 2

logistic regression chapter

The data set is connected to the data set in the previous section. This analysis is based on whether the user is a high-response user or not, and uses logistic regression to predict the user's responsiveness to obtain the probability of response. Linear regression, refer to the previous chapter

1 Read and preview data

Load and read the data, the data is still desensitized data,

file_path<-"data_response_model.csv" #change the location

# read in data
options(stringsAsFactors = F)
raw<-read.csv(file_path)   #read in your csv data
str(raw)     #check the varibale type
View(raw)    #take a quick look at the data
summary(raw) #take a quick look at the summary of variable

# response variable
View(table(raw$dv_response))  #Y
View(prop.table(table(raw$dv_response))) #Y frequency

insert image description here
Determined according to the business, the y value of the data is the response rate is dv_response, and observe its situation
insert image description here

2 Divide data

The data is still divided into three parts, namely training set, validation set and test set.

#Separate Build Sample
train<-raw[raw$segment=='build',]  #select build sample, it should be random selected when you build the model
View(table(train$segment))         #check segment
View(table(train$dv_response))     #check Y distribution  
View(prop.table(table(train$dv_response))) #check Y distribution

#Separate invalidation Sample
test<-raw[raw$segment=='inval',] #select invalidation(OOS) sample
View(table(test$segment))       #check segment
View(prop.table(table(test$dv_response))) #check Y distribution

#Separate out of validation Sample
validation<-raw[raw$segment=='outval',] #select out of validation(OOT) sample
View(table(validation$segment)) #check segment
View(prop.table(table(validation$dv_response))) #check Y distribution

3 Profilng production

Sum the response rate in the data, the high response customer in the original data is 1, and the low response customer is 0. The total sum is the number of high response customers, length is the total number of records, and the average is the overall average number

# overall performance
overall_cnt=nrow(train)   #calculate the total count
overall_resp=sum(train$dv_response)  #calculate the total responders count
overall_resp_rate=overall_resp/overall_cnt  #calculate the response rate
overall_perf<-c(overall_count=overall_cnt,overall_responders=overall_resp,overall_response_rate=overall_resp_rate) #combine
overall_perf<-c(overall_cnt=nrow(train),overall_resp=sum(train$dv_response),overall_resp_rate=sum(train$dv_response)/nrow(train)) #combine
View(t(overall_perf))  #take a look at the summary

insert image description here
The division done here is the same as the division lift chart in the previous chapter, and it can also be written in SQL statements, like group by, to calculate the comparison between the average response rate of each group and the overall response rate.
Before the library, please download the plyr package. To write sql, you need to download sqldf
install.packages("sqldf")

library(plyr)              #call plyr
?ddply
prof<-ddply(train,.(hh_gender_m_flag),summarise,cnt=length(rid),res=sum(dv_response)) #group by hh_gender_m_flg
View(prof)  #check the result
tt=aggregate(train[,c("hh_gender_m_flag","rid")],by=list(train[,c("hh_gender_m_flag")]),length) #group by hh_gender_m_flg
View(tt) 
#calculate the probablity
#prop.table(as.matrix(prof[,-1]),2)
#t(t(prof)/colSums(prof))

prof1<-within(prof,{res_rate<-res/cnt 
index<-res_rate/overall_resp_rate*100
percent<-cnt/overall_cnt
})  #add response_rate,index, percentage

View(prof1)   #check the result

library(sqldf) #Integer
multiplication floating point variable floating point data
sqldf("select hh_gender_m_flag,count( ) as cnt,sum(dv_response)as res,1.0 sum(dv_response) /count(*) as res_rate from train group by 1 ")
Missing values ​​can also be used as part of the feature, and the missing values ​​can also be compared by lift

nomissing<-data.frame(var_data[!is.na(var_data$em_months_last_open),])  #select the no missing value records 
missing<-data.frame(var_data[is.na(var_data$em_months_last_open),])     #select the missing value records

###################################### numeric Profiling:missing part   #############################################################

missing2<-ddply(missing,.(em_months_last_open),summarise,cnt=length(dv_response),res=sum(dv_response)) #group by em_months_last_open
View(missing2)  

missing_perf<-within(missing2,{res_rate<-res/cnt 
index<-res_rate/overall_resp_rate*100
percent<-cnt/overall_cnt
var_category<-c('unknown')
})   #summary

View(missing_perf)

insert image description here
The non-missing value data is divided here, and the non-missing value data is divided into 10 equal parts according to the deciles. Calculate their total records and total high-responding customers separately

nomissing_value<-nomissing$em_months_last_open  #put the nomissing values into a variable

#method1:equal frequency
nomissing$var_category<-cut(nomissing_value,unique(quantile(nomissing_value,(0:10)/10)),include.lowest = T)#separte into 10 groups based on records
class(nomissing$var_category)
View(table(nomissing$var_category))  #take a look at the 10 category

prof2<-ddply(nomissing,.(var_category),summarise,cnt=length(dv_response),res=sum(dv_response)) #group by the 10 groups
View(prof2)

insert image description here
insert image description here
Again, the lift calculation is performed on each group of the data divided into 10 equal parts, and the ratio of the average number of high-responding users in each group to the total number of users is compared. Greater than 100% is a customer tag above the overall performance

nonmissing_perf<-within(prof2,
                        {res_rate<-res/cnt 
                        index<-res_rate/overall_resp_rate*100
                        percent<-cnt/overall_cnt
                        })  #add resp_rate,index,percent
View(nonmissing_perf)

#set missing_perf and non-missing_Perf together
View(missing_perf)
View(nonmissing_perf)
em_months_last_open_perf<-rbind(nonmissing_perf,missing_perf[,-1]) #set 2 data together
View(em_months_last_open_perf)

insert image description here

4 Missing values, outlier handling

1 less than 3% direct deletion or median, mean
imput 2 3% - 20% deletion or knn, EM regression imput
3 20% - 50% multiple imputation4 50 - 80 %
missing value taxonomy5
Higher than 80% is discarded, the data is too inaccurate, and the analysis error is very large

Outliers are usually resolved with the cap method

numeric variables

train$m2_em_count_valid <- ifelse(is.na(train$em_count_valid) == T, 2,      #when em_count_valid is missing ,then assign 2
                                  ifelse(train$em_count_valid <= 1, 1,         #when em_count_valid<=1 then assign 1
                                         ifelse(train$em_count_valid >=10, 10, #when em_count_valid>=10 then assign 10
                                                train$em_count_valid)))        #when 1<em_count_valid<10 and not missing then assign the raw value

summary(train$m2_em_count_valid)  #do a summary
summary(train$m1_EM_COUNT_VALID)  #do a summary

5 Model Fitting

Select the most valuable variables according to the business

library(picante)  #call picante
var_list<-c('dv_response','m1_POS_NUM_ORDERS_24MO',
            'm1_POS_NUM_ORDERS',
            'm1_SH_MNTHS_LAST_INQUIRED',
            'm1_POS_SP_QTY_24MO',
            'm1_POS_REVENUE_TOTAL',
            'm1_POS_LAST_ORDER_DPA',
            'm1_POS_MARGIN_TOTAL',
            'm1_pos_mo_btwn_fst_lst_order',
            'm1_POS_REVENUE_BASE',
            'm1_POS_TOT_REVPERSYS',
            'm1_EM_COUNT_VALID',
            'm1_EM_NUM_OPEN_30',
            'm1_POS_MARGIN_TOTAL_12MO',
            'm1_EX_AUTO_USED0005_X5',
            'm1_SH_INQUIRED_LAST3MO',
            'm1_EX_AUTO_USED0005_X789',
            'm1_HH_INCOME',
            'm1_SH_INQUIRED_LAST12MO',
            'm1_POS_LAST_TOTAL_REVENUE',
            'm1_EM_ALL_OPT_OUT_FLAG',
            'm1_POS_REVENUE_TOTAL_6MO',
            'm1_EM_MONTHS_LAST_OPEN',
            'm1_POS_MNTHS_LAST_ORDER',
            'm1_WEB_MNTHS_SINCE_LAST_SES')  #put the variables you want to do correlation analysis here

Make a correlation coefficient matrix, filter the variables according to the correlation, select the identified variable method or the dummy variable method by collinearity, logistic regression can use the IV value to select variables

corr_var<-train[, var_list]  #select all the variables you want to do correlation analysis
str(corr_var)                #check the variable type
correlation<-data.frame(cor.table(corr_var,cor.method = 'pearson'))  #do the correlation
View(correlation)
cor_only=data.frame(row.names(correlation),correlation[, 1:ncol(corr_var)])  #select correlation result only
View(cor_only)

After selecting the variables to be placed in the model

var_list<-c('m1_WEB_MNTHS_SINCE_LAST_SES',
            'm1_POS_MNTHS_LAST_ORDER',
            'm1_POS_NUM_ORDERS_24MO',
            'm1_pos_mo_btwn_fst_lst_order',
            'm1_EM_COUNT_VALID',
            'm1_POS_TOT_REVPERSYS',
            'm1_EM_MONTHS_LAST_OPEN',
            'm1_POS_LAST_ORDER_DPA'
)   #put the variables you want to try in model here

mods<-train[,c('dv_response',var_list)] #select Y and varibales you want to try
str(mods) unstandardized
fitting, and then use stepwise regression to filter variables

mods<-train[,c('dv_response',var_list)]  #select Y and varibales you want to try
str(mods)

(model_glm<-glm(dv_response~.,data=mods,family =binomial(link ="logit")))  #logistic model
model_glm
#Stepwise
library(MASS)
model_sel<-stepAIC(model_glm,direction ="both")  #using both backward and forward stepwise selection
model_sel
summary<-summary(model_sel)  #summary
model_summary<-data.frame(var=rownames(summary$coefficients),summary$coefficients) #do the model summary
View(model_summary)

Standardized modeling of the data, standardized modeling is convenient to view the degree of influence of each variable on y

#variable importance
#standardize variable

#?scale
mods2<-scale(train[,var_list],center=T,scale=T) 
mods3<-data.frame(dv_response=c(train$dv_response),mods2[,var_list])
# View(mods3)
(model_glm2<-glm(dv_response~.,data=mods3,family =binomial(link ="logit")))  #logistic model
(summary2<-summary(model_glm2)) 
model_summary2<-data.frame(var=rownames(summary2$coefficients),summary2$coefficients) #do the model summary
View(model_summary2)
model_summary2_f<-model_summary2[model_summary2$var!='(Intercept)',]
model_summary2_f$contribution<-abs(model_summary2_f$Estimate)/(sum(abs(model_summary2_f$Estimate)))
View(model_summary2_f)

insert image description here

6 Model Evaluation

Regression fitted VIF values

#Variable VIF
fit <- lm(dv_response~., data=mods)  #regression model
#install.packages('car')  #Install Package 'Car' to calculate VIF
require(car)  #call Car
vif=data.frame(vif(fit)) #get Vif
var_vif=data.frame(var=rownames(vif),vif) #get variables and corresponding Vif
View(var_vif)

Correlation coefficient matrix production

#variable correlation
cor<-data.frame(cor.table(mods,cor.method = 'pearson')) #calculate the correlation
correlation<-data.frame(variables=rownames(cor),cor[, 1:ncol(mods)]) #get correlation only
View(correlation)

insert image description here
Finally, make the ROC curve, draw the ROC curve diagram on the model, and observe the effect.

library(ROCR)

#### test data####
pred_prob<-predict(model_glm,test,type='response') #predict Y
pred_prob
pred<-prediction(pred_prob,test$dv_response) #put predicted Y and actual Y together
pred@predictions
View(pred)
perf<-performance(pred,'tpr','fpr')   #Check the performance,True positive rate
perf
par(mar=c(5,5,2,2),xaxs = "i",yaxs = "i",cex.axis=1.3,cex.lab=1.4)  #set the graph parameter

#AUC value
auc <- performance(pred,"auc")
unlist(slot(auc,"y.values"))

#plotting the ROC curve
plot(perf,col="black",lty=3, lwd=3,main='ROC Curve')

insert image description here

#plot Lift chart
perf<-performance(pred,‘lift’,‘rpp’)
plot(perf,col=“black”,lty=3, lwd=3,main=‘Lift Chart’)

insert image description here

7 Overall division of user group lift diagram

insert image description here

pred<-predict(model_glm,train,type='response')   #Predict Y
decile<-cut(pred,unique(quantile(pred,(0:10)/10)),labels=10:1, include.lowest = T)  #Separate into 10 groups
sum<-data.frame(actual=train$dv_response,pred=pred,decile=decile) #put actual Y,predicted Y,Decile together

decile_sum<-ddply(sum,.(decile),summarise,cnt=length(actual),res=sum(actual)) #group by decile
decile_sum2<-within(decile_sum,
                     {res_rate<-res/cnt 
                    index<-100*res_rate/(sum(res)/sum(cnt))
                     })  #add resp_rate,index
decile_sum3<-decile_sum2[order(decile_sum2[,1],decreasing=T),] #order decile
View(decile_sum3)

The deciles are used to divide, and the customer groups are divided by the number of records. It can be found that users of 1-10 levels have a good real response rate lift value.
Paste the regression equation

ss <- summary(model_glm)  #put model summary together
ss
which(names(ss)=="coefficients")

#XBeta
#Y = 1/(1+exp(-XBeta))
#output model equoation
gsub("\\+-","-",gsub('\\*\\(Intercept)','',paste(ss[["coefficients"]][,1],rownames(ss[["coefficients"]]),collapse = "+",sep = "*"))) 

insert image description here

Guess you like

Origin blog.csdn.net/weixin_44498127/article/details/124353690