R language 26-Prosper loan data analysis 2

Univariate analysis

First, the basic customer information platform for analysis, include location, credit status, apply for loans and other reasons, have the general characteristics of target customers to analyze tendencies:

  • Region Distribution:
library(ggplot2)
ggplot(data=subset(data,!data$BorrowerState==""),
       aes(x=BorrowerState))+geom_bar(fill="pink",color="black")+
  theme(axis.text = element_text(size = 5) )

figure 1
You can see the company's customers in California, New York, Florida, Texas, Illinois distribution of more, ahead of other states, may be appropriate to increase the propaganda in the rest of the state, to develop new customers . Prosper is headquartered in San Francisco, it may also be associated with the largest number of people using the El Segundo, California.

  • Violations to the analysis:
ggplot(data=subset(data,!data$DelinquenciesLast7Years==""),
       aes(x=DelinquenciesLast7Years))+geom_bar(fill="orange",color="black")+
  theme(axis.text = element_text(size = 5) )+scale_x_continuous(limits = c(-1,50))

figure 2

  • Customer employment situation:
ggplot(aes(EmploymentStatus),data = subset(data,!(data$EmploymentStatus==""))) + 
  geom_bar(color="black",fill=I("#B2DFEE"),width = 0.5) +
  theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))

image 3
The platform can be seen most customers were hired employment or full-time, we have jobs, stable income.

  • Customer credit queries:
bar_plot <- function(varname, binwidth) {
  return(ggplot(aes_string(x = varname), data = data) + geom_histogram(binwidth = binwidth))
}

bar_plot('InquiriesLast6Months',1)+
  coord_cartesian(xlim=c(0,quantile(data$InquiriesLast6Months,probs = 0.95,
                                    "na.rm" = TRUE)))+
  geom_vline(xintercept = quantile(data$InquiriesLast6Months, 
                                     probs = 0.95, "na.rm" = TRUE), 
             linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))

Figure 4
The number of recent credit inquiries indicates the borrower's loan application number, the more intense the more the number of funds to a certain extent. The figure can be seen in 95% of the number of customer loans were less than 5 times.

  • Customer benefit ratio debt situation:
bar_plot('DebtToIncomeRatio',0.04)+
  coord_cartesian(xlim=c(0,quantile(data$DebtToIncomeRatio,probs = 0.95,
                                    "na.rm" = TRUE)))+
  geom_vline(xintercept = quantile(data$DebtToIncomeRatio, 
                                     probs = 0.95, "na.rm" = TRUE), 
             linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))

Figure 5
Liabilities income, the lower the ratio the higher the ability to repay the loan, the platform 95% of debt income ratio of less than 0.5, overall customer liabilities income is relatively low.

  • Customer's monthly income:
bar_plot('StatedMonthlyIncome',425)+
  scale_x_continuous(limits = (c(0,15000)),breaks = seq(0,15000,500))+
  geom_vline(xintercept = 5000, linetype = "dashed", color = "red")+
  geom_vline(xintercept = 3000, linetype = "dashed", color = "red")+
  theme(panel.background =element_rect(fill="white"))+
  theme(axis.text.x=element_text(angle = 90,hjust = 1,vjust=0,size=8))

Figure 6
As can be seen most borrowers monthly salary of between 3,000 to 5,000 dollars.

  • Loan reasons:
ggplot(data,aes(x=ListingCategory..numeric.))+
  geom_bar(color="black",fill=I("#70DBDB"))+scale_x_continuous(breaks = c(0:20))+scale_y_sqrt()

Figure 7
With this analysis we can see mainly focused on the use of the loans category 1,0,7. Because did not give specific meaning corresponding, so it is not clear the specific purpose of the loan, it can be queried by complete information.

  • Platform User creditworthiness (rating / score):
library(gridExtra)
data$creditlevel <- factor(data$creditlevel,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$CreditGrade <- factor(data$CreditGrade,order=TRUE,levels = c("AA","A","B","C","D","E","HR"))
data$ProsperRating..Alpha. <- factor(data$ProsperRating..Alpha.,order=TRUE,
                                     levels = c("AA","A","B","C","D","E","HR"))

p1 <- ggplot(data,aes(x=creditscore))+
  geom_histogram(binwidth=20,color="black",fill=I("#DBDB70"))+
  scale_x_continuous(limits = c(400,900))

p2 <- ggplot(data=subset(data,data$CreditGrade!=""& data$CreditGrade!="NC"),aes(x=CreditGrade))+
  geom_bar(color="black",fill=I("#7093DB"))+
  xlab("creditlevel(pre2009)")

p3 <- ggplot(data=subset(data,data$ProsperRating..Alpha.!=""),
             aes(x=ProsperRating..Alpha.))+
  geom_bar(color="black",fill=I("#E9C2A6"))+
  xlab("creditlevel(after2009)")

p4 <- ggplot(data=subset(data,!is.na(data$creditlevel)),aes(x=creditlevel))+
  geom_bar(color="black",fill=I("#EAADEA"))
grid.arrange(p1,p2,p3,p4,ncol = 1)

Figure 8
According to the customer's credit rating and score graphics, we can see the basic normally distributed, mainly in the 650-750 credit score points, focus on the credit rating of B, C, D, and after 2009, A-level and user-level users and AA end of the E-level users and HR division clearer.

Published 26 original articles · won praise 0 · Views 400

Guess you like

Origin blog.csdn.net/xiuxiuxiu666/article/details/104246663