Visual analysis based on loan data of various countries (including python code)

Article directory

Dataset introduction (with access link)

1. Data preview

2. Data visualization with maps

1. Make a thematic map to show the loan status of different countries (average loan amount)

2. The map shows the poverty index in different regions

3. Use basemap to draw a picture to show "the 7 regions with the highest funding amount in India (from high to low)

3. Data visualization (column, polyline, tree, pie chart, heat map)

1. Loan purpose

2. Repayment method

3. Loan needs (number and amount of loans by country)

4. Make a line chart to show the loan amount and approval amount, as well as the changing trend of the ratio of the two

5. View the distribution of all loan amounts

6. See the ratio of men to women among all lenders

7. Look at the average loan amount for men and women

Summarize


Dataset introduction (with access link)

Dataset link: https://pan.baidu.com/s/1g9HvEpBuanQMNlz6u6Hl8g  
Extraction code: 2tv9 

data_kiva_loans.csv — main loan information (loan amount, lender’s gender, repayment method, country and region, etc.);

data_kiva_mpi_region_locations.csv — regional location data (latitude and longitude of the loan location and other information)

data_loan_themes_by_region.csv — Loan data by region data (Poverty Index MPI, world region names, etc. information)


1. Data preview

Import package: 

#导入接下来会用到的包
import pandas as pd
import numpy as np
import squarify
import seaborn #基于matplotlib的图形可视化
import matplotlib.pyplot as plt
import plotly.offline
import plotly.graph_objs
from mpl_toolkits.basemap import Basemap
from numpy import array

#忽略报警信息
import warnings 
warnings.filterwarnings("ignore")

%matplotlib inline
#使用matplotlib inline命令可以将matplotlib的图表直接嵌入到Notebokk之中,
#因此就不需要plt.show()这一语句来显示图片

Import Data:

kiva_loans_data = pd.read_csv("data_kiva_loans.csv") #主要贷款数据
kiva_mpi_locations_data = pd.read_csv("data_kiva_mpi_region_locations.csv") #区域位置数据
loan_themes_by_region_data = pd.read_csv("data_loan_themes_by_region.csv") #按地区数据划分的贷款数据

#打印每种数据的大小(行列)
print("kiva_loans_data数据集大小",kiva_loans_data.shape)
print("kiva_mpi_locations_data数据集大小",kiva_mpi_locations_data.shape)
print("loan_themes_by_region_data数据集大小",loan_themes_by_region_data.shape)

Use the info function to display the basic information of the dataframe dataset 

kiva_loans_data.info() #info(),观看数据的基本信息,包括索引范围、列名、非空值的数量、列的数据类型和内存使用情况

Use the describe function to view the descriptive statistical variables of the dataframe dataset 

kiva_loans_data.describe() #describe()默认参数,显示int和float类型(包括非空数量、均值、标准差、最小值、分位数、最大值)

 

describe(include=["O"]), display object type (including: non-empty number, number of categories, highest number of categories and occurrences)

kiva_loans_data.describe(include=["O"])
#注意:双引号内是大写的字母0,不是阿拉伯数字0

2. Data visualization with maps

1. Make a thematic map to show the loan status of different countries (average loan amount)

#计算各个国家平均贷款金额,并排序
countries_funded_amount=kiva_loans_data.groupby('国家').mean()['审批金额'].sort_values(ascending=False)

plt.figure(figsize=(15,8),dpi=600)
#把所需指标传进去
data=[dict(
      type='choropleth',
      locations=countries_funded_amount.index,
      locationmode='country names',
      z=countries_funded_amount.values,
      colorscale='Red',
      colorbar=dict(title='平均贷款金额(美元)'))]
                    
#layout:用来定义布局
layout=dict(title='Kiva中不同国家的贷款情况')
#将data补分和layout部分组合成figure对象
fig=dict(data=data,laout=layout)
#设置 validate=False可以禁用请求验证,若输入值错误程序也能成功运行     
plotly.offline.iplot(fig,validate=False)

 Effect:

map visualization

2. The map shows the poverty index in different regions

data=[dict(
      type="scattergeo",#画地理坐标中的散点图
      lat = kiva_mpi_locations_data["纬度"],#传入经纬度信息
      lon = kiva_mpi_locations_data["经度"],
      text = kiva_mpi_locations_data["位置名称"],#显示位置名称
      marker =dict(
           color = kiva_mpi_locations_data['贫苦指数MPI'],
           colorbar=dict(title="多位贫苦指数")
      ))]
layout=dict(title = '不同地区的贫苦指数')
fig = dict(data=data, layout=layout)
plotly.offline.iplot(fig)

Effect:

Poverty index mpi - map scatter

3. Use basemap to draw a picture to show "the 7 regions with the highest funding amount in India (from high to low)

#打印印度资助金额最高的7个地区的信息
temp = pd.DataFrame(loan_themes_by_region_data[loan_themes_by_region_data["国家"]=='India'])
print("印度资助金额最高的7个地区(由高到低)")
top_cities = temp.sort_values(by="金额",ascending=False)
top7_cities=top_cities.head(7)
top7_cities

 

#在印度地图上绘制前7个受资助地区
plt.figure(figsize=(20,15),dpi=600)
seaborn.set_theme(style="whitegrid",font='SimHei',font_scale=1) #设置字体和大小
#用basemap画图
#projection参数规定了投影方法,projection='lcc’:可以通过经纬度设置来得到某一区域的局部地图
#resolution参数设置分辨率级别,1’(低)
#llcrnrlon、lcrnrlat、urcrnrlon、urcrnrlat分别左下角的x轴坐标、左下角的y轴坐标、右上角的x轴坐标、右上角的y轴坐标#lat_,
#lon_0改变地图的投影方式非常简单
map=Basemap(width=4500000,height=900000,projection="lcc",
                   llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawmapboundary()#fillcontinents()和drawmapboundary ()可以实现填充颜色
map.drawcountries()
map.drawcoastlines()#使用drawcoastlines方法绘制地图
lg=array(top7_cities["经度"])#经度
lt=array(top7_cities["纬度"])#纬度
pt=array(top7_cities["金额"])#金额
nc=array(top7_cities["地区"])#地区
x,y= map(lg,lt)
#进行min-max标准化,amount sizes=(原数据最小值)/(最大值-最小值)
a=int(top7_cities["金额"].max())
b=int(top7_cities["金额"].min())
amount_sizes=top7_cities["金额"].apply(lambda x:int(x-b)/(a-b)*100)
#设置位置、大小、标记、颜色
plt.scatter(x,y,s=amount_sizes,marker="o",c=amount_sizes)
#给图形添加标签
for ncs,xpt,ypt in zip(nc,x,y):
    plt.text(xpt+60000,ypt+30000,ncs)
plt.title("印度资助金额最高的7个地区")

 

3. Data visualization (column, polyline, tree, pie chart, heat map)

1. Loan purpose

The purpose of the loan is referred to in the table as the Loan Area Level 1 Classification

Use barplot to draw a rainbow-colored picture, and set the font and size, see the code for details

sector_num = kiva_loans_data['贷款领域一级分类'].value_counts()#统计每个指标出现个数
plt.figure(figsize=(15,8),dpi=600)#设置图形和清晰度大小。figsize的单位为英寸,dpi单位为每英寸点数
plt.rcParams['font.sans-serif']=['SimHei']#设置正常显示中文标签
seaborn.barplot(sector_num.values, sector_num.index)#用barplot直接画图
#在图上添加values值,enumerate()函数用于将sector_num,values组合为一个索引序列
for i, v in enumerate(sector_num.values): #在选定位置传进v值
    plt.text(v,i,v)#参数分别表示该字符串起点的横坐标,纵坐标和数值
plt.xticks(rotation='vertical') #使x轴上的刻度垂直展示                 
plt.xlabel('数量')
plt.ylabel('贷款领域(一级分类)')
plt.title('贷款领域(一级分类)分布情况')

The loans that can be used to see agriculture, food, and retail are the most, which is also related to the fact that the data set contains a large number of underdeveloped and developing countries. 

Create a tree map to view the distribution of the first-level classification of the loan field

Use squarify to draw a map of the tree:
sizes:
specify the value corresponding to each level of the discrete variable, that is, reflect the area size of the tree map sub-block;
label: add a specified label for each sub-block;
value: add a label with a numerical value for each sub-block;

plt.figure(figsize=(15,8),dpi=600)

squarify.plot(sizes=sector_num.values,label=sector_num.index, value=sector_num.values)
plt.title('贷款领域一级分类统计')

 Explanation of the above results: In the first-level classification, agriculture, food, retail and other household uses have become popular loan areas

2. Repayment method

count = kiva_loans_data['还款方式'].value_counts()
plt.figure(figsize=(50,20),dpi=600)   
seaborn.barplot(count.values, count.index)
seaborn.set_theme(style="whitegrid",font='SimHei',font_scale=5) #设置字体和大小
for i, v in enumerate(count.values):     
    plt.text(v,i,v,fontsize=19)
plt.xlabel('数量')
plt.ylabel('还款方式')
plt.title("还款方式分布情况")

Explanation of the above results: the number of people who make monthly repayments and irregular repayments accounts for the majority, among which the number of people who make monthly repayments is the largest, and the number of people who make repayments on a weekly basis is the least;

Make a point graph to show the change trend of the loan amount of different repayment methods over time:

#转化object型日期为pandas支持的标准时间类型
kiva_loans_data['日期']=pd.to_datetime(kiva_loans_data['日期'])

#按月指定时间形式
kiva_loans_data['日期(年月)']=kiva_loans_data['日期'].dt.to_period("M")

plt.figure(figsize=(15,8),dpi=600)
seaborn.set_theme(style="whitegrid",font='SimHei',font_scale=2) #设置字体和大小
#指定hue,把不同指标按照颜色区分出来
#pointplot:点图代表散点图位置的数值变量的中心趋势估计,并用误差线提供关于该估计的不确定性的一些指示。
g1=seaborn.pointplot(x='日期(年月)',y='贷款金额',
                    data=kiva_loans_data,hue='还款方式')
g1.set_xticklabels(g1.get_xticklabels(),rotation=90)#x轴数据旋转90度
g1.set_title("按年月计算的贷款金额")
g1.set_xlabel("时间(年月)")
g1.set_ylabel("贷款金额(美元)")
##还款方式:monthly按月还,irregular不定期还款,bullet一次性还款,weekly按周还款

3. Loan needs (number and amount of loans by country)

#统计10个最频繁出现的国家
count = kiva_loans_data['国家'].value_counts().head(10)

plt.figure(figsize=(15,8),dpi=600)   
seaborn.barplot(count.values, count.index)
seaborn.set_theme(style="whitegrid",font='SimHei',font_scale=2) #设置字体和大小
for i, v in enumerate(count.values):     
    plt.text(v,i,v,fontsize=19)
plt.xlabel('数量')
plt.ylabel('国家人数')
plt.title("货币需求(按国家统计)")

 

It can be seen that the Philippines has the largest demand for loans, and the countries with the largest demand for loans are developing countries and underdeveloped countries; 

 Word cloud map:

#缺失值处理,取国家这一列数据中的缺失值
names = kiva_loans_data['国家'][~pd.isnull( kiva_loans_data['国家'])]
myText0=' '.join(names)
from wordcloud import WordCloud #词云
#指定词云中显示的最大字体大小,输出的画布宽度高度
#generate(text):根据文本生成
wordcloud = WordCloud(max_font_size=50, width=600, height=300).generate(myText0)
plt.figure(figsize=(15,8),dpi=600)
plt.imshow(wordcloud)#显示生成的词云
plt.title("词云图")
plt.axis('off')#关闭坐标轴

Make a heat map to show the change of loan amount in different countries over time:

#使用dt.year提取年份数据
kiva_loans_data['年份']=kiva_loans_data.日期.dt.year
#以国家和年份分组,计算贷款总额
loan=kiva_loans_data.groupby(['国家','年份'])['贷款金额'].sum().unstack()#不要堆叠

plt.figure(figsize=(15,8),dpi=600)
loan=loan.sort_values([2014],ascending=False)#把2014年的数据降序排列
loan=loan.fillna(0)#fillna(0)填充空值
temp=seaborn.heatmap(loan,cmap='Reds')#作热力图
plt.show()

4. Make a line chart to show the loan amount and approval amount, as well as the changing trend of the ratio of the two

#将申请时间转换为pandas支持的时间类型
kiva_loans_data.申请时间=pd.to_datetime(kiva_loans_data["申请时间"])
#设置索引为申请时间
kiva_loans_data.index=pd.to_datetime(kiva_loans_data["申请时间"])

plt.figure(figsize=(15,8),dpi=600)
#把贷款金额以星期聚合
ax=kiva_loans_data["贷款金额"].resample("w").sum().plot()
#把审批金额以星期聚合
ax=kiva_loans_data["审批金额"].resample("w").sum().plot()
ax.set_xlabel("金额($)")
ax.set_ylabel("时间")
#设定x坐标轴的范围
ax.set_xlim((pd.to_datetime(kiva_loans_data["申请时间"].min()),
             pd.to_datetime(kiva_loans_data["申请时间"].max())))
#用于设置图例的线条
ax.legend(["申请金额","审批金额"])
plt.title("申请金额与审批金额趋势")

 

Changes in the ratio of application amount to approval amount over time:

plt.figure(figsize=(15,8),dpi=600)
#计算申请金额与审批金额比例
ax=(kiva_loans_data["审批金额"].resample("w").sum()/kiva_loans_data["贷款金额"].resample("w").sum()).plot()

ax.set_ylabel("申请金额与审批金额比例")
ax.set_xlabel("时间")
#设定x坐标轴的范围
ax.set_xlim((pd.to_datetime(kiva_loans_data["申请时间"].min()),
             pd.to_datetime(kiva_loans_data["申请时间"].max())))
#ax.set_ylim(0,1)
plt.title("审批比例")

5. View the distribution of all loan amounts

plt.figure(figsize=(15,8),dpi=600)
#在数据集取审批金额这一列的数据,并从小到大排序
plt.scatter(range(len(kiva_loans_data['审批金额'])),np.sort(kiva_loans_data.审批金额.values))
plt.xlabel('编号')
plt.ylabel('贷款金额')
plt.title("贷款金额分布情况")

The explanation of the above results: Most of them are small loans (0-20000), and the number of people with loans exceeding 20000 is very small;

6. See the ratio of men to women among all lenders

gender_list = []#建立空数组,存入处理后的单个性别信息
for gender in kiva_loans_data["贷款人性别"].values:
    if str(gender) != "nan":#过滤缺失值
        #extend()用于在列表末尾追加另一个序列的多个值,strip()用于移除字符串头尾空格
        gender_list.extend([lst.strip() for lst in gender.split(",")])
temp_data = pd.Series(gender_list).value_counts()
temp_data

labels=np.array(temp_data.index)#np.array数组创建方法
sizes=np.array((temp_data.values/temp_data.values.sum())*100)
plt.figure(figsize=(15,8),dpi=600)
#构造trace,配置相关参数
trace=plotly.graph_objs.Pie(labels=labels,values=sizes)
layout=plotly.graph_objs.Layout(title='贷款人性别')
#将trace保存于列表之中
data=[trace]
#将data补分和layout补分组成figure对象
fig=plotly.graph_objs.Figure(data=data,layout=layout)
#使用plotly.offline.iplot方法,将生成的图形嵌入到ipynb文件中
plotly.offline.iplot(fig)

Explanation of the above results: blue is the female lender, red is the male lender, it can be seen that the number of female borrowers is far more than the number of men

7. Look at the average loan amount for men and women

loan_sex1=kiva_loans_data[['贷款人性别','审批金额']]

#性别信息处理:若存在n个贷款人同时贷款,则拆分成n条贷款数据。
#split用于将字符串以逗号为分隔符拆分,expand=True把series类型转化为dataframe类
loan_sex2=loan_sex1["贷款人性别"].str.split(',',expand=True).stack()

#reset index 用来重新色织索引,drop=True:删除原来的索引列,level=1表示将第一列设置为索引列
loan_sex3=loan_sex2.reset_index(level=1,drop=True).rename('贷款人性别1')

#drop用于删除原先的贷款人性别列,axis=1;对列进行操作
#join用于将两个dataframe中不同的列索引合并成为一个dataframe(默认左外链接)
loan_sex4=loan_sex1.drop('贷款人性别',axis=1).join(loan_sex3)

#用str方loan_sex2法先转成字符串,再用strip()去除字符串左右两边空格
loan_sex4['贷款人性别1']=loan_sex4['贷款人性别1'].str.strip()

loan_sex5=loan_sex4.index.value_counts().sort_index()#计算人数并按照索引重新排序
loan_sex6=pd.concat([loan_sex4['贷款人性别1'],loan_sex4['审批金额'],loan_sex5],axis=1,keys=['贷款人性别1','总金额','人数'])
loan_sex6['实际金额']=loan_sex6['总金额']/loan_sex6['人数']
loan_sex6.head(10)

loan_sex6.loc[loan_sex6.贷款人性别1 =='nan','贷款人性别1']=np.nan#处理空值
#计算男女贷款金额平均值
sex_mean=pd.DataFrame(loan_sex6.groupby(['贷款人性别1'])['实际金额'].mean()).reset_index()
print(sex_mean)
plt.figure(figsize=(15,8),dpi=600)
seaborn.barplot(sex_mean.贷款人性别1,sex_mean.实际金额)
plt.title("按性别划分的平均资助金额")
plt.xlabel("性别")
plt.ylabel("平均贷款金额")

The above results explain: the average loan amount of men is higher than the average loan amount of women;


Summarize

This article uses a lot of visualization forms, among which map visualization is relatively new. As long as there is latitude and longitude data, the map form can be used for visualization.

Guess you like

Origin blog.csdn.net/weixin_50706330/article/details/127216890