[Big Data Foundation] Spark data processing and analysis based on credit card overdue data

https://dblab.xmu.edu.cn/blog/2707/

experiment procedure

data preprocessing

The experimental data set comes from the credit card scoring model construction data of Hejing community. The data set cs-training.csv is used as the analysis subject, which contains 150,000 records and 11 columns of attributes.
Each data contains the following fields:
Field Name Field Meaning Example
(1) SeriousDlqin2yrs Overdue 0,1
(2) RevolvingUtilizationOfUnsecuredLines Total balance of credit card and personal credit line 0.766126609
(3) Age Age 45,20,30
(4) NumberOfTime30-59DaysPastDueNotWorse The number of borrowers who are 30-59 days overdue 0,2,3
(5) DebtRatio Debt ratio 0.802982129
(6) MonthlyIncome Monthly income 9120,3000
(7) NumberOfOpenCreditLinesAndLoans Number of outstanding loans, 0,4,13
(8) NumberOfTimes90DaysLate Borrower Number of times more than 90 days past due 0,1,3
(9) NumberRealEstateLoansOrLines Number of real estate loans3,6
(10) NumberOfTime60-89DaysPastDueNotWorse Number of times borrowers are overdue for 60-89 days0,3
(11) NumberOfDependents Number of dependents in the household0 ,1,3
insert image description here

In this experiment, the pandas library was used to preprocess the data. In the experiment, the four attributes of total balance of credit card and personal line of credit, debt ratio, number of outstanding loans, and number of times overdue for more than 90 days were not processed and analyzed.
The specific processing steps are as follows:
(1) Read the data
(2) Check whether the data has duplicate values, and remove duplicate values
​​(3) Check the missing rate of each field, and fill the missing values ​​with the mean value
(4) Select the attribute to be studied, delete the unresearched Attribute
(5) Save the file locally
Use the code file data_preprocessing.py to preprocess the data. The steps to run the data_preprocessing.py file are as follows:

import pandas as pd
 
# 读取数据
df = pd.read_csv("~/Desktop/cs-training.csv")
 
# 去除重复值
df.duplicated()
df.drop_duplicates()
 
# 查看各字段缺失率
df.info()
# 缺失值按均值填充
for col in list(df.columns[df.isnull().sum() > 0]):
    mean_val = df[col].mean()
    df[col].fillna(mean_val, inplace=True)
 
# 删除不分析的列
columns = ["RevolvingUtilizationOfUnsecuredLines","DebtRatio","NumberOfOpenCreditLinesAndLoans","NumberOfTimes90DaysLate"]
df.drop(columns,axis=1,inplace=True)
 
# 保存到本地
df.to_csv("~/OverDue/data.csv")

insert image description here

Upload files to HDFS file system

# 启动Hadoop
cd /usr/local/hadoop
./sbin/start-dfs.sh
# 在HDFS文件系统中创建/OverDue目录
./bin/hdfs dfs -mkdir /data
# 上传文件到HDFS文件系统中
./bin/hdfs dfs -put ~/OverDue/data.csv /OverDue/data.csv

insert image description here

Data processing and analysis using Spark

We will use the Python programming language and the Spark big data framework to process and analyze the dataset "data.csv". The specific steps are as follows:
(1) Read the data files in the HDFS file system to generate a DataFrame
(2) Modify the column names
(3 ) Overall statistics of overdue credit card (4) Combined
statistics of age and credit card overdue
(5) Combined statistics of two overdue records
and this credit card overdue
7) The combined statistics of the number of family members and the overdue credit card
(8) The combined statistics of the monthly income and the overdue credit card
(9) Return the statistical data to the data visualization file data_web.py

./bin/hdfs dfs -put ~/OverDue/data1.csv /user/hadoop

The content of the code file data_analysis.py is as follows:

from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import functions
 
def analyse(filename):
    # 读取数据
    spark = SparkSession.builder.config(conf = SparkConf()).getOrCreate()
    df = spark.read.format("csv").option("header","true").load(filename)
 
    # 修改列名
    df = df.withColumnRenamed('SeriousDlqin2yrs','y')
    df = df.withColumnRenamed('NumberOfTime30-59DaysPastDueNotWorse','30-59days')
    df = df.withColumnRenamed('NumberOfTime60-89DaysPastDueNotWorse','60-89days')
    df = df.withColumnRenamed('NumberRealEstateLoansOrLines','RealEstateLoans')
    df = df.withColumnRenamed('NumberOfDependents','families')
 
    # 返回data_web.py的数据列表
    all_list = []
    # 本次信用卡逾期分析
    # 共有逾期10026人,139974没有逾期,总人数150000
    total_y = []
    for i in range(2):
        total_y.append(df.filter(df['y'] == i).count())
    all_list.append(total_y)
 
    # 年龄分析
    df_age = df.select(df['age'],df['y'])
    agenum = []
    bin = [0,30,45,60,75,100]
    # 统计各个年龄段的人口
    for i in range(5):
        agenum.append(df_age.filter(df['age'].between(bin[i],bin[i+1])).count())
    all_list.append(agenum)
    # 统计各个年龄段逾期与不逾期的数量
    age_y = []
    for i in range(5):
        y0 = df_age.filter(df['age'].between(bin[i],bin[i+1])).\
            filter(df['y']=='0').count()
        y1 = df_age.filter(df['age'].between(bin[i],bin[i+1])).\
            filter(df['y']=='1').count()
        age_y.append([y0,y1])
    all_list.append(age_y)
 
    # 有逾期记录的人的本次信用卡逾期数量
    df_pastDue = df.select(df['30-59days'],df['60-89days'],df['y'])
    # 30-59有23982人,4985逾期,18997不逾期
    numofpastdue = []
    numofpastdue.append(df_pastDue.filter(df_pastDue['30-59days'] > 0).count())
    y_numofpast1 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['30-59days'] > 0).\
            filter(df_pastDue['y'] == i).count()
        y_numofpast1.append(x)
    # 60-89有7604人,2770逾期,4834不逾期
    numofpastdue.append(df_pastDue.filter(df_pastDue['60-89days'] > 0).count())
    y_numofpast2 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['60-89days'] > 0).\
            filter(df_pastDue['y'] == i).count()
        y_numofpast2.append(x)
    # 两个记录都有的人有4393人,逾期1907,不逾期2486
    numofpastdue.append(df_pastDue.filter(df_pastDue['30-59days'] > 0).
                        filter(df_pastDue['60-89days'] > 0).count())
    y_numofpast3 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['30-59days'] > 0).\
            filter(df_pastDue['60-89days'] > 0).filter(df_pastDue['y'] == i).count()
        y_numofpast3.append(x)
    all_list.append(numofpastdue)
    all_list.append(y_numofpast1)
    all_list.append(y_numofpast2)
    all_list.append(y_numofpast3)
 
    # 房产抵押数量分析
    df_Loans = df.select(df['RealEstateLoans'],df['y'])
    # 有无抵押房产人数情况
    numofrealandnoreal = []
    numofrealandnoreal.append(df_Loans.filter(df_Loans['RealEstateLoans']==0).count())
    numofrealandnoreal.append(df_Loans.filter(df_Loans['RealEstateLoans']>0).count())
    all_list.append(numofrealandnoreal)
    ## 房产无抵押共有56188人,逾期4672人,没逾期51516人
    norealnum = []
    for i in range(2):
        x = df_Loans.filter(df_Loans['RealEstateLoans']==0).\
            filter(df_Loans['y'] == i).count()
        norealnum.append(x)
    all_list.append(norealnum)
    # 房产抵押共有93812人,逾期5354人,不逾期88458人
    realnum = []
    for i in range(2):
        x = df_Loans.filter(df_Loans['RealEstateLoans']>0).\
            filter(df_Loans['y'] == i).count()
        realnum.append(x)
    all_list.append(realnum)
 
    # 家属人数分析
    df_families = df.select(df['families'],df['y'])
    # 有无家属人数统计
    nofamiliesAndfamilies = []
    nofamiliesAndfamilies.append(df_families.filter(df_families['families']>0).count())
    nofamiliesAndfamilies.append(df_families.filter(df_families['families']==0).count())
    all_list.append(nofamiliesAndfamilies)
    # 有家属59174人,逾期4752人,没逾期54422人
    y_families = []
    y_families.append(df_families.filter(df_families['families']>0).
                      filter(df_families['y']==0).count())
    y_families.append(df_families.filter(df_families['families']>0).
                      filter(df_families['y']==1).count())
    all_list.append(y_families)
    # 没家属90826人,逾期5274人,没逾期85552人
    y_nofamilies = []
    y_nofamilies.append(df_families.filter(df_families['families']==0).
                        filter(df_families['y']==0).count())
    y_nofamilies.append(df_families.filter(df_families['families']==0).
                        filter(df_families['y']==1).count())
    all_list.append(y_nofamilies)
 
    # 月收入分析
    df_income = df.select(df['MonthlyIncome'],df['y'])
    # 获取平均值,其中先返回Row对象,再获取其中均值
    mean_income = df_income.agg(functions.avg(df_income['MonthlyIncome'])).head()[0]
    # 收入分布,105854人没超过均值6670,44146人超过均值6670
    numofMeanincome = []
    numofMeanincome.append(df_income.filter(df['MonthlyIncome'] < mean_income).count())
    numofMeanincome.append(df_income.filter(df['MonthlyIncome'] > mean_income).count())
    all_list.append(numofMeanincome)
    # 未超过均值的逾期情况分析,97977人没逾期,7877人逾期
    y_NoMeanIncome = []
    y_NoMeanIncome.append(df_income.filter(df['MonthlyIncome'] < mean_income).filter(df['y']==0).count())
    y_NoMeanIncome.append(df_income.filter(df['MonthlyIncome'] < mean_income).filter(df['y']==1).count())
    all_list.append(y_NoMeanIncome)
    # 超过均值的逾期情况分析,41997人没逾期,2149人逾期
    y_MeanIncome = []
    y_MeanIncome.append(df_income.filter(df['MonthlyIncome'] > mean_income).filter(df['y']==0).count())
    y_MeanIncome.append(df_income.filter(df['MonthlyIncome'] > mean_income).filter(df['y']==1).count())
    all_list.append(y_MeanIncome)
 
    # 数据可视化data_web.py
    return all_list

data visualization

Choose to use the python third-party library pyecharts as a visualization tool, where the version of pyecharts is 1.7.0. Use the histogram and pie chart to present the analysis results in detail.
The content of the code file data_web.py is as follows:

from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Page
from pyecharts import options as opts
import data_analysis
# --------总体逾期人数情况--------------
def draw_total(total_list):
    attr = ["未逾期人数", "逾期人数"]
    pie = (
        Pie()
            .add("总体逾期人数", [list(z) for z in zip(attr,total_list)])
            .set_global_opts(title_opts=opts.TitleOpts(title="总体逾期人数分布"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    return pie
# --------年龄与逾期人数情况--------------
def draw_age(age_list,y_ageList):
    total_pie = draw_total(all_list[0])
    attr = ["0-30", "30-45", "45-60", "60-75", "75-100"]
    y0_agenum = []
    y1_agenum = []
    for i in range(5):
        y0_agenum.append(y_ageList[i][0])
        y1_agenum.append(y_ageList[i][1])
 
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数分布", age_list)
            .add_yaxis("未逾期人数分布", y0_agenum)
            .add_yaxis("逾期人数分布", y1_agenum)
            .set_global_opts(title_opts=opts.TitleOpts(title="各年龄段逾期情况"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("0-30年龄段", [list(z) for z in zip(attr,y_ageList[0])])
            .set_global_opts(title_opts=opts.TitleOpts(title="0-30年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie2 = (
        Pie()
            .add("30-45年龄段", [list(z) for z in zip(attr,y_ageList[1])])
            .set_global_opts(title_opts=opts.TitleOpts(title="30-45年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie3 = (
        Pie()
            .add("45-60年龄段", [list(z) for z in zip(attr,y_ageList[2])])
            .set_global_opts(title_opts=opts.TitleOpts(title="45-60年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie4 = (
        Pie()
            .add("60-75年龄段", [list(z) for z in zip(attr,y_ageList[3])])
            .set_global_opts(title_opts=opts.TitleOpts(title="60-75年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie5 = (
        Pie()
            .add("75-100年龄段", [list(z) for z in zip(attr,y_ageList[4])])
            .set_global_opts(title_opts=opts.TitleOpts(title="75-100年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
 
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.add(pie3)
    page.add(pie4)
    page.add(pie5)
    page.render('age_OverDue.html')
 
# --------逾期记录与逾期人数情况--------------
def draw_pastdue(numofpastdue,pastdue1num,pastdue2num,pastdue12num):
    total_pie = draw_total(all_list[0])
    attr = ["有30-59days逾期记录的人数", "有60-89days逾期记录的人数", "有长短期逾期记录的人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofpastdue)
            .set_global_opts(title_opts=opts.TitleOpts(title="有逾期记录的人数"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("有短期逾期记录的人的逾期情况", [list(z) for z in zip(attr,pastdue1num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有短期逾期记录的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有长期逾期记录的人的逾期情况", [list(z) for z in zip(attr,pastdue2num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有长期逾期记录的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie3 = (
        Pie()
            .add("长短期逾期记录都有的人的逾期情况", [list(z) for z in zip(attr,pastdue12num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="长短期逾期记录都有的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.add(pie3)
    page.render('pastDue_OverDue.html')
# --------房产抵押与逾期人数情况--------------
def draw_realestateLoans(numofrealornoreal,y_norealnum,y_realnum):
    total_pie = draw_total(all_list[0])
    attr = ["无房产抵押人数", "有房产抵押人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofrealornoreal)
            .set_global_opts(title_opts=opts.TitleOpts(title="房产抵押人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("无房产抵押的人的逾期情况", [list(z) for z in zip(attr,y_norealnum)])
            .set_global_opts(title_opts=opts.TitleOpts(title="无房产抵押的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有房产抵押的人的逾期情况", [list(z) for z in zip(attr,y_realnum)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有房产抵押的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('realestateLoans_OverDue.html')
 
# --------家属人数与逾期人数情况--------------
def draw_families(nofamiliesAndfamilies,y_families,y_nofamilies):
    total_pie = draw_total(all_list[0])
    attr = ["有家属人数", "无家属人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", nofamiliesAndfamilies)
            .set_global_opts(title_opts=opts.TitleOpts(title="有无家属人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("无家属的人的逾期情况", [list(z) for z in zip(attr,y_nofamilies)])
            .set_global_opts(title_opts=opts.TitleOpts(title="无家属的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有家属的人的逾期情况", [list(z) for z in zip(attr,y_families)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有家属的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('families_OverDue.html')
# --------月收入与逾期人数情况--------------
def draw_income(numofMeanincome,y_NoMeanIncome,y_MeanIncome):
    total_pie = draw_total(all_list[0])
    attr = ["未超过均值收入人数", "超过均值收入人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofMeanincome)
            .set_global_opts(title_opts=opts.TitleOpts(title="有无超过均值收入人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("未超过均值收入的人的逾期情况", [list(z) for z in zip(attr,y_NoMeanIncome)])
            .set_global_opts(title_opts=opts.TitleOpts(title="未超过均值收入的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("超过均值收入的人的逾期情况", [list(z) for z in zip(attr,y_MeanIncome)])
            .set_global_opts(title_opts=opts.TitleOpts(title="超过均值收入的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('meanIncome_OverDue.html')
 
if __name__ == '__main__':
    print("开始总程序")
    Filename = "/OverDue/data.csv"
    all_list = data_analysis.analyse(Filename) 
    # 年龄与是否逾期情况
    draw_age(all_list[1],all_list[2])
    # 有无逾期记录与是否逾期情况
    draw_pastdue(all_list[3],all_list[4],all_list[5],all_list[6])
    # 房产抵押数量与是否逾期情况
    draw_realestateLoans(all_list[7],all_list[8],all_list[9])
    # 家属人数与是否逾期情况
    draw_families(all_list[10],all_list[11],all_list[12])
    # 月收入与是否逾期情况
    draw_income(all_list[13],all_list[14],all_list[15])
    print("结束总程序")

full code

from pyspark.sql import SparkSession
from pyspark import SparkContext,SparkConf
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import functions
 
def analyse(filename):
    # 读取数据
    spark = SparkSession.builder.config(conf = SparkConf()).getOrCreate()
    df = spark.read.format("csv").option("header","true").load(filename)
 
    # 修改列名
    df = df.withColumnRenamed('SeriousDlqin2yrs','y')
    df = df.withColumnRenamed('NumberOfTime30-59DaysPastDueNotWorse','30-59days')
    df = df.withColumnRenamed('NumberOfTime60-89DaysPastDueNotWorse','60-89days')
    df = df.withColumnRenamed('NumberRealEstateLoansOrLines','RealEstateLoans')
    df = df.withColumnRenamed('NumberOfDependents','families')
 
    # 返回data_web.py的数据列表
    all_list = []
    # 本次信用卡逾期分析
    # 共有逾期10026人,139974没有逾期,总人数150000
    total_y = []
    for i in range(2):
        total_y.append(df.filter(df['y'] == i).count())
    all_list.append(total_y)
 
    # 年龄分析
    df_age = df.select(df['age'],df['y'])
    agenum = []
    bin = [0,30,45,60,75,100]
    # 统计各个年龄段的人口
    for i in range(5):
        agenum.append(df_age.filter(df['age'].between(bin[i],bin[i+1])).count())
    all_list.append(agenum)
    # 统计各个年龄段逾期与不逾期的数量
    age_y = []
    for i in range(5):
        y0 = df_age.filter(df['age'].between(bin[i],bin[i+1])).\
            filter(df['y']=='0').count()
        y1 = df_age.filter(df['age'].between(bin[i],bin[i+1])).\
            filter(df['y']=='1').count()
        age_y.append([y0,y1])
    all_list.append(age_y)
 
    # 有逾期记录的人的本次信用卡逾期数量
    df_pastDue = df.select(df['30-59days'],df['60-89days'],df['y'])
    # 30-59有23982人,4985逾期,18997不逾期
    numofpastdue = []
    numofpastdue.append(df_pastDue.filter(df_pastDue['30-59days'] > 0).count())
    y_numofpast1 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['30-59days'] > 0).\
            filter(df_pastDue['y'] == i).count()
        y_numofpast1.append(x)
    # 60-89有7604人,2770逾期,4834不逾期
    numofpastdue.append(df_pastDue.filter(df_pastDue['60-89days'] > 0).count())
    y_numofpast2 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['60-89days'] > 0).\
            filter(df_pastDue['y'] == i).count()
        y_numofpast2.append(x)
    # 两个记录都有的人有4393人,逾期1907,不逾期2486
    numofpastdue.append(df_pastDue.filter(df_pastDue['30-59days'] > 0).
                        filter(df_pastDue['60-89days'] > 0).count())
    y_numofpast3 = []
    for i in range(2):
        x = df_pastDue.filter(df_pastDue['30-59days'] > 0).\
            filter(df_pastDue['60-89days'] > 0).filter(df_pastDue['y'] == i).count()
        y_numofpast3.append(x)
    all_list.append(numofpastdue)
    all_list.append(y_numofpast1)
    all_list.append(y_numofpast2)
    all_list.append(y_numofpast3)
 
    # 房产抵押数量分析
    df_Loans = df.select(df['RealEstateLoans'],df['y'])
    # 有无抵押房产人数情况
    numofrealandnoreal = []
    numofrealandnoreal.append(df_Loans.filter(df_Loans['RealEstateLoans']==0).count())
    numofrealandnoreal.append(df_Loans.filter(df_Loans['RealEstateLoans']>0).count())
    all_list.append(numofrealandnoreal)
    ## 房产无抵押共有56188人,逾期4672人,没逾期51516人
    norealnum = []
    for i in range(2):
        x = df_Loans.filter(df_Loans['RealEstateLoans']==0).\
            filter(df_Loans['y'] == i).count()
        norealnum.append(x)
    all_list.append(norealnum)
    # 房产抵押共有93812人,逾期5354人,不逾期88458人
    realnum = []
    for i in range(2):
        x = df_Loans.filter(df_Loans['RealEstateLoans']>0).\
            filter(df_Loans['y'] == i).count()
        realnum.append(x)
    all_list.append(realnum)
 
    # 家属人数分析
    df_families = df.select(df['families'],df['y'])
    # 有无家属人数统计
    nofamiliesAndfamilies = []
    nofamiliesAndfamilies.append(df_families.filter(df_families['families']>0).count())
    nofamiliesAndfamilies.append(df_families.filter(df_families['families']==0).count())
    all_list.append(nofamiliesAndfamilies)
    # 有家属59174人,逾期4752人,没逾期54422人
    y_families = []
    y_families.append(df_families.filter(df_families['families']>0).
                      filter(df_families['y']==0).count())
    y_families.append(df_families.filter(df_families['families']>0).
                      filter(df_families['y']==1).count())
    all_list.append(y_families)
    # 没家属90826人,逾期5274人,没逾期85552人
    y_nofamilies = []
    y_nofamilies.append(df_families.filter(df_families['families']==0).
                        filter(df_families['y']==0).count())
    y_nofamilies.append(df_families.filter(df_families['families']==0).
                        filter(df_families['y']==1).count())
    all_list.append(y_nofamilies)
 
    # 月收入分析
    df_income = df.select(df['MonthlyIncome'],df['y'])
    # 获取平均值,其中先返回Row对象,再获取其中均值
    mean_income = df_income.agg(functions.avg(df_income['MonthlyIncome'])).head()[0]
    # 收入分布,105854人没超过均值6670,44146人超过均值6670
    numofMeanincome = []
    numofMeanincome.append(df_income.filter(df['MonthlyIncome'] < mean_income).count())
    numofMeanincome.append(df_income.filter(df['MonthlyIncome'] > mean_income).count())
    all_list.append(numofMeanincome)
    # 未超过均值的逾期情况分析,97977人没逾期,7877人逾期
    y_NoMeanIncome = []
    y_NoMeanIncome.append(df_income.filter(df['MonthlyIncome'] < mean_income).filter(df['y']==0).count())
    y_NoMeanIncome.append(df_income.filter(df['MonthlyIncome'] < mean_income).filter(df['y']==1).count())
    all_list.append(y_NoMeanIncome)
    # 超过均值的逾期情况分析,41997人没逾期,2149人逾期
    y_MeanIncome = []
    y_MeanIncome.append(df_income.filter(df['MonthlyIncome'] > mean_income).filter(df['y']==0).count())
    y_MeanIncome.append(df_income.filter(df['MonthlyIncome'] > mean_income).filter(df['y']==1).count())
    all_list.append(y_MeanIncome)
 
    # 数据可视化data_web.py
    return all_list

from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Page
from pyecharts import options as opts

# --------总体逾期人数情况--------------
def draw_total(total_list):
    attr = ["未逾期人数", "逾期人数"]
    pie = (
        Pie()
            .add("总体逾期人数", [list(z) for z in zip(attr,total_list)])
            .set_global_opts(title_opts=opts.TitleOpts(title="总体逾期人数分布"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    return pie
# --------年龄与逾期人数情况--------------
def draw_age(age_list,y_ageList):
    total_pie = draw_total(all_list[0])
    attr = ["0-30", "30-45", "45-60", "60-75", "75-100"]
    y0_agenum = []
    y1_agenum = []
    for i in range(5):
        y0_agenum.append(y_ageList[i][0])
        y1_agenum.append(y_ageList[i][1])
 
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数分布", age_list)
            .add_yaxis("未逾期人数分布", y0_agenum)
            .add_yaxis("逾期人数分布", y1_agenum)
            .set_global_opts(title_opts=opts.TitleOpts(title="各年龄段逾期情况"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("0-30年龄段", [list(z) for z in zip(attr,y_ageList[0])])
            .set_global_opts(title_opts=opts.TitleOpts(title="0-30年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie2 = (
        Pie()
            .add("30-45年龄段", [list(z) for z in zip(attr,y_ageList[1])])
            .set_global_opts(title_opts=opts.TitleOpts(title="30-45年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie3 = (
        Pie()
            .add("45-60年龄段", [list(z) for z in zip(attr,y_ageList[2])])
            .set_global_opts(title_opts=opts.TitleOpts(title="45-60年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
            )
    )
    pie4 = (
        Pie()
            .add("60-75年龄段", [list(z) for z in zip(attr,y_ageList[3])])
            .set_global_opts(title_opts=opts.TitleOpts(title="60-75年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie5 = (
        Pie()
            .add("75-100年龄段", [list(z) for z in zip(attr,y_ageList[4])])
            .set_global_opts(title_opts=opts.TitleOpts(title="75-100年龄段逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
 
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.add(pie3)
    page.add(pie4)
    page.add(pie5)
    page.render('age_OverDue.html')
 
# --------逾期记录与逾期人数情况--------------
def draw_pastdue(numofpastdue,pastdue1num,pastdue2num,pastdue12num):
    total_pie = draw_total(all_list[0])
    attr = ["有30-59days逾期记录的人数", "有60-89days逾期记录的人数", "有长短期逾期记录的人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofpastdue)
            .set_global_opts(title_opts=opts.TitleOpts(title="有逾期记录的人数"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("有短期逾期记录的人的逾期情况", [list(z) for z in zip(attr,pastdue1num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有短期逾期记录的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有长期逾期记录的人的逾期情况", [list(z) for z in zip(attr,pastdue2num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有长期逾期记录的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie3 = (
        Pie()
            .add("长短期逾期记录都有的人的逾期情况", [list(z) for z in zip(attr,pastdue12num)])
            .set_global_opts(title_opts=opts.TitleOpts(title="长短期逾期记录都有的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.add(pie3)
    page.render('pastDue_OverDue.html')
# --------房产抵押与逾期人数情况--------------
def draw_realestateLoans(numofrealornoreal,y_norealnum,y_realnum):
    total_pie = draw_total(all_list[0])
    attr = ["无房产抵押人数", "有房产抵押人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofrealornoreal)
            .set_global_opts(title_opts=opts.TitleOpts(title="房产抵押人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("无房产抵押的人的逾期情况", [list(z) for z in zip(attr,y_norealnum)])
            .set_global_opts(title_opts=opts.TitleOpts(title="无房产抵押的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有房产抵押的人的逾期情况", [list(z) for z in zip(attr,y_realnum)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有房产抵押的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('realestateLoans_OverDue.html')
 
# --------家属人数与逾期人数情况--------------
def draw_families(nofamiliesAndfamilies,y_families,y_nofamilies):
    total_pie = draw_total(all_list[0])
    attr = ["有家属人数", "无家属人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", nofamiliesAndfamilies)
            .set_global_opts(title_opts=opts.TitleOpts(title="有无家属人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("无家属的人的逾期情况", [list(z) for z in zip(attr,y_nofamilies)])
            .set_global_opts(title_opts=opts.TitleOpts(title="无家属的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("有家属的人的逾期情况", [list(z) for z in zip(attr,y_families)])
            .set_global_opts(title_opts=opts.TitleOpts(title="有家属的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('families_OverDue.html')
# --------月收入与逾期人数情况--------------
def draw_income(numofMeanincome,y_NoMeanIncome,y_MeanIncome):
    total_pie = draw_total(all_list[0])
    attr = ["未超过均值收入人数", "超过均值收入人数"]
    bar = (
        Bar()
            .add_xaxis(attr)
            .add_yaxis("人数", numofMeanincome)
            .set_global_opts(title_opts=opts.TitleOpts(title="有无超过均值收入人数分布"))
    )
    attr = ["未逾期","逾期"]
    pie1 = (
        Pie()
            .add("未超过均值收入的人的逾期情况", [list(z) for z in zip(attr,y_NoMeanIncome)])
            .set_global_opts(title_opts=opts.TitleOpts(title="未超过均值收入的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    pie2 = (
        Pie()
            .add("超过均值收入的人的逾期情况", [list(z) for z in zip(attr,y_MeanIncome)])
            .set_global_opts(title_opts=opts.TitleOpts(title="超过均值收入的人的逾期情况"))
            .set_series_opts(
            tooltip_opts=opts.TooltipOpts(trigger="item", formatter="{a} <br/>{b}: {c} ({d}%)"),
            label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)")
        )
    )
    page = Page()
    page.add(bar)
    page.add(total_pie)
    page.add(pie1)
    page.add(pie2)
    page.render('meanIncome_OverDue.html')
 
if __name__ == '__main__':
    print("开始总程序")
    Filename = "hdfs://localhost:8020/user/hadoop/data1.csv"
    all_list = analyse(Filename) 
    # 年龄与是否逾期情况
    draw_age(all_list[1],all_list[2])
    # 有无逾期记录与是否逾期情况
    draw_pastdue(all_list[3],all_list[4],all_list[5],all_list[6])
    # 房产抵押数量与是否逾期情况
    draw_realestateLoans(all_list[7],all_list[8],all_list[9])
    # 家属人数与是否逾期情况
    draw_families(all_list[10],all_list[11],all_list[12])
    # 月收入与是否逾期情况
    draw_income(all_list[13],all_list[14],all_list[15])
    print("结束总程序")

insert image description here
The result of the operation is as follows: (file with overdue.html)

insert image description here

Data Visualization Results

# 进入OverDue目录
cd ~/OverDue
# 提交data_web.py文件到spark-submit
/usr/local/spark/bin/spark-submit --master local ~/OverDue/data_web.py

Number of dependents

insert image description here
insert image description here
insert image description here

Overdue records

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Number of real estate mortgages

insert image description here
insert image description here

insert image description here
insert image description here
insert image description here

monthly income

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here### Overall
insert image description here
insert image description here
insert image description here

age

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/Algernon98/article/details/130051632