[Data Visualization] Big Assignment (Data Visualization for Postgraduate Entrance Examination Colleges)


Preface

  • Display the geographical location of the university in the form of a map.
  • The postgraduate entrance examination (or college entrance examination) admission results, headcount information, professional teaching staff, examination subjects and content, etc. of the university's computer-related majors in recent years should be expressed in appropriate bar charts, line charts, pie charts, etc. The charts can be clear Changes in different data must be presented so that people observing the chart can quickly obtain information.
    • Admission results, admission number information, professional teaching staff
    • Bar chart, line chart, pie chart
  • For multi-attribute, multi-dimensional, multi-relationship data such as tutors and research directions that individuals are interested in, visualization methods such as relationships and word clouds are used to express the data clearly and effectively.
    • Research direction (relationship + word cloud)
  • Other free play parts

1. Data introduction

1.1 Basic information

  • School name: Shandong University of Technology
  • Geographical location: Zibo City, Shandong Province, 36.810315 North Latitude, 117.999601 East Longitude
  • Institution: School of Computer Science and Technology

1.2 Postgraduate entrance examination information

Collect information on the first choice admissions of Shandong University of Technology's Computer Science and Technology major (master's degree + professional master's degree) from 2020 to 2022 through the Internet. The specific information includes: re-examination college code, re-examination college, name, preliminary examination number, re-examination major code , re-examination major name, research direction code, learning form, first choice/transfer, initial test scores, comprehensive interview scores, total scores, rankings, admission results, remarks. It should be noted that this data does not come from the official website (the school’s official website information has been closed ), there are errors in the data results.

import PyPDF2
import pytesseract
import pandas as pd
import os

# 设置OCR引擎(如果需要)
# pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'  # 指定Tesseract OCR引擎的路径

# 将PDF文件转换为文本
def pdf_to_text(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)
        for page in range(num_pages):
            pdf_page = reader.pages[page]
            text += pdf_page.extract_text()
    return text

# 使用OCR识别文本
def ocr_text(image_path):
    text = pytesseract.image_to_string(image_path)
    return text

# 将文本保存为Excel文件
def save_text_as_excel(text, output_path):
    lines = text.split('\n')
    data = [line.split() for line in lines if line.strip()]
    df = pd.DataFrame(data)
    df.to_excel(output_path, index=False)

# 主函数
def pdf_to_excel(pdf_folder, output_folder):
    pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]

    for pdf_file in pdf_files:
        pdf_path = os.path.join(pdf_folder, pdf_file)
        text = pdf_to_text(pdf_path)

        # 使用OCR识别文本(如果需要)
        # image_path = 'image.png'  # 将PDF转换为图像文件(可选)
        # text = ocr_text(image_path)

        excel_file = pdf_file.replace('.pdf', '.xlsx')
        output_path = os.path.join(output_folder, excel_file)
        save_text_as_excel(text, output_path)

    print("转换完成!")

# 调用函数进行转换
pdf_folder = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/'
output_folder = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Output/'
pdf_to_excel(pdf_folder, output_folder)

Insert image description here

  • Other relevant information:

1.3 Tutor information

Collect and obtain graduate tutor information through the official website of the School of Computer Science and Technology of Shandong University of Technology.

  • Obtain information: name, position, main study and work resume, main research direction, social part-time jobs and honorary titles, main courses and main teaching awards, main scientific research achievements and awards
  • Crawl the code: Mentor team
import time
import requests
from lxml import etree
import pandas as pd

def scrape_website(url, dataframe):
    # 发起HTTP请求获取网页内容
    response = requests.get(url)

    # 检查请求是否成功
    if response.status_code == 200:
        # 使用lxml库解析网页内容
        html = response.text
        tree = etree.HTML(html)

        # 创建字典来存储爬取的数据
        data = {
    
    }

        # 基本信息
        item1 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[1]/div[2]/h2//text()')
        item2 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[1]/div[2]/h3//text()')
        data['Item 1'] = item1
        data['Item 2'] = item2

        # 主要学习工作简历
        data1 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[2]/div/p//text()')
        data['Main Education and Work Experience'] = data1

        # 主要研究方向
        data2 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[3]/div/p//text()')
        data['Main Research Areas'] = data2

        # 社会兼职及荣誉称号
        data3 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[4]/div/p//text()')
        data['Social Positions and Honors'] = data3

        # 主讲课程及主要教学奖励
        data4 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[5]/div/p//text()')
        data['Main Courses and Teaching Awards'] = data4

        # 主要科研成果及奖励
        data5 = tree.xpath('/html/body/div[4]/div/div[2]/div/div[6]/div/p//text()')
        data['Main Research Achievements and Awards'] = data5

        # 将数据转换为DataFrame并添加到现有DataFrame中
        new_dataframe = pd.DataFrame([data])
        dataframe = pd.concat([dataframe, new_dataframe], ignore_index=True)

        return dataframe

    else:
        print("请求失败")

def scrape_url(url):
    # 发起HTTP请求获取网页内容
    response = requests.get(url)

    # 检查请求是否成功
    if response.status_code == 200:
        # 使用lxml库解析网页内容
        html = response.text
        tree = etree.HTML(html)
        # 提取所有链接
        links = tree.xpath('//*[@id="wp_content_w3_0"]//@href')
        for link in links:
            print("链接:", link)
        return links
    else:
        print("请求失败")

# 创建空的DataFrame来存储导师信息
df = pd.DataFrame()

# 调用爬虫函数
links = scrape_url("https://jsjxy.sdut.edu.cn/7534/list.htm")
for link in links:
    print(link)
    df = scrape_website(link, df)
    time.sleep(1)

# 删除JSON格式的数据
df = df.applymap(lambda x: ', '.join(x) if isinstance(x, list) else x)

# 将整理好的数据保存到Excel文件
df.to_excel("导师信息.xlsx", index=False)

print('########################### Over  ###########################')
  • Data Display:
    Insert image description here

2. Preprocessing and analysis

2.1 Data preprocessing

2.1.1 Postgraduate entrance examination information preprocessing

  • Remove title
  • Delete empty data
  • Handle duplicate data
  • Filter the data of the re-examination college as "Computer Science and Technology Data" | "School of Computer Science"
  • The data are saved into three Excel files, named "2020jsj", "2021jsj", and "2022jsj" respectively.
import pandas as pd

# 读取数据文件
data_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020year.xlsx", "2021year.xlsx", "2022year.xlsx"]
save_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

for i in range(len(file_names)):
    file_path = data_path + file_names[i]
    save_path = data_path + save_names[i]

    # 读取Excel文件
    df = pd.read_excel(file_path)

    # 筛选条件
    condition = ((df["复试学院"] == "计算机学院") | (df["复试学院"] == "计算机科学与技术学院"))

    # 根据条件筛选数据
    filtered_data = df[condition]

    # 保存筛选后的数据到新的Excel文件
    filtered_data.to_excel(save_path, index=False)

Insert image description here

Addressing irregularities in recognition of some PDFs converted to Excel:

  • Add all data to the new excel table based on the "serial number" column. If the data corresponding to the "serial number" is non-numeric or empty, skip it.
import pandas as pd

# 读取原始Excel文件
file_path = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2020year.xlsx'
df = pd.read_excel(file_path)

# 新建Excel文件
new_file_path = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/new_data.xlsx'
new_df = pd.DataFrame(columns=df.columns)  # 使用原始Excel文件的表头创建新的DataFrame

# 根据"序号"一列添加数据
for index, row in df.iterrows():
    value = row['序号']
    if pd.notnull(value) and str(value).isdigit():
        new_df = new_df._append(row)

# 将数据保存到新的Excel文件
new_df.to_excel(new_file_path, index=False)

2.1.2 Tutor information preprocessing

  • Delete empty data
  • Handle duplicate data
  • Slice data
import pandas as pd
import re

# 读取Excel文件
file_path = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/导师信息.xlsx'
df = pd.read_excel(file_path)

# 删除空数据
df = df.dropna(how='all')

# 处理重复数据
df = df.drop_duplicates()

# 创建新的Excel文件
output_file = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/导师信息处理后.xlsx'
writer = pd.ExcelWriter(output_file, engine='xlsxwriter')

# 处理每个导师的信息
for index, row in df.iterrows():
    # 获取导师名称
    teacher_name = row['Item 1']

    # 创建以导师名称命名的工作表
    teacher_sheet = writer.book.add_worksheet(teacher_name)

    # 添加表头信息
    headers = ['姓名', '职位', '主要学习工作简历', '主要研究方向', '社会兼职及荣誉称号', '主讲课程及主要教学奖励',
               '主要科研成果及奖励']
    for col_index, header in enumerate(headers):
        teacher_sheet.write(0, col_index, header)

    # 将每一列的数据按照“,”“、”“;”进行切分,并保存到新的工作表中
    for col_index, value in enumerate(row):
        if pd.notnull(value):
            data_list = [x.strip() for x in re.split('[,、;]', str(value))]
            for i, data in enumerate(data_list):
                teacher_sheet.write(i + 1, col_index, data)

# 保存并关闭Excel文件
writer._save()

  • process result:
    Insert image description here

Part of the preprocessed data still exists. After manual processing, the data plus
data content is obtained: name, position, main study and work resume, main research direction, social part-time job and honorary title, main lecture course

2.2 Data analysis


3. Visualization methods and results

3.1 Visualization method

  • Geographical location display: pyecharts map display tool
  • 2020-2022 first choice admission results
    • Score distribution chart from 2020 to 2022: preliminary test scores, comprehensive interview scores, and total scores (bar chart)
    • Pie chart of score distribution from 2020 to 2022: preliminary test scores, comprehensive interview scores, and total scores (pie chart-carousel chart)
    • Comparison of the lowest and highest scores in 2020-2022: preliminary test scores, comprehensive interview scores, and total scores (box plot)
  • Information on the number of first-year volunteers from 2020 to 2022: Change chart of the number of first-year volunteers and the number of admissions (line chart)
  • Exam subjects and content: Screenshot of Excel table
  • Professional teaching team: research direction (relationship diagram + word cloud diagram)

3.2 Visual results display

3.2.1 Basic information

  • geographical location
from pyecharts.charts import Geo
from pyecharts import options as opts
from pyecharts.globals import GeoType


def test_geo():
    g = Geo()
    # 选择要显示的地图
    g.add_schema(maptype="山东")
    # 使用add_coordinate(name, lng, lat)添加坐标点和坐标名称
    g.add_coordinate('山东理工大学', 117.999601, 36.810315)
    # 给上面的坐标点添加数据,
    data_pair = [('山东理工大学', 10)]
    # 将数据添加到定义的地图上
    g.add('', data_pair, type_=GeoType.EFFECT_SCATTER, symbol_size=5)
    # 设置样式
    g.set_series_opts(label_opts=opts.LabelOpts(is_show=True))
    return g


g = test_geo()
# 渲染成html,保存在代码文件的相同目录下
g.render('坐标标注.html')

Insert image description here

3.2.2 Postgraduate entrance examination information

Score distribution chart from 2020 to 2022: preliminary test scores, comprehensive interview scores, and total scores (bar chart)

  • Preliminary test score distribution
import pandas as pd
from pyecharts.charts import Bar
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = [225, 250, 275, 300, 325, 350, 375, 400, 425]

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取初试成绩列数据
    scores = df["初试成绩"]

    # 统计每个分数段的人数
    for score in scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 绘制柱状图
bar = (
    Bar()
    .add_xaxis([str(range_start) + '-' + str(range_end) for range_start, range_end in zip(score_ranges[:-1], score_ranges[1:])])
)

# 添加不同年份的数据系列
for idx, file_name in enumerate(file_names):
    bar.add_yaxis(file_name[:-5], score_counts[idx], stack="stack{}".format(idx))

# 设置全局选项
bar.set_global_opts(
    title_opts=opts.TitleOpts(title="初试成绩分布"),
    xaxis_opts=opts.AxisOpts(name="分数段"),
    yaxis_opts=opts.AxisOpts(name="人数"),
)

# 渲染图表
bar.render("score_distribution.html")

Insert image description here

  • Comprehensive interview score distribution
import pandas as pd
from pyecharts.charts import Bar
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = list(range(70, 91, 2))

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取综合面试成绩列数据
    scores = df["综合面试成绩"]

    # 统计每个分数段的人数
    for score in scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 绘制柱状图
bar = (
    Bar()
    .add_xaxis([str(range_start) + '-' + str(range_end) for range_start, range_end in zip(score_ranges[:-1], score_ranges[1:])])
)

# 添加不同年份的数据系列
for idx, file_name in enumerate(file_names):
    bar.add_yaxis(file_name[:-5], score_counts[idx], stack="stack{}".format(idx))

# 设置全局选项
bar.set_global_opts(
    title_opts=opts.TitleOpts(title="综合面试成绩分布"),
    xaxis_opts=opts.AxisOpts(name="分数段"),
    yaxis_opts=opts.AxisOpts(name="人数"),
)

# 渲染图表
bar.render("综合面试成绩分布.html")

Insert image description here

  • Total grade distribution
import pandas as pd
from pyecharts.charts import Bar
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = list(range(60, 82, 2))

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取总成绩列数据
    total_scores = df["总成绩"]

    # 统计每个分数段的人数
    for score in total_scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 绘制柱状图
bar = (
    Bar()
    .add_xaxis([str(range_start) + '-' + str(range_end) for range_start, range_end in zip(score_ranges[:-1], score_ranges[1:])])
)

# 添加不同年份的数据系列
for idx, file_name in enumerate(file_names):
    bar.add_yaxis(file_name[:-5], score_counts[idx], stack="stack{}".format(idx))

# 设置全局选项
bar.set_global_opts(
    title_opts=opts.TitleOpts(title="总成绩分布"),
    xaxis_opts=opts.AxisOpts(name="分数段"),
    yaxis_opts=opts.AxisOpts(name="人数"),
)

# 渲染图表
bar.render("总成绩分布.html")

Insert image description here

Pie chart of score distribution from 2020 to 2022: preliminary test scores, comprehensive interview scores, and total scores (pie chart)

  • Preliminary test results
import pandas as pd
from pyecharts.charts import Pie, Timeline
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = list(range(225, 425, 25))  # Updated score ranges

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取初试成绩列数据
    initial_scores = df["初试成绩"]

    # 统计每个分数段的人数
    for score in initial_scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 创建时间轴图表
timeline = Timeline()

# 遍历不同时间的数据
for idx, file_name in enumerate(file_names):
    # 创建饼图
    pie = (
        Pie()
        .add(
            series_name="分数段",
            data_pair=[(str(range_start) + '-' + str(range_end), count) for range_start, range_end, count in zip(score_ranges[:-1], score_ranges[1:], score_counts[idx])],
            radius="50%"
        )
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
        .set_global_opts(title_opts=opts.TitleOpts(title="初试成绩分布"), legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_right="2%"))
    )

    # 添加当前时间的图表到时间轴
    timeline.add(pie, file_name[:-5])

# 渲染图表
timeline.render("初试成绩分布Pie.html")

Insert image description here

  • Comprehensive interview results
import pandas as pd
from pyecharts.charts import Pie, Timeline
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = list(range(60, 91, 5))  # Updated score ranges

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取综合面试成绩列数据
    interview_scores = df["综合面试成绩"]

    # 统计每个分数段的人数
    for score in interview_scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 创建时间轴图表
timeline = Timeline()

# 遍历不同时间的数据
for idx, file_name in enumerate(file_names):
    # 创建饼图
    pie = (
        Pie()
        .add(
            series_name="分数段",
            data_pair=[(str(range_start) + '-' + str(range_end), count) for range_start, range_end, count in zip(score_ranges[:-1], score_ranges[1:], score_counts[idx])],
            radius="50%"
        )
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
        .set_global_opts(title_opts=opts.TitleOpts(title="综合面试成绩分布"), legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_right="2%"))
    )

    # 添加当前时间的图表到时间轴
    timeline.add(pie, file_name[:-5])

# 渲染图表
timeline.render("综合面试成绩分布Pie.html")

Insert image description here

  • Overall result
import pandas as pd
from pyecharts.charts import Pie, Timeline
from pyecharts import options as opts

# 文件路径和文件名
file_path = "/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/"
file_names = ["2020jsj.xlsx", "2021jsj.xlsx", "2022jsj.xlsx"]

# 分数段
score_ranges = list(range(60, 80, 5))  # Updated score ranges

# 存储每个分数段的人数
score_counts = [[0] * (len(score_ranges) - 1) for _ in range(len(file_names))]

# 遍历文件进行统计
for idx, file_name in enumerate(file_names):
    file = file_path + file_name
    df = pd.read_excel(file)

    # 获取总成绩列数据
    total_scores = df["总成绩"]

    # 统计每个分数段的人数
    for score in total_scores:
        for i in range(len(score_ranges) - 1):
            if score_ranges[i] <= score < score_ranges[i + 1]:
                score_counts[idx][i] += 1
                break

# 创建时间轴图表
timeline = Timeline()

# 遍历不同时间的数据
for idx, file_name in enumerate(file_names):
    # 创建饼图
    pie = (
        Pie()
        .add(
            series_name="分数段",
            data_pair=[(str(range_start) + '-' + str(range_end), count) for range_start, range_end, count in zip(score_ranges[:-1], score_ranges[1:], score_counts[idx])],
            radius="50%"
        )
        .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
        .set_global_opts(title_opts=opts.TitleOpts(title="总成绩分布"), legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_right="2%"))
    )

    # 添加当前时间的图表到时间轴
    timeline.add(pie, file_name[:-5])

# 渲染图表
timeline.render("总成绩分布Pie.html")

Insert image description here

Information on the number of first-year volunteers from 2020 to 2022: Change chart of the number of first-year volunteers and the number of admissions (line chart)

import pandas as pd
from pyecharts import options as opts
from pyecharts.charts import Line

# 读取Excel文件
df_2020 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2020jsj.xlsx')
df_2021 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2021jsj.xlsx')
df_2022 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2022jsj.xlsx')

# 计算总人数和录取人数
total_counts = [len(df_2020), len(df_2021), len(df_2022)]
admitted_counts = [
    len(df_2020[df_2020['录取结果'] == '拟录取']),
    len(df_2021[df_2021['录取结果'] == '拟录取']),
    len(df_2022[df_2022['录取结果'] == '拟录取'])
]

# 创建折线图
line = (
    Line()
    .add_xaxis(['2020', '2021', '2022'])
    .add_yaxis('总人数', total_counts, markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")]))
    .add_yaxis('录取人数', admitted_counts, markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(type_="average")]))
    .set_global_opts(title_opts=opts.TitleOpts(title='总人数和录取人数变化折线图'),
                     yaxis_opts=opts.AxisOpts(name='人数'),
                     xaxis_opts=opts.AxisOpts(name='年份'))
)

# 保存为HTML文件并在浏览器中打开
line.render('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/录取人数变化.html')

Insert image description here

Comparison of the lowest and highest scores in 2020-2022: preliminary test scores, comprehensive interview scores, and total scores (box plot)

import pandas as pd
import matplotlib.pyplot as plt

# 设置中文字体
plt.rcParams['font.sans-serif'] = 'SimHei'

# 读取数据
data_2020 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2020jsj.xlsx').dropna()
data_2021 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2021jsj.xlsx').dropna()
data_2022 = pd.read_excel('/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/2022jsj.xlsx').dropna()

# 提取所需的列数据
score_2020 = data_2020['初试成绩']
score_2021 = data_2021['初试成绩']
score_2022 = data_2022['初试成绩']

interview_2020 = data_2020['综合面试成绩']
interview_2021 = data_2021['综合面试成绩']
interview_2022 = data_2022['综合面试成绩']

total_2020 = data_2020['总成绩']
total_2021 = data_2021['总成绩']
total_2022 = data_2022['总成绩']

# 绘制箱线图
plt.figure(figsize=(10, 6))

# 初试成绩对比图
plt.subplot(1, 3, 1)
plt.boxplot([score_2020, score_2021, score_2022])
plt.xticks([1, 2, 3], ['2020', '2021', '2022'])
plt.title('初试成绩')

# 综合面试成绩对比图
plt.subplot(1, 3, 2)
plt.boxplot([interview_2020, interview_2021, interview_2022])
plt.xticks([1, 2, 3], ['2020', '2021', '2022'])
plt.title('综合面试成绩')

# 总成绩对比图
plt.subplot(1, 3, 3)
plt.boxplot([total_2020, total_2021, total_2022])
plt.xticks([1, 2, 3], ['2020', '2021', '2022'])
plt.title('总成绩')

plt.tight_layout()
plt.show()


Insert image description here

3.2.3 Tutor information

  • Tutor’s corresponding research direction
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
# 设置中文字体
plt.rcParams['font.sans-serif'] = 'SimHei'

# 读取Excel文件
file_path = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/导师信息处理plus.xlsx'
df = pd.read_excel(file_path, sheet_name=None)

# 创建空的有向图
graph = nx.DiGraph()

# 添加导师和研究方向节点
for sheet_name, sheet_data in df.items():
    tutor_name = sheet_name
    research_directions = sheet_data['主要研究方向'].dropna().tolist()

    # 添加导师节点
    graph.add_node(tutor_name, node_type='tutor')

    # 添加研究方向节点
    for direction in research_directions:
        graph.add_node(direction, node_type='research_direction')

        # 添加导师与研究方向之间的边
        graph.add_edge(tutor_name, direction)

# 绘制关系图
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(graph, seed=42)
node_colors = {
    
    'tutor': 'lightblue', 'research_direction': 'lightgreen'}

nx.draw_networkx_nodes(graph, pos, node_color=[node_colors[graph.nodes[node]['node_type']] for node in graph.nodes()])
nx.draw_networkx_labels(graph, pos, font_size=10, font_color='black')
nx.draw_networkx_edges(graph, pos, arrowstyle='->', arrowsize=10)

plt.axis('off')
plt.show()

Insert image description here

  • Research direction word cloud chart
import pandas as pd
from collections import Counter
from pyecharts import options as opts
from pyecharts.charts import WordCloud

# 读取Excel文件
file_path = '/Users/liuhao/MyProject/PycharmProject/DataVisualization/Project1/Data/导师信息处理plus.xlsx'
df = pd.read_excel(file_path, sheet_name=None)

# 统计研究方向出现的次数
research_directions = []
for sheet_name, sheet_data in df.items():
    research_directions.extend(sheet_data['主要研究方向'].dropna().tolist())

research_direction_counts = Counter(research_directions)

# 生成词云图数据
wordcloud_data = [(key, value) for key, value in research_direction_counts.items()]

# 创建词云图
wordcloud = (
    WordCloud()
    .add(series_name="研究方向", data_pair=wordcloud_data, word_size_range=[20, 100])
    .set_global_opts(title_opts=opts.TitleOpts(title="研究方向词云图"))
)

# 渲染词云图到HTML文件中
wordcloud.render("wordcloud.html")

Insert image description here


4. Summary

This project uses Python web crawler, Pandas, Pyecharts, matplotlib and other tools to collect and display the postgraduate entrance examination information and tutor information of the Computer Science and Technology major of Shandong University of Technology. It shows and analyzes the postgraduate entrance examination admission information of this major for three years from 2020 to 2022.
Through this project, I have mastered crawling network data, visual display of data, and can skillfully use data processing tools such as Pandas to perform simple data processing. Pre-processing, being able to skillfully use data visualization tools such as pyecharts and matplotlib to visually display data through the use of bar charts, line charts, pie charts, box plots, relationship diagrams, word cloud diagrams and other display forms.


5. Appendix

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131271640