Django project-visual analysis of data from Chinese universities

Today I would like to share with you a simple data analysis and visualization project that I recently used to practice Django. The following is a brief introduction to the project.

Project name: Chinese University Data Visualization Analysis Project.

Project implementation functions:

1: Log in and register (relevant information is saved in the mysql database)

2: The data crawler obtains relevant data and stores it in a csv file.

3: Data visualization analysis display (ep: regional distribution of colleges and universities, proportion of undergraduates and junior colleges in each province, proportion of 211/985 number distribution in each province, national college admissions distribution rankings, direct access to national college data, number of majors offered nationwide Ranking list, proportion of school types, top five majors for employment and salary situation in various industries, etc.)

4: Introduce middleware technology to improve the security of web pages and prevent cross-page access of uncached account information.

5: A more harmonious front-end layout and a customized 404 error animation page.

The project involves technologies:

Front-end (html, css, javascript), mysql database, echars visualization tool, python basics and crawler and data analysis, django template framework.


Next, I will attach the project structure diagram:

data: stores crawler and ready-made csv file data.

middleware: middleware (placed in cross-page login without cached account information).

spilder: crawler implementation.

static: front-end style collection (css, js).

templates: front-end html page folder.

models: database.

views: data analysis business function.

The above is the main structural description of the project, and other related structures will not be explained here.


Next, I will place one of the data analysis renderings and introduce in detail how it is implemented, and how the back-end data is passed to and displayed on the front-end page.

as follows:

On this page, I made a circular distribution of the number of provinces and cities in colleges and universities across the country. Let’s analyze them one by one.

Let’s first look at the proportion of universities in the province on the left:

The first step: We need to enter the echars official website. First, we need to select the style in which our data will be displayed. For example, we will select the first style of the pie chart (a certain site user Access From) as follows:

When we enter his code editing part, we can find that his data part is the data field (many small dictionaries are nested in the list):

So we need to process our final data form into the same form as the js code segment in echars (small dictionary nested in the list)

Step 2: Now is the important step. How to extract our data from csv or database and finally process it into what we need?

Let me give you a general idea first: We can find that the dictionary style is { value: 1048, name: 'Search Engine' }. Corresponding to the data of our project, the attribute value 1048 of the value field should be replaced with our specific For province name, just replace the attribute value of the name field with the specific number of colleges and universities in our specific unified province.

For the sake of convenience, I copied the code for processing this data and give a detailed explanation of the code:

def schooladdress(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    selected_field_city = '省份'
    selected_field_city_second = '城市'
    selected_values_numschool = []
    selected_values_numschool_second = []
    with open(filename, 'r', encoding='utf-8') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        for row in csv_reader:
            selected_values_numschool.append(row.get(selected_field_city))
            selected_values_numschool_second.append(row.get(selected_field_city_second))
    distData = {}
    distData_second={}
    for job in selected_values_numschool:
        if distData.get(job, -1) == -1:
            distData[job] = 1
        else:
            distData[job] += 1
    result = []
    for k, v in distData.items():
        result.append({
            'value': v,
            "name": k
        })
    for job in selected_values_numschool_second:
        if distData_second.get(job, -1) == -1:
            distData_second[job] = 1
        else:
            distData_second[job] += 1
    result_second = []
    for k, v in distData_second.items():
        result_second.append({
            'value': v,
            "name": k
        })
    sorted_data = sorted(result_second, key=lambda x: x['value'], reverse=True)[:50    ]  # 按成绩倒序排序并获取前50个字典

    result_second = sorted_data  # 将排序后的结果保存在新的列表中

    python_data_address=result
    python_data_address_second=result_second

    return render(request,'schooladdress.html',{"html_data_address":python_data_address,
                                                "html_data_address_second":python_data_address_second,
                                                "userInfo":username})

We assign the province field name to the variable selected_field_city, create an empty list selected_values_numschool, and then traverse the csv target file and use the selected_values_numschool.append(row.get(selected_field_city)) function method to get all the province information in the csv file in a list, such as [' Henan', 'Anhui', 'Shandong', 'Zhejiang', 'Jiangsu', 'Henan', 'Anhui', 'Guangdong', 'Fujian', 'Sichuan', 'Henan'..... ..], then we need to create an empty dictionary, and then

Loop through the list for job in selected_values_numschool:
if distData.get(job, -1) == -1:distData[job] = 1< a i=2> else:distData[job] += 1

This paragraph means that if the dictionary does not have the name of the province, which is -1 (does not exist), then its number is set to 1. If it already exists, its number is increased by 1. The resulting form is such as: { ''Henan Province': 120, 'Anhui Province': 110, 'Shandong Province': 100....}
result = []
for k, v in distData.items():
result.append({value': v,"name": k})

Then create an empty column. Because it is a dictionary, k and v traverse the dictionary set just obtained, assign the v (attribute value·) of each item in the dictionary to the value field, and assign the k (attribute value) of the item to the province name. Assign the value to the name field, and then append it to the newly created result list. Finally, you get the following style [{value': 120', "name": 'Henan Province'}, {value' : 110,"name": 'Anhui Province'}, {value': 100,"name": 'Shandong Province'}...】

We assume that the variable html_data_pro=[{value': 120',"name": 'Henan Province'}, {value': 110,"name": 'Anhui Province'}, {value' : 100,"name": 'Shandong Province'}.....】

That is the data style we ultimately need (consistent with echars)

Step 3: At this point we have almost got the data we need. The next step is to pass the data to the front-end js location, because the overall layout of the front-end has been almost built using bootstrap. We Copy and paste the echars code directly into the js part of the front-end html page, and then replace the first data field area of ​​echars (the blue area marked in the picture above) with { {html_data_pro | safe}} can be completed.

At this point, the content of our provincial college proportion page is complete. As for the city college proportion, we only need to sort and slice the obtained data sorted_data = sorted(result_second, key=lambda x: x['value& #39;], reverse=True)[:50] Get the top 50 cities with the largest number of universities in the city, and then render the data using the Django template language on the front end.


The above is the implementation process of the regional distribution page of colleges and universities. As for the data processing implementation ideas of other pages, I will add details next time. I will paste the key code here:

views:

import django
from django.core.paginator import Paginator
from django.views.decorators.csrf import csrf_protect
from myapp.models import User
django.setup()
import hashlib
from django.shortcuts import render, redirect,reverse
import csv
from .error import *
import json
import os
import pandas as pd
path = os.path.abspath('.')
filename = os.path.join(path,'myapp\data\school.csv')
print(filename)

def login(request):
    if request.method == 'GET':
        return render(request, 'login.html')
    else:
        uname = request.POST.get('username')
        pwd = request.POST['password']
        md5 = hashlib.md5()
        md5.update(pwd.encode())
        pwd = md5.hexdigest()
        try:
            user = User.objects.get(username=uname,password=pwd)
            request.session['username'] = user.username
            return redirect('index')
        except:
            return errorResponse(request, '用户名或密码错误!')

# 注册页面
def registry(request):
    if request.method == 'GET':
        return render(request, 'register.html')
    else:
        uname = request.POST.get('username')
        pwd = request.POST.get('password')
        checkPWD = request.POST.get('checkPassword')
        try:
            User.objects.get(username=uname)
        except:
            if not uname or not pwd or not checkPWD:return errorResponse(request, '不允许为空!')
            if pwd != checkPWD:return errorResponse(request, '两次密码不符合!')
            md5 = hashlib.md5()
            md5.update(pwd.encode())
            pwd = md5.hexdigest()
            User.objects.create(username=uname,password=pwd)
            return redirect('login')
        return errorResponse(request, '该用户已被注册')
#
def schoolgrade(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    file_path = os.path.join(BASE_DIR, 'myapp/data/school.csv')  # 拼接文件路径
    data = pd.read_csv(file_path)
    list_211=[]
    list_985=[]
    # 211/985城市占比分布图
    unique_values = data['省份'].unique()

    for i in unique_values:
        list_211.append(len(data[(data['211'] == '是') & (data['省份'] == i)]))
        list_985.append(len(data[(data['985'] == '是') & (data['省份'] == i)]))
    html_data_grade1 = list_211
    html_data_grade2 = list_985
    html_data_city = unique_values
    print(html_data_city, html_data_grade1, html_data_grade2)
    return render(request,"schoolgrade.html",{"html_data_city":html_data_city,"html_data_grade1":html_data_grade1,"html_data_grade2":html_data_grade2,"userInfo":username})


#
def schooladdress(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    selected_field_city = '省份'
    selected_field_city_second = '城市'
    selected_values_numschool = []
    selected_values_numschool_second = []
    with open(filename, 'r', encoding='utf-8') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        for row in csv_reader:
            selected_values_numschool.append(row.get(selected_field_city))
            selected_values_numschool_second.append(row.get(selected_field_city_second))
    distData = {}
    distData_second={}
    for job in selected_values_numschool:
        if distData.get(job, -1) == -1:
            distData[job] = 1
        else:
            distData[job] += 1
    result = []
    for k, v in distData.items():
        result.append({
            'value': v,
            "name": k
        })
    for job in selected_values_numschool_second:
        if distData_second.get(job, -1) == -1:
            distData_second[job] = 1
        else:
            distData_second[job] += 1
    result_second = []
    for k, v in distData_second.items():
        result_second.append({
            'value': v,
            "name": k
        })
    sorted_data = sorted(result_second, key=lambda x: x['value'], reverse=True)[:50    ]  # 按成绩倒序排序并获取前50个字典

    result_second = sorted_data  # 将排序后的结果保存在新的列表中

    python_data_address=result
    python_data_address_second=result_second

    return render(request,'schooladdress.html',{"html_data_address":python_data_address,
                                                "html_data_address_second":python_data_address_second,
                                                "userInfo":username})
#
def school_score(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    file_path = os.path.join(BASE_DIR, 'myapp/data/广西.csv')  # 拼接文件路径
    data = pd.read_csv(file_path)  # 读取文件

    # 按照某一字段降序排序
    sorted_df = data.sort_values('2022分数线', ascending=False)

    # 取前100个最大值的数据
    top_100 = sorted_df.head(10)

    # 创建一个空列表用于存放字典
    result = []
    list = []
    # 遍历前100个数据,将对应字段内容放入字典
    for index, row in top_100.iterrows():
        dictionary = {
            'name': row['名称'],
            'value': row['2022分数线'][4:10]
            # 将需要的其他字段依次添加到字典中
        }
        list.append(dictionary['name'])
        result.append(dictionary)

    print(result)
    print(list)
    html_data_score=result
    legend=list
    return render(request,'schoolscore.html',{'html_data_score':html_data_score,"legend":legend,"userInfo":username})

#
def school_benzhuan(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    file_path = os.path.join(BASE_DIR, 'myapp/data/school.csv')  # 拼接文件路径
    # data = pd.read_csv(file_path)

    df = pd.read_csv(file_path, encoding='utf-8')
    new_df = df[["省份", "水平层次"]]
    new_df_grouped = new_df.groupby("省份")
    undergraduate_res_ls = []
    specialty_res_ls = []
    undergraduate_low_dict = {'name': "0~33", 'children': []}
    undergraduate_middle_dict = {'name': "33~66", 'children': []}
    undergraduate_upper_dict = {'name': '66~100', 'children': []}
    specialty_low_dict = {'name': "0~33", 'children': []}
    specialty_middle_dict = {'name': "33~66", 'children': []}
    specialty_upper_dict = {'name': '66~100', 'children': []}
    for i, j in new_df_grouped:
        area_dict = {'name': i}
        undergraduate_rate = j.loc[j["水平层次"] == "普通本科"].count() / j.value_counts().sum()
        specialty_rate = j.loc[j["水平层次"] == "专科(高职)"].count() / j.value_counts().sum()
        undergraduate_rate = float(undergraduate_rate[0])
        specialty_rate = float(specialty_rate[0])
        if undergraduate_rate < 0.33:
            undergraduate_low_dict['children'].append(area_dict)
        elif undergraduate_rate < 0.66:
            undergraduate_middle_dict['children'].append(area_dict)
        else:
            undergraduate_upper_dict['children'].append(area_dict)
        if specialty_rate < 0.33:
            specialty_low_dict['children'].append(area_dict)
        elif specialty_rate < 0.66:
            specialty_middle_dict['children'].append(area_dict)
        else:
            specialty_upper_dict['children'].append(area_dict)
    undergraduate_res_ls.append(undergraduate_low_dict)
    undergraduate_res_ls.append(undergraduate_middle_dict)
    undergraduate_res_ls.append(undergraduate_upper_dict)
    specialty_res_ls.append(specialty_low_dict)
    specialty_res_ls.append(specialty_middle_dict)
    specialty_res_ls.append(specialty_upper_dict)
    print(undergraduate_res_ls)
    print('*' * 100)
    print(specialty_res_ls)
    html_data_ben=undergraduate_res_ls
    html_data_zhuan=specialty_res_ls
    return render(request,'schoolbenzhuan.html',{'html_data_ben':html_data_ben,'html_data_zhuan':html_data_zhuan,"userInfo":username})
def read_csv_file(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as csv_file:
        reader = csv.DictReader(csv_file)
        for row in reader:
            data.append(row)
    return data
def school_list(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    csv_file_path = os.path.join(BASE_DIR, 'myapp/data/school.csv')  # 拼接文件路径

    schools = read_csv_file(csv_file_path)

    # 获取选择的省份
    selected_province = request.GET.get('province')
    all_provinces = set([school['省份'] for school in schools])
    # 筛选学校数据
    if selected_province:
        schools = [school for school in schools if school['省份'] == selected_province]
    # 处理分页
    paginator = Paginator(schools, 10)  # 每页显示10条数据
    page_number = request.GET.get('page')
    page = paginator.get_page(page_number)

    context = {
        'schools': page,
        'selected_province': selected_province,
        'all_provinces': all_provinces,
         "userInfo":username
    }

    return render(request, 'schoolget.html', context)
def cancel(request):
    username = request.session.get("username")
    obj = User.objects.get(username=username)
    obj.delete()
    return render(request, 'login.html')

def workrate(request):
    username = request.session.get("username")
    return render(request,'workrate.html',{ "userInfo":username})
def schoolstate(request):
    username = request.session.get("username")
    userInfo = User.objects.get(username=username)
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    file_path = os.path.join(BASE_DIR, 'myapp/data/school.csv')  # 拼接文件路径
    # 统计某一字段的值的数量
    data = pd.read_csv(file_path)
    value_counts0 = data['办学类型'].value_counts()[0]
    value_counts1 = data['办学类型'].value_counts()[1]
    value_counts2 = data['办学类型'].value_counts()[2]
    total = value_counts0 + value_counts1 + value_counts2
    print(value_counts0, value_counts1, value_counts2, total)

    return render(request,'schoolstate.html',{"html_data_state0":value_counts0,'html_data_state1':value_counts1,
                                              "html_data_state2":value_counts2,"total":total,
                                                "userInfo":username})
def salary(request):
    username = request.session.get("username")
    BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  # 获取项目根目录
    file_path = os.path.join(BASE_DIR, 'myapp/data/salary.csv')  # 拼接文件路径
    data = pd.read_csv(file_path)
    def get_column_values(csv_file, column_name):
        values = []
        with open(csv_file, 'r', encoding='utf-8') as file:
            reader = csv.DictReader(file)
            for row in reader:
                values.append(row[column_name])
        return values

    csv_file = file_path
    column_major = '行业'
    column_noself = '非私营单位就业人员平均工资(元)'
    column_self = '私营单位就业人员平均工资(元)'
    column_values = get_column_values(csv_file, column_major)[::-1]
    column_noself_values = get_column_values(csv_file, column_noself)[::-1]
    column_self_values = get_column_values(csv_file, column_self)[::-1]

    print(column_values)
    print(column_noself_values)
    print(column_self_values)
    html_data_major=column_values
    html_data_noself=column_noself_values
    html_data_self=column_self_values
    return render(request,'salary.html',{'html_data_major':html_data_major,'html_data_noself':html_data_noself,
                                         'html_data_self':html_data_self,"userInfo":username})


def majornum(request):
    return render(request, 'majornum.html', {})

After packaging the function method, just call it through the address on the url.

Finally, I will talk about the process and code of using matplotlib to draw word clouds:

  1. Import the required packages, such as numpy, PIL, matplotlib, wordcloud, pandas, jieba, etc.
  2. To perform word segmentation processing on the text, you can use the jieba word segmentation library to split the text into words and remove stop words.
  3. Count the frequency of occurrence of each word and generate a word frequency dictionary.
  4. Generate a word cloud graph based on the word frequency dictionary, and you can set the shape, color, font and other parameters of the word cloud graph.
  5. Save the generated word cloud image locally or display it on the web page.
import matplotlib.pyplot as plt
import pandas as pd
import jieba
from wordcloud import WordCloud
import os
# 读取csv文件并进行数据处理
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
file_path = os.path.join(BASE_DIR, 'myapp/data/professional.csv')
df = pd.read_csv(file_path)
text = ''.join(df['专业名称'].tolist())
stopwords = set(open('stopwords.txt', 'r', encoding='utf-8').read().split('\n'))
words = [word for word in jieba.cut(text) if word not in stopwords]

# print(words)
# 将处理后的数据转化为字典格式
word_dict = {}
for word in words:
    if word in word_dict:
        word_dict[word] += 1
    else:
        word_dict[word] = 1
# 生成词云图
# wordcloud = WordCloud(background_color='white').generate_from_frequencies(word_dict)
wordcloud = WordCloud(font_path='msyh.ttc', background_color='white', width=800, height=600)
wordcloud.generate_from_frequencies(word_dict)
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('ciyun.jpg')
plt.show()

I will write this here today and continue editing tomorrow. I will release it temporarily. In addition, if you need this project, you can send me a private message. If there are any shortcomings in the project, please let me know. I will go to bed.

Guess you like

Origin blog.csdn.net/Abtxr/article/details/133692627