If you need source code and data sets, please like and follow the collection and leave a private message in the comment area~~~
The following is a visual analysis of data such as students' sentence formation and performance
1: import module
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['simhei']
plt.rcParams['font.serif'] = ['simhei']
import warnings
warnings.filterwarnings('ignore')
2: Get the data and print the first four lines
from matplotlib.font_manager import FontProperties
myfont=FontProperties(fname=r'C:\Windows\Fonts\SimHei.ttf',size=12)
sns.set(font=myfont.get_name())
df = pd.read_csv('.\data\StudentPerformance.csv')
df.head(4)
The corresponding meaning of the attribute list is as follows
Gender
Nationality
Place of Birth
Stageid school level
Gradeid
Sectionid class
Topic subjects
semester semester
ralation child family education director
raisedhands The number of times students raised their hands in class
announcementviews The number of times students viewed online courseware
discussion The number of times students participated in classroom discussions
parentanswersurvey Did the parents fill out the school's questionnaire
parentsschoolsatisfaction Parents' satisfaction with the school
studentabsencedays student absence days
3: Data visualization analysis
Next, modify the table column name to Chinese
df.rename(columns={'gender':'性别','NationalITy':'国籍','PlaceofBirth':'出生地',
'StageID':'学段','GradeID':'年级','SectionID':'班级','Topic':'科目',
'Semester':'学期','Relation':'监管人','raisedhands':'举手次数',
'VisITedResources':'浏览课件次数','AnnouncementsView':'浏览公告次数',
'Discussion':'讨论次数','ParentAnsweringSurvey':'父母问卷',
'ParentschoolSatisfaction':'家长满意度','StudentAbsenceDays':'缺勤次数',
'Class':'成绩'},inplace=True)
df.replace({'lowerlevel':'小学','MiddleSchool':'中学','HighSchool':'高中'},inplace=True)
df.columns
Display the values of term and period
Then modify the data
df.replace({'lowerlevel':'小学','MiddleSchool':'中学','HighSchool':'高中'},inplace=True)
df['性别'].replace({'M':'男','F':'女'},inplace=True)
df['学期'].replace({'S':'春季','F':'秋季'},inplace=True)
df.head(4)
View vacancies
df.isnull().sum()
View data statistics
Then draw a histogram of counts by grade
sns.countplot(x = '成绩', order = ['L', 'M', 'H'], data = df, linewidth=2,edgecolor=sns.color_palette("dark",4))
Then draw a histogram of counts by sex
sns.countplot(x = '性别', order = ['女', '男'],data = df)
Draw a histogram of counts by subject
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(16,8)},font=myfont.get_name(),font_scale=1.5)
sns.countplot(x = '科目', data = df)
Draw a histogram of counts of different grades by subject
Plot a histogram of counts by gender and grade
sns.countplot(x = '性别', hue = '成绩',data = df, order = ['女', '男'], hue_order = ['L', 'M', 'H'])
View grade distribution ratio by class
sns.countplot(x = '班级', hue='成绩', data=df, hue_order = ['L','M','H'])
# 从这里可以看出虽然每个班人数较少,但是没有那个班优秀的人数的比例比较突出,这个特征可以删除
Analyze the correlation between 4 performance and grades
# 了解四个课堂和课后表现与成绩的相关性
fig, axes = plt.subplots(2,2,figsize=(14,10))
sns.barplot(x='成绩', y='浏览课件次数',data=df,order=['L','M','H'],ax=axes[0,0])
sns.barplot(x='成绩', y='浏览公告次数',data=df,order=['L','M','H'],ax=axes[0,1])
sns.barplot(x='成绩', y='举手次数',data=df,order=['L','M','H'],ax=axes[1,0])
sns.barplot(x='成绩', y='讨论次数',data=df,order=['L','M','H'],ax=axes[1,1])
# 在sns.barplot中,默认的计算方式为计算平均值
Analyze the discussion of students with different grades
# 了解举手次数与成绩之间的相关性
sns.set(rc={'figure.figsize':(8,6)},font=myfont.get_name(),font_scale=1.5)
sns.boxplot(x='成绩',y='讨论次数',data=df,order=['L','M','H'])
Analyze the correlation between the number of hands raised and the number of discussions participated
# 了解四个课堂后量化表现之间的相关性
# fig,axes = plt.subplots(2,1,figsize=(10,10))
sns.regplot(x='举手次数',y='讨论次数',order =4,data=df)
# sns.regplot(x='浏览公告次数',y='浏览课件次数',order=4,data=df,ax=axes[1]) ,ax=axes[0]
Analyze the correlation between the number of times of browsing courseware, the number of times of raising hands, the number of times of browsing announcements, and the number of discussions
# Correlation Matrix 相关性矩阵
corr = df[['浏览课件次数','举手次数','浏览公告次数','讨论次数']].corr()
corr
Finally, visualize the correlation matrix with a heat map
# Correlation Matrix Visualization 相关性可视化
sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns)
It's not easy to create and find it helpful, please like, follow and collect~~~