Python Data Visualization Practical Final Course Design——Visualization Project of Examination Results in Various Subjects

case requirements

1. Training points
(1) Master the method of using seaborn library or matplotlib library for data visualization
(2) Master the method of writing visual analysis report
2. Requirement description
In real life, students' grades and performance are often subject to many factors . In teaching research, in addition to the analysis of the test results of each subject, if it is possible to deeply analyze other information of students (such as factors affecting students' family background, gender, diet, and pre-class preparation), then teachers will further Find out how students are doing on exams. The student test score data set contains 8 fields, a total of 1000 pieces of data, and the field information description is shown in the table below.
Table: Field information description in the student test score dataset
insert image description here
In order to understand the performance of students of different genders in mathematics, reading, and writing, to understand whether the education level of parents has an impact on students' mathematics, reading, and writing, and to understand whether the lunch standard affects student performance. To understand whether sufficient test preparation can help improve student performance requires data reading, processing, and visual analysis of student test score datasets.

(1) Use the Pandas library to read the file, view the relevant characteristics and description information of the original data, and check whether there are null values.
(2) Obtain the three fields of reading score, math score, and writing score in the data frame respectively, add up and sum to calculate the total score of each student, and then divide it by 3 to get the average score percentage.
(3) Set the passing mark of each course to 60 points, judge whether the student passes (Fail/Pass) each course, and merge the new data columns pass_reading, pass_math, and pass_writing.
(4) Judge whether the overall status of each student is passed. If one of the three courses is Fail, the final assessment is Fail, and the new data column status is merged.
(5) For the data whose total score is Pass, a 5-level grade system is set according to the average score, that is, the percentage greater than or equal to 90 is excellent, 80 to 89 is good, 70 to 79 is medium, 60 to 69 is several, and others for failing.
(6) Draw visual graphics.
Draw a horizontal histogram of parents' education level
Draw a pie chart of the distribution of grades for all students
Draw a histogram of grade distribution for each subject
Draw a statistical classification map of parents' education level and whether pre-requisite courses are completed
Draw a boxplot of grades and gender distribution
Draw lunch standards and Scatterplot of gender classification of total score
Draw correlation heat map of each characteristic

background

To understand the performance of students of different genders in mathematics, reading, and writing, to understand whether parental education has an impact on students' mathematics, reading, and writing, to understand whether lunch standards have an impact on student performance, and to understand whether adequate test preparation is helpful. To improve student performance, it is necessary to read, process, and visualize the data set of student test scores. In today's society, education is the most important part of many lives. With the continuous development of data analysis and visualization technologies, more and more people have begun to apply these technologies in the field of education to help schools, parents and governments better manage and make decisions. As a popular programming language, Python has been widely recognized and used for its data analysis and visualization capabilities. This article introduces a Python script-based student characteristics and performance analysis program, which aims to explore the impact of different characteristics on student performance, and provide strong support and reference for schools and parents to provide more effective educational decisions.

1. Processing data

import pandas as pd

import pandas as pd

# 使用 GBK 编码方式读取文件
df = pd.read_csv("StudentsPerformance.csv", encoding="gbk")


# 查看数据的前几行
print(df.head())

# 查看数据的基本信息,包括每列数据的类型和非空数量等
print(df.info())

# 查看数据的统计信息,包括每列数据的基本统计量(如均值、标准差、最大值、最小值等)
print(df.describe())

# 检查是否有空值
print(df.isnull().sum())

This code basically uses Python's Pandas library to analyze a CSV file called "StudentsPerformance.csv". After running this code, it does the following:
Reads the data from the "StudentsPerformance.csv" file using the Pandas library's read_csv() function and stores it in a Dataframe object called df.
Executing the df.head() function can display the first five rows of the Dataframe object, which is convenient for us to check the data initially.
Execute the df.info() function, you can print out the basic information of the Dataframe, including the number of columns, each column name, the number of non-null values ​​in each column, and data types, etc. Execute the df.describe() function, you can print out
each Statistics of column data, such as mean, standard deviation, maximum value, minimum value, etc.
By executing the df.isnull().sum() function, you can check whether there are missing values ​​in the Dataframe, count the number of missing values ​​in each column, and print them out.

 结果:
  性别 民族 父母教育程度     午餐 课程完成情况  数学成绩  阅读成绩  写作成绩
0  女  B   学士学位     标准    未完成    72    72    74
1  女  C  大学未毕业     标准     完成    69    90    88
2  女  B   硕士学位     标准    未完成    90    95    93
3  男  A  副学士学位  自由/减少    未完成    47    57    44
4  男  C  大学未毕业     标准    未完成    76    78    75
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   性别      1000 non-null   object
 1   民族      1000 non-null   object
 2   父母教育程度  1000 non-null   object
 3   午餐      1000 non-null   object
 4   课程完成情况  1000 non-null   object
 5   数学成绩    1000 non-null   int64 
 6   阅读成绩    1000 non-null   int64 
 7   写作成绩    1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None
             数学成绩         阅读成绩         写作成绩
count  1000.00000  1000.000000  1000.000000
mean     66.08900    69.169000    68.054000
std      15.16308    14.600192    15.195657
min       0.00000    17.000000    10.000000
25%      57.00000    59.000000    57.750000
50%      66.00000    70.000000    69.000000
75%      77.00000    79.000000    79.000000
max     100.00000   100.000000   100.000000
性别        0
民族        0
父母教育程度    0
午餐        0
课程完成情况    0
数学成绩      0
阅读成绩      0
写作成绩      0
dtype: int64

2. Select the three fields of reading score, math score and writing score, and calculate the total score and average score

# 选取阅读成绩、数学成绩、写作成绩3个字段,计算总分和平均分
df['总分'] = df.iloc[:,5:8].sum(axis=1)
df['平均分'] = df['总分'] / 3

# 查看计算结果
print(df[['阅读成绩', '数学成绩', '写作成绩', '总分', '平均分']].head())

The purpose of this code is to select the three fields of reading score, math score, and writing score from the DataFrame object df, and calculate the total score and average score of each student. The specific explanation is as follows:

df.iloc[:, 5:8] 表示选取第6、7、8列(阅读成绩、数学成绩、写作成绩),使用 sum(axis=1) 对每行求和,将每位学生的总分添加到 df 中。

计算每位学生的平均分,即将每位学生的总分除以3,并将其添加到 df 中。

使用 df[['阅读成绩', '数学成绩', '写作成绩', '总分', '平均分']] 选取 DataFrame 对象 df 中的 阅读成绩、数学成绩、写作成绩、总分、平均分这5个字段,并使用 head() 方法打印出前五行数据。

In short, the purpose of this code is to process specific fields of the DataFrame object df and add the processing results to df. It calculates the total score and average score of each student, and adds these two indicators to the original data for further statistics and analysis. Finally, use the print function to output the selected fields and the results of the newly added fields, so that we can view the calculation results.

3. Set the pass line as 60 points, and use the lambda function to judge whether each student has passed each course

   阅读成绩  数学成绩  写作成绩   总分        平均分
0    72    72    74  218  72.666667
1    90    69    88  247  82.333333
2    95    90    93  278  92.666667
3    57    47    44  148  49.333333
4    78    76    75  229  76.333333
# 设置及格线为60分,并使用 lambda 函数判断每位学生是否通过各门课程
df['pass_reading'] = df['阅读成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')
df['pass_math'] = df['数学成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')
df['pass_writing'] = df['写作成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')

# 打印修改后的数据帧
print(df.head())

The main function of this code is to first set the pass line to 60 points, and then use the lambda function to determine whether each student has passed the three courses of reading, mathematics and writing, and store the results in the newly added three columns pass_reading, pass_math and in pass_writing. The specific explanation is as follows:

Use the apply method and the lambda function, and use the if judgment statement for the three columns of reading score, math score, and writing score. If the score is greater than or equal to 60, set the value of the student to 'Pass', otherwise set it to 'Fail' ', and add the judgment result to a new column of the DataFrame object df.

Because the three columns pass_reading, pass_math and pass_writing of the DataFrame object df have newly added judgment results, the modified DataFrame object df is printed out so that we can view the results.

In short, the purpose of this code is to judge the three columns of data of the DataFrame object df, namely the reading score, math score, and writing score, and add the judgment results to the three newly added columns of df. It uses the apply method and the lambda function to judge each score, and add it to the corresponding new column according to the judgment result. Doing so will help us better judge whether each student has passed each course, and can help us evaluate student performance more accurately in further analysis.

  性别 民族 父母教育程度     午餐 课程完成情况  数学成绩  阅读成绩  写作成绩   总分        平均分 pass_reading  \
0  女  B   学士学位     标准    未完成    72    72    74  218  72.666667         Pass   
1  女  C  大学未毕业     标准     完成    69    90    88  247  82.333333         Pass   
2  女  B   硕士学位     标准    未完成    90    95    93  278  92.666667         Pass   
3  男  A  副学士学位  自由/减少    未完成    47    57    44  148  49.333333         Fail   
4  男  C  大学未毕业     标准    未完成    76    78    75  229  76.333333         Pass   

  pass_math pass_writing  
0      Pass         Pass  
1      Pass         Pass  
2      Pass         Pass  
3      Fail         Fail  
4      Pass         Pass  

4. Use the apply() function and lambda expression to judge whether the overall status of each student is passed, and store the judgment result in a new data column

# 使用 apply() 函数和 lambda 表达式判断每个学生的整体状态是否通过,并将判断结果存储在新的数据列中
df['情况'] = df.apply(lambda x: 'Pass' if x['pass_reading'] == 'Pass' and x['pass_math'] == 'Pass' and x['pass_writing'] == 'Pass' else 'Fail', axis=1)

# 打印修改后的数据帧
print(df.head())
  性别 民族 父母教育程度     午餐 课程完成情况  数学成绩  阅读成绩  写作成绩   总分        平均分 pass_reading  \
0  女  B   学士学位     标准    未完成    72    72    74  218  72.666667         Pass   
1  女  C  大学未毕业     标准     完成    69    90    88  247  82.333333         Pass   
2  女  B   硕士学位     标准    未完成    90    95    93  278  92.666667         Pass   
3  男  A  副学士学位  自由/减少    未完成    47    57    44  148  49.333333         Fail   
4  男  C  大学未毕业     标准    未完成    76    78    75  229  76.333333         Pass   

The purpose of this code is to use the apply() function and lambda expressions to calculate the overall status of each student as Pass or not and store the result in a new column 'case' of the DataFrame object df. The specific explanation is as follows:

Use the apply() function and lambda expression to traverse the DataFrame object df and apply the lambda function, axis=1 means to operate along the row direction. For each student, judge whether the three subjects of reading, mathematics, and writing are all passed. If all pass, the student passes as a whole; otherwise, the student fails as a whole.

According to the judgment result, store "Pass" or "Fail" into the new column 'case' of the DataFrame object df.

Print out the modified DataFrame object df so that we can view the calculation results.

In short, the purpose of this code is to make an overall judgment on whether the three subjects of the DataFrame object df are passed or not, and store the judgment result in a new column, which can make student performance evaluation more convenient.

  pass_math pass_writing    情况  
0      Pass         Pass  Pass  
1      Pass         Pass  Pass  
2      Pass         Pass  Pass  
3      Fail         Fail  Fail  
4      Pass         Pass  Pass  

5. Calculate the total score and average score of each student, and use the conditional expression to judge the grade, and store the judgment result in a new data column

# 计算每个学生的总分和平均分,并使用条件表达式判断成绩等级,并将判断结果存储在新的数据列中
df['总分'] = df['数学成绩'] + df['阅读成绩'] + df['写作成绩']
df['平均分'] = df['总分'] / 3
df['等级'] = df.apply(lambda x: '优秀' if x['平均分'] >= 90 else '良好' if x['平均分'] >= 80 else '中等' if x['平均分'] >= 70 else '及格' if x['平均分'] >= 60 else '不及格', axis=1)

# 打印修改后的数据帧
print(df.head())

The purpose of this code is to calculate the total score and average score of each student, and judge the grade according to the conditional expression, and store the judgment result in the new column 'grade' of the DataFrame object df. The specific explanation is as follows:

First, the total and average marks for each student are calculated and stored in the 'total' and 'average' columns of the DataFrame object df.

Use the apply() function and lambda expressions to traverse the DataFrame object df, axis=1 means to operate along the row direction. For each student, use conditional expressions to judge their grades based on their average score. For example, if the average score is greater than or equal to 90 points, it is excellent, and if it is greater than or equal to 80 points, it is good, and so on, and the judgment results are stored in DataFrame in a new column 'rank' of object df.

Print out the modified DataFrame object df so that we can view the calculation results.

In short, the purpose of this code is to calculate and judge the grades of the three subjects of the DataFrame object df, and store the judgment results in a new data column. This makes it easier to evaluate and analyze student performance.

  性别 民族 父母教育程度     午餐 课程完成情况  数学成绩  阅读成绩  写作成绩   总分        平均分 pass_reading  \
0  女  B   学士学位     标准    未完成    72    72    74  218  72.666667         Pass   
1  女  C  大学未毕业     标准     完成    69    90    88  247  82.333333         Pass   
2  女  B   硕士学位     标准    未完成    90    95    93  278  92.666667         Pass   
3  男  A  副学士学位  自由/减少    未完成    47    57    44  148  49.333333         Fail   
4  男  C  大学未毕业     标准    未完成    76    78    75  229  76.333333         Pass   

  pass_math pass_writing    情况   等级  
0      Pass         Pass  Pass   中等  
1      Pass         Pass  Pass   良好  
2      Pass         Pass  Pass   优秀  
3      Fail         Fail  Fail  不及格  
4      Pass         Pass  Pass   中等  

6. Data visualization

1. Count the number of people with education level of each parent, and draw a horizontal histogram

import seaborn as sns
import matplotlib.pyplot as plt
# 指定字体
plt.rcParams['font.sans-serif'] = ['SimHei']
# 统计每个家长受教育水平的人数,并绘制水平柱状图
edu_counts = df['父母教育程度'].value_counts()
plt.bar(edu_counts.index, edu_counts.values)
plt.xlabel('父母教育程度')
plt.ylabel('人数')
plt.title('父母受教育程度的水平柱状图')
plt.show()

insert image description here

2. Count the number of people who pass and fail, and draw a pie chart

# 统计及格和不及格的人数,并绘制饼图
pass_counts = df['情况'].value_counts()

labels = ['及格', '不及格']
sizes = [pass_counts['Pass'], pass_counts['Fail']]
plt.pie(sizes,
        labels=labels,
        autopct='%1.1f%%')

plt.title('全体学生成绩分布饼图')
plt.axis('equal')
plt.show()

insert image description here

3. Draw a histogram of math scores, reading scores, and writing scores

# 绘制数学成绩、阅读成绩和写作成绩的直方图
plt.hist(df['数学成绩'], bins=10, alpha=0.5, label='math score')
plt.hist(df['阅读成绩'], bins=10, alpha=0.5, label='reading score')
plt.hist(df['写作成绩'], bins=10, alpha=0.5, label='writing score')

plt.legend(loc='upper right')
plt.xlabel('成绩')
plt.ylabel('人数')
plt.title('各科成绩分布直方图')
plt.show()

insert image description here

4. Draw a classification map of parents' education level and whether the prerequisite courses are completed

# 绘制父母受教育程度和前置课程是否完成的分类图
sns.countplot(x='父母教育程度',
              data=df,
              hue='课程完成情况')
plt.title('父母受教育程度与前置课程是否完成统计分类图')
plt.show()

insert image description here

5. Draw a boxplot of grades and gender distribution

# 绘制成绩评级和性别分布的箱线图
sns.boxplot(x='等级',
            y='总分',
            hue='性别',
            data=df)
plt.title('成绩评级与性别分布箱线图')
plt.show()

insert image description here

6. Draw a gender-disaggregated scatterplot of lunch criteria and total grades

# 绘制午餐标准和总成绩的性别分类散点图
sns.scatterplot(x='午餐',
                y='总分',
                hue='性别',
                data=df)
plt.title('午餐标准与总成绩的性别分类散点图')
plt.show()

insert image description here

7. Calculate the correlation coefficient between each feature and draw a heat map

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('各特征的相关热力图')
plt.show()

insert image description here

7. Summary

Through the analysis and visualization of student data, this Python script explores the relationship between different characteristics of students and their grades, and draws several conclusions. These include the fact that parents are generally more educated, and female students have significantly higher total scores than male students. These conclusions provide a very valuable reference for schools, parents and the government to formulate more effective education programs.

In addition, using Matplotlib and Seaborn, two powerful visualization libraries, makes the data results more intuitive and understandable. Through intuitive chart presentation, we can see the data distribution and interrelationship more clearly, so as to analyze and interpret the data more accurately.

It should be noted that although this script draws several conclusions, these conclusions are not necessarily applicable to all situations, and the actual situation and context still need to be considered. At the same time, through further exploration and analysis of data, deeper laws and conclusions can be sought to provide more reference and guidance for the field of education.

In conclusion, this Python script provides a good starting point for data analysis and research in the field of education, and also provides strong support for the relationship between student characteristics and grades. In the future student management and education practice, we should further tap the potential of data to provide more effective help and support for the growth and development of students.

Eight, the complete code

The following code is a series of operations and visualizations based on the Pandas library for student performance data, mainly including data reading, data cleaning, data calculation, data analysis, and data visualization. Among them, we can see a large number of processing operations on the data, such as: viewing the first few rows of data and basic information, calculating the total score and average score, judging whether each course is passed, judging whether the overall status is passed, calculating grades, statistics Classification plots, boxplots, scatter plots, heat maps, etc. These operations can help us better understand and analyze the data set to make better decisions and guidance later.

import pandas as pd

import pandas as pd

# 使用 GBK 编码方式读取文件
df = pd.read_csv("StudentsPerformance.csv", encoding="gbk")


# 查看数据的前几行
print(df.head())

# 查看数据的基本信息,包括每列数据的类型和非空数量等
print(df.info())

# 查看数据的统计信息,包括每列数据的基本统计量(如均值、标准差、最大值、最小值等)
print(df.describe())

# 检查是否有空值
print(df.isnull().sum())


# 选取阅读成绩、数学成绩、写作成绩3个字段,计算总分和平均分
df['总分'] = df.iloc[:,5:8].sum(axis=1)
df['平均分'] = df['总分'] / 3

# 查看计算结果
print(df[['阅读成绩', '数学成绩', '写作成绩', '总分', '平均分']].head())


# 设置及格线为60分,并使用 lambda 函数判断每位学生是否通过各门课程
df['pass_reading'] = df['阅读成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')
df['pass_math'] = df['数学成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')
df['pass_writing'] = df['写作成绩'].apply(lambda x: 'Pass' if x >= 60 else 'Fail')

# 打印修改后的数据帧
print(df.head())

# 使用 apply() 函数和 lambda 表达式判断每个学生的整体状态是否通过,并将判断结果存储在新的数据列中
df['情况'] = df.apply(lambda x: 'Pass' if x['pass_reading'] == 'Pass' and x['pass_math'] == 'Pass' and x['pass_writing'] == 'Pass' else 'Fail', axis=1)

# 打印修改后的数据帧
print(df.head())

# 计算每个学生的总分和平均分,并使用条件表达式判断成绩等级,并将判断结果存储在新的数据列中
df['总分'] = df['数学成绩'] + df['阅读成绩'] + df['写作成绩']
df['平均分'] = df['总分'] / 3
df['等级'] = df.apply(lambda x: '优秀' if x['平均分'] >= 90 else '良好' if x['平均分'] >= 80 else '中等' if x['平均分'] >= 70 else '及格' if x['平均分'] >= 60 else '不及格', axis=1)

# 打印修改后的数据帧
print(df.head())

import seaborn as sns
import matplotlib.pyplot as plt
# 指定字体
plt.rcParams['font.sans-serif'] = ['SimHei']
# 统计每个家长受教育水平的人数,并绘制水平柱状图
edu_counts = df['父母教育程度'].value_counts()
plt.bar(edu_counts.index, edu_counts.values)
plt.xlabel('父母教育程度')
plt.ylabel('人数')
plt.title('父母受教育程度的水平柱状图')
plt.show()

# 统计及格和不及格的人数,并绘制饼图
pass_counts = df['情况'].value_counts()

labels = ['及格', '不及格']
sizes = [pass_counts['Pass'], pass_counts['Fail']]

plt.pie(sizes,
        labels=labels,
        autopct='%1.1f%%')

plt.title('全体学生成绩分布饼图')
plt.axis('equal')
plt.show()

# 绘制数学成绩、阅读成绩和写作成绩的直方图
plt.hist(df['数学成绩'], bins=10, alpha=0.5, label='math score')
plt.hist(df['阅读成绩'], bins=10, alpha=0.5, label='reading score')
plt.hist(df['写作成绩'], bins=10, alpha=0.5, label='writing score')

plt.legend(loc='upper right')
plt.xlabel('成绩')
plt.ylabel('人数')
plt.title('各科成绩分布直方图')
plt.show()

# 绘制父母受教育程度和前置课程是否完成的分类图
sns.countplot(x='父母教育程度',
              data=df,
              hue='课程完成情况')
plt.title('父母受教育程度与前置课程是否完成统计分类图')
plt.show()

# 绘制成绩评级和性别分布的箱线图
sns.boxplot(x='等级',
            y='总分',
            hue='性别',
            data=df)
plt.title('成绩评级与性别分布箱线图')
plt.show()

# 绘制午餐标准和总成绩的性别分类散点图
sns.scatterplot(x='午餐',
                y='总分',
                hue='性别',
                data=df)
plt.title('午餐标准与总成绩的性别分类散点图')
plt.show()

# 计算各特征之间的相关系数,并绘制热力图
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('各特征的相关热力图')
plt.show()

Appendix: Data Files

The CSV data used in this case

Guess you like

Origin blog.csdn.net/m0_62338174/article/details/130460787