Python Comprehensive Case - Student Data Visualization

In recent years, data analysis and visualization have become important tools in many fields. In the field of education, through data analysis and visualization of student performance and behavior, it is possible to better understand students' learning status, identify problems, improve teaching, and improve student performance. This article will introduce a comprehensive Python case, using Pandas and Seaborn libraries to clean, preprocess and visualize student data, and explore the relationship between student performance and learning behavior. Through this case, we can gain an in-depth understanding of the application of Python in data analysis and visualization, and it also provides a new idea and method for educational data analysis.

1. Obtain data

To read the StudentPerformance.csv file, you can use the pandas library in Python. First you need to install the pandas library, you can use the following command to install:

pip install pandas

Once installed, the StudentPerformance.csv file can be read using the following code:

import pandas as pd

df = pd.read_csv('StudentPerformance.csv')

print(df.head())

Among them, pd.read_csv('StudentPerformance.csv')read the csv file and return a pandas DataFrame object. df.head()Output the first 5 rows of DataFrame data.

2. Modify the list name, please change the list name to Chinese.

 To modify the table column names to Chinese, you can use the functions in the pandas library rename().

Change the table column name to Chinese:

import pandas as pd

df = pd.read_csv('StudentPerformance.csv')

# 新列名列表
new_columns = ['性别', '民族', '出生地', '学段', '年级', '班级', '主题', '学期', '与家长关系', '举手次数', '上课用品查看次数', '公告查看次数', '参与讨论次数', '家长是否回答调查问卷', '家长对学校满意度', '学生缺勤天数', '班级']

# 将原列名与新列名对应起来,组成字典
rename_dict = dict(zip(df.columns, new_columns))

# 使用rename()函数修改列名
df.rename(columns=rename_dict, inplace=True)

print(df.head())

Among them, new_columnsthe new Chinese column names are stored in the list, which can be modified according to the needs. Then use zip()the function to match the original column name with the new Chinese column name one by one, and finally rename()modify the column name through the function.

 3. Display the value of semester and school period

# 显示学期和学段的取值
print('学期取值:', df['学期'].unique())
print('学段取值:', df['学段'].unique())

4. Modify data

Change lowerlevel, middleschool, and highschool to Chinese, change M and F in "gender" to Chinese, and change S and F in "semester" to spring and autumn

To change lowerlevel, middleschooland highschoolinto Chinese, you need to use replace()the method to replace these three values ​​with 小学, 初中and 高中. Likewise, replace()methods can be used to replace Mand with and . You can also replace and with and using the same method . Here is sample code:FSF春季秋季

import pandas as pd

df = pd.read_csv('StudentPerformance.csv')

# 修改学段数据
df.loc[:, 'StageID'] = df['StageID'].replace({'lowerlevel': '小学', 'middleschool': '初中', 'highschool': '高中'})

# 修改性别数据
df.loc[:, 'gender'] = df['gender'].replace({'M': '男', 'F': '女'})

# 修改学期数据
df.loc[:, 'Semester'] = df['Semester'].replace({'S': '春季', 'F': '秋季'})

print(df.head())

Here, we use loc[]to select the entire column of data and use replace()the method to modify it accordingly. Finally, output the modified DataFrame data, and you can see that the sum StageIDin the column is replaced by and , the sum in the column is replaced by sum , and the sum in the column is replaced by sum .lowerlevelmiddleschoolhighschool小学初中高中genderMFsemesterSF春季秋季

Fifth, check the vacancy data.

To check for missing data in a DataFrame, you can use isnull()the method to check whether each cell is empty. This returns a DataFrame of boolean values, where a value Trueindicates that the cell is empty, and otherwise False. We can sum()count how many missing values ​​there are in each column using the method. Here is sample code:

df.isnull().sum()

 6. Draw a count histogram by grade

Here, we use plt.rcParams['font.sans-serif']the function will be used SimHeias a global font, used to replace the default font. Then use plt.xlabel()the and plt.ylabel()function to add the labels for the x and y axes, use plt.title()the function to add the title of the chart, and finally use plt.show()the function to display the chart.

Here is the complete sample code:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('StudentPerformance.csv')
scores = df['TotalScore']
plt.rcParams['font.sans-serif'] = ['SimHei']

plt.hist(scores, bins=10)
plt.xlabel('Total Score')
plt.ylabel('Number')
plt.title('Distribution of Student Scores')
plt.show()

Execute the above code, and you can get a count histogram drawn by grade, in which Chinese characters have been replaced with English.

7. Draw a histogram of counts by sex

Here is the code to draw a histogram of the counts, using the filenames StudentPerformance.csvand grouping by gender:


# 绘制计数柱状图
sns.countplot(data=df, x='gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Number of students by gender')

# 展示图形
plt.show()

In the above code, we have used countplot()the function of the Seaborn library to plot the count histogram and xset the parameters to gendergroup the students by gender. We also added labels and titles to make the graphs easier to understand.

Note, you need to make sure the dataset file StudentPerformance.csvis in the current working directory, or specify the correct file path.

8. Draw a count histogram by subject

To use Seaborn's countplot()function to draw a histogram of counts grouped by subject. Here is a sample code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 绘制计数柱状图
sns.countplot(data=df, x='Topic')
plt.xlabel('Subject')
plt.ylabel('Count')
plt.title('Number of students by subject')

# 展示图形
plt.show()

In the above code, we have used countplot()the function of the Seaborn library to plot the count histogram and xset the parameters to group the students by subject. We also added labels and titles to make the graphs easier to understand.

 of course we can

# Set the x-axis label rotation angle and alignment
plt.xticks(rotation=45, ha='right')

 9. Draw a count histogram of different grades by subject

import pandas as pd
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 选择需要的列和科目
cols = ['Topic', 'Class']
topic_cols = df[cols]

# 将不同成绩分组并统计数量
counts_df = topic_cols.groupby(['Topic', 'Class']).size().reset_index(name='counts')

# 创建一个包含 4 行 3 列子图的数组
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(12, 12))

# 创建一个包含所有科目的列表
subjects = ['IT', 'French', 'Arabic', 'Science', 'English', 'Biology', 'Spanish', 'Chemistry', 'Geology', 'Quran', 'Math', 'History']

# 在每个子图中绘制对应的计数柱状图
for i, subject in enumerate(subjects):
    # 获取当前科目的数据
    subject_df = counts_df[counts_df['Topic'] == subject]
    
    # 计算子图的行和列索引
    row_index = i // 3
    col_index = i % 3
    
    # 在当前子图中绘制计数柱状图
    axes[row_index, col_index].bar(subject_df['Class'], subject_df['counts'], width=0.5)
    axes[row_index, col_index].set_xlabel('class')
    axes[row_index, col_index].set_ylabel('count')
    axes[row_index, col_index].set_title(f'{subject} 不同成绩的人数分布')

# 调整子图之间的距离和外边距
plt.subplots_adjust(wspace=0.3, hspace=0.5, top=0.95, bottom=0.05, left=0.05, right=0.95)

# 显示图像
plt.show()

 Here it is because of the font!

10. Draw a histogram of counts by gender and grades

Use Seaborn's countplot()functions to plot a histogram of counts grouped by gender and grade. Here is a sample code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 绘制计数柱状图
sns.countplot(data=df, x='gender', hue='grade')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Number of students by gender and grade')

# 设置图例位置
plt.legend(loc='upper right')

# 展示图形
plt.show()

In the above code, we have used countplot()the function of the Seaborn library to draw a histogram of counts and set the parameter to x, to group the students by gender and grade. We also added labels and titles to make the graphs easier to understand.genderhuegrade

Additionally, we used Matplotlib legend()functions to set the legend position. We locset the (location) parameter to upper rightplace the legend in the upper right corner.

10. View the distribution ratio of grades by class

Use countplotthe function to achieve this task. This function can accept a data set and several keyword parameters, which are used to specify the abscissa, ordinate, color and other information to be drawn.

The following is the complete code that combines sns.countplotthe function to display the distribution of the number of people with L, M, and H grades in different classes in the data set:

import pandas as pd
import seaborn as sns

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 使用 Seaborn 绘制班级和成绩的人数分布情况
sns.countplot(x='班级', hue='成绩', hue_order=['L', 'M', 'H'], data=df)

# 显示图像
plt.show()

11. Analyze the correlation between the four performances (the number of browsing courseware, the number of browsing announcements, the number of hands raised, and the number of discussions) and grades 

Analyze the correlation between the four performance indicators (the number of browsing courseware, the number of browsing announcements, the number of hands raised and the number of discussions) and grades, and use barplotthe functions display the four performance indicators in different grades (L, M , H) the difference between. This task can be accomplished by computing the average of the four performance metrics within each score band. You can use barplotthe function to achieve this task. This function can accept a data set and several keyword parameters, which are used to specify the abscissa, ordinate, color and other information to be drawn.

Here's the full code that combines sns.barplotthe function to show the difference between the four performance metrics across grade bands:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 计算每个成绩段内四个表现指标的平均值
mean_df = df.groupby('成绩').mean()[['浏览课件次数', '浏览公告次数', '举手次数', '讨论次数']].reset_index()

# 使用 Seaborn 绘制四个表现指标与成绩的关系
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.barplot(x='成绩', y='浏览课件次数', data=mean_df, order=['L', 'M', 'H'], ax=axes[0, 0])
sns.barplot(x='成绩', y='浏览公告次数', data=mean_df, order=['L', 'M', 'H'], ax=axes[0, 1])
sns.barplot(x='成绩', y='举手次数', data=mean_df, order=['L', 'M', 'H'], ax=axes[1, 0])
sns.barplot(x='成绩', y='讨论次数', data=mean_df, order=['L', 'M', 'H'], ax=axes[1, 1])

# 显示图像
plt.show()

 12. Analyze the discussions of students with different grades

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 绘制带有回归拟合线的散点图
sns.lmplot(x='Discussion', y='raisedhands', data=df, hue='Class', col='Class', col_wrap=3)

# 显示图像
plt.show()

In this example, we use lmplotthe function to plot a scatterplot between the number of discussions ( Discussion) and the number of hands raised ( ), and use color coding to indicate students with different grades. raisedhandsBy setting the coland col_wrapparameters, we split the image into three subplots, arranged by grade level.

13. Analyze the correlation between the number of hands raised and the number of discussions

You can use a scatterplot to analyze the correlation between the number of hands raised and the number of discussions attended. Here are some possible steps:

  1. Read the dataset and select the two features to be analyzed (number of hands raised and number of discussions participated).
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 选择需要分析的两个特征
hand_raised = df['raisedhands']
discussions = df['Discussion']
  1. Use sns.scatterplotthe function to draw a scatterplot, and use regplotthe parameter to add a regression fit line.
# 绘制散点图
sns.scatterplot(x=hand_raised, y=discussions)

# 添加回归拟合线
sns.regplot(x=hand_raised, y=discussions, scatter=False)

# 显示图像
plt.show()

14. Analyze the correlation between the number of courseware browsing, the number of hands raised, the number of browsing announcements, and the number of discussions, write out the correlation matrix, and visualize it.

To analyze and visualize the correlation between these four features, you can first use corr()the function to calculate the correlation coefficient matrix between them, and then use heatmap()the function of Seaborn to draw the heat map.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 读取数据集
df = pd.read_csv('StudentPerformance.csv')

# 选择需要分析的四个特征
features = ['VisITedResources', 'raisedhands', 'AnnouncementsView', 'Discussion']
data = df[features]

# 计算相关系数矩阵
corr_matrix = data.corr()

# 绘制热力图
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

# 设置标题和坐标轴标签
plt.title('Correlation Matrix')
plt.xlabel('Features')
plt.ylabel('Features')

# 显示图像
plt.show()

This code selects four features in the data set and uses corr()the function to calculate the correlation coefficient matrix between them. Then, use heatmap()the function to draw the heat map, and set the title, axis labels and other details. Finally, use show()the function to display the image.

For these four features ( VisITedResources, raisedhands, AnnouncementsView, Discussion), the correlation between them is relatively strong, among which VisITedResourcesthe raisedhandscorrelation between and is the highest, followed VisITedResourcesby AnnouncementsViewthe correlation between and . DiscussionThe correlation between and the other three features is relatively low.

This conclusion can help us better understand the relationship between these features, and provide a reference for further analysis and prediction of students' learning performance. For example, if a student frequently participates in discussions in class but never browses resources or checks announcements, more guidance and support may be needed to help them more fully grasp the course content.

Summarize

This Python comprehensive case mainly explores the relationship between student performance and learning behavior through visual analysis of student data. In this case, we used two commonly used Python libraries, Pandas and Seaborn, to process data and draw graphics. Specifically, we have done the following:

  1. Read the student dataset: Use read_csv()the function to read the data file in CSV format and store the data in a DataFrame object.
  2. Data cleaning and preprocessing: Perform necessary cleaning and preprocessing on the data to facilitate subsequent analysis and visualization. For example, drop useless columns, remove missing values, etc.
  3. Data visualization: Use the functions of the Seaborn library to draw multiple graphs to show the relationship between student performance and learning behavior, including:
    • Student Grade Distribution Histogram: Shows the distribution of student grades.
    • Boxplot of Student Performance vs. Gender: Shows the difference in performance between male and female students and its distribution.
    • Bar chart of pass rate of students scored in different courses: Shows the difference in pass rate in different subjects.
    • Correlation coefficient heat map of the four learning behavior characteristics: Shows the magnitude and direction of the correlation between the four learning behavior characteristics.

Through the above work, we can more intuitively understand the relationship between students' learning performance and learning behavior, and also provide a reference for subsequent data analysis and prediction.

Guess you like

Origin blog.csdn.net/m0_62338174/article/details/130460281