Data analysis chapter 5 after-school training--application of Matplotlib, seaborn, pyecharts library visual analysis (answer to task 3)

Experiment name application Matplotlib, seaborn, pyecharts library visual analysis

Experiment time 2023.5.10

1. Purpose of the experiment

1. Master the basic syntax of pyplot.
2. Master the drawing method of pie chart.
3. Master the drawing method of box plot.
4. Master the drawing method of the subgraph.
5. Master the drawing method of column chart.
6. Master the use of related functions in the NumPy library.
7. Master the drawing method of classification scatter diagram.
8. Master the drawing method of linear regression fitting graph.
9. Master the drawing method of heat map.
10. Master the drawing method of funnel diagram.
11. Master the drawing method of word cloud map.

2. Experimental equipment or materials

Laptop, Anaconda software

3. Experimental principle

1. Description of requirements
After the final exam, the school collects statistics on the students' final exam results and other characteristic information, and saves them as a student grade characteristic relationship table (student grade.xlsx). There are 7 characteristics in the student performance characteristic relation table, which are gender, lunch, exam course preparation, mathematics performance, reading performance, writing performance and total grade. Part of the data is shown in Table 5-40. In order to understand the distribution of the total scores of students in the exam, the total scores are divided into four grades of "failure", "pass", "good" and "excellent" according to the range of 0150, 150200, 200~250, and 250~300 . Check the proportion of the number of students in each interval, and check the dispersion of the students' three single-subject scores by drawing a box plot.
insert image description here

2. Requirements Description
In order to understand whether there is some relationship between the two characteristics of the students' parents' education level, lunch, and exam preparation and the overall performance, based on the data of training 1, the corresponding values ​​of the three characteristics are different. Calculate the average of the total grades of the students, draw a line graph to check the relationship between the parent's education level and the total grade, draw a column graph to check the relationship between lunch, exam course preparation and the total grade, and analyze the results.
student number, name and class information.
3. Demand Description
Air Quality Index (Air Quality Index, AOI) is data that can quantitatively describe air quality. Air quality reflects the degree of air pollution, which is judged based on the concentration of pollutants in the air. Air pollution is a complex phenomenon, and the concentration of air pollutants is affected by many factors.
Part of the AQI data of a certain city from January to September 2020 is shown in Table 5-41.
insert image description here

This training will draw classification scatter diagrams and regression fitting diagrams based on the data shown in Table 5-41, and analyze the relationship between PM2.5 concentration and AQI, as well as the classification of AQI. At the same time, draw a heat map to analyze the correlation between air quality indicators and AQI.
4. Requirement Description
A shopping mall has put 5 vending machines in different locations, numbered A, B, C, D, and E, and recorded the sales data of each vending machine in June 2017. In order to understand the sales situation of each commodity, it is classified by two-level categories, and the sales of the top 5 commodity categories are counted, and a funnel diagram is drawn. At the same time, a word cloud diagram is drawn according to the sales volume and commodity name.

4. Experimental content and steps

Task 1: Analyze the distribution and dispersion of student test score characteristics.

1. Implementation ideas and steps
(1) Use the pandas library to read student test score data.
(2) Divide the total student test scores into 4 intervals, calculate the number of students in each interval, and draw a pie chart of the distribution of the total student test scores.
(3) Extract the data of students' three single-subject scores, and draw the box-line diagram of the dispersion of students' test scores.
(4) Analyze the distribution of students' total test scores and the dispersion of three single-subject scores.

Task 2: Analyze the relationship between student test scores and various characteristics.

2. Implementation ideas and steps
(1) Create a canvas and add subgraphs.
(2) Use the mean function in the NumPy library to find the average of the students' total grades under the three characteristics of the students' parents' education level, lunch, and examination course information.
(3) Draw the corresponding line chart or column chart on the sub-chart.
(4) Analyze the relationship between the three characteristics and the total test score.

Task 3: Analyze the correlation between various air quality indexes.

3. Implementation ideas and steps
(1) Use pandas Nanfa Zhongzhanggu 2020 working days-September AOI statistics.
(2) Solve the problem of Chinese display, set the font to bold, and solve the problem that the minus sign "-" is displayed as a square when saving the image.
(3) Draw a scatter diagram of quality grade classification.
(4) Draw the linear regression fitting graph of PM2.5 concentration and AOI.
(5) Calculate the correlation coefficient.
(6) Draw a heat map of the correlation of air quality characteristics.

Task 4: Drawing Interactive Basic Graphics

4. Implementation ideas and steps
(1) Obtain commodity sales data.
(2) Statistics of commodity category sales according to the secondary category.
(3) Count the sales volume of commodities.
(4) Set series configuration items and global configuration items, and draw a funnel diagram of the top 5 product categories in sales.
(5) Set series configuration items and global configuration items, and draw a word cloud diagram of product sales quantity and product name.

5. Experimental results and analysis

Task one:

insert image description hereinsert image description here

insert image description here

This pie chart shows the distribution of students' total test scores, which are divided into 4 intervals. It can be seen from the pie chart that the number of students in the 150-200 division is the largest, accounting for 34.3%. It can be seen that the total test scores of most students are above excellent, indicating that the overall score of this class is good.
A boxplot is a statistical graphic that can be used to show the distribution of a set of data. A boxplot contains the following:

  1. Upper edge (Max): The maximum value, that is, the largest value in the data.
  2. Lower edge (Min): The minimum value, that is, the smallest value in the data.
  3. Median: The middle value of the data, the value in the middle after sorting the data by size.
  4. Upper quartile (Q3): Divide the data into four equal parts, the last data point of the upper boundary part.
  5. Lower quartile (Q1): Divide the data into four equal parts, the first data point of the lower boundary part.
  6. Inner limit: The distance between the upper and lower quartiles, called the "bin".
  7. Outer Limit: The distance between the upper and lower edges, called the "whisker".
    The role of the box plot:
  8. It can visually display the distribution of data, including statistics such as extreme values, medians and quartiles of data, so that people can understand the situation of data more intuitively.
  9. Outliers can be quickly identified and dealt with, which can in some cases have a dramatic impact on data analysis.
  10. It can compare the distribution of different groups of data, which plays an important role in data analysis and decision-making.

Task two:

insert image description here
insert image description here

According to the line graph, it can be seen that there are differences in the average total test scores of students under different parental education levels. Children of parents with a master's degree had the highest average overall test score of 220.8, while children of parents who did not graduate from high school had the lowest average overall test score of 189.29. This suggests that the level of education of parents has a certain impact on children's academic performance.
In addition, it can be seen that children of parents with bachelor's and associate's degrees have relatively higher average total test scores, while children of parents who did not graduate from college have lower average total test scores. This may be because parents with bachelor's and associate's degrees have higher expectations and better educational resources for their children's education, while parents without college degrees may not provide adequate support and resources.
In short, parents' education level has a certain impact on children's academic performance, but it is not the only factor. There are other factors such as personal talent and learning attitude that will also affect academic performance.

Task three:

insert image description here

It can be seen that the values ​​of air quality index (AQI) are different at different time points. Among them, the highest AQI value is 203, which belongs to severe pollution, and the lowest AQI value is 22, which is good. At the same time, it can be seen that the distribution of AQI values ​​shows a trend that most values ​​are low and a few values ​​are high, that is, the air quality is good most of the time, but there are also cases of poor air quality in some time.
In addition, it can be seen that there is a certain correspondence between the AQI value and the air quality level. When the AQI value is between 0-50, the air quality level is excellent; when it is between 51-100, the air quality level is good; when it is between 101-150, the air quality level is lightly polluted; When the air quality is between 201-300, the air quality level is moderately polluted; when it is between 201-300, the air quality level is severely polluted. Therefore, the air quality level at that time can be judged according to the AQI value.
In a word, air quality index (AQI) is an important index to measure air quality, and the change of its value reflects the change of air quality, which has an important impact on people's health and life.

insert image description here

It can be seen that there is a certain correspondence between the AQI value and the PM2.5 content. PM2.5 refers to particulate matter with a diameter less than or equal to 2.5 microns in the air, which is one of the main components of air pollution. It can be seen from the data that the AQI value and the value of PM2.5 content show a consistent trend of change most of the time, that is, the values ​​of AQI value and PM2.5 content are both low or high. This shows that the PM2.5 content is one of the important factors affecting the AQI value.
In addition, it can be seen that both the AQI value and the PM2.5 content have large fluctuations, that is, there are large differences between the AQI value and the PM2.5 content at different time points. This may be due to different sources of air pollution, changes in meteorological conditions and other factors.
In a word, AQI value and PM2.5 content are two important indicators to reflect air quality, and their changes reflect changes in air quality, which have an important impact on people's health and life.
insert image description here

The correlation analysis of air quality characteristics can provide us with the relationship between different pollutant indicators, which is helpful to identify the main pollution sources and formulate specific environmental science policies. We can use the Pearson correlation coefficient to measure the linear relationship between variables, which ranges from -1 to +1, where 0 means no linear relationship, -1 means a perfect negative correlation, and +1 means a perfect positive correlation.
As can be seen from the figure above, the following data are correlated:
PM2.5 content and PM10 content: the correlation coefficient is 0.97, and the two are strongly positively correlated.
NO2 content and PM2.5 content, PM10 content: the correlation coefficients are 0.79 and 0.78 respectively, and NO2 content is positively correlated with PM2.5 content and PM10 content.
O3_8h content and AQI: The correlation coefficient is 0.6, and AQI is positively correlated with O3_8h content.
It should be noted that the correlation coefficient can only reflect the linear relationship between variables, but cannot represent the nonlinear relationship, so other factors need to be considered comprehensively in practical applications.

Task four:

insert image description here

It can be seen that the sales volume is better: dairy products, beverages, jerky, biscuits, functional drinks, etc. These commodities are the necessities of daily life, and stores can sell more of these commodities.

insert image description here

It can be seen from the picture above that C'estbon purified water is the best-selling product, followed by soy milk, hot dog sausage and shrimp crackers.
A word cloud map is a graph that visually displays text data, in which words in the text data are presented according to their frequency, and their importance is distinguished by different font sizes or colors. Usually, the word cloud map is used to process massive text data, which can quickly understand the key information or themes in the text, and help users quickly understand the key points in the text. It is widely used in sentiment analysis, public opinion analysis, market research and news reports and other fields.

6. Conclusion and experience

As common libraries for visual analysis, matplotlib, seaborn and pyecharts provide powerful, flexible and easy-to-use tools, support the generation of various chart types, and help us better understand the results of data analysis.
In practical applications, matplotlib is one of the most commonly used drawing libraries. It contains a wealth of data visualization functions, supports drawing multiple types of graphics and custom styles, so it can meet most drawing needs. If you need more efficient plotting and more reader-friendly visualizations, seaborn is your first choice. It is an extension library of matplotlib that provides many advanced functions and plotting tools, making data analysis and visualization easier and more elegant. For more complex data visualization scenarios, you can consider using pyecharts, which is a Python visualization library based on ECharts, supports the generation of dynamic interactive charts, and can make data display more vivid and intuitive.
In general, the choice of visual analysis library should depend on the data type and visualization requirements. There are many types of graphics used in data analysis, such as bar charts, line charts, pie charts, scatter plots, heat maps, and so on. To accurately characterize the data and draw meaningful conclusions and inferences, it is necessary to choose the correct visualization tools and graphics types. In addition, beautiful visualization effects will make telling and sharing the results of data analysis more attractive and convincing. Therefore, proficiency in the use of visual analysis libraries is an essential skill to improve analysts' work efficiency and data analysis quality.

任务1代码:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'  # 设置中文显示
plt.rcParams['axes.unicode_minus'] = False
data = np.load(r'./data/student_grade.npz', encoding='ASCII', 
               allow_pickle=True)  
columns = data['arr_0']
values = data['arr_1']


# 定义成绩变量
sum_score = values[:, -1]
math_grade = values[:, -4]
reading_grade = values[:, -3]
writing_grade = values[:, -2]
all_grade = values[:, -1]
student_id = np.arange(len(values))
p = plt.figure(figsize=(15, 15))  # 设置画布

# 提取学生考试总成绩区间人数
grade_0_150 = 0
grade_150_200 = 0
grade_200_250 = 0
grade_250_300 = 0

for i in range(len(values)):
    if 0 < values[i, -1] <= 150:
        grade_0_150 += 1
    elif 150 < values[i, -1] <= 200:
        grade_150_200 += 1
    elif 200 < values[i, -1] <= 250:
        grade_200_250 += 1
    elif 250 < values[i, -1] <= 300:
        grade_250_300 += 1

all_stu_grade = [grade_0_150, grade_150_200, grade_200_250, grade_250_300] 

饼图:
# 绘制学生考试总成绩的总体分布情况饼图
p = plt.figure(figsize=(9, 9))  # 设置画布
label= ['不及格', '及格', '良好', '优秀']
explode = [0.01,0.01,0.01,0.01]  # 设定各项离心n个半径
plt.pie(all_stu_grade, explode=explode, labels=label, 
        autopct='%1.1f%%', textprops={'fontsize': 15})  # 绘制饼图
plt.title('学生考试总成绩的总体分布情况饼图', fontsize=20)
plt.savefig('./tmp/学生考试总成绩的总体分布情况饼图.png')
plt.show()

#箱线图
# 绘制学生考试总成绩的总体分散情况箱线图
p = plt.figure(figsize=(12, 8))
label= ['总成绩']
gdp = (list(sum_score))
plt.boxplot(gdp,notch=True,labels=label, meanline=True)  # 绘制箱线图
plt.xlabel('学生考试科目')
plt.ylabel('学生考试总分数')
plt.title('学生考试总成绩的总体分散情况箱线图', fontsize=20)
plt.savefig('./tmp/学生考试总成绩的总体分散情况箱线图.png')
plt.show()



# 绘制学生考试总成绩的总体分散情况箱线图
p = plt.figure(figsize=(12, 8))
label= ['数学成绩','阅读成绩','写作成绩']
gdp = (list(math_grade), list(reading_grade), list(writing_grade))
plt.boxplot(gdp,notch=True,labels=label, meanline=True)  # 绘制箱线图
plt.xlabel('学生考试科目')
plt.ylabel('学生考试分数')
plt.title('学生各项考试成绩的总体分散情况箱线图', fontsize=20)
plt.savefig('./tmp/学生各项考试成绩的总体分散情况箱线图.png')
plt.show()

#任务二
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'  # 设置中文显示
plt.rcParams['axes.unicode_minus'] = False
data = np.load(r'./data/student_grade.npz', encoding='ASCII', allow_pickle=True)  
columns = data['arr_0']
values = data['arr_1']

# 分别提取学生父母教育水平对应的总成绩
master = []
bachelor = []
undergraduate_college = []
associate = []
highschool = []
undergraduate_highschool = []
all_grade = values[:, -1] 
for i in range(len(values)):
    if values[i, 2] == '硕士学位':
        master.append(values[i, -1])
    elif values[i, 2] == '学士学位':
         bachelor.append(values[i, -1])
    elif values[i, 2] == '大学未毕业':
         undergraduate_college.append(values[i, -1])
    elif values[i, 2] == '副学士学位':
         associate.append(values[i, -1])
    elif values[i, 2] == '高中毕业':
         highschool.append(values[i, -1])
    elif values[i, 2] == '高中未毕业':
         undergraduate_highschool.append(values[i, -1])
# 分别计算学生父母教育水平对应的总成绩均值
mean_master = round(np.mean(master), 2)   #round保留两位小数
mean_bachelor = round(np.mean(bachelor), 2)
mean_undergraduate_college = round(np.mean(undergraduate_college), 2)
mean_associate = round(np.mean(associate), 2)
mean_highschool = round(np.mean(highschool), 2)
mean_undergraduate_highschool = round(np.mean(undergraduate_highschool), 2)

#把平均值加到列表里
mean_education_grade = [mean_master, mean_bachelor,
                        mean_undergraduate_college, mean_associate, 
                        mean_highschool, mean_undergraduate_highschool] 



# 分别提取学生午餐情况对应的总成绩
standard = []
reduced = []
all_grade = values[:, -1] 
for i in range(len(values)):
    if values[i,3] == '标准':
        standard.append(values[i, -1])
    else:
        reduced.append(values[i, -1])
# 分别计算学生午餐情况对应的总成绩均值
mean_standard = round(np.mean(standard), 2)
mean_reduced = round(np.mean(reduced), 2)  
mean_lunch_grade = [mean_standard, mean_reduced]
# 分别提取学生考试准备情况对应的总成绩
completed = []
uncompleted = []
all_grade = values[:, -1] 
for i in range(len(values)):
    if values[i, 4] == '完成':
        completed.append(values[i, -1])
    else:
        uncompleted.append(values[i, -1])
# 分别计算学生完成考试准备和未完成考试准备对应的总成绩均值
mean_completed = round(np.mean(completed), 2)
mean_uncompleted = round(np.mean(uncompleted), 2)
mean_prepartion_grade = [mean_completed, mean_uncompleted]
# print(mean_prepartion_grade)


p = plt.figure(figsize=(13, 13)) #设置画布
# 子图1
ax1 = p.add_subplot(2, 1, 1) 
label = ['硕士学位', '学士学位', '大学未毕业', '副学士学位', '高中毕业', '高中未毕业']
plt.plot(range(6), mean_education_grade)  # 绘制折线图
plt.xlabel('父母教育水平')
plt.ylabel('学生平均考试总成绩')
plt.xticks(range(6), label)
plt.title('学生平均考试总成绩与父母教育水平关系直方图')

# 子图2
ax2 = p.add_subplot(2, 2, 3) 
label = ['标准', '免费/简单']
plt.bar(range(2), mean_lunch_grade, width=0.4)  # 绘制直方图
plt.xlabel('午餐情况')
plt.ylabel('学生平均考试总成绩')
plt.xticks(range(2), label)
plt.title('学生平均考试总成绩与午餐情况关系直方图')
# 子图3
ax2 = p.add_subplot(2, 2, 4) 
label = ['已完成', '未完成']
plt.bar(range(2), mean_prepartion_grade, width=0.4)  # 绘制直方图
plt.xlabel('考试课程准备情况')
plt.ylabel('学生平均考试总成绩')
plt.xticks(range(2), label)
plt.title('学生平均考试总成绩与考试课程准备情况关系直方图')

plt.savefig('./tmp/学生考试总成绩与各个特征关系图.png')
plt.show()
  

#任务三
import matplotlib.pyplot as plt
import pandas as pd

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 解决中文显示问题-设置字体为黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
# 读取数据
data = pd.read_csv('./data/aqi.csv')


# --------------------绘制空气质量等级分类图--------------------
with sns.axes_style('whitegrid'):
    sns.stripplot(x=data['质量等级'])
ax = sns.stripplot(x='质量等级', y='AQI', data=data, jitter=True)
ax.set_title('2020年芜湖市空气质量等级分类图')
plt.show()


# --------------------绘制AQI与PM2.5线性回归拟合图--------------------
ax = sns.regplot(x='PM2.5含量(ppm)', y='AQI', data=data)
ax.set_title('2020年芜湖市空气质量指数PM2.5与AQI回归拟合图')
plt.show()


# 计算相关系数
corr_data = data.corr()
# --------------------绘制特征相关性热力图--------------------
ax = sns.heatmap(corr_data, annot=True)
ax.set_title('2020年芜湖市空气质量特征相关性热力图')
plt.show()


insert image description here

Guess you like

Origin blog.csdn.net/weixin_48676558/article/details/130596890