Use Python to match the transcript with histogram and normal distribution test

Use Python data analysis tools to analyze student grades (D:\Grade Sheet.xlsx, there are two courses in the workbook, see Figure 1), the table includes: student number, name, usual grades, final grades, and overall evaluation, please press " Segmented statistics for "General Evaluation", calculating the mean value, mean square deviation, segmental distribution of grades, and single-sample KS normal distribution test (5% confidence level, two-sided). The statistical results are inserted into the original document to realize the automatic analysis of students' grades.

Score division: 0~59 is "fail", 60~69 is "pass", 70~79 is "medium", 80~89 is "good", 90~100 is "excellent". Proportion.

 Figure 1 D:\ Score Sheet.xlsx structure, there are two courses "Modern Computer Network" and " VB Programming"

1. Processing Excel tables and data analysis tools

To edit an Excel table with Python, a third-party library is required. The commonly used libraries are:

Excel (*.xls) version below 2010: generally use xlrd to read, xlwt to edit, xlutils to copy, and use the three databases in combination.

Excel (*.xlsx) version 2010 and above can be read and edited with openpyxl.

PanDas, SciPy, NumPy, etc. are generally used for data analysis.

2. Basic application of openpyxl module

1. Install

(1) Execute at the Windows command prompt:

pip install openpyxl

(2) Find the interpreter in PyCharm, click the "+" in the upper right corner, enter openpyxl, and click the install package button.

2. openpyxl module

The three major components of the openpyxl module: 1) workbook, 2) worksheet, 3) cell.

(1) Load a workbook that already exists locally

exl = openpyxl.load_workbook(filename)

exl is the workbook object

(2) Get the worksheet

exl_sht = exl[sheet name]

exl_sht is the worksheet object, and the worksheet specified by the worksheet name is obtained from the workbook object exl.

(3) Get the cell

escel = exl_sht.cell(row, column)

escel is the cell object, exl_sht is the worksheet object, row is the row number, and column is the column number.

3. Python 's NumPy, PanDas, SciPy

NumPy: N-dimensional array container, basic mathematical calculation module, mainly matrix, pure mathematics.

PanDas: Table container, which provides a one-dimensional data structure named Series and a two-dimensional data structure named DataFrame, a table structure suitable for statistical analysis, and can be used for data analysis.

SciPy: Scientific computing function library, which is based on NumPy, provides methods to directly calculate structures, and encapsulates some high-level abstractions and physical models.

4. Introduction to PanDas

The name PanDas comes from panel data and Python data analysis. PanDas is an open source third-party Python library built on the basis of NumPy and Matplotlib, and enjoys the reputation of "one of the three Musketeers" for data analysis (NumPy, Matplotlib, PanDas). PanDas has become an essential advanced tool for Python data analysis, and its goal is to become a powerful, flexible data analysis tool that can support any programming language.

In general, use the following statement to import the pandas module:

import pandas as pd

It has almost become an unwritten rule to abbreviate pandas as pd. Therefore, as long as readers see pd, they should think that this is pandas.

There are two main data structures in pandas: Series and DataFrame.

(1) Series

Series is used to process one-dimensional data, which is one-dimensional column data.

(2) DataFrame

DataFrame is used to process two-dimensional data, each column is a Series.

These two data structures are sufficient to handle most typical use cases in finance, statistics, social sciences, engineering, etc. First, let's look at Series.

Series

A Series is an object similar to a one-dimensional array in NumPy. It consists of a set of data of any type and a set of data labels (ie, indexes) associated with it. Take the simplest example:

import pandas as pd

print(pd.Series([1,3,5,7,9]))

The above code will print something like Figure 2:

Figure 2 Series structure 

In Figure 2, the index on the left is the index of the data, which increases sequentially from 0 by default. The right side is the corresponding data, and the last line indicates the data type.

You can use dictionaries to create data with custom data indexes at the same time. Pandas will automatically use the keys of the dictionary as the data index and the values ​​of the dictionary as the corresponding data.

import pandas as pd

print(pd.Series({'a':1,'b':3,'c':5,'d':7,'e':9}))

The running result is shown in 3:

 

Figure 3 Series structure custom index

The way to access the data in the Series is similar to the way to access the list and dictionary elements in Python. It also uses square brackets and indexes (keys) to get the data.

DataFrame

Series is one-dimensional data, while DataFrame is two-dimensional data. DataFrame can be imagined as a two-dimensional table. The table has two dimensions of rows and columns, so it is two-dimensional data.

Each column of the DataFrame is a Series.

The syntax for creating a DataFrame object is:

df = pd.DataFrame(data, index, columns, dtype, copy)

Among them: data is data, supports lists, dictionaries, numpy arrays, Series objects, etc.; index is row index; columns is column index; dtype is the data type of each column data.

Because there are Chinese in the table, the characters occupied by Chinese are different from the characters occupied by English and numbers, so you need to call pd.set_option() to align the table. If you are using Jupyter to run the code, Jupyter will automatically render a table, you don't need this setting.

There are many ways to construct a DataFrame, the most common one is to pass in a dictionary of lists of equal length. That is, each value in the dictionary is a list, and their length must be equal.

In this way, a table is obtained, and the key of the dictionary will be used as the column name (column index) of the table. The leftmost is the row index, if not specified, it will increase sequentially from 0 by default. Of course, you can specify the index parameter to customize the row index when building the DataFrame.

For example:

 

Short tabular data can be directly written in the function, relatively large tabular data is stored in the variable data first, and then the variable data is placed in the function. The DataFrame row index can be defined by the index parameter, and the DataFrame column index can be defined by the columns parameter. The running result of the above code is shown in Figure 4 (the red text, arrows and rectangular boxes are added to help the author understand):

  

Figure 4 Two ways to define the DataFrame structure

For other operations, please refer to the notes in the program. The complete program is as follows:

 

The running results are shown in Figure 5 and Figure 6, and the automatic analysis of the performance data has been realized.

Figure 5 Effect of adding histogram and distribution information in the report card of "Modern Computer Network" (refer to Figure 1) 

 

Figure 6 The effect of adding histogram and distribution information in the report card of "VB Programming" (refer to Figure 1)

Guess you like

Origin blog.csdn.net/hz_zhangrl/article/details/128514341