Python data analysis (4)--operating Excel files

1 Manipulate Excel files - multiple implementation methods

In actual production, excel is often used to process data. Although excel has powerful formulas, many tasks can only be semi-automated. Using Python can automate some daily tasks and greatly improve work efficiency.

openpyxl: Only allowed to read and write .xlsx format files and perform addition, deletion, modification and search.
xlwings: Allows reading and writing of .xlsx and .xls format files and adding, deleting, modifying and checking.
xlsxwriter: Only files in .xlsx format are allowed to be written.

Comparing the three, you may think that xlsxwriter this library is too bad, right? Actually no, the first two libraries cannot compare with it in terms of writing. Its essence lies in writing (multiple style charts, pictures, table style modifications, etc.).

1.1 xlsxwriter library stores data to excel

xlsxwriter is a Python module for creating Excel XLSX files. It can be used to write text, numbers, formulas, and hyperlinks to multiple worksheets in Excel2007 + XLSX files. It supports formatting and other functions.

Advantage:

The function is relatively strong:
Supports font settings, foreground and background colors, border settings, view zoom (zoom), cell merging, autofilter, freeze panes, formulas, data validation, Cell comments, row height and column width settings
Support large file writing
Reading, modification, XLS files, and pivot tables are not supported

1. Install xlsxwriter

pip install XlsxWriter -i https://pypi.douban.com/simple

2. Common operations

Add worksheet style

bold = workbook.add_format({
    
    
                            'bold':  True,          # 字体加粗
                            'border': 1,            # 单元格边框宽度
                            'align': 'left',        # 水平对齐方式
                            'valign': 'vcenter',    # 垂直对齐方式
                            'fg_color': '#F4B084',  # 单元格背景颜色
                            'text_wrap': True,      # 是否自动换行
                            })

Write cell data

# 写入单个单元格数据
# row:行， col：列， data:要写入的数据, bold:单元格的样式
worksheet1.write(row, col, data, bold)
 
# 写入一整行，一整列
# A1:从A1单元格开始插入数据，按行插入， data:要写入的数据（格式为一个列表), bold:单元格的样式
worksheet1.write_row(“A1”, data, bold)
 
# A1:从A1单元格开始插入数据，按列插入， data:要写入的数据（格式为一个列表), bold:单元格的样式
worksheet1.write_column(“A1”,data,bold)

Insert picture

# 第一个参数是插入的起始单元格，第二个参数是图片你文件的绝对路径
worksheet1.insert_image('A1','f:\\1.jpg')

Write hyperlink

worksheet1.write_url(row, col, "internal:%s!A1" % ("要关联的工作表表名"), string="超链接显示的名字")

Insert chart

workbook.add_chartsheet（type=""）
 
# 参数中的type指的是图表类型，图表类型示例如下：
# [area：面积图,bar：条形图,column：直方图,doughnut：环状图,line：折线图,pie：饼状图,scatter：散点图,radar：雷达图,stock:箱线图]

Get all worksheets of the current excel file:workbook.worksheets()
Close excel file:workbook.close()

3. Create a simple XLSX file

Let's say we have some data on monthly expenses and we want to convert it to an Excel XLSX file:

import xlsxwriter as xw
 
 
def xw_toExcel(data, fileName):                 
    """xlsxwriter库储存数据到excel"""
    workbook = xw.Workbook(fileName)                # 创建工作簿
    worksheet1 = workbook.add_worksheet("sheet1")   # 创建子表
    worksheet1.activate()                           # 激活表
    title = ['序号', '项目支出', '消费金额']          # 设置表头
    worksheet1.write_row('A1', title)               # 从A1单元格开始写入表头
    row = 1                                         # 从第二行开始写入数据
    col = 0
    for record in data:                             # 迭代数据并逐行写入    
        worksheet1.write(row, col, record["id"])
        worksheet1.write(row, col+1, record["name"])
        worksheet1.write(row, col+2, record["expenses"])
        row += 1
    workbook.close()  # 关闭表
 
 
# "-------------数据用例-------------"
test_data = [
    {
    
    "id": 1, "name": "Rent", "expenses": 1000},
    {
    
    "id": 2, "name": "Gas", "expenses": 100},
    {
    
    "id": 3, "name": "Food", "expenses": 300},
    {
    
    "id": 4, "name": "Gym", "expenses": 50},
]
file_name = 'test.xlsx'
xw_toExcel(test_data, file_name)

4. Write different types of data to xlsx

Continuing from the previous chapter, add a date type column and write it to the xlsx file

from datetime import  datetime
import xlsxwriter as xw
 
 
def xw_toExcel(data, fileName):                 
    """xlsxwriter库储存数据到excel"""
    workbook = xw.Workbook(fileName)                # 创建工作簿
    worksheet1 = workbook.add_worksheet("sheet1")   # 创建子表
    bold = workbook.add_format({
    
    'bold': True})      # 新增一个粗体格式
    money_format = workbook.add_format({
    
    'num_format': '$#,##0'}) # 新增一个数值格式代表金额
    date_format = workbook.add_format({
    
    'num_format': 'yyyy-mm-dd'}) # 增加一个时间类型的格式
    worksheet1.set_column(1, 1, 15)                 # 调整列的宽度
    worksheet1.activate()                           # 激活表
    title = ['日期', '项目支出', '消费金额']          # 设置表头
    worksheet1.write_row('A1', title, bold)               # 从A1单元格开始写入表头
    row = 1                                         # 从第二行开始写入数据
    col = 0
    for record in data:                             # 迭代数据并逐行写入
        date = datetime.strptime(record["date"], "%Y-%m-%d")
        worksheet1.write_datetime(row, col, date, date_format)
        worksheet1.write_string(row, col+1, record["name"])
        worksheet1.write_number(row, col+2, record["expenses"], money_format)
        row += 1
    
    # 写公式
    worksheet1.write(row, 0, 'Total', bold)
    worksheet1.write_formula(row, 2, '=SUM(C2:C5)', money_format)

    workbook.close()  # 关闭表
 
# "-------------数据用例-------------"
test_data = [
    {
    
    "date": "2023-10-24", "name": "Rent", "expenses": 1000},
    {
    
    "date": "2023-10-25", "name": "Gas", "expenses": 100},
    {
    
    "date": "2023-10-27", "name": "Food", "expenses": 300},
    {
    
    "date": "2023-10-30", "name": "Gym", "expenses": 50},
]
file_name = 'test_1.xlsx'
xw_toExcel(test_data, file_name)

5. Enter the data queried from the database into the xlsx file

import pymysql
from datetime import datetime
import xlsxwriter

# 创建mysql连接
conn = pymysql.connect(host='localhost', port=3306, user='root', passwd='xxxxxx',db='school')
cursor = conn.cursor()

sql1 = "select cou_name, cou_credit from tb_course"
cursor.execute(sql1)

rows = cursor.fetchall()
fields = cursor.description     # 获取列名

# 创建一个workbook和worksheet
workbook = xlsxwriter.Workbook('course.xlsx')
worksheet = workbook.add_worksheet()

# 新增一个粗体格式
bold = workbook.add_format({
    
    'bold': True})

# 写表头
worksheet.write('A1', 'course', bold)
worksheet.write('B1', 'course_credit', bold)


# 数据坐标 0,0 ~ row, col   row取决于：result的行数；col取决于fields的总数
for row in range(1, len(rows)+1):
    for col in range(0, len(fields)):
        worksheet.write(row, col, u'%s' % rows[row-1][col])
workbook.close()

# 关闭连接
cursor.close()
conn.close()

Test Record:

1.2 pandas library stores data to excel

In Python, pandas is built based on NumPy arrays, making data preprocessing, cleaning, and analysis faster and easier. Pandas is specifically designed for processing tabular and mixed data, while NumPy is more suitable for processing uniform numerical array data. pandas has two main data structures: Series and DataFrame.

Series is an object similar to a one-dimensional array. It consists of a set of data (various NumPy data types) and a set of data labels (i.e. indexes) related to it, namely index and values. Method to select a single value or a group of values in the Series.

DataFrame is a tabular data type. The value type of each column can be different. It is the most commonly used pandas object. DataFrame has both row and column indexes, and it can be viewed as a dictionary composed of Series (sharing the same index). Data in a DataFrame is stored in one or more two-dimensional blocks (rather than lists, dictionaries, or other one-dimensional data structures)

import pandas as pd

def pd_to_excel(data, file_name):
    ids = []
    names = []
    prices = []
    for item in data:
        ids.append(item["id"])
        names.append(item["name"])
        prices.append(item["expenses"])
    df_data = {
    
    
        '序号': ids,
        '项目支出': names, 
        '消费金额': prices
    }
    df = pd.DataFrame(df_data)
    df.to_excel(file_name, index=False)

# "-------------数据用例-------------"
test_data = [
    {
    
    "id": 1, "name": "Rent", "expenses": 1000},
    {
    
    "id": 2, "name": "Gas", "expenses": 100},
    {
    
    "id": 3, "name": "Food", "expenses": 300},
    {
    
    "id": 4, "name": "Gym", "expenses": 50},
]
file_name = 'test_2.xlsx'
pd_to_excel(test_data, file_name)

1.3 openpyxl library stores data to excel

Install:pip install openpyxl==2.2.6

Open existing file

from openpyxl import load_workbook
wb2 = load_workbook('文件名称.xlsx')

Get letters from numbers, get numbers from letters

from openpyxl.utils import get_column_letter, column_index_from_string
 
# 根据列的数字返回字母
print(get_column_letter(2)) # B
# 根据字母返回列的数字
print(column_index_from_string('D')) # 4

delete worksheet

# 方式一
wb.remove(sheet)
# 方式二
del wb[sheet]

View table name and select table (sheet)

# 显示所有表名
print(wb.sheetnames)
['Sheet2', 'New Title', 'Sheet1']
 
# 遍历所有表
for sheet in wb:
    print(sheet.title)
 
# sheet 名称可以作为 key 进行索引
ws3 = wb["New Title"]
ws4 = wb.get_sheet_by_name("New Title")
ws is ws3 is ws4        # True

Set cell style

from openpyxl.styles import Font, colors, Alignment
 
# 字体
## 指定等线24号，加粗斜体，字体颜色红色。直接使用cell的font属性，将Font对象赋值给它
bold_itatic_24_font = Font(name='等线', size=24, italic=True, color=colors.RED, bold=True)
sheet['A1'].font = bold_itatic_24_font
 
# 对齐方式
 
## 使用cell的属性aligment，这里指定垂直居中和水平居中。除了center，还可以使用right、left等等参数。
## 设置B1中的数据垂直居中和水平居中
sheet['B1'].alignment = Alignment(horizontal='center', vertical='center')
 
## 设置行高和列宽
### 第2行行高
sheet.row_dimensions[2].height = 40
### C列列宽
sheet.column_dimensions['C'].width = 30
 
# 合并和拆分单元格
## 所谓合并单元格，即以合并区域的左上角的那个单元格为基准，覆盖其他单元格使之称为一个大的单元格。
## 相反，拆分单元格后将这个大单元格的值返回到原来的左上角位置。
# 合并单元格， 往左上角写入数据即可
sheet.merge_cells('B1:G1') # 合并一行中的几个单元格
sheet.merge_cells('A1:C3') # 合并一个矩形区域中的单元格
合并后只可以往左上角写入数据，也就是区间中:左边的坐标。
如果这些要合并的单元格都有数据，只会保留左上角的数据，其他则丢弃。换句话说若合并前不是在左上角写入数据，合并后单元格中不会有数据。
以下是拆分单元格的代码。拆分后，值回到A1位置。
sheet.unmerge_cells('A1:C3')

Create an XLSX file

import openpyxl as op

def op_to_excel(data, file_name):
    '''openpyxl库储存数据到excel'''
    wb = op.Workbook()          # 创建工作簿对象
    ws = wb['Sheet']            # 创建子表
    ws.append(['序号', '项目支出', '消费金额']) # 添加表头
    for item in data:
        d = item["id"], item["name"], id["expenses"]
        ws.append(d)        # 每次写入一行
    wb.save(file_name)

# "-------------数据用例-------------"
test_data = [
    {
    
    "id": 1, "name": "Rent", "expenses": 1000},
    {
    
    "id": 2, "name": "Gas", "expenses": 100},
    {
    
    "id": 3, "name": "Food", "expenses": 300},
    {
    
    "id": 4, "name": "Gym", "expenses": 50},
]
file_name = 'test_3.xlsx'
pd_to_excel(test_data, file_name)

Summary
The most suitable application scenarios for each library:

If you don’t want to use the GUI but want to give Excel more functions, you can choose between openpyxl and xlsxwriter;
If you need to perform scientific calculations and process large amounts of data, we recommend pandas+xlsxwriter or pandas+openpyxl;
Students who want to write Excel scripts and know Python but not VBA may consider xlwings or DataNitro;
As for win32com, it is very powerful in terms of function and performance, and students with windows programming experience can use it. However, it is equivalent to the encapsulation of windows COM and does not have complete documentation. It is a bit painful for novices to use it.

2 format interchange

2.1 .mat to .csv

import pandas as pd
import scipy
from scipy import io
import os
#遍历文件夹
for dirname, _, filenames in os.walk('./data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        # print(filename)
        # print(os.path.realpath(filename))  # 获取当前文件路径
        print(os.path.dirname(os.path.realpath(filename)))  # 从当前文件路径中获取目录
        # print(os.path.basename(os.path.realpath(filename)))  # 获取文件名
        (file, ext) = os.path.splitext(os.path.realpath(filename))
        # print(file)
        print(os.path.basename(os.path.realpath(file)))  # 获取文件名
        # print(ext)
        print(dirname)


        path = os.path.join(dirname, filename)
        # 1、导入文件
        matfile = scipy.io.loadmat(path)
        # 2、加载数据
        datafile = list(matfile.values())[-1]
        # 3、构造一个表的数据结构，data为表中的数据
        dfdata = pd.DataFrame(data=datafile)
        # 4、保存为.csv格式的路径
        datapath = dirname+'\\'+os.path.basename(os.path.realpath(file))+'.csv'
        # 5、保存为.txt格式的路径
        dfdata.to_csv(datapath, index=False)

2.2 .csv to .npy

import pandas as pd
import numpy as np

# 先用pandas读入csv
data = pd.read_csv("xxxx.csv")
# 再使用numpy保存为npy
np.save("xxx.npy", data)

reference

Detailed explanation of Python’s use of Excel artifact xlsxwriter:https://www.jianshu.com/p/6c979f0c6516
Python writes Excel files - multiple implementation methods:https://blog.csdn.net/qq_44695727/article/details/109174842
Python’s xlsxwriter module:https://blog.csdn.net/u010520724/article/details/115758171
Which Python-Excel module is better? ：https://zhuanlan.zhihu.com/p/23998083
Getting Started with XlsxWriter：https://xlsxwriter.readthedocs.io/contents.html