Python merge multiple sheets of excel

Article Directory

Introduction

Because there are a large number of excel report summary processing tasks every day, I wrote a script to handle it.

It is to find the specific sheet in each excel, read the specific columns of these sheets and merge them into one sheet.

Because the data of each sheet is not the same, it is a little troublesome. The following uses openpyxl and pandas to process.

openpyxl way

Use openpyxl to implement the merge logic yourself, which is a bit more troublesome. It is worth noting that there may be formulas in excel. The following methods can be used when reading excel:

load_workbook(data_file_path, data_only=True)

Using data_only=True, you can get the value of the formula after calculation, instead of the formula itself, because the formula itself is merged in another sheet, the formula may be invalid or even incorrect.

Here is a sample code for reference only:

"""
 pip install openpyxl
"""
from openpyxl import load_workbook
from openpyxl import Workbook
import os
import re

# 模板文件
TEMPLATE_FILE = r'H:\合并\合并模板.xlsx'
# 合并结果文件
RESULT_FILE = r'H:\合并\结果.xlsx'
# 数据文件目录
DATA_ROOT_DIR = r"H:\合并"

# 文件名称正则
DATA_FILE_REG = r"(.*?)-合同导入台账\d{8}.xlsx"


# 获取要处理的文件
def get_deal_file_map():
    file_sn_map = {
    
    }
    fs = os.listdir(DATA_ROOT_DIR)
    for f in fs:
        match = re.match(DATA_FILE_REG, f)
        if match:
            city = match.group(1)
            sn = 2
            if city == '成都':
                sn = 4
            elif city == '杭州':
                sn = 3
            file_sn_map[os.path.join(DATA_ROOT_DIR, f)] = sn
    return file_sn_map


# 规范化列名
def get_normal_column_name(origin_col_name):
    if origin_col_name:
        start = origin_col_name.find("（")
        if start == -1:
            return origin_col_name.strip()
        else:
            return origin_col_name[0:start].strip()


# 获取列名与列坐标的映射
def get_col_name_coordinate_map(sheet_row):
    name_coor_map = {
    
    }
    for cell in sheet_row:
        # name_coor_map[get_normal_column_name(cell.value)] = cell.column_letter
        name_coor_map[get_normal_column_name(cell.value)] = cell.column
    return name_coor_map


# 获取模板文件的列名与列坐标映射
def get_template_name_coordinate_map(template_file_path):
    template_wbook = load_workbook(template_file_path)
    table = template_wbook[template_wbook.sheetnames[0]]
    row = table[1:1]
    return get_col_name_coordinate_map(row)


def deal_data_content():
    """
        合并文件内容
    """
    dfile_sn_map = get_deal_file_map()
    save_book = Workbook()
    wsheet = save_book.active
    wsheet.title = 'merge-data'
    tmp_col_coor_map = get_template_name_coordinate_map(TEMPLATE_FILE)
    wsheet.append(list(tmp_col_coor_map.keys()))
    line = 2
    for data_file_path in dfile_sn_map.keys():
        sheet_num = dfile_sn_map[data_file_path]
        wbook = load_workbook(data_file_path, data_only=True)

        names = wbook.sheetnames

        for i in range(0, sheet_num):
            table = wbook[names[i]]
            row = table[1:1]
            data_col_coor_map = get_col_name_coordinate_map(row)
            use_col = data_col_coor_map.keys() & tmp_col_coor_map.keys()
            for row in table.iter_rows(min_row=2, values_only=True):
                rcol_index = data_col_coor_map['城市']
                city = row[rcol_index - 1]
                if (city is None) or len(city.strip()) == 0:
                    continue
                for col_name in use_col:
                    rcol_index = data_col_coor_map[col_name]
                    wcol_index = tmp_col_coor_map[col_name]
                    wsheet.cell(line, wcol_index, row[rcol_index - 1])
                line += 1
    save_book.save(RESULT_FILE)


if __name__ == '__main__':
    deal_data_content()

pandas way

Compared to using openpyxl directly, using pandas is much more convenient, just use the concat method directly.

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None,verify_integrity=False,copy=True)

Parameter meaning

parameter	meaning
objs	kist，Series、DataFrame、Panel
axis	The default is 0, connect by line
join	inner, outer, the default is "outer"
keys	list, the outermost layer to build a hierarchical index, if it is a multi-index, use a tuple
levels	list, used to build a specific level of MultiIndex
names	list, the name of the level in the result hierarchy index
copy	boolean, the default is True. If False, do not copy data unnecessarily
join_axes	Will be discarded, it is recommended to use reindex on the result set
ignore_index	boolean, the default is False. If True, ignore index
verify_integrity	boolean, the default is False. Check whether the newly connected axis contains duplicates

Let's look directly at the example:

# coding:utf-8
import pandas as pd

# 读取指定文件的指定sheet
df1 = pd.read_excel(r'H:\merge\cd-contract-charge-1-20200807.xlsx', header=0, sheet_name=0)
df2 = pd.read_excel(r'H:\merge\cd-contract-charge-2-20200807.xlsx', header=0, sheet_name=1)
df3 = pd.read_excel(r'H:\merge\cd-contract-charge-3-20200807.xlsx', header=0, sheet_name=2)
df4 = pd.read_excel(r'H:\merge\hz-contract-charge-1-20200807.xlsx', header=0, sheet_name=0)
df5 = pd.read_excel(r'H:\merge\hz-contract-charge-2-20200807.xlsx', header=0, sheet_name=1)

# 按行拼接
data = pd.concat([df1, df2, df3, df4, df5], sort=False, ignore_index=True)

# 选择需要的列
header = ['日期', '合同号', '城市', '姓名', 'charge']
data = data.loc[:, header]

# 将结果写到值得excel文件
data.to_excel(r'H:\merge\result.xlsx', index=False)

Mainly to read excel files. For reading and writing pandas files, you can refer to: pandas reading and writing files

In addition to using the concat method, you can also use the append method. The append method is a special concat method, that is, when the concat parameter axis=0, it is also the default value of the concat method axis.

Since pandas is used, of course, some data filtering, filling, and conversion operations can also be done by the way.