[Python] excel multiple sheet data merge instance

background

Because there are a large number of excel report summary processing tasks every day, I wrote a script to process it.

It is to find out the specific sheets in each excel, read out the specific columns of these sheets and merge them into one sheet.

Because the data of each sheet is different, it is a little troublesome. The following uses openpyxl method and pandas method to deal with it.

1. The openpyxl method

Using the openpyxl method to implement the merge logic by yourself is a bit troublesome. It is worth noting that there may be formulas in excel, and the following methods can be used when reading excel:

load_workbook(data_file_path, data_only=True)

Use data_only=True, you can get the value after the formula calculation, not the formula itself, because the formula itself is merged in another sheet, the formula may be invalid or even wrong.

Here is a sample code for reference only:

"""
 pip install openpyxl
"""
from openpyxl import load_workbook
from openpyxl import Workbook
import os
import re

# 模板文件
TEMPLATE_FILE = r'H:\合并\合并模板.xlsx'
# 合并结果文件
RESULT_FILE = r'H:\合并\结果.xlsx'
# 数据文件目录
DATA_ROOT_DIR = r"H:\合并"

# 文件名称正则
DATA_FILE_REG = r"(.*?)-合同导入台账\d{8}.xlsx"


# 获取要处理的文件
def get_deal_file_map():
    file_sn_map = {
    
    }
    fs = os.listdir(DATA_ROOT_DIR)
    for f in fs:
        match = re.match(DATA_FILE_REG, f)
        if match:
            city = match.group(1)
            sn = 2
            if city == '成都':
                sn = 4
            elif city == '杭州':
                sn = 3
            file_sn_map[os.path.join(DATA_ROOT_DIR, f)] = sn
    return file_sn_map


# 规范化列名
def get_normal_column_name(origin_col_name):
    if origin_col_name:
        start = origin_col_name.find("(")
        if start == -1:
            return origin_col_name.strip()
        else:
            return origin_col_name[0:start].strip()


# 获取列名与列坐标的映射
def get_col_name_coordinate_map(sheet_row):
    name_coor_map = {
    
    }
    for cell in sheet_row:
        # name_coor_map[get_normal_column_name(cell.value)] = cell.column_letter
        name_coor_map[get_normal_column_name(cell.value)] = cell.column
    return name_coor_map


# 获取模板文件的列名与列坐标映射
def get_template_name_coordinate_map(template_file_path):
    template_wbook = load_workbook(template_file_path)
    table = template_wbook[template_wbook.sheetnames[0]]
    row = table[1:1]
    return get_col_name_coordinate_map(row)


def deal_data_content():
    """
        合并文件内容
    """
    dfile_sn_map = get_deal_file_map()
    save_book = Workbook()
    wsheet = save_book.active
    wsheet.title = 'merge-data'
    tmp_col_coor_map = get_template_name_coordinate_map(TEMPLATE_FILE)
    wsheet.append(list(tmp_col_coor_map.keys()))
    line = 2
    for data_file_path in dfile_sn_map.keys():
        sheet_num = dfile_sn_map[data_file_path]
        wbook = load_workbook(data_file_path, data_only=True)

        names = wbook.sheetnames

        for i in range(0, sheet_num):
            table = wbook[names[i]]
            row = table[1:1]
            data_col_coor_map = get_col_name_coordinate_map(row)
            use_col = data_col_coor_map.keys() & tmp_col_coor_map.keys()
            for row in table.iter_rows(min_row=2, values_only=True):
                rcol_index = data_col_coor_map['城市']
                city = row[rcol_index - 1]
                if (city is None) or len(city.strip()) == 0:
                    continue
                for col_name in use_col:
                    rcol_index = data_col_coor_map[col_name]
                    wcol_index = tmp_col_coor_map[col_name]
                    wsheet.cell(line, wcol_index, row[rcol_index - 1])
                line += 1
    save_book.save(RESULT_FILE)


if __name__ == '__main__':
    deal_data_content()

Two, pandas method

  • Compared with using openpyxl directly, it is much more convenient to use pandas, just use the concat method directly.

    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None,verify_integrity=False,copy=True)
    
  • Parameter meaning

    parameter meaning
    objs kist,Series、DataFrame、Panel
    axis Default is 0, join by row
    join inner, outer, default is "outer"
    keys list, the outermost layer constructs a hierarchical index, if it is a multi-index, use a tuple
    levels list, used to build a specific level of the MultiIndex
    names list, the names of the levels in the resulting hierarchical index
    copy boolean, default True. If False, don't copy data unnecessarily
    join_axes To be deprecated, it is recommended to use reindex on the result set
    ignore_index boolean, defaultFalse. If True, ignore the index
    verify_integrity boolean, defaultFalse. Check newly connected axes for duplicates
  • the case

    import pandas as pd   #合并多个sheet
    data = pd.read_excel('C:\\Users\\Rose\\Desktop\\财务费用.xlsx',None)
    cols = list(data.keys())
    newdata =pd.DataFrame()
    for i in cols:
        df= data[i]
        newdata=pd.concat([newdata,df])
    newdata.to_excel('C:\\Users\\Rose\\Desktop\\财务合并数据.xlsx',index=False)
    

    In addition to using the concat method, you can also use the append method. The append method is a special concat method, that is, the concat parameter axis=0, which is also the default value of the concat method's axis.

    Now that pandas is used, of course, some operations such as data filtering, filling, and conversion can also be done by the way.

Summarize

This article mainly explains the two methods of merging multiple sheet data in excel. You can try it yourself. If you have any questions, welcome to discuss in the comment area.

Guess you like

Origin blog.csdn.net/u011397981/article/details/130069887