Article Directory
Introduction
Because there are a large number of excel report summary processing tasks every day, I wrote a script to handle it.
It is to find the specific sheet in each excel, read the specific columns of these sheets and merge them into one sheet.
Because the data of each sheet is not the same, it is a little troublesome. The following uses openpyxl and pandas to process.
openpyxl way
Use openpyxl to implement the merge logic yourself, which is a bit more troublesome. It is worth noting that there may be formulas in excel. The following methods can be used when reading excel:
load_workbook(data_file_path, data_only=True)
Using data_only=True, you can get the value of the formula after calculation, instead of the formula itself, because the formula itself is merged in another sheet, the formula may be invalid or even incorrect.
Here is a sample code for reference only:
"""
pip install openpyxl
"""
from openpyxl import load_workbook
from openpyxl import Workbook
import os
import re
# 模板文件
TEMPLATE_FILE = r'H:\合并\合并模板.xlsx'
# 合并结果文件
RESULT_FILE = r'H:\合并\结果.xlsx'
# 数据文件目录
DATA_ROOT_DIR = r"H:\合并"
# 文件名称正则
DATA_FILE_REG = r"(.*?)-合同导入台账\d{8}.xlsx"
# 获取要处理的文件
def get_deal_file_map():
file_sn_map = {
}
fs = os.listdir(DATA_ROOT_DIR)
for f in fs:
match = re.match(DATA_FILE_REG, f)
if match:
city = match.group(1)
sn = 2
if city == '成都':
sn = 4
elif city == '杭州':
sn = 3
file_sn_map[os.path.join(DATA_ROOT_DIR, f)] = sn
return file_sn_map
# 规范化列名
def get_normal_column_name(origin_col_name):
if origin_col_name:
start = origin_col_name.find("(")
if start == -1:
return origin_col_name.strip()
else:
return origin_col_name[0:start].strip()
# 获取列名与列坐标的映射
def get_col_name_coordinate_map(sheet_row):
name_coor_map = {
}
for cell in sheet_row:
# name_coor_map[get_normal_column_name(cell.value)] = cell.column_letter
name_coor_map[get_normal_column_name(cell.value)] = cell.column
return name_coor_map
# 获取模板文件的列名与列坐标映射
def get_template_name_coordinate_map(template_file_path):
template_wbook = load_workbook(template_file_path)
table = template_wbook[template_wbook.sheetnames[0]]
row = table[1:1]
return get_col_name_coordinate_map(row)
def deal_data_content():
"""
合并文件内容
"""
dfile_sn_map = get_deal_file_map()
save_book = Workbook()
wsheet = save_book.active
wsheet.title = 'merge-data'
tmp_col_coor_map = get_template_name_coordinate_map(TEMPLATE_FILE)
wsheet.append(list(tmp_col_coor_map.keys()))
line = 2
for data_file_path in dfile_sn_map.keys():
sheet_num = dfile_sn_map[data_file_path]
wbook = load_workbook(data_file_path, data_only=True)
names = wbook.sheetnames
for i in range(0, sheet_num):
table = wbook[names[i]]
row = table[1:1]
data_col_coor_map = get_col_name_coordinate_map(row)
use_col = data_col_coor_map.keys() & tmp_col_coor_map.keys()
for row in table.iter_rows(min_row=2, values_only=True):
rcol_index = data_col_coor_map['城市']
city = row[rcol_index - 1]
if (city is None) or len(city.strip()) == 0:
continue
for col_name in use_col:
rcol_index = data_col_coor_map[col_name]
wcol_index = tmp_col_coor_map[col_name]
wsheet.cell(line, wcol_index, row[rcol_index - 1])
line += 1
save_book.save(RESULT_FILE)
if __name__ == '__main__':
deal_data_content()
pandas way
Compared to using openpyxl directly, using pandas is much more convenient, just use the concat method directly.
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None,verify_integrity=False,copy=True)
Parameter meaning
parameter | meaning |
---|---|
objs | kist,Series、DataFrame、Panel |
axis | The default is 0, connect by line |
join | inner, outer, the default is "outer" |
keys | list, the outermost layer to build a hierarchical index, if it is a multi-index, use a tuple |
levels | list, used to build a specific level of MultiIndex |
names | list, the name of the level in the result hierarchy index |
copy | boolean, the default is True. If False, do not copy data unnecessarily |
join_axes | Will be discarded, it is recommended to use reindex on the result set |
ignore_index | boolean, the default is False. If True, ignore index |
verify_integrity | boolean, the default is False. Check whether the newly connected axis contains duplicates |
Let's look directly at the example:
# coding:utf-8
import pandas as pd
# 读取指定文件的指定sheet
df1 = pd.read_excel(r'H:\merge\cd-contract-charge-1-20200807.xlsx', header=0, sheet_name=0)
df2 = pd.read_excel(r'H:\merge\cd-contract-charge-2-20200807.xlsx', header=0, sheet_name=1)
df3 = pd.read_excel(r'H:\merge\cd-contract-charge-3-20200807.xlsx', header=0, sheet_name=2)
df4 = pd.read_excel(r'H:\merge\hz-contract-charge-1-20200807.xlsx', header=0, sheet_name=0)
df5 = pd.read_excel(r'H:\merge\hz-contract-charge-2-20200807.xlsx', header=0, sheet_name=1)
# 按行拼接
data = pd.concat([df1, df2, df3, df4, df5], sort=False, ignore_index=True)
# 选择需要的列
header = ['日期', '合同号', '城市', '姓名', 'charge']
data = data.loc[:, header]
# 将结果写到值得excel文件
data.to_excel(r'H:\merge\result.xlsx', index=False)
Mainly to read excel files. For reading and writing pandas files, you can refer to: pandas reading and writing files
In addition to using the concat method, you can also use the append method. The append method is a special concat method, that is, when the concat parameter axis=0, it is also the default value of the concat method axis.
Since pandas is used, of course, some data filtering, filling, and conversion operations can also be done by the way.