Python batch merge Excel tables

1. Presentation of the question

Now that multiple Excel tables are collected, these tables need to be summarized and merged. But these Excel tables are not regular, there are many empty columns, and although these column names are the same, but the order is different, so it is not so simple to sum up. A single Excel table is displayed as follows:

excel sheet

2. Problem solving

At first I considered using openpyxl, but due to the complexity of the problem, I finally decided to use os and pandas to solve it.

Step 1: Use os to traverse all excel sheets in the current directory, and use list expressions to generate a list of these file names:

import os
files = [file for file in os.listdir(".") if file.endswith(".xlsx") if not file.endswith("~")]

Step 2: Import pandas, read Excel tables in batches, and delete empty cases.

    df =  pd.read_excel(file, index_col=None,header = 0)
    df1 = df.dropna(how='all', axis=1,inplace=False) #inplace=True不创建新的对象
    lst.append(df1)

Step 3: Use concat in pandas to merge the data frames according to the column names, and finally convert the data frames to Excel, and finally form the following code:

import pandas as pd
import os
lst = []
files = [file for file in os.listdir(".") if file.endswith(".xlsx") if not file.endswith("~")]
for file in files:
    df =  pd.read_excel(file, index_col=None,header = 0)
    df1 = df.dropna(how='all', axis=1,inplace=False) #inplace=True不创建新的对象
    lst.append(df1)
sava_data = pd.concat(lst,axis = 0,ignore_index=True) #ignore_index 重建索引 axis=1 列空值
sava_data.to_excel("合并.xlsx",index=False,header=1) #设置无索引

3. The merged Excel table

The following is the display of the merged Excel table, the effect is not bad, the original table header and format are maintained, and the arrangement is neat.

Merged Excel table

4. Post-school reflection

  1. I am used to openpyxl and don't like to use pandas, mainly because pandas is more complicated, but it can solve complex problems, so I still have to study hard if I have the opportunity.
  2. Pandas has powerful functions. Although it is not easy to learn, it can help us solve many practical problems. The only disadvantage is that the program it makes is larger and takes longer after it is packaged. For example, the above 10 lines of code actually reached 90M after packaging. After packaging, the running speed of the program is also good, the sorted data is more tidy, and the format is more standardized, which is convenient for later analysis and processing.
  3. Finally, I still want to emphasize that the learning of many Python packages should be project-oriented, starting from solving practical problems, and on the basis of understanding pandas, through bold attempts, careful verification, and practice while learning, the harvest can be full.

Guess you like

Origin blog.csdn.net/henanlion/article/details/130692020