[Introduction to Artificial Intelligence] Reading and writing excel with Pandas

[Introduction to Artificial Intelligence] Reading and writing excel with Pandas


Pandas读取excel用的是它的read_excel方法

  • When reading only one sheet, data of DataFrame type is returned;
  • When reading multiple sheets at the same time, dictionary type data is returned with the structure {name: DataFrame}.
  • Note 1: DataFrame is a tabular data type that can display the tabular structure of data.
  • Note 2: It is inconvenient to read multiple sheets at the same time for subsequent operations. It is recommended to read only one sheet at a time.

1. Use read_excel to read excel files

1.Basic parameters of read_excel

This is 1.1

  • Specify the path to the excel file, which can be a path or file object.
  • Sometimes pandas may fail to parse because the file path contains Chinese strings, which can be solved by using file objects.
file = 'xxxx.xlsx'
f = open(file, 'rb')
df = pd.read_excel(f, sheet_name='Sheet1')
f.close()  # 切记要手动释放文件。
# ------------- with模式 -------------------
with open(file, 'rb') as f:
    df = pd.read_excel(f, sheet_name='Sheet1')

1.2 sheet_name=0:

  • Specify the excel worksheet to be accessed, which can be str, int, list, or None type. The default value is 0.
  • str type: specify the worksheet name, for example, sheet_name=‘Test1’;
  • int type: specifies the index of the worksheet starting from 0, for example, sheet_name=0;
  • List type: specify multiple worksheets, for example, sheet_name=[0, 2, ‘Test3’];
  • None type: access all worksheets.

1.3 header=0:

  • Header is the header row, specifying a certain row as the header row of the data, that is, the column name of the data. By default, the first row of data (0-index) is used as the header row.
  • If the table header has more than one row, you can specify multiple header rows by passing in a list, such as header=[0,1].
  • To use header=None without a header row.

注:指定标题行后,read_excel将从最后一个标题行的下一行开始读取数据。

1.4 names=None

  • It is mostly used when the file data does not have a header row (header). You can use names to specify column names. The list type is passed in. In this case, header=None must be explicitly pointed out, otherwise the first row of data will be lost.

1.5 index_col=None

  • Specify the column index. The type of the incoming data is int or a list whose elements are all ints. Use the data of a certain column as the row label of the DataFrame. If a list is passed, these columns will be combined into a multi-index.

实际应用中,它更多的是被用于剔除不想要的列。

1.6 skiprows=0

  • Skip the specified number of lines of data. The incoming data type is int type or a list whose elements are all ints, corresponding to the number of lines to be skipped at the beginning of the file (int) or the line number to be skipped (0 index list).
  • When the input is int type data n, skip the first n lines and start reading.
  • When the input is a list whose elements are all of type int, the rows with corresponding numbers (numbered starting from 0) in the list are skipped. For example, skiprows=[0,1,3,5] skips lines 1, 2, 4, and 6 and does not read them.

1.7 skipfooter=0

  • Specify the number of rows to skip from the end, that is, the last few rows are not read. The input data is int type, and the default is 0, from bottom to top.

1.8 dtype=None

  • To specify the data type of a column, just pass in a dictionary corresponding to the column name and type, for example {‘A’: np.int64, ‘B’: str}.

1.9 Other parameters

  • usecols=None: Specify the columns to use, if not parse all columns by default.

  • squeeze=False, Boolean value, default False. If the parsed data has only one column, a Series is returned.

  • nrows=None: int type, default None. Only parse the specified number of rows of data.

2. Obtaining numerical values ​​from data - values

The data type read by read_excel is the DataFrame type and cannot be used directly for calculations, so values ​​are needed to obtain the values.

  • df.values, get all the data, the return type is ndarray (two-dimensional);
  • df.index.values, obtains the vector of row index, the return type is ndarray (one-dimensional);
  • df.columns.values, obtains the vector of column index (for the method with header, it is the header label vector), and the return type is ndarray (one-dimensional).

3. Data slice access - loc, iloc

  • Data used
    Insert image description here
  1. The loc method finds the required value through the name of the row or column or the index label, which is a closed interval.
  • It is very similar to the general slicing operation, except that it can be accessed using index tags.
data.loc[1] # 索引第二行,返回的数据类型为'pandas.core.series.Series'
data.loc[1].values # 索引第二行,返回的数据类型是'numpy.ndarray'的列表
data.loc[1,'id':'name'] # 索引第二行中从列标签为‘id’到‘name’的项,返回的数据类型为'pandas.core.series.Series'
data.loc[1,'id':'name'].values # 索引第二行中从列标签为‘id’到‘name’的项,返回的数据类型为'numpy.ndarray'的列表
  • You can also filter based on conditions
# 根据'scalar'列中大于60的值筛选
data5 = data.loc[ data.scalar > 60]
# 也可进行切片操作,选择id,name,scalar三列区域内,scalar列大于60的值
data1 = data.loc[ data.scalar > 60, ["id","name","scalar"]]
  1. The iloc method finds values ​​through the index positions [index, columns] of index rows and columns, which are closed in the front and open in the back.
  • Almost equivalent to a slicing operation, but cannot be accessed using index tags.

Summary: The difference between these two methods is that loc treats parameters as labels, and iloc treats parameters as index numbers.

  • In the header mode, when the column index uses the str label, only loc can be used, and when the column index uses the int index number, only iloc can be used;
  • In the headerless method, the index vector is also a label vector, and both loc and iloc can be used; in slicing, loc is a closed interval and iloc is a semi-open interval.

2. Write data to excel file

There are two ways to write, you can use the to_excel method or the ExcelWriter() class.

1.to_excel method

  • It can write data into a sheet. Be careful to close the excel file when writing.
# 用字典指定数据
data = {
    
     '名字': ['张三','李四'],
        '分数': [100, 100] }
# 将数据转化成DataFrame格式以便于传入excel
df= pandas.DataFrame(data)
# 写入数据到指定文件中
df.to_excel('1.xlsx', sheet_name='Sheet1',index=False)# index = False表示不写入索引

2.ExcelWriter()

  • It can write corresponding table data to different sheets of the same excel at one time.
  1. The following code writes two tables, sheet1 and sheet2, in 1.xlsx.
df1 = pandas.DataFrame({
    
    '名字': ['张三', '王四'], '分数': [100, 100]})
df2 = pandas.DataFrame({
    
    '年龄': ['18', '19'], '性别': ['男', '女']})
 
with pandas.ExcelWriter('1.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet1', index=False)
    df2.to_excel(writer, sheet_name='Sheet2', index=False)
  1. Add form
  • You can add a new sheet by adding the mode parameter in ExcelWriter.
  • This parameter defaults to w. If it is changed to a, you can add a sheet to excel that already has a sheet.
df3 = pandas.DataFrame({
    
    '新增表': ['1', '2']})
with pandas.ExcelWriter('1.xlsx', mode='a') as writer:
    df3.to_excel(writer, sheet_name='Sheet3', index=False)
  1. Overwrite existing sheets in excel
  • If you need to rewrite a sheet in excel, it is not possible to directly write the sheet with the same name into excel. The command will overwrite all the original data, so the following method needs to be used.
df4 = pandas.DataFrame({
    
    'test':['2017002038','2017003024']})
with pandas.ExcelWriter('1.xlsx',mode='a',engine='openpyxl') as writer:
    wb = writer.book # openpyxl.workbook.workbook.Workbook 获取所有sheet
    wb.remove(wb['Sheet3']) #删除需要覆盖的sheet
    df4.to_excel(writer, sheet_name='Sheet3',index=False) #sheet3的内容更新成df4值
  1. Append data to existing sheet
  • To implement excel addition, you can read the original data first and then add it together with the data that needs to be stored.
import pandas as pd
# 待追加的数据
df5 = {
    
    '名字': ['李五', '赵六', '孙七'],
         '分数': [90,60,68]}
df5 = pd.DataFrame(df5)

# 读取相应sheet中原本的数据
original_data = pd.read_excel('1.xlsx',sheet_name='Sheet1')
# 将新数据与旧数据合并起来
save_data = pd.concat([original_data, df5], axis=0)

# 覆盖掉原有的sheet
with pd.ExcelWriter('1.xlsx',mode='a',engine='openpyxl') as writer:
    wb = writer.book # openpyxl.workbook.workbook.Workbook 获取所有sheet
    wb.remove(wb['Sheet1']) #删除需要覆盖的sheet
    save_data.to_excel(writer, sheet_name='Sheet1',index=False) #sheet1的内容更新成save_data值

Guess you like

Origin blog.csdn.net/qq_44928822/article/details/129002504