Working with Excel Sheets in Python

Excel is a popular and powerful spreadsheet program for Windows. The openpyxl module enables Python programs to read and modify Excel spreadsheet files. For example, you might have a boring task where you need to copy some data from one table and paste it into another. Or maybe you need to pick a few rows out of thousands and modify them slightly based on a certain condition. Or need to look at hundreds of departmental budget spreadsheets to find deficits. These boring tasks can all be done in Python.

Excel document

An Excel spreadsheet file is called a workbook. A workbook is saved in a file with the extension .xlsx. Each workbook can contain multiple sheets (also called worksheets). The sheet the user is currently viewing (or the sheet last viewed before closing Excel) is called the active sheet. Each table has columns (addresses are letters starting with A) and rows (addresses are numbers starting with 1). Scrolls in specific rows and columns are called cells. Each cell contains a number or text value. The grid and data formed by the cells make up the table.

Install the openpyxl module

Python does not come with openpyxl, so you must install it yourself. Open command line input

pip install openpyxl

to install.

Read Excel documents

We'll use a spreadsheet called example.xlsx, which can be found and downloaded at Automate the Boring Stuff with Python . As shown in the figure below, there are 3 default sheets named Sheet1, Sheet2 and Sheet3, which are automatically provided by Excel for new workbooks (different operating systems and spreadsheet programs may provide different default sheets).

Open Excel document with openpyxl module

After importing the openpyxl module, you can use openpyxl.load_workbook()functions to open Excel documents.

openpyxl.load_workbook()The function accepts a filename and returns a value of the workbook data type. The workbook object represents the entire Excel file (example.xlsx), similar to the File object representing an open text file.

get worksheet from workbook

A list of all sheet names in the workbook can be obtained through the sheetnames property of the workbook object.

Each sheet is represented by a Worksheet object, which can be obtained by indexing into the workbook (using the sheet name string). The active sheet of the workbook can be obtained through the active property of the Workbook object. After getting the Worksheet object, you can get its name through the title attribute.

get cell from table

After getting the Worksheet object, you can access the Cell object by name.

The Cell object has a value property that contains the value stored in this cell. The Cell object also has row, column, and coordinate properties that provide information about the location of the cell.

As shown above, access the value property of the Cell object of cell B1 to get the string 'Apples'. The row attribute gives the integer 1, the column attribute gives 2, and the coordinate attribute gives 'B1'.

When calling a table cell()method, you can pass an integer as the row and column keyword arguments, or you can get a cell.

The size of the table can be determined by the max_column and max_row properties of the Worksheet object.

sheet.max_column, sheet.max_row
# (3, 7)

Convert between column letters and numbers

To convert from letters to numbers, call the openpyxl.utils.column_index_from_stringfunction. To convert from numbers to letters, call the openpyxl.utils.get_column_letter()function.

After importing these two functions from the openpyxl.utils module, you can call them get_column_letter(), passing in an integer like 27, to figure out what the letter in column 27 is. The function column_index_stringdoes the opposite: you pass in the letter name of a column, and it returns what the number for that column is.

Get rows and columns from a table

You can slice a Worksheet object to get all Cell objects in a row, column, or rectangular area in the spreadsheet. You can then loop through all the cells in this slice.

This tuple contains 3 tuples: each tuple represents 1 row, from the top to the bottom of the specified area. Each of these 3 inner tuples contains a row of Cell objects in the specified range, from the leftmost cell to the rightmost. This slice of the worksheet contains all the Cell objects from A1 to C3, from the upper left cell to the lower right cell.

You can get a row or column in the table as follows

Write to Excel document

openpyxl also provides some methods to write data. This means that your program can create and edit spreadsheet files. Using Python, creating a spreadsheet with thousands of rows of data is very simple.

Create and save Excel documents

Call openpyxl.Workbook()the function to create a new empty Workbook object.

When the Workbook object or its sheets and cells are modified, the spreadsheet file is not saved unless save()the workbook method is called.

wb.save('demo.xlsx')

Create and delete worksheets

Using the create_sheet()and remove_sheet()methods, you can add or delete worksheets in a workbook.

create_sheet()The method returns a new Worksheet object, named SheetX, which by default is the last worksheet of the workbook. Alternatively, you can use the index and title keyword arguments to specify the index or name of the new worksheet.

remove()The method accepts a Worksheet object as its parameter, not a string of the sheet name.

write value to cell

Writing a value to a cell is much like writing a value to a key in a dictionary.

Looking at the demo.xlsx file, you can see that the values have been written to the file

Project: Read data from spreadsheet

Suppose you have a spreadsheet with data from the 2010 U.S. Census. You have a rather boring task of iterating through thousands of rows in a table, calculating the total population, and the number of census tracts in each county (a census tract is a geographic area, defined for the purpose of a census). Each row in the table represents a census tract. The table is called censupopdata.xlsx and can be downloaded at Automate the Boring Stuff with Python .

There is only one table in the censuspopdata.xlsx spreadsheet called "Population by Census Tract". Each row holds data for one census tract. The columns are the census tract number (A), the state abbreviation (B), the county name (C), and the population of the census tract (D).

In this project, we need to write a script that reads data from a census spreadsheet and does the following:

Read data from Excel sheet
Calculate the number of census tracts in each county
Calculate the total population of each county
print result

This means that the code needs to do the following tasks:

Open Excel document with openpyxl module and read cells
Calculate all census tract and population data, save them into one data structure
Using the pprint module, write the data structure to a text file with a .py extension

The complete code looks like this:

import openpyxl, pprint

# 用openpyxl模块打开Excel文档
wb = openpyxl.load_workbook('censuspopdata.xlsx')
sheet = wb['Population by Census Tract']
# 存储普查区数目和人口数据
countryData = {}

for row in range(2,sheet.max_row+1):
    state = sheet['B'+str(row)].value
    country = sheet['C'+str(row)].value
    pop = sheet['D'+str(row)].value

    countryData.setdefault(state, {})

    countryData[state].setdefault(country, {'tracts':0, 'pop':0})

    countryData[state][country]['tracts'] += 1
    countryData[state][country]['pop'] += int(pop)

# 将数据存储到文件中
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countryData))
resultFile.close()

The data saved to the census2010.py file looks like this:

reference

Automate the Boring Stuff with Python

If you encounter difficulties in the beginning of learning and want to find a python learning and communication environment, you can join us, receive learning materials, and discuss together