While we don't often think of spreadsheets as programming tools, nearly everyone uses them to organize information into two-dimensional data structures, perform calculations with formulas, and produce output in the form of graphs. In the next two chapters, we'll integrate Python into two popular spreadsheet applications: Microsoft Excel and Google Sheets.
Excel is a popular and powerful spreadsheet application for Windows. openpyxl
module allows your Python program to read and modify Excel spreadsheet files. For example, you may have the tedious task of copying some data from one spreadsheet and pasting it into another. Or you may have to go through thousands of lines and pick out only a small subset of them to make small edits based on some criteria. Or you may have to look through spreadsheets of hundreds of departmental budgets, looking for any deficits. These are exactly the kinds of boring, no-brainer spreadsheet tasks that Python can do for you.
Although Excel is proprietary to Microsoft, there is also free software that runs on Windows, MacOS, and Linux. Both LibreOffice Calc and OpenOffice Calc can use Excel's xlsx
spreadsheet file format, which means that openpyxl
the module can also handle spreadsheets from these applications. You can download the software from www.libreoffice.org
and respectively. www.openoffice.org
Even if you already have Excel installed on your computer, you'll find these programs easier to use. However, the screenshots in this chapter are all from Excel 2010 on Windows 10.
Excel document
First, let's review some basic definitions: An Excel spreadsheet document is called a workbook , and a single workbook is kept in a single xlsx
file. Each workbook can contain multiple tables (also known as worksheets ). The worksheet that the user is currently viewing (or was last viewing before closing Excel) is called the active worksheet .
Each sheet has columns ( A
addressed by a letter starting at the beginning) and rows (addressed by a number starting with 1). Boxes on specific columns and rows are called cells . Each cell can contain a number or text value. A grid of cells containing data makes up a worksheet.
install openpyxl
module
Python doesn't have OpenPyXL, so you have to install it. Follow the instructions for installing third-party modules in Appendix A; the name of the module is openpyxl
.
This book uses version 2.6.2 of OpenPyXL. It is important to install this version by running pip install --user -U openpyxl==2.6.2
because newer versions of OpenPyXL are not compatible with the information in this book. To test that the installation is correct, enter the following in an interactive shell:
>>> import openpyxl
If the module is installed correctly, there should be no error messages. Remember to import the module before running the interactive shell examples in this chapter openpyxl
, otherwise you will get an NameError: name 'openpyxl' is not defined
error.
You can openpyxl.readthedocs.org
find the full documentation for OpenPyXL here.
Read Excel documents
The examples in this chapter will use a spreadsheet named example.xlsx
. You can create a spreadsheet yourself, or nostarch.com/automatestuff2
download it from . Figure 13-1 shows the three default worksheets Sheet1
, Sheet2
and Sheet3
tabs, that Excel automatically provides for new workbooks. (The number of default sheets created may vary by operating system and spreadsheet program.)
Figure 13-1: The workbook's sheet tabs are in the lower left corner of Excel.
Table 1 in the example file should look like Table 13-1. (If you did not download from the website example.xlsx
, you should enter this data into the form yourself.)
Table 13-1 : example.xlsx
Spreadsheet
A | B | C | |
---|---|---|---|
1 | 4/5/2015 1:34:02 PM | Apples | 73 |
2 | 4/5/2015 3:41:23 AM | Cherries | 85 |
3 | 4/6/2015 12:46:51 PM | Pears | 14 |
4 | 4/8/2015 8:59:43 AM | Oranges | 52 |
5 | 4/10/2015 2:07:00 AM | Apples | 152 |
6 | 4/10/2015 6:10:37 PM | Bananas | 23 |
7 | 4/10/2015 2:40:46 AM | Strawberries | 98 |
Now that we have our example spreadsheet, let's see how to openpyxl
manipulate it with modules.
Open Excel documents with OpenPyXL
Once the module is imported openpyxl
, the functions can be used openpyxl.load_workbook()
. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> type(wb)
<class 'openpyxl.workbook.workbook.Workbook'>
openpyxl.load_workbook()
The function takes a filename and returns a workbook
value of data type. This Workbook
object represents an Excel file, a bit like an File
object represents an open text file.
Remember example.xlsx
it needs to be in the current working directory in order for you to use it. You can find out what the current working directory is by importing os
and using , and you can change the current working directory with .os.getcwd()
os.chdir()
Get sheet from workbook
By accessing sheetnames
the properties, you can get a list of all sheet names in the workbook. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> wb.sheetnames # The workbook's sheets' names.
['Sheet1', 'Sheet2', 'Sheet3']
>>> sheet = wb['Sheet3'] # Get a sheet from the workbook.
>>> sheet
<Worksheet "Sheet3">
>>> type(sheet)
<class 'openpyxl.worksheet.worksheet.Worksheet'>
>>> sheet.title # Get the sheet's title as a string.
'Sheet3'
>>> anotherSheet = wb.active # Get the active sheet.
>>> anotherSheet
<Worksheet "Sheet1">
Each worksheet is Worksheet
represented by an object, which you can get by using square brackets and a worksheet name string like a dictionary key. Finally, you can use a property Workbook
of the object active
to get the active sheet of the workbook. The active worksheet is the top-level worksheet when the workbook is opened in Excel. Once you have Worksheet
an object, you can title
get its name from a property.
get cell from worksheet
Once you have an Worksheet
object, you can access an object by its name Cell
. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1'] # Get a sheet from the workbook.
>>> sheet['A1'] # Get a cell from the sheet.
<Cell 'Sheet1'.A1>
>>> sheet['A1'].value # Get the value from the cell.
datetime.datetime(2015, 4, 5, 13, 34, 2)
>>> c = sheet['B1'] # Get another cell from the sheet.
>>> c.value
'Apples'
>>> # Get the row, column, and value from the cell.
>>> 'Row %s, Column %s is %s' % (c.row, c.column, c.value)
'Row 1, Column B is Apples'
>>> 'Cell %s is %s' % (c.coordinate, c.value)
'Cell B1 is Apples'
>>> sheet['C1'].value
73
Cell
The object has a value
property which, as expected, contains the value stored in that cell. Cell
Objects also have row
, column
and coordinate
properties that provide location information for cells.
Here, accessing the properties Cell
of the object in cell B1 value
gets the string 'Apples'
. row
Attributes give us integers 1
, column
attributes give us 'B'
, coordinate
attributes give us 'B1'
.
OpenPyXL will automatically interpret the dates in column A and datetime
return them as values rather than strings. datetime
Data types are further explained in Chapter 17 .
Designating columns with letters can be difficult to program, especially because after column Z, the columns start with two letters: AA, AB, AC, and so on. Alternatively, you can also use the worksheet's cell()
method and pass integers to it row
and column
keyword arguments to get the cell. The integers in the first row or column are yes 1
and no 0
. Continue the interactive shell example by typing:
>>> sheet.cell(row=1, column=2)
<Cell 'Sheet1'.B1>
>>> sheet.cell(row=1, column=2).value
'Apples'
>>> for i in range(1, 8, 2): # Go through every other row:
... print(i, sheet.cell(row=i, column=2).value)
...
1 Apples
3 Pears
5 Apples
7 Strawberries
As you can see, using the worksheet cell()
method and passing it row=1
and column=2
will get the cell B1
's Cell
object, as specified sheet['B1']
. Then, using cell()
the method and its keyword arguments, you can write a for
loop to print the values of a range of cells.
Suppose you want to print the value in the cell of each odd row, starting from column B. You can get cells from every other row (in this case, all odd-numbered rows) by passing it as an argument range()
to the function . Variables for loops are passed to methods as keyword arguments, and are always passed as keyword arguments. Note that integers are passed , not strings .step
2
for
i
row
cell()
2
column
2
'B'
You can use Worksheet
the object's max_row
and max_column
properties to determine the size of the worksheet. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1']
>>> sheet.max_row # Get the highest row number.
7
>>> sheet.max_column # Get the highest column number.
3
Note that max_column
the attribute is an integer, not a letter as it appears in Excel.
Conversion between column letters and numbers
To convert letters to numbers, call openpyxl.utils.column_index_from_string()
the function. To convert from numbers to letters, call openpyxl.utils.get_column_letter()
the function. Enter the following in the interactive shell:
>>> import openpyxl
>>> from openpyxl.utils import get_column_letter, column_index_from_string
>>> get_column_letter(1) # Translate column 1 to a letter.
'A'
>>> get_column_letter(2)
'B'
>>> get_column_letter(27)
'AA'
>>> get_column_letter(900)
'AHP'
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1']
>>> get_column_letter(sheet.max_column)
'C'
>>> column_index_from_string('A') # Get A's number.
1
>>> column_index_from_string('AA')
27
openpyxl.utils
After importing these two functions from the module, you can call get_column_letter()
it and pass it an integer like 27 to figure out what the letter name of the 27th column is. The function column_index_string()
does the opposite: you pass it the letter name of a column, and it tells you what number that column is. Using these functions does not require loading a workbook. If you wanted, you could load a workbook, get an Worksheet
object, and use a similar max_column
property Worksheet
to get an integer. You can then pass that integer to get_column_letter()
.
Get rows and columns from worksheet
You can split Worksheet
objects to get all objects in a row, column, or rectangular area of a spreadsheet Cell
. You can then loop over all cells in the slice. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1']
>>> tuple(sheet['A1':'C3']) # Get all cells from A1 to C3.
((<Cell 'Sheet1'.A1>, <Cell 'Sheet1'.B1>, <Cell 'Sheet1'.C1>), (<Cell
'Sheet1'.A2>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.C2>), (<Cell 'Sheet1'.A3>,
<Cell 'Sheet1'.B3>, <Cell 'Sheet1'.C3>))
>>> for rowOfCellObjects in sheet['A1':'C3']: # ➊
... for cellObj in rowOfCellObjects: # ➋
... print(cellObj.coordinate, cellObj.value)
... print('--- END OF ROW ---')
A1 2015-04-05 13:34:02
B1 Apples
C1 73
--- END OF ROW ---
A2 2015-04-05 03:41:23
B2 Cherries
C2 85
--- END OF ROW ---
A3 2015-04-06 12:46:51
B3 Pears
C3 14
--- END OF ROW ---
Here we specify that we want the objects in the rectangular area from A1 to C3 and we get an object containing the objects Cell
in that area . To help us visualize this object, we can use on it to display its objects in a tuple .Cell
Generator
Generator
tuple()
Cell
This tuple contains three tuples: one per line, from the top to the bottom of the desired region. Each of these three inner tuples contains Cell
the objects in a row of our desired range, from the leftmost cell to the right cell. So in summary, our worksheet slice contains all Cell
the objects in the range from A1 to C3, starting with the upper left cell and ending with the lower right cell.
To print the value of each cell in the range, we use two for
loops. The outer for
loop iterates over each row in the slice ➊. Then, for each row, the nested for
loop iterates through each cell of that row ➋.
To access the value of a cell in a particular row or column, you can also use the AND property Worksheet
of an object . These properties must be converted to a list with a function before using square brackets and indexing . Enter the following in the interactive shell:rows
columns
list()
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> list(sheet.columns)[1] # Get second column's cells.
(<Cell 'Sheet1'.B1>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.B3>, <Cell 'Sheet1'.
B4>, <Cell 'Sheet1'.B5>, <Cell 'Sheet1'.B6>, <Cell 'Sheet1'.B7>)
>>> for cellObj in list(sheet.columns)[1]:
print(cellObj.value)
Apples
Cherries
Pears
Oranges
Apples
Bananas
Strawberries
Worksheet
Using properties on an object rows
will give you a tuple. Each inner tuple represents a row and contains Cell
the objects in that row. columns
The attribute is also given a tuple, each inner tuple contains Cell
the objects in a particular column. For example.xlsx
, since there are 7 rows and 3 columns, rows
we are given a tuple of 7-tuples (each containing 3 Cell
objects), columns
giving us a tuple of 3-tuples (each containing 7 Cell
objects).
To access a particular tuple, refer to it by its index within the larger tuple. For example, to get a tuple representing column B, use list(sheet.columns)[1]
. Cell
To get a tuple containing the objects in column A , you can use list(sheet.columns)[0]
. Once you have a tuple representing a row or column, you can iterate over its Cell
objects and print their values.
workbook, worksheet, cell
As a quick recap, here's a list of all the functions, methods, and data types involved in reading cells from a spreadsheet file:
- Import
openpyxl
the module. - call
openpyxl.load_workbook()
function. - Get an
Workbook
object. - Use
active
orsheetnames
attribute. - Get an
Worksheet
object. - Use index or worksheet methods with keyword arguments
row
and .column
cell()
- Get an
Cell
object. - Read the properties
Cell
of an objectvalue
.
Project: Reading data from a spreadsheet
Suppose you have a spreadsheet of 2010 US Census data, and you have the tedious task of traversing its thousands of rows to calculate the total population and number of census tracts for each county. (Census tracts are simply geographic areas defined for the purposes of the census.) Each row represents a census tract. We named the spreadsheet file censuspopdata.xlsx
, which you can nostarch.com/automatestuff2
download from . Its contents will look like Figure 13-2.
Figure 13-2: censuspopdata.xlsx
Spreadsheet
Although Excel can sum multiple selected cells, you still need to select cells for each of the 3000+ counties. Even if calculating the population of a county by hand takes seconds, the entire spreadsheet would take hours.
In this project, you'll write a script that reads a census spreadsheet file and calculates statistics for each county in seconds.
This is what your program does:
- Read data from an Excel spreadsheet
- Count the number of census tracts in each county
- Count the total population of each county
- print result
This means your code needs to do the following:
- Use
openpyxl
the module to open and read the cells of an Excel document. - All geographic and population data is calculated and stored in a data structure.
- Using
pprint
the module, write the data structure topy
a text file with extension.
Step One: Reading Spreadsheet Data
censuspopdata.xlsx
There is only one worksheet in the spreadsheet, named , 'Population by Census Tract'
and each row holds data for one census tract. The columns are area number (A), state abbreviation (B), county name (C), and area population (D).
Open a new file editor tab and enter the following code. Save the file as readCensusExcel.py
.
#! python3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
import openpyxl, pprint # ➊
print('Opening workbook...')
wb = openpyxl.load_workbook('censuspopdata.xlsx') # ➋
sheet = wb['Population by Census Tract'] # ➌
countyData = {
}
# TODO: Fill in countyData with each county's population and tracts.
print('Reading rows...')
for row in range(2, sheet.max_row + 1): # ➍
# Each row in the spreadsheet has data for one census tract.
state = sheet['B' + str(row)].value
county = sheet['C' + str(row)].value
pop = sheet['D' + str(row)].value
# TODO: Open a new text file and write the contents of countyData to it.
This code imports openpyxl
the module, and the module used to print the final county data ➊ pprint
. It then opens the census pdata.xlsx file ➋, gets the worksheet with census data ➌, and starts iterating over its rows ➍.
Note that you also created a countyData
variable called , which will contain the population and land amount you calculated for each county. However, before you can store anything in it, you should determine exactly how your data will be organized in it.
Step 2: Populate the data structure
The data structure stored in countyData
will be a dictionary keyed by state abbreviations. Each state abbreviation will map to another dictionary whose keys are the county name strings for that state. Each county name will in turn be mapped to a dictionary with only two keys, 'tracts'
and 'pop'
. These keys map to the county's census tracts and population numbers. For example, a dictionary would look like this:
{
'AK': {
'Aleutians East': {
'pop': 3141, 'tracts': 1},
'Aleutians West': {
'pop': 5561, 'tracts': 2},
'Anchorage': {
'pop': 291826, 'tracts': 55},
'Bethel': {
'pop': 17013, 'tracts': 3},
'Bristol Bay': {
'pop': 997, 'tracts': 1},
--snip--
If the previous dictionary was stored in countyData
, the following expression would evaluate as follows:
>>> countyData['AK']['Anchorage']['pop']
291826
>>> countyData['AK']['Anchorage']['tracts']
55
More generally, countyData
the keys of a dictionary look like this:
countyData[state abbrev][county]['tracts']
countyData[state abbrev][county]['pop']
Now that you know countyData
how it will be structured, you can write the code that populates it with county data. Add the following code to the bottom of the program:
#! python 3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
--snip--
for row in range(2, sheet.max_row + 1):
# Each row in the spreadsheet has data for one census tract.
state = sheet['B' + str(row)].value
county = sheet['C' + str(row)].value
pop = sheet['D' + str(row)].value
# Make sure the key for this state exists.
countyData.setdefault(state, {
}) # ➊
# Make sure the key for this county in this state exists.
countyData[state].setdefault(county, {
'tracts': 0, 'pop': 0}) # ➋
# Each row represents one census tract, so increment by one.
countyData[state][county]['tracts'] += 1 # ➌
# Increase the county pop by the pop in this census tract.
countyData[state][county]['pop'] += int(pop) # ➍
# TODO: Open a new text file and write the contents of countyData to it.
The last two lines of code do the actual computational work, for
incrementing the value of ➌ for the current county tracts
and incrementing pop
the value of ➍ for the current county on each iteration of the loop.
countyData
Here's another code, because you can't add a county dictionary as a value for a state abbreviation key until the key itself exists in . (That is, if the ' AK'
' key does not yet exist, countyData['AK']['Anchorage']['tracts'] += 1
it will cause an error.) To ensure that the state abbreviation key exists in your data structure, you need to call setdefault()
a method to set a value for state
➊ if it does not already exist.
Just as countyData
a dictionary needs a dictionary as a value for each state abbreviation key, each of those dictionaries needs its own dictionary as a value for each county key ➋. Each of these dictionaries in turn requires a key sum 0
starting with an integer value . (If you forget the structure of a dictionary, refer back to the example dictionary at the beginning of this section.)*'tracts'
'pop'
setdefault()
Since it does nothing if the key already exists , you can for
call it on each iteration of the loop without any problems.
Step 3: Write the result to a file
After for
the loop is complete, countyData
the dictionary will contain all population and area information keyed by county and state. At this point, you can write more code, writing it to a text file or another Excel spreadsheet. Now, let's use pprint.pformat()
the function to countyData
write the dictionary values as one big string to a census2010.py
file called . Add the following code to the bottom of the program (make sure it's not indented so it doesn't appear outside for
the loop):
#! python 3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
--snip--
for row in range(2, sheet.max_row + 1):
--snip--
# Open a new text file and write the contents of countyData to it.
print('Writing results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countyData))
resultFile.close()
print('Done.')
pprint.pformat()
The function produces a string which itself is formatted as valid Python code. census2010.py
You've generated a Python program from your Python program by outputting it to a text file named ! This may seem complicated, but the benefit is that you can now import it like any other Python module census2010.py
. In an interactive shell, change the current working directory to census2010.py
the folder containing the newly created file, then import the file:
>>> import os
>>> import census2010
>>> census2010.allData['AK']['Anchorage']
{
'pop': 291826, 'tracts': 55}
>>> anchoragePop = census2010.allData['AK']['Anchorage']['pop']
>>> print('The 2010 population of Anchorage was ' + str(anchoragePop))
The 2010 population of Anchorage was 291826
readCensusExcel.py
Programs are one-time codes: once you save their results to census2010.py
, you don't need to run the program again. Whenever you need county data, just run import census2010
.
Computing these numbers by hand would take hours; this program does it in seconds. Using OpenPyXL, you can effortlessly extract information saved to Excel spreadsheets and perform calculations on them. You can download the full program from .
Ideas for Similar Programs
Many businesses and offices use Excel to store various types of data, and it's not uncommon for spreadsheets to become large and unwieldy. Any program that parses an Excel spreadsheet has a similar structure: it loads the spreadsheet file, prepares some variables or data structures, and then iterates over each row in the spreadsheet. Such a program can do the following:
- Compare data across multiple rows in a spreadsheet.
- Open multiple Excel files and compare data between spreadsheets.
- Check the spreadsheet for blank rows or invalid data and alert the user if so.
- Read data from a spreadsheet and use it as input to a Python program.
Write an Excel document
OpenPyXL also provides methods for writing data, which means your program can create and edit spreadsheet files. Using Python, it is very simple to create spreadsheets with thousands of rows of data.
Create and save an Excel document
Calling openpyxl.Workbook()
the function creates a new blank Workbook
object. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook() # Create a blank workbook.
>>> wb.sheetnames # It starts with one sheet.
['Sheet']
>>> sheet = wb.active
>>> sheet.title
'Sheet'
>>> sheet.title = 'Spam Bacon Eggs Sheet' # Change title.
>>> wb.sheetnames
['Spam Bacon Eggs Sheet']
The workbook will start with a worksheet called Worksheet . title
You can change the name of a sheet by storing a new string in the sheet's properties.
Whenever you modify Workbook
the object or its sheets and cells, the spreadsheet file will not be saved until you call the save()
workbook method. Enter the following in the interactive shell (in the current working directory example.xlsx
):
>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> sheet.title = 'Spam Spam Spam'
>>> wb.save('example_copy.xlsx') # Save the workbook.
Here we change the name of the worksheet. To save our changes, we pass the filename to save()
the method as a string. Pass a different filename than the original, for example 'example_copy.xlsx'
, to save changes to a copy of the spreadsheet.
Whenever you edit a spreadsheet loaded from a file, you should save the new, edited spreadsheet with a different filename than the original file. This way, you can still use the original spreadsheet file in case an error in the code causes the newly saved file to contain incorrect or corrupt data.
Create and delete worksheets
Sheets can be added and removed from the workbook using create_sheet()
methods and operators. del
Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> wb.sheetnames
['Sheet']
>>> wb.create_sheet() # Add a new sheet.
<Worksheet "Sheet1">
>>> wb.sheetnames
['Sheet', 'Sheet1']
>>> # Create a new sheet at index 0.
>>> wb.create_sheet(index=0, title='First Sheet')
<Worksheet "First Sheet">
>>> wb.sheetnames
['First Sheet', 'Sheet', 'Sheet1']
>>> wb.create_sheet(index=2, title='Middle Sheet')
<Worksheet "Middle Sheet">
>>> wb.sheetnames
['First Sheet', 'Sheet', 'Middle Sheet', 'Sheet1']
create_sheet()
method returns a Sheet
new Worksheet
object named X, which by default is set to be the last worksheet in the workbook. Optionally, the index and name of the new worksheet can be specified with keyword index
arguments title
and .
Continue the previous example by entering:
>>> wb.sheetnames
['First Sheet', 'Sheet', 'Middle Sheet', 'Sheet1']
>>> del wb['Middle Sheet']
>>> del wb['Sheet1']
>>> wb.sheetnames
['First Sheet', 'Sheet']
You can use del
the operator to delete a sheet from a workbook, just like you can use it to delete a key-value pair from a dictionary.
After adding or removing worksheets in the workbook, remember to call save()
the method to save the changes.
write value to cell
Writing a value to a cell is very similar to writing a value to a key in a dictionary. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb['Sheet']
>>> sheet['A1'] = 'Hello, world!' # Edit the cell's value.
>>> sheet['A1'].value
'Hello, world!'
If you have cell coordinates as a string, you can Worksheet
use it like a dictionary key on an object to specify which cell to write to.
Project: Update Spreadsheet
In this project, you'll write a program to update cells in a farm sales spreadsheet. Your program will browse the spreadsheet, find products of a certain kind, and update their prices. nostarch.com/automatestuff2
Download the spreadsheet from ](https://nostarch.com/automatestuff2/). Figure 13-3 shows what the spreadsheet looks like.
Figure 13-3: Spreadsheet for Agricultural Sales
Each row represents a separate sale. The columns are the type of product sold (A), cost per pound of product (B), pounds sold (C), and total revenue from sales (D). TOTAL
The column is set up as an Excel formula =ROUND(B3 * C3, 2)
that multiplies the cost per pound by the pounds sold and rounds the result to the nearest cent. With this formula, if column B or C changes, TOTAL
the cells in the column will update automatically.
Now imagine that the prices for garlic, celery, and lemons were entered incorrectly, making you update the cost per pound of garlic, celery, and lemons across thousands of rows in this spreadsheet. You can't do a simple find and replace on prices because there might be other items with the same price and you don't want to "correct" by mistake For thousands of rows, doing it by hand would take hours. But you can write a program to do it in seconds.
Your program does the following:
- loop over all rows
- Change the price if the row is garlic, celery, or lemon
This means your code needs to do the following:
- Open the spreadsheet file.
- For each row, check if the value in column A is
Celery
,Garlic
orLemon
. - If yes, update the price in column B.
- Save the spreadsheet to a new file (just in case so you don't lose the old spreadsheet).
Step 1: Build a data structure with update information
The prices you need to update are as follows:
- Celery 1.19
- Garlic 3.07
- Lemon 1.27
You can write code like this:
if produceName == 'Celery':
cellObj = 1.19
if produceName == 'Garlic':
cellObj = 3.07
if produceName == 'Lemon':
cellObj = 1.27
It's kind of unsightly to hardcode produce and updated price data like this. If you need to update the spreadsheet again with different prices or different products, you will have to modify a lot of code. Every time you modify the code, you run the risk of introducing bugs.
A more flexible solution would be to store the correct price information in a dictionary and write code to use this data structure. In a new file editor tab, enter the following code:
#! python3
# updateProduce.py - Corrects costs in produce sales spreadsheet.
import openpyxl
wb = openpyxl.load_workbook('produceSales.xlsx')
sheet = wb['Sheet']
# The produce types and their updated prices
PRICE_UPDATES = {
'Garlic': 3.07,
'Celery': 1.19,
'Lemon': 1.27}
# TODO: Loop through the rows and update the prices.
Save this as updateProduce.py
. If you need to update the spreadsheet again, you only need to update PRICE_UPDATES
the dictionary, not any other code.
Step Two: Check All Rows and Update Incorrect Prices
The next part of the program iterates over all the rows in the spreadsheet. Add the following code to updateProduce.py
the bottom of the :
#! python3
# updateProduce.py - Corrects costs in produce sales spreadsheet.
--snip--
# Loop through the rows and update the prices.
for rowNum in range(2, sheet.max_row): # skip the first row # ➊
produceName = sheet.cell(row=rowNum, column=1).value # ➋
if produceName in PRICE_UPDATES: # ➌
sheet.cell(row=rowNum, column=2).value = PRICE_UPDATES[produceName]
wb.save('updatedProduceSales.xlsx') # ➍
We loop through the rows starting at row 2, since row 1 is just the header ➊. The cells of column 1 (ie column a) will be stored in the variable produceName
➋. If produceName
it exists in the dictionary ➌ as a key PRICE_UPDATES
, then you know this is a row whose price must be corrected. The correct price will be in PRICE_UPDATES[produceName]
.
Notice PRICE_UPDATES
how clean the code is using . You only need one statement per type of product update if
instead of if produceName == 'Garlic':
code like this. Since the code uses PRICE_UPDATES
a dictionary instead of hardcoding the product names and updated costs into for
the loop, if the product sales spreadsheet requires additional changes, only PRICE_UPDATES
the dictionary needs to be modified, not the code.
After navigating through the spreadsheet and making changes, the code Workbook
saves the object into updatedproducesales.xlsx
➍. It won't overwrite the old spreadsheet in case your program has a bug and the newer spreadsheet is wrong. After checking that the updated spreadsheet looks correct, you can delete the old spreadsheet.
You can nostarch.com/automatestuff2
download the full source code of this program from here.
Ideas for Similar Programs
Since many office workers use Excel spreadsheets all the time, a program that can automatically edit and write Excel files could be very useful. Such a program can do the following:
- Read data from one spreadsheet and write it to parts of other spreadsheets.
- Read data from a website, text file, or clipboard, and write it to a spreadsheet.
- Automatically "clean" data in spreadsheets. For example, it can use regular expressions to read phone numbers in multiple formats and compile them into a single standard format.
Set the font style of the cell
Styling certain cells, rows, or columns can help you emphasize important areas in your spreadsheet. For example, in a production spreadsheet, your program could apply bold text to rows of potatoes, garlic, and parsnips. Or, you might want to italicize every line that costs more than $5 per pound. Designing parts of large spreadsheets by hand can be tedious, but your program does it in no time.
To customize the font style in the cells, it is important to openpyxl.styles
import Font()
the functions from the module.
from openpyxl.styles import Font
This allows you to type Font()
instead of openpyxl.styles.Font()
. (See Importing modules on page 47 for a review of this style of import
statement.)
The following example creates a new workbook and sets cell A1 to 24 point italic font. Enter the following in the interactive shell:
>>> import openpyxl
>>> from openpyxl.styles import Font
>>> wb = openpyxl.Workbook()
>>> sheet = wb['Sheet']
>>> italic24Font = Font(size=24, italic=True) # Create a font. # ➊
>>> sheet['A1'].font = italic24Font # Apply the font to A1. # ➋
>>> sheet['A1'] = 'Hello, world!'
>>> wb.save('styles.xlsx')
In this case, Font(size=24, italic=True)
an Font
object is returned, which is stored in italic24Font
➊. The keyword arguments Font()
, size
and italic
, configure Font
the style information for the object. When sheet['A1'].font
assigned to italic24Font
object ➋, all font style information is applied to cell A1.
font object
To set font
properties, keyword arguments need to be passed to Font()
. Table 13-2 shows Font()
the possible keyword arguments for functions.
Table 13-2 : Font
Object keyword arguments
keyword arguments | type of data | describe |
---|---|---|
name |
string | font name, such as 'Calibri' or'Times New Roman' |
size |
integer | font size |
bold |
Boolean value | True in bold |
italic |
Boolean value | True in italics |
You can call Font()
to create an Font
object and Font
store this object in a variable. Then assign that variable to a property Cell
of an object font
. For example, this code creates various font styles:
>>> import openpyxl
>>> from openpyxl.styles import Font
>>> wb = openpyxl.Workbook()
>>> sheet = wb['Sheet']
>>> fontObj1 = Font(name='Times New Roman', bold=True)
>>> sheet['A1'].font = fontObj1
>>> sheet['A1'] = 'Bold Times New Roman'
>>> fontObj2 = Font(size=24, italic=True)
>>> sheet['B3'].font = fontObj2
>>> sheet['B3'] = '24 pt Italic'
>>> wb.save('styles.xlsx')
Here, we Font
store an object in fontObj1
and then set the properties Cell
of the A1 object font
to fontObj1
. We Font
repeat this process with another object to set the font of the second cell. After running this code, cells A1 and B3 in the spreadsheet will be styled with the custom font style, as shown in Figure 13-4.
Figure 13-4: Spreadsheet with custom font styles
For cell A1, we set the font name to 'Times New Roman'
and will be bold
set to true
, so our text appears in bold TimesNewRoman. We didn't specify a size, so the default of openpyxl
11 is used. In cell B3, our text is italic and has a size of 24; we didn't specify a font name, so the openpyxl
default Calibri is used.
official
Excel formulas that begin with an equal sign configure cells to contain values calculated from other cells. In this section, you'll use openpyxl
modules to programmatically add formulas to cells, just like any normal value. For example:
>>> sheet['B9'] = '=SUM(B1:B8)'
This will =SUM(B1:B8)
store the value in cell B9. This sets cell B9 to the formula that sums the values in cells B1 through B8. You can see this in Figure 13-5.
Figure 13-5: Cell B9 contains the formula =SUM(B1:B8)
to add cells B1 and B8.
Excel formulas are set up just like any other text value in a cell. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet['A1'] = 200
>>> sheet['A2'] = 300
>>> sheet['A3'] = '=SUM(A1:A2)' # Set the formula.
>>> wb.save('writeFormula.xlsx')
The cells in A1 and A2 are set to 200 and 300, respectively. The value in cell A3 is set as a formula that sums the values in A1 and A2. When the spreadsheet is opened in Excel, A3 displays its value as 500.
Excel formulas provide a degree of programmability to spreadsheets, but can quickly become unmanageable for complex tasks. For example, even if you are very familiar with Excel formulas, trying to explain what is =IFERROR(TRIM(IF(LEN(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE))>0,SUBSTITUTE(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE), " ", ""),"")), "")
, actually is. Python code is more readable.
Adjust rows and columns
In Excel, resizing rows and columns is as easy as clicking and dragging the edge of a row or column heading. But if you need to set the size of a row or column based on the contents of a cell, or if you want to set a size across a large spreadsheet file, it's much faster to write a Python program to do it.
Rows and columns can also be completely hidden. Or they can be "frozen" in place so they are always visible on screen and appear on every page when the spreadsheet is printed (handy for headers).
Set row height and column width
Worksheet
row_dimensions
The object has and properties that control the row height and column width column_dimensions
. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet['A1'] = 'Tall row'
>>> sheet['B2'] = 'Wide column'
>>> # Set the height and width:
>>> sheet.row_dimensions[1].height = 70
>>> sheet.column_dimensions['B'].width = 20
>>> wb.save('dimensions.xlsx')
row_dimensions
The sum of the worksheets column_dimensions
is a dictionary-like value; row_dimensions
contains RowDimension
objects, column_dimensions
contains ColumnDimension
objects. In row_dimensions
, you can access one of the objects using the line number (1 or 2 in this case). In column_dimensions
, you can use the letter of the column (in this case, A or B) to access one of the objects.
dimensions.xlsx
The spreadsheet looks like Figure 13-6.
Figure 13-6: Row 1 and Column B are set to a larger height and width
Once you have RowDimension
the object, you can set its height. Once you have ColumnDimension
the object, you can set its width. Row height can be set to an integer or floating point value between 0
and . 409
This value represents height measured in points , where one point is equal to 1/72 of an inch. The default row height is 12.75. The column width can be set as an integer or floating point value between 0
and . 255
This value represents the number of characters in the default font size (11 point) that can be displayed in the cell. The default column width is 8.43 characters. The user does not see 0
columns of width or 0
rows of height.
Merge and split cells
Use merge_cells()
the sheet method to combine cells in a rectangular area into one cell. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet.merge_cells('A1:D3') # Merge all these cells.
>>> sheet['A1'] = 'Twelve cells merged together.'
>>> sheet.merge_cells('C5:D5') # Merge these two cells.
>>> sheet['C5'] = 'Two merged cells.'
>>> wb.save('merged.xlsx')
merge_cells()
The arguments to are single strings for the upper-left and lower-right cells of the rectangular area to merge: 'A1:D3'
12 cells to merge into one. To set the value of these merged cells, simply set the value of the upper left cell of the merged group.
When you run this code, merged.xlsx
it will look like Figure 13-7.
Figure 13-7: Merged cells in a spreadsheet
To split a cell, call unmerge_cells()
the sheet method. Enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('merged.xlsx')
>>> sheet = wb.active
>>> sheet.unmerge_cells('A1:D3') # Split these cells up.
>>> sheet.unmerge_cells('C5:D5')
>>> wb.save('merged.xlsx')
If you save your changes, then view the spreadsheet, you'll see that the merged cells have reverted to separate cells.
freeze pane
For spreadsheets that are too large to display all at once, it can be helpful to "freeze" a few of the topmost rows or leftmost columns on the screen. For example, frozen column or row headers are always visible even as the user scrolls through the spreadsheet. These are called freeze panes . In OpenPyXL, every Worksheet
object has a freeze_panes
property that can be set to either an Cell
object or a string of cell coordinates. Note that all rows above and all columns to the left of the cell will be frozen, but the rows and columns of the cell itself will not be frozen.
To unfreeze all panes, freeze_panes
set to None
or 'A1'
. Table 13-3 shows freeze_panes
some example settings which rows and columns will be frozen.
TABLE 13-3 : FREEZE PANES EXAMPLES
freeze_panes settings _ |
frozen rows and columns |
---|---|
sheet.freeze_panes = 'A2' |
first row |
sheet.freeze_panes = 'B1' |
Column A |
sheet.freeze_panes = 'C1' |
Columns A and B |
sheet.freeze_panes = 'C2' |
Row 1 and Columns A and B |
sheet.freeze_panes = 'A1' orsheet.freeze_panes = None |
no frozen panes |
Make sure you have the produce sales spreadsheet. Then enter the following in the interactive shell:
>>> import openpyxl
>>> wb = openpyxl.load_workbook('produceSales.xlsx')
>>> sheet = wb.active
>>> sheet.freeze_panes = 'A2' # Freeze the rows above A2.
>>> wb.save('freezeExample.xlsx')
If freeze_panes
the property is set to 'A2'
, row 1 is always visible no matter where the user scrolls in the spreadsheet. You can see this in Figure 13-8.
Figure 13-8: When freeze_panes
set A2
to , the first row is always visible, even if the user scrolls down.
chart
OpenPyXL supports the creation of bar, line, scatter, and pie charts using data in worksheet cells. To make a graph, you need to do the following:
- Create an object from a rectangular cell
Reference
. Reference
Create an object by passing in an objectSeries
.- Create an
Chart
object. - Attach
Series
objects toChart
objects. - Add
Chart
object toWorksheet
Object, optionally specifying which cell should be in the upper left corner of the chart.
Objects need some explanation. Objects are created by calling openpyxl.chart.Reference()
a function and passing three parameters Reference
:
- An object containing chart data
Worksheet
. - A tuple of two integers representing the upper-left cell of a rectangular cell selection containing chart data: the first integer in the tuple is the row, the second the column. Note
1
that the first line is not0
. - A tuple of two integers representing the lower-right cell of the rectangular cell selection containing chart data: the first integer in the tuple is the row, the second the column.
Figure 13-9 shows some sample coordinate parameters.
Figure 13-9: From left to right: (1, 1)
, (10, 1)
; (3, 2)
, (6, 4)
; (5, 3)
,(5, 3)
Enter this interactive shell example to create a bar chart and add it to a spreadsheet:
>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> for i in range(1, 11): # create some data in column A
... sheet['A' + str(i)] = i
...
>>> refObj = openpyxl.chart.Reference(sheet, min_col=1, min_row=1, max_col=1,
max_row=10)
>>> seriesObj = openpyxl.chart.Series(refObj, title='First series')
>>> chartObj = openpyxl.chart.BarChart()
>>> chartObj.title = 'My Chart'
>>> chartObj.append(seriesObj)
>>> sheet.add_chart(chartObj, 'C5')
>>> wb.save('sampleChart.xlsx')
This will produce a spreadsheet similar to Figure 13-10.
Figure 13-10: Spreadsheet with chart added
openpyxl.chart.BarChart()
We created a bar chart by calling . You can also create line, scatter, and pie charts openpyxl.charts.LineChart()
by calling , openpyxl.chart.ScatterChart()
and .openpyxl.chart.PieChart()
Summarize
Often, the hard part of processing information is not the processing itself, but simply converting the data into a format suitable for the program. But once the spreadsheet is loaded into Python, extracting and manipulating the data is much faster than doing it by hand.
You can also generate spreadsheets as output from the program. So if a colleague needs to convert your text files or the PDF files of thousands of sales contacts into spreadsheet files, you don't have to tediously copy and paste them all into Excel.
Equipped with openpyxl
modules and some programming knowledge, you'll find handling even the largest spreadsheets a piece of cake.
In the next chapter, we'll look at how to use Python to interact with another spreadsheet program: the popular online Google Sheets app.
practice questions
For the questions below, assume you wb
have an Workbook
object in a variable, an object sheet
in , an Worksheet
object cell
in , an object in Cell
, and an object in .comm
Comment
img
Image
-
openpyxl.load_workbook()
What does the function return? -
wb.sheetnames
What do workbook properties contain? -
How do I retrieve the object named
'Sheet1'
worksheetWorksheet
? -
How can I retrieve the object of the active sheet of the workbook
Worksheet
? -
How do I retrieve the value in cell C5?
-
How do I set the value in cell C5 to
"Hello"
? -
How to retrieve a cell's row and column as integers?
-
sheet.max_column
andsheet.max_row
worksheet properties contain and what is the data type of those properties? -
'M'
What function do you need to call if you need to get the integer index of a column ? -
14
What function do you need to call if you need to get the string name of the column ? -
Cell
How can I retrieve a tuple of all objects from A1 to F1 ? -
How to save workbook as filename
example.xlsx
? -
How to set formula in cell?
-
If you want to retrieve the result of a cell formula, but not the cell formula itself, what do you have to do first?
-
How can I set the height of row 5 to 100?
-
How do you hide column C?
-
What are Freeze Panes?
-
Which five functions and methods do you need to call to create a bar chart?
practice project
For practice, write programs that perform the following tasks.
Multiplication Table Maker
Create a program multiplicationTable.py
that takes a number from the command line N
and creates a N × N
multiplication table in an Excel spreadsheet. For example, when the program is run like this:
py multiplicationTable.py 6
...which should create a spreadsheet similar to Figure 13-11.
Figure 13-11: Multiplication table generated in spreadsheet
Row 1 and column A are applied to the label and appear in bold.
blank line inserter
Create a program blankRowInserter.py
that accepts two integers and a filename string as command-line arguments. We call the first integer N
, and the second integer M
. Starting at row N
, the program should M
insert the blank row into the spreadsheet. For example, when the program is run like this:
python blankRowInserter.py 3 2 myProduce.xlsx
...the "before" and "after" spreadsheets should look like Figure 13-12.
Figure 13-12: Two blank rows before (left) and after (right) the insert in row 3
You can program this by reading in the contents of a spreadsheet. Then, when writing out a new spreadsheet, use a for
loop to copy the previous N
row. For the remaining rows, M
the row number in the output spreadsheet will be added.
Spreadsheet Cell Inverter
Write a program to reverse the rows and columns of cells in a spreadsheet. For example, the value in row 5, column 3 will be in row 3, column 5 (and vice versa). This should be done for all cells in the spreadsheet. For example, a "before" and "after" spreadsheet would look similar to Figure 13-13.
Figure 13-13: Spreadsheet before (top) and after (bottom) inversion
You can write this program by using nested for
loops to read the spreadsheet's data into a list of list data structures. For cells of columns x
and rows , this data structure can have . Then, when writing out a new spreadsheet, use the column and row cells .y
sheetData[x][y]
x
y
sheetData[y][x]
text file to spreadsheet
Write a program that reads in the contents of several text files (you can create text files yourself) and inserts them into a spreadsheet, one line per line of text. The rows of the first text file will be in the cells of column A, the rows of the second text file will be in the cells of column B, and so on.
Use readlines()
File
the object methods to return a list of strings, one string per line in the file. For the first file, output the first line to column 1, line 1. The second row should write column 1, row 2, and so on. The next file read with readlines()
will be written to column 2, the next file will be written to column 3, and so on.
spreadsheet to text file
Write a program that performs the tasks of the previous program in reverse order: the program should open a spreadsheet, write the cells of column A to one text file, the cells of column B to another text file, and so on .