Python Automation Guide (Automation of Trivial Work) Second Edition: Thirteen, Using EXCEL Spreadsheets

Original: https://automatetheboringstuff.com/2e/chapter13/

Image

While we don't often think of spreadsheets as programming tools, nearly everyone uses them to organize information into two-dimensional data structures, perform calculations with formulas, and produce output in the form of graphs. In the next two chapters, we'll integrate Python into two popular spreadsheet applications: Microsoft Excel and Google Sheets.

Excel is a popular and powerful spreadsheet application for Windows. openpyxlmodule allows your Python program to read and modify Excel spreadsheet files. For example, you may have the tedious task of copying some data from one spreadsheet and pasting it into another. Or you may have to go through thousands of lines and pick out only a small subset of them to make small edits based on some criteria. Or you may have to look through spreadsheets of hundreds of departmental budgets, looking for any deficits. These are exactly the kinds of boring, no-brainer spreadsheet tasks that Python can do for you.

Although Excel is proprietary to Microsoft, there is also free software that runs on Windows, MacOS, and Linux. Both LibreOffice Calc and OpenOffice Calc can use Excel's xlsxspreadsheet file format, which means that openpyxlthe module can also handle spreadsheets from these applications. You can download the software from www.libreoffice.organd respectively. www.openoffice.orgEven if you already have Excel installed on your computer, you'll find these programs easier to use. However, the screenshots in this chapter are all from Excel 2010 on Windows 10.

Excel document

First, let's review some basic definitions: An Excel spreadsheet document is called a workbook , and a single workbook is kept in a single xlsxfile. Each workbook can contain multiple tables (also known as worksheets ). The worksheet that the user is currently viewing (or was last viewing before closing Excel) is called the active worksheet .

Each sheet has columns ( Aaddressed by a letter starting at the beginning) and rows (addressed by a number starting with 1). Boxes on specific columns and rows are called cells . Each cell can contain a number or text value. A grid of cells containing data makes up a worksheet.

install openpyxlmodule

Python doesn't have OpenPyXL, so you have to install it. Follow the instructions for installing third-party modules in Appendix A; the name of the module is openpyxl.

This book uses version 2.6.2 of OpenPyXL. It is important to install this version by running pip install --user -U openpyxl==2.6.2because newer versions of OpenPyXL are not compatible with the information in this book. To test that the installation is correct, enter the following in an interactive shell:

>>> import openpyxl

If the module is installed correctly, there should be no error messages. Remember to import the module before running the interactive shell examples in this chapter openpyxl, otherwise you will get an NameError: name 'openpyxl' is not definederror.

You can openpyxl.readthedocs.orgfind the full documentation for OpenPyXL here.

Read Excel documents

The examples in this chapter will use a spreadsheet named example.xlsx. You can create a spreadsheet yourself, or nostarch.com/automatestuff2download it from . Figure 13-1 shows the three default worksheets Sheet1, Sheet2and Sheet3tabs, that Excel automatically provides for new workbooks. (The number of default sheets created may vary by operating system and spreadsheet program.)

image

Figure 13-1: The workbook's sheet tabs are in the lower left corner of Excel.

Table 1 in the example file should look like Table 13-1. (If you did not download from the website example.xlsx, you should enter this data into the form yourself.)

Table 13-1 : example.xlsxSpreadsheet

A B C
1 4/5/2015 1:34:02 PM Apples 73
2 4/5/2015 3:41:23 AM Cherries 85
3 4/6/2015 12:46:51 PM Pears 14
4 4/8/2015 8:59:43 AM Oranges 52
5 4/10/2015 2:07:00 AM Apples 152
6 4/10/2015 6:10:37 PM Bananas 23
7 4/10/2015 2:40:46 AM Strawberries 98

Now that we have our example spreadsheet, let's see how to openpyxlmanipulate it with modules.

Open Excel documents with OpenPyXL

Once the module is imported openpyxl, the functions can be used openpyxl.load_workbook(). Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> type(wb)
<class 'openpyxl.workbook.workbook.Workbook'>

openpyxl.load_workbook()The function takes a filename and returns a workbookvalue of data type. This Workbookobject represents an Excel file, a bit like an Fileobject represents an open text file.

Remember example.xlsxit needs to be in the current working directory in order for you to use it. You can find out what the current working directory is by importing osand using , and you can change the current working directory with .os.getcwd()os.chdir()

Get sheet from workbook

By accessing sheetnamesthe properties, you can get a list of all sheet names in the workbook. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> wb.sheetnames # The workbook's sheets' names.
['Sheet1', 'Sheet2', 'Sheet3']
>>> sheet = wb['Sheet3'] # Get a sheet from the workbook.
>>> sheet
<Worksheet "Sheet3">
>>> type(sheet)
<class 'openpyxl.worksheet.worksheet.Worksheet'>
>>> sheet.title # Get the sheet's title as a string.
'Sheet3'
>>> anotherSheet = wb.active # Get the active sheet.
>>> anotherSheet
<Worksheet "Sheet1">

Each worksheet is Worksheetrepresented by an object, which you can get by using square brackets and a worksheet name string like a dictionary key. Finally, you can use a property Workbookof the object activeto get the active sheet of the workbook. The active worksheet is the top-level worksheet when the workbook is opened in Excel. Once you have Worksheetan object, you can titleget its name from a property.

get cell from worksheet

Once you have an Worksheetobject, you can access an object by its name Cell. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1'] # Get a sheet from the workbook.
>>> sheet['A1'] # Get a cell from the sheet.
<Cell 'Sheet1'.A1>
>>> sheet['A1'].value # Get the value from the cell.
datetime.datetime(2015, 4, 5, 13, 34, 2)
>>> c = sheet['B1'] # Get another cell from the sheet.
>>> c.value
'Apples'
>>> # Get the row, column, and value from the cell.
>>> 'Row %s, Column %s is %s' % (c.row, c.column, c.value)
'Row 1, Column B is Apples'
>>> 'Cell %s is %s' % (c.coordinate, c.value)
'Cell B1 is Apples'
>>> sheet['C1'].value
73

CellThe object has a valueproperty which, as expected, contains the value stored in that cell. CellObjects also have row, columnand coordinateproperties that provide location information for cells.

Here, accessing the properties Cellof the object in cell B1 valuegets the string 'Apples'. rowAttributes give us integers 1, columnattributes give us 'B', coordinateattributes give us 'B1'.

OpenPyXL will automatically interpret the dates in column A and datetimereturn them as values ​​rather than strings. datetimeData types are further explained in Chapter 17 .

Designating columns with letters can be difficult to program, especially because after column Z, the columns start with two letters: AA, AB, AC, and so on. Alternatively, you can also use the worksheet's cell()method and pass integers to it rowand columnkeyword arguments to get the cell. The integers in the first row or column are yes 1and no 0. Continue the interactive shell example by typing:

>>> sheet.cell(row=1, column=2)
<Cell 'Sheet1'.B1>
>>> sheet.cell(row=1, column=2).value
'Apples'
>>> for i in range(1, 8, 2): # Go through every other row:
...     print(i, sheet.cell(row=i, column=2).value)
...
1 Apples
3 Pears
5 Apples
7 Strawberries

As you can see, using the worksheet cell()method and passing it row=1and column=2will get the cell B1's Cellobject, as specified sheet['B1']. Then, using cell()the method and its keyword arguments, you can write a forloop to print the values ​​of a range of cells.

Suppose you want to print the value in the cell of each odd row, starting from column B. You can get cells from every other row (in this case, all odd-numbered rows) by passing it as an argument range()to the function . Variables for loops are passed to methods as keyword arguments, and are always passed as keyword arguments. Note that integers are passed , not strings .step2forirowcell()2column2'B'

You can use Worksheetthe object's max_rowand max_columnproperties to determine the size of the worksheet. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1']
>>> sheet.max_row # Get the highest row number.
7
>>> sheet.max_column # Get the highest column number.
3

Note that max_columnthe attribute is an integer, not a letter as it appears in Excel.

Conversion between column letters and numbers

To convert letters to numbers, call openpyxl.utils.column_index_from_string()the function. To convert from numbers to letters, call openpyxl.utils.get_column_letter()the function. Enter the following in the interactive shell:

>>> import openpyxl
>>> from openpyxl.utils import get_column_letter, column_index_from_string
>>> get_column_letter(1) # Translate column 1 to a letter.
'A'
>>> get_column_letter(2)
'B'
>>> get_column_letter(27)
'AA'
>>> get_column_letter(900)
'AHP'
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb['Sheet1']
>>> get_column_letter(sheet.max_column)
'C'
>>> column_index_from_string('A') # Get A's number.
1
>>> column_index_from_string('AA')
27

openpyxl.utilsAfter importing these two functions from the module, you can call get_column_letter()it and pass it an integer like 27 to figure out what the letter name of the 27th column is. The function column_index_string()does the opposite: you pass it the letter name of a column, and it tells you what number that column is. Using these functions does not require loading a workbook. If you wanted, you could load a workbook, get an Worksheetobject, and use a similar max_columnproperty Worksheetto get an integer. You can then pass that integer to get_column_letter().

Get rows and columns from worksheet

You can split Worksheetobjects to get all objects in a row, column, or rectangular area of ​​a spreadsheet Cell. You can then loop over all cells in the slice. Enter the following in the interactive shell:

   >>> import openpyxl
   >>> wb = openpyxl.load_workbook('example.xlsx')
   >>> sheet = wb['Sheet1']
   >>> tuple(sheet['A1':'C3']) # Get all cells from A1 to C3.
   ((<Cell 'Sheet1'.A1>, <Cell 'Sheet1'.B1>, <Cell 'Sheet1'.C1>), (<Cell
   'Sheet1'.A2>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.C2>), (<Cell 'Sheet1'.A3>,
   <Cell 'Sheet1'.B3>, <Cell 'Sheet1'.C3>))
   >>> for rowOfCellObjects in sheet['A1':'C3']: # ➊
   ...     for cellObj in rowOfCellObjects: # ➋
   ...         print(cellObj.coordinate, cellObj.value)
   ...     print('--- END OF ROW ---')
   A1 2015-04-05 13:34:02
   B1 Apples
   C1 73
   --- END OF ROW ---
   A2 2015-04-05 03:41:23
   B2 Cherries
   C2 85
   --- END OF ROW ---
   A3 2015-04-06 12:46:51
   B3 Pears
   C3 14
   --- END OF ROW ---

Here we specify that we want the objects in the rectangular area from A1 to C3 and we get an object containing the objects Cellin that area . To help us visualize this object, we can use on it to display its objects in a tuple .CellGeneratorGeneratortuple()Cell

This tuple contains three tuples: one per line, from the top to the bottom of the desired region. Each of these three inner tuples contains Cellthe objects in a row of our desired range, from the leftmost cell to the right cell. So in summary, our worksheet slice contains all Cellthe objects in the range from A1 to C3, starting with the upper left cell and ending with the lower right cell.

To print the value of each cell in the range, we use two forloops. The outer forloop iterates over each row in the slice ➊. Then, for each row, the nested forloop iterates through each cell of that row ➋.

To access the value of a cell in a particular row or column, you can also use the AND property Worksheetof an object . These properties must be converted to a list with a function before using square brackets and indexing . Enter the following in the interactive shell:rowscolumnslist()

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> list(sheet.columns)[1] # Get second column's cells.
(<Cell 'Sheet1'.B1>, <Cell 'Sheet1'.B2>, <Cell 'Sheet1'.B3>, <Cell 'Sheet1'.
B4>, <Cell 'Sheet1'.B5>, <Cell 'Sheet1'.B6>, <Cell 'Sheet1'.B7>)
>>> for cellObj in list(sheet.columns)[1]:
        print(cellObj.value)
Apples
Cherries
Pears
Oranges
Apples
Bananas
Strawberries

WorksheetUsing properties on an object rowswill give you a tuple. Each inner tuple represents a row and contains Cellthe objects in that row. columnsThe attribute is also given a tuple, each inner tuple contains Cellthe objects in a particular column. For example.xlsx, since there are 7 rows and 3 columns, rowswe are given a tuple of 7-tuples (each containing 3 Cellobjects), columnsgiving us a tuple of 3-tuples (each containing 7 Cellobjects).

To access a particular tuple, refer to it by its index within the larger tuple. For example, to get a tuple representing column B, use list(sheet.columns)[1]. CellTo get a tuple containing the objects in column A , you can use list(sheet.columns)[0]. Once you have a tuple representing a row or column, you can iterate over its Cellobjects and print their values.

workbook, worksheet, cell

As a quick recap, here's a list of all the functions, methods, and data types involved in reading cells from a spreadsheet file:

  1. Import openpyxlthe module.
  2. call openpyxl.load_workbook()function.
  3. Get an Workbookobject.
  4. Use activeor sheetnamesattribute.
  5. Get an Worksheetobject.
  6. Use index or worksheet methods with keyword arguments rowand .columncell()
  7. Get an Cellobject.
  8. Read the properties Cellof an object value.

Project: Reading data from a spreadsheet

Suppose you have a spreadsheet of 2010 US Census data, and you have the tedious task of traversing its thousands of rows to calculate the total population and number of census tracts for each county. (Census tracts are simply geographic areas defined for the purposes of the census.) Each row represents a census tract. We named the spreadsheet file censuspopdata.xlsx, which you can nostarch.com/automatestuff2download from . Its contents will look like Figure 13-2.

image

Figure 13-2: censuspopdata.xlsxSpreadsheet

Although Excel can sum multiple selected cells, you still need to select cells for each of the 3000+ counties. Even if calculating the population of a county by hand takes seconds, the entire spreadsheet would take hours.

In this project, you'll write a script that reads a census spreadsheet file and calculates statistics for each county in seconds.

This is what your program does:

  1. Read data from an Excel spreadsheet
  2. Count the number of census tracts in each county
  3. Count the total population of each county
  4. print result

This means your code needs to do the following:

  1. Use openpyxlthe module to open and read the cells of an Excel document.
  2. All geographic and population data is calculated and stored in a data structure.
  3. Using pprintthe module, write the data structure to pya text file with extension.

Step One: Reading Spreadsheet Data

censuspopdata.xlsxThere is only one worksheet in the spreadsheet, named , 'Population by Census Tract'and each row holds data for one census tract. The columns are area number (A), state abbreviation (B), county name (C), and area population (D).

Open a new file editor tab and enter the following code. Save the file as readCensusExcel.py.

   #! python3
   # readCensusExcel.py - Tabulates population and number of census tracts for
   # each county.
   import openpyxl, pprint # ➊
   print('Opening workbook...')
   wb = openpyxl.load_workbook('censuspopdata.xlsx') # ➋
   sheet = wb['Population by Census Tract'] # ➌
   countyData = {
    
    }
   # TODO: Fill in countyData with each county's population and tracts.
   print('Reading rows...')
   for row in range(2, sheet.max_row + 1): # ➍
       # Each row in the spreadsheet has data for one census tract.
       state  = sheet['B' + str(row)].value
       county = sheet['C' + str(row)].value
       pop    = sheet['D' + str(row)].value
# TODO: Open a new text file and write the contents of countyData to it.

This code imports openpyxlthe module, and the module used to print the final county data ➊ pprint. It then opens the census pdata.xlsx file ➋, gets the worksheet with census data ➌, and starts iterating over its rows ➍.

Note that you also created a countyDatavariable called , which will contain the population and land amount you calculated for each county. However, before you can store anything in it, you should determine exactly how your data will be organized in it.

Step 2: Populate the data structure

The data structure stored in countyDatawill be a dictionary keyed by state abbreviations. Each state abbreviation will map to another dictionary whose keys are the county name strings for that state. Each county name will in turn be mapped to a dictionary with only two keys, 'tracts'and 'pop'. These keys map to the county's census tracts and population numbers. For example, a dictionary would look like this:

{
    
    'AK': {
    
    'Aleutians East': {
    
    'pop': 3141, 'tracts': 1},
        'Aleutians West': {
    
    'pop': 5561, 'tracts': 2},
        'Anchorage': {
    
    'pop': 291826, 'tracts': 55},
        'Bethel': {
    
    'pop': 17013, 'tracts': 3},
        'Bristol Bay': {
    
    'pop': 997, 'tracts': 1},
        --snip--

If the previous dictionary was stored in countyData, the following expression would evaluate as follows:

>>> countyData['AK']['Anchorage']['pop']
291826
>>> countyData['AK']['Anchorage']['tracts']
55

More generally, countyDatathe keys of a dictionary look like this:

countyData[state abbrev][county]['tracts']
countyData[state abbrev][county]['pop']

Now that you know countyDatahow it will be structured, you can write the code that populates it with county data. Add the following code to the bottom of the program:

#! python 3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
--snip--
for row in range(2, sheet.max_row + 1):
     # Each row in the spreadsheet has data for one census tract.
     state  = sheet['B' + str(row)].value
     county = sheet['C' + str(row)].value
     pop    = sheet['D' + str(row)].value
     # Make sure the key for this state exists.
     countyData.setdefault(state, {
    
    }) # ➊
     # Make sure the key for this county in this state exists.
     countyData[state].setdefault(county, {
    
    'tracts': 0, 'pop': 0}) # ➋
     # Each row represents one census tract, so increment by one.
     countyData[state][county]['tracts'] += 1 # ➌
     # Increase the county pop by the pop in this census tract.
     countyData[state][county]['pop'] += int(pop) # ➍
# TODO: Open a new text file and write the contents of countyData to it.

The last two lines of code do the actual computational work, forincrementing the value of ➌ for the current county tractsand incrementing popthe value of ➍ for the current county on each iteration of the loop.

countyDataHere's another code, because you can't add a county dictionary as a value for a state abbreviation key until the key itself exists in . (That is, if the ' AK'' key does not yet exist, countyData['AK']['Anchorage']['tracts'] += 1it will cause an error.) To ensure that the state abbreviation key exists in your data structure, you need to call setdefault()a method to set a value for state➊ if it does not already exist.

Just as countyDataa dictionary needs a dictionary as a value for each state abbreviation key, each of those dictionaries needs its own dictionary as a value for each county key ➋. Each of these dictionaries in turn requires a key sum 0starting with an integer value . (If you forget the structure of a dictionary, refer back to the example dictionary at the beginning of this section.)*'tracts''pop'

setdefault()Since it does nothing if the key already exists , you can forcall it on each iteration of the loop without any problems.

Step 3: Write the result to a file

After forthe loop is complete, countyDatathe dictionary will contain all population and area information keyed by county and state. At this point, you can write more code, writing it to a text file or another Excel spreadsheet. Now, let's use pprint.pformat()the function to countyDatawrite the dictionary values ​​as one big string to a census2010.pyfile called . Add the following code to the bottom of the program (make sure it's not indented so it doesn't appear outside forthe loop):

#! python 3
# readCensusExcel.py - Tabulates population and number of census tracts for
# each county.
--snip--
for row in range(2, sheet.max_row + 1):
--snip--
# Open a new text file and write the contents of countyData to it.
print('Writing results...')
resultFile = open('census2010.py', 'w')
resultFile.write('allData = ' + pprint.pformat(countyData))
resultFile.close()
print('Done.')

pprint.pformat()The function produces a string which itself is formatted as valid Python code. census2010.pyYou've generated a Python program from your Python program by outputting it to a text file named ! This may seem complicated, but the benefit is that you can now import it like any other Python module census2010.py. In an interactive shell, change the current working directory to census2010.pythe folder containing the newly created file, then import the file:

>>> import os
>>> import census2010
>>> census2010.allData['AK']['Anchorage']
{
    
    'pop': 291826, 'tracts': 55}
>>> anchoragePop = census2010.allData['AK']['Anchorage']['pop']
>>> print('The 2010 population of Anchorage was ' + str(anchoragePop))
The 2010 population of Anchorage was 291826

readCensusExcel.pyPrograms are one-time codes: once you save their results to census2010.py, you don't need to run the program again. Whenever you need county data, just run import census2010.

Computing these numbers by hand would take hours; this program does it in seconds. Using OpenPyXL, you can effortlessly extract information saved to Excel spreadsheets and perform calculations on them. You can download the full program from .

Ideas for Similar Programs

Many businesses and offices use Excel to store various types of data, and it's not uncommon for spreadsheets to become large and unwieldy. Any program that parses an Excel spreadsheet has a similar structure: it loads the spreadsheet file, prepares some variables or data structures, and then iterates over each row in the spreadsheet. Such a program can do the following:

  • Compare data across multiple rows in a spreadsheet.
  • Open multiple Excel files and compare data between spreadsheets.
  • Check the spreadsheet for blank rows or invalid data and alert the user if so.
  • Read data from a spreadsheet and use it as input to a Python program.

Write an Excel document

OpenPyXL also provides methods for writing data, which means your program can create and edit spreadsheet files. Using Python, it is very simple to create spreadsheets with thousands of rows of data.

Create and save an Excel document

Calling openpyxl.Workbook()the function creates a new blank Workbookobject. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook() # Create a blank workbook.
>>> wb.sheetnames # It starts with one sheet.
['Sheet']
>>> sheet = wb.active
>>> sheet.title
'Sheet'
>>> sheet.title = 'Spam Bacon Eggs Sheet' # Change title.
>>> wb.sheetnames
['Spam Bacon Eggs Sheet']

The workbook will start with a worksheet called Worksheet . titleYou can change the name of a sheet by storing a new string in the sheet's properties.

Whenever you modify Workbookthe object or its sheets and cells, the spreadsheet file will not be saved until you call the save()workbook method. Enter the following in the interactive shell (in the current working directory example.xlsx):

>>> import openpyxl
>>> wb = openpyxl.load_workbook('example.xlsx')
>>> sheet = wb.active
>>> sheet.title = 'Spam Spam Spam'
>>> wb.save('example_copy.xlsx') # Save the workbook.

Here we change the name of the worksheet. To save our changes, we pass the filename to save()the method as a string. Pass a different filename than the original, for example 'example_copy.xlsx', to save changes to a copy of the spreadsheet.

Whenever you edit a spreadsheet loaded from a file, you should save the new, edited spreadsheet with a different filename than the original file. This way, you can still use the original spreadsheet file in case an error in the code causes the newly saved file to contain incorrect or corrupt data.

Create and delete worksheets

Sheets can be added and removed from the workbook using create_sheet()methods and operators. delEnter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> wb.sheetnames
['Sheet']
>>> wb.create_sheet() # Add a new sheet.
<Worksheet "Sheet1">
>>> wb.sheetnames
['Sheet', 'Sheet1']
>>> # Create a new sheet at index 0.
>>> wb.create_sheet(index=0, title='First Sheet')
<Worksheet "First Sheet">
>>> wb.sheetnames
['First Sheet', 'Sheet', 'Sheet1']
>>> wb.create_sheet(index=2, title='Middle Sheet')
<Worksheet "Middle Sheet">
>>> wb.sheetnames
['First Sheet', 'Sheet', 'Middle Sheet', 'Sheet1']

create_sheet()method returns a Sheetnew Worksheetobject named X, which by default is set to be the last worksheet in the workbook. Optionally, the index and name of the new worksheet can be specified with keyword indexarguments titleand .

Continue the previous example by entering:

>>> wb.sheetnames
['First Sheet', 'Sheet', 'Middle Sheet', 'Sheet1']
>>> del wb['Middle Sheet']
>>> del wb['Sheet1']
>>> wb.sheetnames
['First Sheet', 'Sheet']

You can use delthe operator to delete a sheet from a workbook, just like you can use it to delete a key-value pair from a dictionary.

After adding or removing worksheets in the workbook, remember to call save()the method to save the changes.

write value to cell

Writing a value to a cell is very similar to writing a value to a key in a dictionary. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb['Sheet']
>>> sheet['A1'] = 'Hello, world!' # Edit the cell's value.
>>> sheet['A1'].value
'Hello, world!'

If you have cell coordinates as a string, you can Worksheetuse it like a dictionary key on an object to specify which cell to write to.

Project: Update Spreadsheet

In this project, you'll write a program to update cells in a farm sales spreadsheet. Your program will browse the spreadsheet, find products of a certain kind, and update their prices. nostarch.com/automatestuff2Download the spreadsheet from ](https://nostarch.com/automatestuff2/). Figure 13-3 shows what the spreadsheet looks like.

image

Figure 13-3: Spreadsheet for Agricultural Sales

Each row represents a separate sale. The columns are the type of product sold (A), cost per pound of product (B), pounds sold (C), and total revenue from sales (D). TOTALThe column is set up as an Excel formula =ROUND(B3 * C3, 2)that multiplies the cost per pound by the pounds sold and rounds the result to the nearest cent. With this formula, if column B or C changes, TOTALthe cells in the column will update automatically.

Now imagine that the prices for garlic, celery, and lemons were entered incorrectly, making you update the cost per pound of garlic, celery, and lemons across thousands of rows in this spreadsheet. You can't do a simple find and replace on prices because there might be other items with the same price and you don't want to "correct" by mistake For thousands of rows, doing it by hand would take hours. But you can write a program to do it in seconds.

Your program does the following:

  1. loop over all rows
  2. Change the price if the row is garlic, celery, or lemon

This means your code needs to do the following:

  1. Open the spreadsheet file.
  2. For each row, check if the value in column A is Celery, Garlicor Lemon.
  3. If yes, update the price in column B.
  4. Save the spreadsheet to a new file (just in case so you don't lose the old spreadsheet).

Step 1: Build a data structure with update information

The prices you need to update are as follows:

  • Celery 1.19
  • Garlic 3.07
  • Lemon 1.27

You can write code like this:

if produceName == 'Celery':
    cellObj = 1.19
if produceName == 'Garlic':
    cellObj = 3.07
if produceName == 'Lemon':
    cellObj = 1.27

It's kind of unsightly to hardcode produce and updated price data like this. If you need to update the spreadsheet again with different prices or different products, you will have to modify a lot of code. Every time you modify the code, you run the risk of introducing bugs.

A more flexible solution would be to store the correct price information in a dictionary and write code to use this data structure. In a new file editor tab, enter the following code:

#! python3
# updateProduce.py - Corrects costs in produce sales spreadsheet.
import openpyxl
wb = openpyxl.load_workbook('produceSales.xlsx')
sheet = wb['Sheet']
# The produce types and their updated prices
PRICE_UPDATES = {
    
    'Garlic': 3.07,
                 'Celery': 1.19,
                 'Lemon': 1.27}
# TODO: Loop through the rows and update the prices.

Save this as updateProduce.py. If you need to update the spreadsheet again, you only need to update PRICE_UPDATESthe dictionary, not any other code.

Step Two: Check All Rows and Update Incorrect Prices

The next part of the program iterates over all the rows in the spreadsheet. Add the following code to updateProduce.pythe bottom of the :

   #! python3
   # updateProduce.py - Corrects costs in produce sales spreadsheet.
   --snip--
   # Loop through the rows and update the prices.
   for rowNum in range(2, sheet.max_row):    # skip the first row # ➊
       produceName = sheet.cell(row=rowNum, column=1).value # ➋
       if produceName in PRICE_UPDATES: # ➌
          sheet.cell(row=rowNum, column=2).value = PRICE_UPDATES[produceName]
   wb.save('updatedProduceSales.xlsx') # ➍

We loop through the rows starting at row 2, since row 1 is just the header ➊. The cells of column 1 (ie column a) will be stored in the variable produceName➋. If produceNameit exists in the dictionary ➌ as a key PRICE_UPDATES, then you know this is a row whose price must be corrected. The correct price will be in PRICE_UPDATES[produceName].

Notice PRICE_UPDATEShow clean the code is using . You only need one statement per type of product update ifinstead of if produceName == 'Garlic':code like this. Since the code uses PRICE_UPDATESa dictionary instead of hardcoding the product names and updated costs into forthe loop, if the product sales spreadsheet requires additional changes, only PRICE_UPDATESthe dictionary needs to be modified, not the code.

After navigating through the spreadsheet and making changes, the code Workbooksaves the object into updatedproducesales.xlsx➍. It won't overwrite the old spreadsheet in case your program has a bug and the newer spreadsheet is wrong. After checking that the updated spreadsheet looks correct, you can delete the old spreadsheet.

You can nostarch.com/automatestuff2download the full source code of this program from here.

Ideas for Similar Programs

Since many office workers use Excel spreadsheets all the time, a program that can automatically edit and write Excel files could be very useful. Such a program can do the following:

  • Read data from one spreadsheet and write it to parts of other spreadsheets.
  • Read data from a website, text file, or clipboard, and write it to a spreadsheet.
  • Automatically "clean" data in spreadsheets. For example, it can use regular expressions to read phone numbers in multiple formats and compile them into a single standard format.

Set the font style of the cell

Styling certain cells, rows, or columns can help you emphasize important areas in your spreadsheet. For example, in a production spreadsheet, your program could apply bold text to rows of potatoes, garlic, and parsnips. Or, you might want to italicize every line that costs more than $5 per pound. Designing parts of large spreadsheets by hand can be tedious, but your program does it in no time.

To customize the font style in the cells, it is important to openpyxl.stylesimport Font()the functions from the module.

from openpyxl.styles import Font

This allows you to type Font()instead of openpyxl.styles.Font(). (See Importing modules on page 47 for a review of this style of importstatement.)

The following example creates a new workbook and sets cell A1 to 24 point italic font. Enter the following in the interactive shell:

  >>> import openpyxl
  >>> from openpyxl.styles import Font
  >>> wb = openpyxl.Workbook()
  >>> sheet = wb['Sheet']
   >>> italic24Font = Font(size=24, italic=True) # Create a font. # ➊
   >>> sheet['A1'].font = italic24Font # Apply the font to A1. # ➋
  >>> sheet['A1'] = 'Hello, world!'
  >>> wb.save('styles.xlsx')

In this case, Font(size=24, italic=True)an Fontobject is returned, which is stored in italic24Font➊. The keyword arguments Font(), sizeand italic, configure Fontthe style information for the object. When sheet['A1'].fontassigned to italic24Fontobject ➋, all font style information is applied to cell A1.

font object

To set fontproperties, keyword arguments need to be passed to Font(). Table 13-2 shows Font()the possible keyword arguments for functions.

Table 13-2 : FontObject keyword arguments

keyword arguments type of data describe
name string font name, such as 'Calibri'or'Times New Roman'
size integer font size
bold Boolean value Truein bold
italic Boolean value Truein italics

You can call Font()to create an Fontobject and Fontstore this object in a variable. Then assign that variable to a property Cellof an object font. For example, this code creates various font styles:

>>> import openpyxl
>>> from openpyxl.styles import Font
>>> wb = openpyxl.Workbook()
>>> sheet = wb['Sheet']
>>> fontObj1 = Font(name='Times New Roman', bold=True)
>>> sheet['A1'].font = fontObj1
>>> sheet['A1'] = 'Bold Times New Roman'
>>> fontObj2 = Font(size=24, italic=True)
>>> sheet['B3'].font = fontObj2
>>> sheet['B3'] = '24 pt Italic'
>>> wb.save('styles.xlsx')

Here, we Fontstore an object in fontObj1and then set the properties Cellof the A1 object fontto fontObj1. We Fontrepeat this process with another object to set the font of the second cell. After running this code, cells A1 and B3 in the spreadsheet will be styled with the custom font style, as shown in Figure 13-4.

image

Figure 13-4: Spreadsheet with custom font styles

For cell A1, we set the font name to 'Times New Roman'and will be boldset to true, so our text appears in bold TimesNewRoman. We didn't specify a size, so the default of openpyxl11 is used. In cell B3, our text is italic and has a size of 24; we didn't specify a font name, so the openpyxldefault Calibri is used.

official

Excel formulas that begin with an equal sign configure cells to contain values ​​calculated from other cells. In this section, you'll use openpyxlmodules to programmatically add formulas to cells, just like any normal value. For example:

>>> sheet['B9'] = '=SUM(B1:B8)'

This will =SUM(B1:B8)store the value in cell B9. This sets cell B9 to the formula that sums the values ​​in cells B1 through B8. You can see this in Figure 13-5.

image

Figure 13-5: Cell B9 contains the formula =SUM(B1:B8)to add cells B1 and B8.

Excel formulas are set up just like any other text value in a cell. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet['A1'] = 200
>>> sheet['A2'] = 300
>>> sheet['A3'] = '=SUM(A1:A2)' # Set the formula.
>>> wb.save('writeFormula.xlsx')

The cells in A1 and A2 are set to 200 and 300, respectively. The value in cell A3 is set as a formula that sums the values ​​in A1 and A2. When the spreadsheet is opened in Excel, A3 displays its value as 500.

Excel formulas provide a degree of programmability to spreadsheets, but can quickly become unmanageable for complex tasks. For example, even if you are very familiar with Excel formulas, trying to explain what is =IFERROR(TRIM(IF(LEN(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE))>0,SUBSTITUTE(VLOOKUP(F7, Sheet2!$A$1:$B$10000, 2, FALSE), " ", ""),"")), ""), actually is. Python code is more readable.

Adjust rows and columns

In Excel, resizing rows and columns is as easy as clicking and dragging the edge of a row or column heading. But if you need to set the size of a row or column based on the contents of a cell, or if you want to set a size across a large spreadsheet file, it's much faster to write a Python program to do it.

Rows and columns can also be completely hidden. Or they can be "frozen" in place so they are always visible on screen and appear on every page when the spreadsheet is printed (handy for headers).

Set row height and column width

Worksheetrow_dimensionsThe object has and properties that control the row height and column width column_dimensions. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet['A1'] = 'Tall row'
>>> sheet['B2'] = 'Wide column'
>>> # Set the height and width:
>>> sheet.row_dimensions[1].height = 70
>>> sheet.column_dimensions['B'].width = 20
>>> wb.save('dimensions.xlsx')

row_dimensionsThe sum of the worksheets column_dimensionsis a dictionary-like value; row_dimensionscontains RowDimensionobjects, column_dimensionscontains ColumnDimensionobjects. In row_dimensions, you can access one of the objects using the line number (1 or 2 in this case). In column_dimensions, you can use the letter of the column (in this case, A or B) to access one of the objects.

dimensions.xlsxThe spreadsheet looks like Figure 13-6.

image

Figure 13-6: Row 1 and Column B are set to a larger height and width

Once you have RowDimensionthe object, you can set its height. Once you have ColumnDimensionthe object, you can set its width. Row height can be set to an integer or floating point value between 0and . 409This value represents height measured in points , where one point is equal to 1/72 of an inch. The default row height is 12.75. The column width can be set as an integer or floating point value between 0and . 255This value represents the number of characters in the default font size (11 point) that can be displayed in the cell. The default column width is 8.43 characters. The user does not see 0columns of width or 0rows of height.

Merge and split cells

Use merge_cells()the sheet method to combine cells in a rectangular area into one cell. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> sheet.merge_cells('A1:D3') # Merge all these cells.
>>> sheet['A1'] = 'Twelve cells merged together.'
>>> sheet.merge_cells('C5:D5') # Merge these two cells.
>>> sheet['C5'] = 'Two merged cells.'
>>> wb.save('merged.xlsx')

merge_cells()The arguments to are single strings for the upper-left and lower-right cells of the rectangular area to merge: 'A1:D3'12 cells to merge into one. To set the value of these merged cells, simply set the value of the upper left cell of the merged group.

When you run this code, merged.xlsxit will look like Figure 13-7.

image

Figure 13-7: Merged cells in a spreadsheet

To split a cell, call unmerge_cells()the sheet method. Enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('merged.xlsx')
>>> sheet = wb.active
>>> sheet.unmerge_cells('A1:D3') # Split these cells up.
>>> sheet.unmerge_cells('C5:D5')
>>> wb.save('merged.xlsx')

If you save your changes, then view the spreadsheet, you'll see that the merged cells have reverted to separate cells.

freeze pane

For spreadsheets that are too large to display all at once, it can be helpful to "freeze" a few of the topmost rows or leftmost columns on the screen. For example, frozen column or row headers are always visible even as the user scrolls through the spreadsheet. These are called freeze panes . In OpenPyXL, every Worksheetobject has a freeze_panesproperty that can be set to either an Cellobject or a string of cell coordinates. Note that all rows above and all columns to the left of the cell will be frozen, but the rows and columns of the cell itself will not be frozen.

To unfreeze all panes, freeze_panesset to Noneor 'A1'. Table 13-3 shows freeze_panessome example settings which rows and columns will be frozen.

TABLE 13-3 : FREEZE PANES EXAMPLES

freeze_panessettings _ frozen rows and columns
sheet.freeze_panes = 'A2' first row
sheet.freeze_panes = 'B1' Column A
sheet.freeze_panes = 'C1' Columns A and B
sheet.freeze_panes = 'C2' Row 1 and Columns A and B
sheet.freeze_panes = 'A1'orsheet.freeze_panes = None no frozen panes

Make sure you have the produce sales spreadsheet. Then enter the following in the interactive shell:

>>> import openpyxl
>>> wb = openpyxl.load_workbook('produceSales.xlsx')
>>> sheet = wb.active
>>> sheet.freeze_panes = 'A2' # Freeze the rows above A2.
>>> wb.save('freezeExample.xlsx')

If freeze_panesthe property is set to 'A2', row 1 is always visible no matter where the user scrolls in the spreadsheet. You can see this in Figure 13-8.

image

Figure 13-8: When freeze_panesset A2to , the first row is always visible, even if the user scrolls down.

chart

OpenPyXL supports the creation of bar, line, scatter, and pie charts using data in worksheet cells. To make a graph, you need to do the following:

  1. Create an object from a rectangular cell Reference.
  2. ReferenceCreate an object by passing in an object Series.
  3. Create an Chartobject.
  4. Attach Seriesobjects to Chartobjects.
  5. Add Chartobject to WorksheetObject, optionally specifying which cell should be in the upper left corner of the chart.

Objects need some explanation. Objects are created by calling openpyxl.chart.Reference()a function and passing three parameters Reference:

  1. An object containing chart data Worksheet.
  2. A tuple of two integers representing the upper-left cell of a rectangular cell selection containing chart data: the first integer in the tuple is the row, the second the column. Note 1that the first line is not 0.
  3. A tuple of two integers representing the lower-right cell of the rectangular cell selection containing chart data: the first integer in the tuple is the row, the second the column.

Figure 13-9 shows some sample coordinate parameters.

image

Figure 13-9: From left to right: (1, 1), (10, 1); (3, 2), (6, 4); (5, 3),(5, 3)

Enter this interactive shell example to create a bar chart and add it to a spreadsheet:

>>> import openpyxl
>>> wb = openpyxl.Workbook()
>>> sheet = wb.active
>>> for i in range(1, 11): # create some data in column A
...     sheet['A' + str(i)] = i
...
>>> refObj = openpyxl.chart.Reference(sheet, min_col=1, min_row=1, max_col=1,
max_row=10)
>>> seriesObj = openpyxl.chart.Series(refObj, title='First series')
>>> chartObj = openpyxl.chart.BarChart()
>>> chartObj.title = 'My Chart'
>>> chartObj.append(seriesObj)
>>> sheet.add_chart(chartObj, 'C5')
>>> wb.save('sampleChart.xlsx')

This will produce a spreadsheet similar to Figure 13-10.

image

Figure 13-10: Spreadsheet with chart added

openpyxl.chart.BarChart()We created a bar chart by calling . You can also create line, scatter, and pie charts openpyxl.charts.LineChart()by calling , openpyxl.chart.ScatterChart()and .openpyxl.chart.PieChart()

Summarize

Often, the hard part of processing information is not the processing itself, but simply converting the data into a format suitable for the program. But once the spreadsheet is loaded into Python, extracting and manipulating the data is much faster than doing it by hand.

You can also generate spreadsheets as output from the program. So if a colleague needs to convert your text files or the PDF files of thousands of sales contacts into spreadsheet files, you don't have to tediously copy and paste them all into Excel.

Equipped with openpyxlmodules and some programming knowledge, you'll find handling even the largest spreadsheets a piece of cake.

In the next chapter, we'll look at how to use Python to interact with another spreadsheet program: the popular online Google Sheets app.

practice questions

For the questions below, assume you wbhave an Workbookobject in a variable, an object sheetin , an Worksheetobject cellin , an object in Cell, and an object in .commCommentimgImage

  1. openpyxl.load_workbook()What does the function return?

  2. wb.sheetnamesWhat do workbook properties contain?

  3. How do I retrieve the object named 'Sheet1'worksheet Worksheet?

  4. How can I retrieve the object of the active sheet of the workbook Worksheet?

  5. How do I retrieve the value in cell C5?

  6. How do I set the value in cell C5 to "Hello"?

  7. How to retrieve a cell's row and column as integers?

  8. sheet.max_columnand sheet.max_rowworksheet properties contain and what is the data type of those properties?

  9. 'M'What function do you need to call if you need to get the integer index of a column ?

  10. 14What function do you need to call if you need to get the string name of the column ?

  11. CellHow can I retrieve a tuple of all objects from A1 to F1 ?

  12. How to save workbook as filename example.xlsx?

  13. How to set formula in cell?

  14. If you want to retrieve the result of a cell formula, but not the cell formula itself, what do you have to do first?

  15. How can I set the height of row 5 to 100?

  16. How do you hide column C?

  17. What are Freeze Panes?

  18. Which five functions and methods do you need to call to create a bar chart?

practice project

For practice, write programs that perform the following tasks.

Multiplication Table Maker

Create a program multiplicationTable.pythat takes a number from the command line Nand creates a N × Nmultiplication table in an Excel spreadsheet. For example, when the program is run like this:

py multiplicationTable.py 6

...which should create a spreadsheet similar to Figure 13-11.

image

Figure 13-11: Multiplication table generated in spreadsheet

Row 1 and column A are applied to the label and appear in bold.

blank line inserter

Create a program blankRowInserter.pythat accepts two integers and a filename string as command-line arguments. We call the first integer N, and the second integer M. Starting at row N, the program should Minsert the blank row into the spreadsheet. For example, when the program is run like this:

python blankRowInserter.py 3 2 myProduce.xlsx

...the "before" and "after" spreadsheets should look like Figure 13-12.

image

Figure 13-12: Two blank rows before (left) and after (right) the insert in row 3

You can program this by reading in the contents of a spreadsheet. Then, when writing out a new spreadsheet, use a forloop to copy the previous Nrow. For the remaining rows, Mthe row number in the output spreadsheet will be added.

Spreadsheet Cell Inverter

Write a program to reverse the rows and columns of cells in a spreadsheet. For example, the value in row 5, column 3 will be in row 3, column 5 (and vice versa). This should be done for all cells in the spreadsheet. For example, a "before" and "after" spreadsheet would look similar to Figure 13-13.

image

Figure 13-13: Spreadsheet before (top) and after (bottom) inversion

You can write this program by using nested forloops to read the spreadsheet's data into a list of list data structures. For cells of columns xand rows , this data structure can have . Then, when writing out a new spreadsheet, use the column and row cells .ysheetData[x][y]xysheetData[y][x]

text file to spreadsheet

Write a program that reads in the contents of several text files (you can create text files yourself) and inserts them into a spreadsheet, one line per line of text. The rows of the first text file will be in the cells of column A, the rows of the second text file will be in the cells of column B, and so on.

Use readlines() Filethe object methods to return a list of strings, one string per line in the file. For the first file, output the first line to column 1, line 1. The second row should write column 1, row 2, and so on. The next file read with readlines()will be written to column 2, the next file will be written to column 3, and so on.

spreadsheet to text file

Write a program that performs the tasks of the previous program in reverse order: the program should open a spreadsheet, write the cells of column A to one text file, the cells of column B to another text file, and so on .

Guess you like

Origin blog.csdn.net/wizardforcel/article/details/129931312