Parsing Excel Files with Ruby
In this article, I will be judged in several languages Ruby library access Excel files. I will discuss several existing Ruby libraries for access to Excel files in different formats. This article focuses more on reading Excel files, but also with the changes / write Excel files a little more discussion.
If you can not wait to see the code, please move on my Github submit a project , the project has some code snippets to read Excel files, are mentioned in this article.
Excel file types
Before we get into the different Ruby libraries, let’s talk about Excel files. It is important to identify the type of Excel files that you are going to be using. There are two main types: legacy files and the newer OOXML file format introduced in Microsoft Office 2007.
There is a nice description of the differences on Wikipedia. The tldr; version is that the legacy file format includes files with the following extensions:
File name extension | Explanation |
---|---|
.xls | Traditional Excel file format |
.xlt | Excel template traditional format |
.xlm | Excel file format with the traditional macro code |
Microsoft Excel 2007 abandoned the legacy binary format and switched to the Open Office XML (OOXML) format that is used today. These files use the following extensions:
File name extension | Explanation |
---|---|
.xlsx | OOXML Excel file |
.xlst | OOXML Excel file template |
.xlsm | OOXML Excel file with macros |
Determine what Excel file format (the traditional format or OOOXML format) is very important that you will be involved. If you use Excel software work could often turn to go between various formats, but in my scenario, the Excel file is received from an external file format and can not control, but I do not want to rely on manual format conversion. And there is no need, modern .xlsx format generally can use other software to access a spreadsheet, for example: a Numbers and LibreOffice .
Excel library in Ruby
There are many Ruby library for accessing Excel-- may too much. When I studied these different libraries, really spent a lot of time to figure out their functions and limitations. I found the following questions useful for a library for research:
- What Excel file format support?
- Support read or write, or that reading and writing are supported?
- I can support huge files? Quickly?
- Do I have to read the file? You can support streaming mode?
Depending on the application, these problems several or all may be very important.
Select the appropriate library
The following table details the six different features Ruby Excel to access the library:
Storehouse | license | 支持.xlsx |
支持.xls |
ability |
---|---|---|---|---|
axlsx | WITH | yes | no | write |
rubyXL | WITH | yes | no | read/write |
roo | WITH | yes | yes | read |
creek | WITH | yes | no | read |
spreadsheet | GPLv3 | no | yes | read/write |
simple_xlsx_reader | WITH | yes | no | read |
Based on your needs, of which one or more libraries may be able to help. Consider the following usage scenarios:
Write .xlsx file
If you need to write axslx is a good choice . It supports write cell value generated charts. If you need a lightweight library, rubyXL is a good option. .xlsx文件,
Read .xlsx file
If you just need to read the file, you can rubyXL, Roo, Creek and choose among a simple_xlsx_reader. roo is a very popular choice, because it also supports traditional creek and simple_xlsx_reader clearly more adept at handling large files. If you want from reading the data stream (rather than file), rubyXL became the only choice. .xlsx
.xls格式。然而,如果你关注速度,
IO
.Xlsx file read and write
If you need to read and write .xlsx
files, you have two options. You can use rubyXL, it supports reading and writing. Another option is that you can use two different libraries, one for reading, one for writing .
Read and write Excel files tradition
To support the traditional .xls
format will have more constraints. If you only need to support traditional spreadsheet, it supports reading and writing. If you also need support you can choose Roo , both support reading traditional format also supports modern formats. .xls,我推荐
.xlsx格式,我更推荐选择第二个gem来做此事......除非你仅仅需要读取功能,这样的话
The good news is, whether you ultimately choose the kind of library, open the file and read code is very simple, and use different library looks very similar. For example, here is the use of creek code.
require 'creek' workbook = Creek::Book.new 'path/to/file.xlsx' worksheets = workbook.sheets worksheets.each do |worksheet| worksheet.rows.each do |row| row_cells = row.values # do something with row_cells end end
I submit to the project on GitHub, there are sample code uses the library to read .xlsx of each.
performance
If you need to read a huge amount of data in Excel files, you might compare the performance of the respective library. I quickly established a somewhat dirty code performance testing program, tested in the above table four kinds can be read . .xlsx格式的库
I created a sample .xlsx文件,分别含有
500,10000,50000,200000 and 500,000 rows of data. Then I run the code to read each file (ie read every row in the data file). Various libraries Each sample file to read code may be re here obtained.
I read each individual library files are run three times, recording the average time (per pass time change is not great).
rubyXL 和 roo性能大体相当, 读取500000行的Excel文件需要2分多钟。 creek 和simple_xlsx_reader 则都快的多了,只需要不足一分钟就能读取 500000行的Excel文件。
我希望本文能为你使用Ruby语言访问Excel文件提供些许地指引。如果你正在使用一种我没有提到的库,并且你很喜欢它,请务必告知我。