[python] Process tabular data based on Tablib library

Tablib is a Python library for working with spreadsheets (such as Excel, CSV, JSON). It provides a simple yet powerful way to manipulate and process data. With Tablib, we can easily read, write, filter and transform various types of spreadsheet data. Tablib has a consistent and easy-to-use API for seamless conversion between different data formats. For example, Tablib can import data from Excel tables as Python objects, then convert them to JSON or CSV format, and perform corresponding operations and analysis. In addition, Tablib also supports common operations such as sorting, filtering and merging data. The official Tablib warehouse address is: tablib, and the official Tablib documentation address is: tablib-doc.

Tablib needs to be installed under Python3.6+ version. The installation command is as follows:

pip install tablib

import tablib
# 查看版本
tablib.__version__
'3.4.0'

1 Tablib usage

1.1 Form creation

Create data structure

# 创建表
data = tablib.Dataset()
type(data)
tablib.core.Dataset
# 查看表格名字,默认为空
data.title 
# 设置表的名字
data.title = 'data'
data.title
'data'

Data addition

# 添加行
data.append(['John', 28])
data.append(['Tom', 16])
data.append(['Jane', 32])
# 查看数据,list格式
data.dict
# 或者
# print(data)
[['John', 28], ['Tom', 16], ['Jane', 32]]
# 添加标题行
data.headers = ['Name', 'Age']
# data.dict
print(data)
Name|Age
----|---
John|28 
Tom |16 
Jane|32 
# 添加列
# 需要和当前行数一致
data.append_col(['USA', 'UK','UK'], header='Country')
# data.dict
print(data)
Name|Age|Country
----|---|-------
John|28 |USA    
Tom |16 |UK     
Jane|32 |UK     

Select row or column

# 选择第一行
data[0]
('John', 28, 'USA')
# 选择第一行第三列
data[0][2]
'USA'
# 选择列
data['Age']
[28, 16, 32]
# 获得表头
data.headers
['Name', 'Age', 'Country']
# 基于索引获得列
data.get_col(0)
['John', 'Tom', 'Jane']

Delete row or column

# 删除列
del data['Country']
# 删除行
# del data[:-1]
print(data)
Name|Age
----|---
John|28 
Tom |16 
Jane|32 

Advanced row and column operations

# 表格转置
transposed_data = data.transpose()
print(transposed_data)
Name|John|Tom|Jane
----|----|---|----
Age |28  |16 |32  
# 读取数据维度
data.width,data.height
(2, 3)
# 按照字段排序
# 年龄从大到小排序
data = data.sort("Age",reverse=True)
print(data)
Name|Age
----|---
Jane|32 
John|28 
Tom |16 
# 计算平均年龄
ages = data['Age']
float(sum(ages)) / len(ages)
25.333333333333332
# 移除第一行
tmp = data.lpop()
print(data)
Name|Age
----|---
John|28 
Tom |16 
# 第一行添加数据
data.lpush(list(tmp))
print(data)
Name|Age
----|---
Jane|32 
John|28 
Tom |16 
# 在最左侧插入一列数据
new_column = ['Engineer', 'Doctor','Doctor']
data.lpush_col(new_column, header='Profession')
print(data)
Profession|Name|Age
----------|----|---
Engineer  |Jane|32 
Doctor    |John|28 
Doctor    |Tom |16 
# 移除最后一行
# data.pop()
# print(data)
# 移除重复行
# 创建数据集
data = tablib.Dataset()
data.headers = ['Name', 'Age']
data.append(['Alice', 25])
data.append(['Alice', 30])
data.append(['Alice', 25])  # 重复行

# 去除重复行,必须所有列值一样
data.remove_duplicates()

print(data)
Name |Age
-----|---
Alice|25 
Alice|30 

Table merge

# 创建两个表格
data1 = tablib.Dataset()
data1.headers = ['Name', 'Age']
data1.append(['Alice', 25])
data1.append(['Bob', 30])

data2 = tablib.Dataset()
data2.headers = ['Name', 'Occupation']
data2.append(['Alice', 'Engineer'])
data2.append(['Bob', 'Doctor'])
# 按行合并
# 使用stack方法合并两个表格
stacked_data = data1.stack(data2)
print(stacked_data)
Name |Age     
-----|--------
Alice|25      
Bob  |30      
Alice|Engineer
Bob  |Doctor  
# 按列合并
# 两个表格行数需要一致
# 使用stack_cols方法合并两个表格的列
stacked_cols_data = data1.stack_cols(data2)
print(stacked_cols_data)
Name |Age|Name |Occupation
-----|---|-----|----------
Alice|25 |Alice|Engineer  
Bob  |30 |Bob  |Doctor    

1.2 Data import and export

Data output

Tablib allows users to flexibly export data to different environments according to specific needs and seamlessly integrate and interact with other tools. The result of the conversion is an object representation of these formats rather than being saved as a local file. These formats include, but are not limited to:

  • CSV: A common spreadsheet format with each field separated by commas.
  • JSON: A common data exchange format that stores data in the form of key-value pairs.
  • Excel: Spreadsheet format, requires additional installation of libraries, can contain multiple worksheets, and supports functions such as formulas and charts.
  • YAML: A human-readable data serialization format commonly used in configuration files.
  • HTML: The markup language used to create web pages.
  • Pandas DataFrame: Pandas is another Python library used for data processing and analysis. Tablib supports exporting data to Pandas DataFrame.

Tablib provides two ways to export data to other formats, one is to call the export function, and the other is to call its own properties. As shown below, both data.export(‘csv’) and data.csv can be used to obtain the CSV representation of Dataset data:

data.export('csv')
data.csv

The specific sample code is as follows:

# 创建表格
data = tablib.Dataset()
data.headers = ['Name', 'Age']
data.append(['John', 28])
data.append(['Tom', 16])
data.append(['Jane', 32])
# 导出为csv字符流
data_csv = data.export('csv')
print(type(data_csv))

# 导出数据到本地csv文件
with open('data.csv', 'w') as f:
    f.write(data_csv)
<class 'str'>
# 导出为json字符串
data_json = data.export('json')
type(data_json)

# 将json字符串解析为Python对象
import json
data_json = json.loads(data_json)
print(data_json)
[{'Name': 'John', 'Age': 28}, {'Name': 'Tom', 'Age': 16}, {'Name': 'Jane', 'Age': 32}]
# 将数据集对象保存为json文件
with open('data.json', 'w') as f:
    f.write(data.export('json'))
# 保存为yaml文件
with open('data.yml', 'w') as f:
    f.write(data.export('yaml'))
# 将数据集保存为xls文件,注意使用wb模式
# 需要安装额外库
# pip install xlrd
# pip install xlwt
with open('data.xls', 'wb') as f:
    f.write(data.export('xls'))

with open('data.xlsx', 'wb') as f:
    f.write(data.export('xlsx'))
# 转换为html
# 需要安装MarkupPy库
# pip install MarkupPy 
with open('data.html', 'w') as f:
    f.write(data.export('html'))
# 转换为pandas的dataframe
# 需要安装Pandas库
df = data.export('df')
df.head()
Name Age
0 John 28
1 Tom 16
2 Jane 32

data import

We can use the tablib library to import files in various formats to initialize tablib's data objects. As follows:

with open('data.csv', 'r') as fh:
    imported_data = tablib.Dataset().load(fh)
print(imported_data)
Name|Age
----|---
John|28 
Tom |16 
Jane|32 

For table formats, such as csv format, you can also not import the header row, that is, do not use the first row as the header row, as shown below:

with open('data.csv', 'r') as fh:
    # headers=False不导入标题行
    imported_data = tablib.Dataset().load(fh,headers=False)
print(imported_data)
Name|Age
John|28 
Tom |16 
Jane|32 

For xls and xlsx that support multiple tables, the first table is currently opened by default. Please use rb mode. See the next section for multi-table management.

with open('data.xls', 'rb') as fh:
    imported_data =  tablib.Dataset().load(fh, 'xls')
print(imported_data)

Name|Age 
----|----
John|28.0
Tom |16.0
Jane|32.0

1.3 Multi-table management

In Tablib, Databook is a data structure used to organize and manage multiple data tables (Data Tables). Databook provides a convenient way to operate and process multiple data tables.

Create Databook

# 创建Databook
databook = tablib.Databook()
# 创建第一个数据表
data_table1 = tablib.Dataset()

# 设置数据表的列和数据
data_table1.headers = ['Name', 'Age']
data_table1.append(['John', 25])
data_table1.append(['Alice', 30])
# 设置表名
data_table1.title = "table1"

# 添加数据表到 Databook
databook.add_sheet(data_table1)
# 创建二个数据表
data_table2 = tablib.Dataset()

# 设置数据表的列和数据
data_table2.headers = ['Name', 'Age']
data_table2.append(['Jane', 34])
data_table2.append(['Mike', 14])
# 设置表名
data_table2.title = "table2"

# 添加数据表到 Databook
databook.add_sheet(data_table2)
# 可以利用现有表一次性创建Databoook
tablib.Databook((data_table1, data_table2))
<databook object>

View databook

# 查看子表数量
databook.size
2
# 查看各表
databook.sheets()
[<table1 dataset>, <table2 dataset>]

Get table based on index

for index,table in enumerate(databook.sheets()):
    print(f" \ntable{
      
      index}")
    print(table)
table0
Name |Age
-----|---
John |25 
Alice|30 
 
table1
Name|Age
----|---
Jane|34 
Mike|14 

Save and import

Databook supports saving xlsx and xls files, but importing only supports xlsx files.

# 保存为xlsx文件
with open('databook.xlsx', 'wb') as f:
    f.write(databook.export('xlsx'))
# 多表导入
with open(r'databook.xlsx', 'rb') as fh:
    databook = tablib.Databook().load(fh, 'xlsx')
print(databook.sheets())
[<table1 dataset>, <table2 dataset>]

1.4 Advanced use

dynamic column

Talblib allows you to freely create and manage dynamic columns in data tables. These columns do not need to be defined in advance and can be added, removed, and modified at any time as needed. Set the columns based on a random function as shown below:

# 导入数据
with open('data.csv', 'r') as fh:
    data = tablib.Dataset().load(fh)
print(data)
Name|Age
----|---
John|28 
Tom |16 
Jane|32 
import random

# 随机设置分数
def random_grade(row):
    # 根据传入的行设置不同数据标准
    if int(row[1]) > 30:
        return (random.randint(59,100)/100.0)
    else:
        return (random.randint(60,99)/100.0)

data.append_col(random_grade, header='Grade')
print(data)
Name|Age|Grade
----|---|-----
John|28 |0.65 
Tom |16 |0.99 
Jane|32 |0.79 

Data filtering

Tablib provides the filter method to filter data based on the tags of the data set.

fruits = tablib.Dataset()  

fruits.headers = ['name', 'color'] 
# 添加tags为fruit与sour的行
fruits.append(['tomato', 'red'], tags=['fruit', 'sour']) 
fruits.append(['strawberry', 'red'], tags=['fruit', 'sweet' ]) 
fruits.append(['corn', 'yellow'], tags=['vegetable', 'sweet']) 

# 转换为其他格式,tags属性不会跟随转换
print(fruits.yaml)
- {color: red, name: tomato}
- {color: red, name: strawberry}
- {color: yellow, name: corn}
# 过滤出标签为vegetable的数据
fruits.filter(['vegetable']).df  
name color
0 corn yellow
# 过滤出标签为vegetable或sweet的数据
fruits.filter(['vegetable', 'sweet']).df  
name color
0 strawberry red
1 corn yellow
# 先过滤出标签为fruit,再过滤为sour的数据
fruits.filter(['fruit']).filter(['sour']).df  
name color
0 tomato red

separator

Tablib provides the append_separator function to add separators in excel tables as follows:

# Daniel和Suzie的测试数据
daniel_tests = [
    ('11/24/09', 'Apple', 'Red'),
    ('05/24/10', 'Banana', 'Yellow')
]

suzie_tests = [
    ('11/24/09', 'Orange', 'Orange'),
    ('05/24/10', 'Grapes', 'Purple')
]

# 创建新的数据集
tests = tablib.Dataset()
tests.headers = ['Date', 'Fruit Name', 'Color']

# 添加分隔符
tests.append_separator('Fruits A')  
for test_row in daniel_tests:
   tests.append(test_row)

# 添加分隔符
tests.append_separator('')  
for test_row in suzie_tests:
   tests.append(test_row)

# 将数据集写入磁盘,以xls格式存储
with open('fruits.xls', 'wb') as f:
    f.write(tests.export('xls'))

By displaying the xls data, you can see that blank rows of data are added to some rows.


# 导入pandas库
import pandas as pd

# 从xls文件中读取数据,并将其存储在DataFrame中
pd.read_excel('fruits.xls', keep_default_na=False)

Date Fruit Name Color
0 Fruits A
1 11/24/09 Apple Red
2 05/24/10 Banana Yellow
3
4 11/24/09 Orange Orange
5 05/24/10 Grapes Purple

Format columns

Tablib provides the add_formatter function to add a custom formatter to the Dataset object so that it can be formatted according to the specified format when exporting the data.

# 创建一个空的 Dataset 对象
data = tablib.Dataset()

# 添加数据到 Dataset
data.headers = ['name','age','role']
data.append(['John', 28, 'Developer'])
data.append(['Amy', 25, 'Designer'])

# 定义一个自定义的格式化函数
def custom_formatter(val):
    if isinstance(val, int):
        return f'Age: {
      
      val}'
    elif isinstance(val, str):
        return val.upper()
    else:
        return str(val)

# 添加自定义格式化函数到 Dataset
# 第一个参数可以为列号
data.add_formatter(0,custom_formatter)
# 如果有列名,也可以指定列名
data.add_formatter('age',custom_formatter)

# 导出数据并应用自定义格式化函数
data.df
name age role
0 JOHN Age: 28 Developer
1 AMY Age: 25 Designer

Create subtable

# 创建表格
data = tablib.Dataset()
data.headers = ['Name', 'Age','Profession']
data.append(['Alice', 25, 'Doctor'])
data.append(['Bob', 30, 'Doctor'])
data.append(['Jack', 28, 'Engineer'])
# subset方法用于从现有的数据集中选择子集
# rows表示行号,rows=[0, 2]表示选择第0行和第2行
sub_data = data.subset(rows=[0, 2], cols=['Name', 'Profession'])
sub_data.df
Name Profession
0 Alice Doctor
1 Jack Engineer

2 Reference

おすすめ

転載: blog.csdn.net/LuohenYJ/article/details/134709069