How pandas reads and writes source data

Work together to create and grow together! This is the 10th day of my participation in the "Nuggets Daily New Plan·August Update Challenge", click to view the event details

foreword

When it comes to pandas, I believe that every data analyst is familiar with it. In daily analytical work, it is inevitable to need to read and write data. The sources of data are often various, such as csv files, excel content, relational databases and so on.

In order to support the reading and writing of these data sources, pandas has related methods to achieve it. Here is a summary of how to read and write various types of data.

Reading and writing of csv and excel

csv file reading

Let's talk about the csv file first. The function to read the csv content is: read_csv. There are a lot of parameters that can be passed, let's take a look at the more used ones:

pandas.read_csv(filepath_or_buffer, header="infer", index_col=None, usecols=None, nrows=None)
复制代码

filepath_or_buffer: csv file path/URL address
header: int/list of int, automatically infers whether there is a header by default. When filling in int, such as header=3, set the third row header.
index_col: int/str. index_col=4 means set the 4th column as the index column
usecols: list-like or callable, optional parameter. Indicates the data of the selected part of the column
nrows: int, optional parameter. The number of lines read. When reading large files, this can be set to not load all line data

for example:

import pandas as pd
df = pd.read_csv('/data/a.csv')
复制代码

Set the header of the second row, read only 5000 rows.

import pandas as pd
df = pd.read_csv('/data/a.csv', header=2, nrows=5000)
复制代码

csv file writing

The function is to_csv, the main parameters are:

df.to_csv(path_or_buf, header=True, index=True)
复制代码

path_or_buff: the path to write the file to
header: whether there is a header, the default is True
index: whether to write the index, the default is True

for example:

import pandas as pd
df = pd.read_csv('/data/a.csv')
df['col1'] = 1
# 写入到b文件中
df.to_csv('/data/b.csv', index=False)
复制代码

reading excel file

The function is read_excel, the main parameters are:

pandas.read_excel(io, sheet_name=0,header=0,index_col=None)
复制代码

io: file path or url address
sheet_name: str, int, list, default is 0. Indicates that the first sheet is read by default, str indicates that the content of the sheetname is loaded, and list indicates that the sheet in the list is loaded.
header: int, list of int, default is 0
index_col: int, list of int, default None. index field

Example:

import pandas as pd
# 读取a.xlsx表sheet_name是‘测试数据’的数据
df = pd.read_excel('data/a.xlsx', sheet_name='测试数据')
复制代码

writing excel file

函数是to_excel,主要参数有：

df.to_excel(excel_writer, sheet_name='Sheet1', index=True)
复制代码

excel_writer：Str或Excel Writer对象
sheet_name：str, sheet名称
index: bool，默认True。是否保存index

小例子：

import pandas as pd

df = pd.read_excel('data/a.xlsx', sheet_name='测试数据')
with pd.ExcelWriter('/data/b.xlsx') as writer:
    df.to_excel(writer, sheet_name='测试数据',index=False)
复制代码

另外像json文件读写都与excel、csv相似，大家可参考官方文档。

数据库读写

数据库也是我们最常遇到的读写场景，我们这里主要以MySQL为例。

MySQL的读取

函数read_sql, 常用参数：

pandas.read_sql(sql, con, index_col=None)
复制代码

sql: sql查询语句
con: 数据库连接对象，主要是sqlalchemy、sqlite3连接
index_col: index字段的设置

mysql的连接，我们通常用sqlalchemy来作为连接对象，请看下面示例：

from sqlalchemy import create_engine
import  pandas as pd

# 填写mysql的连接url
con = "mysql+pymysql://{user}:{pwd}@{host}:{port}/{db}?charset=utf8"
engine = create_engine(con, connect_args={'connect_timeout': 20})

sql = "select id, name from users where id"
df = pd.read_sql(sql, con)
复制代码

MySQL的写入

函数to_sql, 常用参数：

DataFrame.to_sql(name, con, if_exists='fail', index=True, index_label=None, method=None)
复制代码

name: MySQL的表名
con: 连接对象
if_exists: fail/replace/append, 默认fail。
index：默认True，是否保存index
index_label: 索引标签
method: 控制sql插入的自定义方法

示例：

from sqlalchemy import create_engine
import  pandas as pd

# 填写mysql的连接url
con = "mysql+pymysql://{user}:{pwd}@{host}:{port}/{db}?charset=utf8"
engine = create_engine(con, connect_args={'connect_timeout': 20})

sql = "select id, name from users where id"
df = pd.read_sql(sql, con)

# 数据写入到a表
df.to_sql(sql, con=con, if_exists="append", index=False,index_label=False)
复制代码

需要注意的是：如果往一个表添加数据，参数if_exists="append"即可。如果设置为if_exists="replace"将会抹去表的数据，再往表里插入数据。

MySQL数据的更新问题

因实际工作中我们经常会有更新数据的场景。to_sql最基本的方法无法达到我们的预期，我们需要用别的方法去实现。

第一个方法就是自定义to_sql的method方法，具体操作就不细讲了。
第二个方法就是结合pandas结合MySQL操作，分步完成数据更新。这里提供一个思路：

建立MySQL临时表，将要更新的数据用pandas插入到临时表中
sql语句更新目标表的数据
删除临时表

示例代码如下：

import pandas as pd
from sqlalchemy import create_engine

# 填写mysql的连接url
con = "mysql+pymysql://{user}:{pwd}@{host}:{port}/{db}?charset=utf8"
engine = create_engine(con, connect_args={'connect_timeout': 20})

sql = "select * from a where status=0"
df = pd.read_sql(sql, con=engine)
df['status'] = 1

# 创建临时表
sql = "create table tmp_data (xxx)"
...

# 更新目标表
update_sql = """
UPDATE table_to_update AS f
     set m3 = t.m3
     from
        temp_table AS t
     where
        f.id=t.id
"""
...

# 删除临时表
del_sql = "drop table temp_table"
...
复制代码

小结

本文主要介绍了pandas读写常用数据源的方法，其他数据源我们也可通过官方文档来查阅使用方法。