L: Python's Pandas module: read/write CSV files, read/write HDF5 files, get stock data

Read/write data file

Read/write CSV file

CSV files are comma-separated text files, often used as intermediate files for data exchange between software. Pandas provides two methods, read_csv() and to_csv() to read/write CSV files.
Assuming there is a mobile.csv file, the content is as follows:
,apple, huawei, oppo
January,1100,1250,800
February,1050,1300,850
March,1200,1328,750

df=pd.read_csv("mobile.csv", encoding='cp936', index_col=0) # 读文件
文件mobile.csv中含有中文,当初保存时选了GBK(cp936)编码字符集,
所以读取时也应指定该编码集。如不指定,Pandas默认将按utf-8编码读取,
就会产生如下的'utf-8'解析错误。
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 2: invalid continuation byte

index_col=0 指定将文件的第0列作为索引标签。mobile.csv文件的第0行会被自动解析为列名。
数据文件m2.csv的内容如下。该文件只含数据,不含列名和标签,读取时可用names参数自行指定列名。
1100,1250,800
1050,1300,850
1200,1328,750
# 注:该文件不含中文,只有数字,所以可不指定encoding
df2 = pd.read_csv("m2.csv", names=['apple', 'huawei', 'oppo'])
# 也可以第一步:先读取;  
df3 = pd.read_csv("m2.csv", header=None)
# 第二步:再修改columns和index
df3.columns = ['apple', 'huawei', 'oppo']
df3.index = ['一月', '二月', '三月']

header=None表示文件不含列名。若不指定这个参数,则文件第0行“1100,1250,800”将被错误解析为列名。

Some files are not separated by commas, but separated by spaces. The content of the file m4.txt is as follows:
1100 1250 800
1050 1300 850
1200 1328 750

df4 = pd.read_csv("m4.txt", sep="\s+", header=None)

m4.txt文件各个数据用数量不等的空格或制表键分隔,所以指定sep="\s+"参数。"\s+"是正则表达式,表示分隔符可为若干空白字符。

# 跳过前2行
pd.read_csv("score.txt", skiprows=2, sep='\s+', encoding='cp936' )        pd.read_csv("数据文件名", skiprows=[0, 2])    	# 跳过第0, 2行
pd.read_csv("数据文件名", skipfooter=2, engine = 'python') # 跳过尾部的2行
pd.read_csv("数据文件名", nrows=10)         	# 只读取前10行数据

If the file contains date data, you can use the parse_dates parameter to specify parsing by date. The content of stock.txt is as follows: the
highest and lowest closing volume at the opening of the trading day
2019/03/22 18.09 18.63 18.02 18.15 43760812
2019/03/23 18.16 18.35 18.06 18.13 27830796
2019/03/24 18.11 18.11 17.68 17.72 27448272

df = pd.read_csv('stock.txt', parse_dates=['交易日'], 	encoding='cp936', sep='\s+', index_col='交易日')
解析后的 df.index 是日期类,便于以后按日期段查询统计。
df.index    # 显示索引为日期类 dtype=datetime64[ns]

The data frame object has the to_csv() method, which can save the data into a CSV file.

df.to_csv("d1.csv", encoding='cp936')           	# 存盘,默认逗号分隔
df.to_csv("d2.txt", encoding='cp936', sep=' ') # 指定用空格分隔

Pandas provides two methods read_excel() and to_excel() to read/write spreadsheets.
#Read the first worksheet by default, row 1 as the column name, column A (column 0) as the label

df1 = pd.read_excel("mobile.xlsx", index_col=0)    # 读Excel文件

# sheet_name指定读工作簿中的某个工作表
df2 = pd.read_excel("mobile.xlsx", sheet_name='二季度')

将数据框保存到Excel文件中时,使用to_excel()方法。
df1.to_excel("a1.xlsx")               # 存入Excel文件

可以将多个数据框保存到一个 Excel文件的不同工作表中。
如下语句执行后,df1和df2的数据分别保存在test.xlsx内名为"一季度""二季度"的工作表中。
from  pandas  import  ExcelWriter
with  ExcelWriter("test.xlsx")  as  writer:
    df1.to_excel(writer, sheet_name='一季度')
    df2.to_excel(writer, sheet_name='二季度')

Read/write HDF5 files

HDF (Hierarchical Data Format) is a file format used to store and organize large amounts of data.

#若store.h5文件不存在,则先创建;若已存在,则打开此文件
store = pd.HDFStore('store.h5')   
store['dfa'] = df1 		# 将df1保存到文件中,键名dfa
store['dfb'] = df2 		# 将df2保存到文件中,键名dfb
store.close() 		# 关闭文件

读出已保存的数据可以使用下面的命令。
store = pd.HDFStore('store.h5')  	# 打开文件   
df3 = store['dfa']               	# 根据dfa键名取出数据
df4 = store['dfb']               	# 根据dfb键名取出数据

从上面示例可以看出,HDF文件存取数据的方式类似于字典操作,都通过键存取数据。

Get stock data (using Yahoo Finance and tushare)

命令行上先安装   conda  install   pandas-datareader
# 雅虎财经(慢/易出错)
import pandas_datareader.data as web
# 上市后缀.ss,  深市后缀.sz,  获取 'High', 'Low', 'Open', 'Close', 'Volume'
d1= web.get_data_yahoo('600030.ss', start = '2020-02-01', end = '2020-02-29')
d2= web.get_data_yahoo('002522.sz', start = '2020-02-01', end = '2020-02-29')
d3= web.get_data_yahoo('BA', start = '2020-02-01', end = '2020-03-29')
d4= web.get_data_yahoo(['MSFT','AAPL'], start = '2020-02-01', end = '2020-02-29')

命令行上先安装  pip  install  tushare
import tushare as ts    # 快速
hs300 = ts.get_k_data('hs300',start ='2020-01-01', end = '2020-02-29')
k2522 = ts.get_k_data('002522',start ='2020-01-01', end = '2020-02-29')

Guess you like

Origin blog.csdn.net/qq_43416157/article/details/106878987