Dask可以将超过计算资源池可用内存的大型数据集“装载”进内存,然后像Pandas、Numpy等数据处理工具一样,对数据进行处理。它屏蔽了数据的分批装载和计算过程,让开发者更专注于数据本身的逻辑。使用上,仅需pip install dask
,即可在本地使用,用法可参考上面的链接。以下简单示例在16G内存电脑上,处理存储在多个csv文件中的20G期货tick数据(实际上仅使用2G内存):
import dask.dataframe as dd
df = dd.read_csv("../future-quotation/2020.8.3~2020.9.18.期货全市场行情数据/DataTimeStream/*.csv",
names = [
"localtime",
"InstrumentID",
"TradingDay",
"ActionDay",
"UpdateTime",
"UpdateMillisec",
"LastPrice",
"Volume",
"HighestPrice",
"LowestPrice",
"OpenPrice",
"ClosePrice",
"AveragePrice",
"AskPrice1",
"AskVolume1",
"BidPrice1",
"BidVolume1",
"UpperLimitPrice",
"LowerLimitPrice",
"OpenInterest",
"Turnover",
"PreClosePrice",
"PreOpenInterest",
"PreSettlementPrice"],
dtype = {
'AveragePrice': 'float64',
'OpenInterest': 'float64',
'PreOpenInterest': 'float64',
'BidPrice1': 'float64',
'ClosePrice': 'float64',
'HighestPrice': 'float64',
'LastPrice': 'float64',
'LowestPrice': 'float64',
'OpenPrice': 'float64',
'PreClosePrice': 'float64',
'Turnover': 'float64'})
df = df.drop(columns = ["localtime",
"ActionDay",
"UpdateTime",
"UpdateMillisec",
"LastPrice",
"Volume",
"HighestPrice",
"LowestPrice",
"OpenPrice",
"ClosePrice",
"AveragePrice",
"UpperLimitPrice",
"LowerLimitPrice",
"OpenInterest",
"Turnover",
"PreClosePrice",
"PreOpenInterest",
"PreSettlementPrice"])
df["AskInterest"]=df["AskPrice1"]*df["AskVolume1"]
df["BidInterest"]=df["BidPrice1"]*df["BidVolume1"]
df = df.groupby(["InstrumentID","TradingDay"]).sum()
df["AskIndex"] = df["AskInterest"] / df["AskVolume1"]
df["BidIndex"] = df["BidInterest"] / df["BidVolume1"]
df.compute()
df.head()