pandas.read_csv分块读取大文件

以下代码是“达观杯”csv数据文件读取，来源：加载大数据：带有可爱的读取进度条

import time
import pandas as pd
from tqdm import tqdm


# @execution_time
def reader_pandas(file, chunkSize=100000, patitions=10 ** 4):
    reader = pd.read_csv(file, iterator=True)
    chunks = []
    with tqdm(range(patitions), 'Reading ...') as t:
        for _ in t:
            try:
                chunk = reader.get_chunk(chunkSize)
                chunks.append(chunk)
            except StopIteration:
                break
    return pd.concat(chunks, ignore_index=True)

print(reader_pandas("./data/train_set.csv"))

输出：

D:\software\Anaconda3\python.exe D:/Competitions/DaGuanBei/test.py
Reading ...:   0%|          | 2/10000 [00:41<79:10:31, 28.51s/it] 
            id  ...  class
0            0  ...     14
1            1  ...      3
2            2  ...     12
3            3  ...     13
4            4  ...     12
5            5  ...     13
6            6  ...      1
7            7  ...     10
8            8  ...     10
9            9  ...     19
10          10  ...     18
11          11  ...      7
12          12  ...      9
13          13  ...      4
14          14  ...     17
15          15  ...      9
16          16  ...     13
17          17  ...     10
18          18  ...     10
19          19  ...     14
20          20  ...     10
21          21  ...      9
22          22  ...      1
23          23  ...      2
24          24  ...     13
25          25  ...      1
26          26  ...      7
27          27  ...     17
28          28  ...     10
29          29  ...      8
...        ...  ...    ...
102247  102247  ...      9
102248  102248  ...     18
102249  102249  ...     13
102250  102250  ...      9
102251  102251  ...      1
102252  102252  ...     14
102253  102253  ...     12
102254  102254  ...     11
102255  102255  ...     19
102256  102256  ...      2
102257  102257  ...      4
102258  102258  ...      3
102259  102259  ...      6
102260  102260  ...      9
102261  102261  ...      1
102262  102262  ...     18
102263  102263  ...      6
102264  102264  ...      8
102265  102265  ...     16
102266  102266  ...     18
102267  102267  ...     15
102268  102268  ...      3
102269  102269  ...      3
102270  102270  ...      3
102271  102271  ...      8
102272  102272  ...     14
102273  102273  ...      8
102274  102274  ...     12
102275  102275  ...      4
102276  102276  ...     11

[102277 rows x 4 columns]

Process finished with exit code 0

上面的代码运用的是pandas的read_csv()，默认参数sep=','分隔符为','，正好和csv以逗号为分隔符吻合。

iterator : boolean, default False

返回一个TextFileReader 对象，以便逐块处理文件。

iterator=True表示逐块读取文件。

reader.get_chunk(chunkSize)表示每次读取块的大小为chunkSize。

tqdm模块是用来打印读取文件的进度条，详见参考资料。

参考资料：

pandas.read_csv参数详解

扫描二维码关注公众号，回复： 3502704 查看本文章

pandas.read_csv——分块读取大文件

python的Tqdm模块

pandas.read_csv分块读取大文件

猜你喜欢