0096-【Python包】-modin-多核心加速pandas

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/leadingsci/article/details/89278102

快速提示-使用Modin加速Pandas

https://python.freelycode.com/contribution/detail/1454

github

https://github.com/modin-project/modin

说明手册

https://modin.readthedocs.io/en/latest/pandas_supported.html#list-of-other-supported-operations-available-on-import


1. 安装

pip install modin

安装的时候,提示要安装cpython

2. 使用方法,加一行代码

# import pandas as pd
import modin.pandas as pd

示例1:

import modin.pandas as pd
import numpy as np

frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)

4. 速度提升

import modin.pandas as pd

df = pd.read_csv("my_dataset.csv")

在这里插入图片描述

5. 文件测试

1. 文件大小

-rw-r--r-- 1 toucan toucan 289K Dec 20 17:17 IthaGenes_variations_export_all.csv

2. pandas读入

# 运行 python read_pandas.py

$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce

# run 2 tierations of read_csv to get an average

time = []
for i in range(0,2):
    start = timer()
    df = pd.read_csv("IthaGenes_variations_export_all.csv")
    end = timer()
    time.append(end - start)

time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)

输出结果:

$python read_pandas.py
0.009299777299929701

3. modin读入

# 运行 python read_modin.py

$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce

# run 2 tierations of read_csv to get an average

time = []
for i in range(0,10):
    start = timer()
    df = pd.read_csv("IthaGenes_variations_export_all.csv")
    end = timer()
    time.append(end - start)

time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)

输出结果:

$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-37-43_6323/logs.
Waiting for redis server at 127.0.0.1:35024 to respond...
Waiting for redis server at 127.0.0.1:62923 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
0.20180192090001584

问题

是不是由于输入文件太小,笔记本的内存不足,没有显示出优势来呢。我再虚拟机中是设置有12Gb内存的。本虚拟机机只有2核,4个线程。

在这里插入图片描述

实践2

1, 文件大小
换为大文件791M

-rw-rw-r-- 1 toucan toucan 791M Apr 13 10:43 hapmap_3.3_hg19_pop_stratified_af.vcf

2. pandas读入

$cat read_pandas.py
from timeit import default_timer as timer
import pandas as pd
from functools import reduce

# run 2 tierations of read_csv to get an average

time = []
for i in range(0,22):
    start = timer()
    df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
    end = timer()
    time.append(end - start)

time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)

输出:

$python read_pandas.py
12.275232009363746

3. modin读入

$cat read_pandas.py
from timeit import default_timer as timer
import modin.pandas as pd
from functools import reduce

# run 2 tierations of read_csv to get an average

time = []
for i in range(0,22):
    start = timer()
    df = pd.read_csv("hapmap_3.3_hg19_pop_stratified_af.vcf",sep="\t")
    end = timer()
    time.append(end - start)

time_read = reduce(lambda x,y : x + y,time)/len(time)
print(time_read)

在top中,不是以python来运行,而是
在这里插入图片描述
输出:

$python read_pandas.py
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-13_10-52-14_6531/logs.
Waiting for redis server at 127.0.0.1:48416 to respond...
Waiting for redis server at 127.0.0.1:29343 to respond...
Starting Redis shard with 10.0 GB max memory.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 6283886592 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 7.0 GB memory using /tmp.
WARNING: Logging before InitGoogleLogging() is written to STDERR
W0413 10:52:51.888617  6546 node_manager.cc:245] Last heartbeat was sent 524 ms ago
W0413 10:52:56.548642  6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:53:03.599262  6546 node_manager.cc:245] Last heartbeat was sent 532 ms ago
W0413 10:53:07.556165  6546 node_manager.cc:245] Last heartbeat was sent 782 ms ago
W0413 10:53:08.947691  6546 node_manager.cc:245] Last heartbeat was sent 636 ms ago
W0413 10:53:17.075079  6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:53:19.810811  6546 node_manager.cc:245] Last heartbeat was sent 804 ms ago
W0413 10:53:20.800647  6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:53:22.806788  6546 node_manager.cc:245] Last heartbeat was sent 699 ms ago
W0413 10:54:00.502030  6546 node_manager.cc:245] Last heartbeat was sent 585 ms ago
W0413 10:54:10.019619  6546 node_manager.cc:245] Last heartbeat was sent 513 ms ago
W0413 10:54:24.286998  6546 node_manager.cc:245] Last heartbeat was sent 732 ms ago
W0413 10:54:28.974217  6546 node_manager.cc:245] Last heartbeat was sent 865 ms ago
W0413 10:54:44.903314  6546 node_manager.cc:245] Last heartbeat was sent 537 ms ago
W0413 10:54:45.480008  6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:54:50.389829  6546 node_manager.cc:245] Last heartbeat was sent 556 ms ago
W0413 10:54:52.274536  6546 node_manager.cc:245] Last heartbeat was sent 522 ms ago
W0413 10:54:52.873443  6546 node_manager.cc:245] Last heartbeat was sent 599 ms ago
W0413 10:55:15.301537  6546 node_manager.cc:245] Last heartbeat was sent 1008 ms ago
W0413 10:55:16.863193  6546 node_manager.cc:245] Last heartbeat was sent 1129 ms ago
W0413 10:55:18.049829  6546 node_manager.cc:245] Last heartbeat was sent 603 ms ago
W0413 10:55:24.432444  6546 node_manager.cc:245] Last heartbeat was sent 959 ms ago
W0413 10:56:02.659128  6546 node_manager.cc:245] Last heartbeat was sent 643 ms ago
W0413 10:56:09.559237  6546 node_manager.cc:245] Last heartbeat was sent 607 ms ago
W0413 10:56:12.926802  6546 node_manager.cc:245] Last heartbeat was sent 595 ms ago
W0413 10:56:14.754364  6546 node_manager.cc:245] Last heartbeat was sent 830 ms ago
W0413 10:56:17.414083  6546 node_manager.cc:245] Last heartbeat was sent 526 ms ago
W0413 10:56:21.293486  6546 node_manager.cc:245] Last heartbeat was sent 539 ms ago
W0413 10:56:23.624935  6546 node_manager.cc:245] Last heartbeat was sent 576 ms ago
W0413 10:56:25.183625  6546 node_manager.cc:245] Last heartbeat was sent 703 ms ago
W0413 10:57:06.594352  6546 node_manager.cc:245] Last heartbeat was sent 544 ms ago
W0413 10:57:09.569542  6546 node_manager.cc:245] Last heartbeat was sent 693 ms ago
W0413 10:57:12.113721  6546 node_manager.cc:245] Last heartbeat was sent 506 ms ago
W0413 10:57:13.748317  6546 node_manager.cc:245] Last heartbeat was sent 690 ms ago
W0413 10:57:18.617753  6546 node_manager.cc:245] Last heartbeat was sent 1032 ms ago
W0413 10:57:25.745839  6546 node_manager.cc:245] Last heartbeat was sent 580 ms ago
W0413 10:57:38.555815  6546 node_manager.cc:245] Last heartbeat was sent 1772 ms ago
W0413 10:58:38.673028  6546 node_manager.cc:245] Last heartbeat was sent 1484 ms ago
W0413 10:58:59.426427  6546 node_manager.cc:245] Last heartbeat was sent 2125 ms ago
21.333674854863624

等了几分钟,计算才得出21秒。

结论

在本虚拟机上测试,由于CPU核数不多,modin的优势并没有明显体现,反而更
慢。

猜你喜欢

转载自blog.csdn.net/leadingsci/article/details/89278102