pandas acceleration of Groupby

          In the usual financial data processing, model building, often used in groupby pandas. Previous article also describes the role had groupby of:

https://blog.csdn.net/qtlyx/article/details/80515077

         However, we all know, there is a thing called python GIL, it means no more threads python this kind of thing. Well, now if we want to operate groupby how to do it? We can use multiple threads, using a module called the joblib to achieve groupby parallel computing, and then in combination, there is a feeling of a bit map-reduce.

        Our scenario is this: We want to calculate a series of beta fund yields. Well, according to the ordinary way it is for each fund groupby, then return it groupby every time, and then calculate the beta. We look at a function to calculate the beta:

def beta_cal_mult(one_fund_df):
    ll = list()
    for ind in range(len(one_fund_df)):
        one_fund_df_sub = one_fund_df.iloc[ind:ind + 20]
        ll.append(cross_regression(one_fund_df_sub, ['NAV_ADJ_RETURN1'], ['bench_mark_return']).params['bench_mark_return'])

    one_fund_df['beta_mult'] = pd.Series(ll).shift(19).tolist()
    # print pd.Series(ll).shift(19)
    return one_fund_df

        This code one_fund_df represents the return of a time series of the Fund, and the reference is NAV_ADJ_RETURN1 containing beta calculation of return, bench_mark_return.

        So, if our current data is as follows:

date code  FUND_FUNDSCALE  NAV_ADJ_RETURN1  bench_mark_return
2015-01-05  000001.OF    4.059972e+09         1.782683           2.270838
2015-01-06  000001.OF    4.059972e+09         0.583820           0.699535
...               ...             ...              ...                ...
2018-11-21  960033.OF    5.157890e+07         0.073306           0.447395
2018-11-22  960033.OF    5.157890e+07        -0.324404          -0.179186
2018-11-23  960033.OF    5.157890e+07        -1.496063          -3.140655
2018-11-26  960033.OF    5.157890e+07        -0.234479          -0.156276
2018-11-27  960033.OF    5.157890e+07         0.486085           0.243327
2018-11-28  960033.OF    5.157890e+07         1.041888           1.251801
2018-11-29 960033.OF 5.157890e + 07 -1.394150 -1.790077
        In general, we groupby of code, and then apply what the function above it. If you computer is a multi-core, you will find the time to run, in fact, there will only be a nuclear fully used, while others are idle nuclear significant. We assume that a large amount of data, and our servers are 50-core cpu, then, under such a scenario, I am sure you will crash.

        Therefore, the following string of code is how to achieve a parallel computing. In fact, the idea is very simple, that is after pandas groupby returns an iterator, which is part of a value pandas after groupby. So, we can use this iterator to calculate more than one process, the final result of all the merger integration.

import pandas as pd
from joblib import Parallel, delayed
import multiprocessing
import statsmodels.api as sm


def cross_regression(df_temp, y_name, x_list, constant=True):
    df = df_temp.dropna()
    y = df[y_name]
    x = df[x_list]
    X = sm.add_constant(x) if constant else x
    results = sm.OLS(y, X, hasconst=constant).fit()
    return results

def beta_cal_mult(one_fund_df):
    ll = list()
    for ind in range(len(one_fund_df)):
        one_fund_df_sub = one_fund_df.iloc[ind:ind + 20]
        ll.append(cross_regression(one_fund_df_sub, ['NAV_ADJ_RETURN1'], ['bench_mark_return']).params['bench_mark_return'])

    one_fund_df['beta_mult'] = pd.Series(ll).shift(19).tolist()
    print pd.Series(ll).shift(19)
    return one_fund_df

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)



data_df = pd.read_hdf('test.h5')
multi_res = applyParallel(data_df.iloc[:10000].groupby('code'), beta_cal_mult)
multi_res.to_hdf('fil.h5', key='data')

        The core code above is actually:

multi_res = applyParallel(data_df.groupby('code'), beta_cal_mult)

        Originally the back should be:

multi_res = data_df.groupby('code').apply(beta_cal_mult)

        And now is the use of applyParallel function, this function:

def applyParallel(dfGrouped, func):
    retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)

        Using joblib the Parallel function that is actually a function invoked in parallel, where the parameters n_jobs is the number of computer cores used later in fact, is the use of group portion groupby returned by the iterator, i.e. slice pandas of and then turn into this function func.

        When the large amount of data, when such parallel processing is possible to save time than expected, it is strongly recommended to pandas features built into such a pandas repository.

 

 

Published 205 original articles · won praise 236 · views 980 000 +

Guess you like

Origin blog.csdn.net/qtlyx/article/details/87301224