Build a high-performance factor computing platform with DolphinDB and Python Celery

Factor mining is at the heart of quantitative finance research and trading. In the traditional development process, Python is usually used to read data from relational databases (such as SqlServer, Oracle, etc.), and factor calculations are performed in Python. With the continuous expansion of the scale of securities transactions and the surge in the amount of transaction data, users have put forward higher requirements for the performance of the factor calculation platform. Traditional factor calculation engineering faces the following problems:

  • Factor data growth and performance bottlenecks caused by Python as a computing platform
  • How traditional computing platform frameworks can seamlessly replace the problem

This tutorial focuses on the actual business scenario of factor batch calculation, and introduces DolphinDB as a core calculation tool to the traditional factor platform. The factor calculation platform with DolphinDB as the core includes: data synchronization module, factor batch calculation module and factor scheduling module. Among them, dataX, as a data synchronization tool, is responsible for synchronously storing original data and incremental data from relational databases to DolphinDB; DolphinDB is used as a factor calculation and storage module; Celery is used as a task scheduling framework module.

The DolphinDB factor calculation platform can provide real-time factor calculation services, big data batch calculation services, and historical factor query services for business departments. After the introduction of DolphinDB, it can not only meet the requirements of high, medium and low frequency factor calculation, but also has rich API and ETL tools to realize seamless integration. The following takes factor No. 1 in the WorldQuant 101 Alpha factor indicator library WQAlpha1as an example to show the construction process of the entire factor calculation platform.

1. Overall structure

The factor platform construction based on DolphinDB and Python Celery framework mainly includes the data source injection of SQL Server Reader (historical data synchronization module) and DolphinDB (data storage and calculation module) based on dataX, the definition and call of DolphinDB factor functions, and the Celery framework ( Factor Platform Scheduling Module) call the factor function to pass parameters and finally visualize the data in Dataframeformat . Its overall composition is as follows:

1.1 SQL Server Overview

  • Introduction:
    SQL Server is a relational database management system developed and promoted by Microsoft.
    SQL Server is the original data source part of the whole architecture.
  • Tutorial and download and installation:
    For the use, download and installation of SQL Server, please refer to the official SQL Server documentation .

1.2 Overview of dataX

  • Introduction:
    dataX is an offline synchronization tool for heterogeneous data sources, which is used to realize efficient data synchronization between various heterogeneous data sources including MySQL, Oracle, SQL Server, Postgre and so on.
    The dataX plug-in of dolphindbWriter in this tutorial can help users import SQL Server data into DolphinDB.
  • Tutorial and download and installation:
    For the use and installation of dataX, please refer to the dataX guide . To download, please click dataX .

1.3 DolphinDB overview

  • Introduction:
    DolphinDB is a high-performance time-series data processing framework for calculating high-frequency factors and factor storage.
    This tutorial uses DolphinDB as the main tool for factor calculation, and combines its own unique function view function to realize the predefined factor function and call DolphinDB's Python API in Python.
  • Tutorial and download, installation:
    For DolphinDB installation guide, please refer to DolphinDB Installation and User Guide , click the download link to download. For the call of Python api, please refer to Python API for DolphinDB .

1.4 Celery overview

  • Introduction:
    Celery is a simple, flexible and reliable distributed asynchronous message queue developed based on Python. It is used to implement asynchronous task (Async Task) processing. In actual use, it is necessary to use message middleware (Broker) to monitor the task execution unit ( Worker) and store the task execution results in the result storage (Backend).
    Celery has the following advantages:

  • It can realize the function of asynchronously initiating and processing requests, and it is more convenient to realize Python multithreading;

  • It is easy to integrate into components such as rabbitMQ and DolphinDB, and has strong scalability.

In this tutorial, Celery is used as the task scheduling framework, and redis is used as the message middleware and result storage to implement the task call for factor calculation.

Note: In order to prevent errors TypeError: __init__() got an unexpected keyword argument 'username'such , it is recommended to uninstall the default kombou library and specify the installed 5.1.0version of the library after installing the Celery framework.

2. Environment deployment

Note:
1. This tutorial introduces the deployment of the test environment, so the deployed DolphinDB service is a single-node version. For specific deployment tutorials, please refer to the DolphinDB single-node deployment tutorial ;
2. The version of Celery used in this tutorial is 4.3.0 .

  • Hardware environment:

  • Software Environment:

3. Development and use cases

3.1 Data introduction

This tutorial selects the daily closing prices of multiple stocks from 2020.01.01 to 2021.01.01, with a total of 544174 data items. The following is the data structure of the closing price table in SQL Server and DolphinDB:

3.2 Introduction to Business Scenarios and Indicators

This tutorial selects factor No. 1 in the WorldQuant 101 Alpha factor indicator library WQAlpha1as a case for calculation. For details about the indicator library and the reference method of this factor, please refer to the WorldQuant 101 Alpha factor indicator library .

3.3 dataX synchronizes SQL Server data to DolphinDB

This section describes how to synchronize SQL Server data to DolphinDB.

Note:
1. In this tutorial, the SQL Server database with data has been pre-built by default, and its construction process will not be described in the following process introduction; 2. The
corresponding port of the single-node version of DolphinDB service deployed in this tutorial for 8848.

  • DolphinDB database table construction:
    Before data is imported into DolphinDB, database tables need to be constructed in the deployed DolphinDB service in advance, and the following DolphinDB scripts are executed to create the database dfs://tick_closeand its data tables tick_close:

    dbName = “dfs://tick_close” tbName = “tick_close” if(existsDatabase(dbName)){ dropDatabase(dbName) } db = database(dbName, RANGE, date(datetimeAdd(2000.01M,0…50*12,‘M’))) name = SecurityIDTradeDateValue type = SYMBOLDATEDOUBLE schemaTable = table(1:0, name, type) db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`TradeDate)

  • Write an import configuration file:
    When starting dataX to execute a data import command, you need to first write a configuration file in json format, which is used to specify the data source-related configuration during data synchronization.
    In general, the synchronization of each data table often requires writing a corresponding configuration file. In this tutorial, the following tick_close.json files are written for the tick_close data table:

    { “job”: { “content”: [ { “writer”: { “parameter”: { “dbPath”: “dfs://tick_close”, “tableName”: “tick_close”, “batchSize”: 100, “userId”: “admin”, “pwd”: “123456”, “host”: “127.0.0.1”, “table”: [ { “type”: “DT_SYMBOL”, “name”: “SecurityID” }, { “type”: “DT_DATE”, “name”: “TradeDate” }, { “type”: “DT_DOUBLE”, “name”: “Value” } ], “port”: 8848 }, “name”: “dolphindbwriter” }, “reader”: { “name”: “sqlserverreader”, “parameter”: { “username”: “SA”, “password”: “Sa123456”, “column”: [ “*” ], “connection”: [ { “table”: [ “tick_close” ], “jdbcUrl”: [ “jdbc:sqlserver://127.0.0.1:1234;DatabaseName=tick_close” ]

                            }
                        ]
                    }
                }
            }
        ],
        "setting": {
          
          
            "speed": {
          
          
                "channel": 1
            }
        }
    }
    

    }

Note: The data synchronization involved in this tutorial is only the full synchronization of historical data. In the actual process, if there is a need for incremental synchronization, etc., two configurations, and , should be added to writerthe configuration datasaveFunctionName import based on DataX tool .saveFunctionDef

  • Execute the data import command:
    enter the dataX bindirectory , and execute the following commands to tick_closeimport data into the DolphinDB data table:

    $ python datax.py …/conf/tick_close.json

Parameter explanation:

  • datax.py: The script used to start dataX, required
  • ../conf/tick_close.json: The path to store the configuration file, required

Expected output:

3.4 Celery framework triggers DolphinDB predefined function calculation

This section describes how to use the DolphinDB script to implement the factor function for calculating interval returns and use the Celery framework to call and trigger the framework.

  • Construction of message middleware and result storage module redis service:

The Celery framework needs a message middleware to send messages to implement task scheduling, and also needs a result storage tool to store the results. In this tutorial, we recommend using redis as the message middleware and result storage tool, and its deployment port is 6379. In actual use, users can choose the tools and deployment methods to use according to the actual situation. The deployment process of redis in this tutorial is omitted.

  • DolphinDB factor function implementation process:

Log in to the machine or use DolphinDB GUI or VScode plug-in to connect to DolphinDB service and use DolphinDB script predefined functions. This tutorial encapsulates the No. 1 factor in the WorldQuant 101 Alpha factor indicator library WQAlpha1and takes it as an example. The code implementation is as follows:

/**
 * 因子:WorldQuant 101 Alpha 因子指标库中的1号因子 WQAlpha1
参数:
       security_id:STRING VECTOR,股票代码序列
       begin_date:DATE,区间开始日期
       end_date:DATE,区间结束日期
 */
use wq101alpha
defg get_alpha1(security_id, begin_date, end_date){
	if (typestr(security_id) == 'STRING VECTOR' && typestr(begin_date) == `DATE && typestr(end_date) == `DATE){
	tick_list = select * from loadTable("dfs://tick_close", "tick_close") where TradeDate >= begin_date and TradeDate <= end_date and SecurityID in security_id
	alpha1_list=WQAlpha1(panel(tick_list.TradeDate, tick_list.SecurityID, tick_list.Value))
	return table(alpha1_list.rowNames() as TradeDate, alpha1_list)
	}
	else {
		print("What you have entered is a wrong type")
		return `NULLValue
	}
}

Parameter explanation:

  • Request parameters:

  • Return parameters:

Since the operation of using the Python api to call the DolphinDB predefined function in the Python code is different from the operation of the predefined function in the server service, sessionin order to enable the Python api to successfully call the function defined by the DolphinDB script, we introduced in this tutorial functionView(ie function view) function, you need to add the function to the function view first, and give the execution permission of the function view to a researcher (the admin user does not need to grant permission), the specific code implementation is as follows:

//将该函数加入到函数视图中
addFunctionView(get_alpha1)
//将该函数的调用权限赋予给xxx用户
grant("xxx", VIEW_EXEC, "get_alpha1")
  • Celery calls the factor function project construction process:

This section introduces how to build a project based on the Celery framework. The Celery installation method used in this tutorial is pipthe command . Log in to the machine and execute the following command. Users can also use other installation methods:

$ pip install celery==4.3.0 && pip install redis==3.2.0

Note: If an error such TypeError: __init__() got an unexpected keyword argument 'username'as , it means that there is a problem with the version of the component kombu library installed at the same time as the Celery framework is installed. It is recommended to uninstall the original version of the library and execute pip3 install kombu==5.1.0the installation 5.1.0version of the library.

After installing the required libraries, execute the following commands in a specific directory to build the project directory structure and required files:

$ mkdir celery_project && touch celery_project/tasks.py celery_project/app.py

Execute tree ./celery_projectthe command to view the project directory structure as follows:

./celery_project ├── app.py └── tasks.py  0 directories, 2 files

Among them, the contents of the two files are written as follows:

tasks.py: This file is used to create a sessionpackage , and declare the packaged function as a task that can be scheduled asynchronously by the Celery framework.

First, import the required Python libraries:

from celery import Celery
import dolphindb as ddb
import numpy as np
import pandas as pd
from datetime import datetime

Second, call DolphinDB's Python api to establish with the previous DolphinDB service session:

s = ddb.session()
s.connect("127.0.0.1", 8848, "admin", "123456")

At the same time, instantiate a Celery object and set related configurations:

app = Celery(
    'celeryApp',
    broker='redis://localhost:6379/1',
    backend='redis://localhost:6379/2'
)
app.conf.update(
    task_serializer='pickle',
    accept_content=['pickle'], 
    result_serializer='pickle',
    timezone='Asia/Shanghai',
    enable_utc=True,
)

Note: Since the calculation process datetimeinvolves DataFramethe transmission and return of format data, Celery's default serialization method jsoncannot support the serialization of this type of data, so it is necessary to set task_serializer, accept_content, and result_serializerthree parameters to specify the serialization method as pickle.

Finally, encapsulate the predefined function that calls DolphinDB into a function for calling, and add @app.task()a decorator to indicate that the executed task is an asynchronous task that can be called by Celery:

@app.task()
def get_alpha1(security_id, begin_date, end_time):
    return s.run("get_alpha1", security_id, begin_date, end_time)

Note: Here we use the parameters of the data type in Python to pass. For the corresponding relationship between the data types of Python and DolphinDB and the parameter passing of the data types in DolphinDB, please refer to Section 1.3 of Python API for DolphinDB .

  • app.py: This file is used to create a sessionpackage , and declare the packaged function as a task that can be scheduled asynchronously by the Celery framework.

The following is the code implementation. Here we use a loop statement and call delay()the function to send two task call requests to the Celery framework, and print out the tasks in each loop id:

import numpy as np
from tasks import get_alpha1
security_id_list=[["600020", "600021"],["600022", "600023"]]
if __name__ == '__main__':
  for i in security_id_list:
    result = get_alpha1.delay(i, np.datetime64('2020-01-01'), np.datetime64('2020-01-31'))
    print(result)
  • Celery calls the factor function task implementation process:

Execute the following statement on the command line to run the worker side of the Celery framework:

$ celery -A tasks worker --loglevel=info

Expected output:

 -------------- celery@cnserver9 v4.3.0 (rhubarb)
---- **** -----
--- * ***  * -- Linux-3.10.0-1160.53.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core 2022-11-11 00:10:34
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         celeryApp:0x7f597a1d4e48
- ** ---------- .> transport:   redis://localhost:6379/1
- ** ---------- .> results:     redis://localhost:6379/2
- *** --- * --- .> concurrency: 64 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . tasks.max_drawdown

[2022-11-11 00:10:37,413: INFO/MainProcess] Connected to redis://localhost:6379/1
[2022-11-11 00:10:37,437: INFO/MainProcess] mingle: searching for neighbors
[2022-11-11 00:10:38,465: INFO/MainProcess] mingle: all alone
[2022-11-11 00:10:38,488: INFO/MainProcess] celery@cnserver9 ready.

Since this command will remain in interactive mode workerafter , we need to establish a new session with the machine, and execute the following command to send an asynchronous task call request to the Celery framework after entering the Celery project directory:

$ python3 app.py

Expected output:

400a3024-65a1-4ba6-b8a9-66f6558be242
cd830360-e866-4850-aba0-3a07e8738f78

At this time, we look at workerthe terminal , and we can view the execution status of the asynchronous task and the returned result information:

 -------------- celery@cnserver9 v4.3.0 (rhubarb)
---- **** -----
--- * ***  * -- Linux-3.10.0-1160.53.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core 2022-11-11 00:10:34
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app:         celeryApp:0x7f597a1d4e48
- ** ---------- .> transport:   redis://localhost:6379/1
- ** ---------- .> results:     redis://localhost:6379/2
- *** --- * --- .> concurrency: 64 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
 -------------- [queues]
                .> celery           exchange=celery(direct) key=celery


[tasks]
  . tasks.max_drawdown

[2022-11-11 00:10:37,413: INFO/MainProcess] Connected to redis://localhost:6379/1
[2022-11-11 00:10:37,437: INFO/MainProcess] mingle: searching for neighbors
[2022-11-11 00:10:38,465: INFO/MainProcess] mingle: all alone
[2022-11-11 00:10:38,488: INFO/MainProcess] celery@cnserver9 ready.
[2022-11-11 00:12:44,365: INFO/MainProcess] Received task: tasks.max_drawdown[400a3024-65a1-4ba6-b8a9-66f6558be242]
[2022-11-11 00:12:44,369: INFO/MainProcess] Received task: tasks.max_drawdown[cd830360-e866-4850-aba0-3a07e8738f78]
[2022-11-11 00:12:44,846: INFO/ForkPoolWorker-63] Task tasks.get_alpha1[400a3024-65a1-4ba6-b8a9-66f6558be242] succeeded in 0.04292269051074982s:    TradeDate  600020  600021
0  2020-01-01     NaN     NaN
1  2020-01-02     NaN     NaN
2  2020-01-03     NaN     NaN
3  2020-01-06     NaN     NaN
4  2020-01-07     0.5     0.0
5  2020-01-08     0.5     0.0
6  2020-01-09     0.0     0.5
7  2020-01-10     0.0     0.5
8  2020-01-13     0.0     0.5
9  2020-01-14     0.0     0.5
10 2020-01-15     0.5     0.0
11 2020-01-16     0.5     0.0
12 2020-01-17     0.5     0.0
13 2020-01-20     0.5     0.0
14 2020-01-21     0.0     0.5
15 2020-01-22     0.5     0.0
16 2020-01-23     0.5     0.0
17 2020-01-24     0.5     0.0
18 2020-01-27     0.5     0.0
19 2020-01-28     0.0     0.5
20 2020-01-29     0.0     0.5
21 2020-01-30     0.0     0.5
22 2020-01-31     0.0     0.5

[2022-11-11 00:12:45,054: INFO/ForkPoolWorker-1] Task tasks.get_alpha1[cd830360-e866-4850-aba0-3a07e8738f78] succeeded in 0.06510275602340698s:     TradeDate  600022  600023
0  2020-01-01     NaN     NaN
1  2020-01-02     NaN     NaN
2  2020-01-03     NaN     NaN
3  2020-01-06     NaN     NaN
4  2020-01-07     0.0     0.0
5  2020-01-08     0.0     0.0
6  2020-01-09     0.0     0.0
7  2020-01-10     0.0     0.0
8  2020-01-13     0.0     0.0
9  2020-01-14     0.0     0.0
10 2020-01-15     0.0     0.5
11 2020-01-16     0.0     0.0
12 2020-01-17     0.0     0.5
13 2020-01-20     0.5     0.0
14 2020-01-21     0.5     0.0
15 2020-01-22     0.5     0.0
16 2020-01-23     0.5     0.0
17 2020-01-24     0.0     0.5
18 2020-01-27     0.0     0.0
19 2020-01-28     0.5     0.0
20 2020-01-29     0.5     0.0
21 2020-01-30     0.5     0.0
22 2020-01-31     0.5     0.0

After the task execution ends, we can also view the corresponding information of the task execution result stored in redis.

Note: We can also send an asynchronous task call request to the Celery framework terminal when the Celery framework terminal is not started, but at this time, only one task can be returned because the unstarted workerterminal cannot view the status and results of task execution id.

4. Summary

This tutorial focuses on how to introduce DolphinDB into the traditional factor computing platform to solve the performance bottleneck of the traditional factor platform. After actual testing, we combined the advantages of the Celery framework's task asynchronous call with the powerful performance advantages of DolphinDB's computing and storage integration to provide a solution case for the actual production process.

At the same time, due to limited space, some other operations related to SQL Server, dataX, DolphinDB and Celery framework cannot be further displayed, and users need to adjust according to the actual situation during use. You are also welcome to criticize and correct possible mistakes and defects in this tutorial.

appendix

Guess you like

Origin blog.csdn.net/qq_34626094/article/details/130517199