Data migration tutorial | From PostgreSQL/Greenplum to DolphinDB

PostgreSQL  is an open source relational database management system (DBMS) and one of the most widely used open source databases. It allows users to extend its functionality by adding custom functions, data types and indexes, supports ACID transactions, and uses multi-version concurrency control (MVCC) to manage concurrent access, with excellent security and scalability. However, due toconcurrency issues and expansion issues etc., the development of PostgreSQL still faces many bottlenecks.

DolphinDB  is an efficient, distributed data management and analysis platform that integrates powerful programming languages ​​such as SQL, Python, Java, etc. and high-capacity and high-speed streaming The data analysis system provides a one-stop solution for the rapid storage, retrieval, analysis and calculation of massive data (especially time series data). It is simple to operate, highly scalable, has good fault tolerance and excellent multi-user concurrent access capabilities, and is suitable for various large-scale data processing scenarios.

This article aims to provide a concise tutorial reference for users who need to migrate from PostgreSQL database to DolphinDB. This tutorial is also applicable to other types of databases developed based on PostgreSQL, such as Greenplum. The specific implementation may be different, and the official instructions of the actual database shall prevail.

The overall framework for migrating data from PostgreSQL to DolphinDB is as follows:

1. Implementation method

There are two methods to migrate data from PostgreSQL to DolphinDB:

1.1 ODBC plug-in

The ODBC (Open Database Connectivity) plug-in is an open source product provided by DolphinDB that accesses PostgreSQL through the ODBC interface. Using the plug-in in conjunction with the DolphinDB script and running in the same process space as the server can efficiently complete the writing of Oracle data to DolphinDB.

ODBC provides the following functions. For specific usage of the functions, please refer toodbc/README_CN.md · Zhejiang Zhiyu Technology Co., Ltd./DolphinDBPlugin - Gitee 

  • odbc::connect(connStr, [dataBaseType])
  • odbc::close(conn)
  • odbc::query(connHandle or connStr, querySql, [t], [batchSize], [tranform])
  • odbc::execute(connHandle or connStr, SQLstatements)
  • odbc::append(connHandle, tableData, tablename, [createTableIfNotExist], [insertIgnore])

1.2 DataX driver

DataX is an extensible data synchronization framework that abstracts the synchronization of different data sources into a Reader plug-in that reads data from the source data source, and a Writer plug-in that writes data to the target. In theory, the DataX framework can support any data source type. Data sync works.

DolphinDB provides open source drivers based on DataXReader and DataXWriter. The DolphinDBWriter plug-in implements writing data to DolphinDB. Using the existing reader plug-in of DataX combined with the DolphinDBWriter plug-in, you can synchronize data from different data sources to DolphinDB. Users can include the DataX driver package in a Java project to develop data migration software from Oracle data sources to DolphinDB.

2. Application requirements

Many data stored in PostgreSQL can be synchronized to DolphinDB through the above two methods. The practical case in this article is based on the transaction data of one day on 2021.01.04, with a data volume of approximately 27.21 million. Some examples of its data are as follows:

3. Migration cases and operation steps

3.1 Environment configuration

The following databases and plug-ins were used in this case, and the versions and models are as follows:

  • Postgre Version:PostgreSQL 13.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
  • unixODBC Version:2.3.7
  • DolphinDB Server 版本:2.00.10 2023.07.18
  • DolphinDB GUI 版本:1.30.22.1

2.00.10 The version Server comes with the ODBC plug-in, which is located in the directory of the Server and can be loaded and used directly. If the ODBC folder does not exist in the directory, download it through the following link:  <HomeDir>/plugins  <HomeDir>/plugins 

Zhejiang Zhiyu Technology Co., Ltd./DolphinDBPlugin [Branch: release200.10]

Please note that the DolphinDB ODBC plug-in version number must be consistent with the Server version number, otherwise errors may occur. For example, if the version number of DolphinDB Server is 2.00.10.X, you must use the ODBC plug-in of the release200.10 branch.

  • Notice:

If you want to compile it yourself, please refer to:odbc/README_CN.md · Zhejiang Zhiyu Technology Co., Ltd./DolphinDBPlugin - Gitee  in the article chapter 2.

3.2 Create database and table

The PostgreSQL table creation statement is as follows:

create table ticksh(
  SecurityID         varchar(20),
  TradeTime       	 TIMESTAMP,
  TradePrice         NUMERIC(38,4),
  TradeQty 	         NUMERIC(38),
  TradeAmount        NUMERIC(38,4),
  BuyNo              NUMERIC(38),
  SellNo             NUMERIC(38),
  TradeIndex         NUMERIC(38),
  ChannelNo          NUMERIC(38),
  TradeBSFlag        varchar(10),
  BizIndex           integer
);

When designing a database and table building plan, you need to comprehensively consider factors such as the fields, types, and data volumes of the actual data, as well as whether partitioning is required in DolphinDB, the partitioning scheme, and whether to use OLAP or TSDB engines. For some data repository table design practices, you can refer to DolphinDB Database Partitioning Tutorial.  

In this example, the content of the DolphinDB database and table creation file createTable.dos is as follows:

def createTick(dbName, tbName){
	if(existsDatabase(dbName)){
		dropDatabase(dbName)
	}
	db1 = database(, VALUE, 2020.01.01..2021.01.01)
	db2 = database(, HASH, [SYMBOL, 10])
	db = database(dbName, COMPO, [db1, db2], , "TSDB")
	db = database(dbName)
	name = `SecurityID`TradeTime`TradePrice`TradeQty`TradeAmount`BuyNo`SellNo`ChannelNo`TradeIndex`TradeBSFlag`BizIndex
	type = `SYMBOL`TIMESTAMP`DOUBLE`INT`DOUBLE`INT`INT`INT`INT`SYMBOL`INT
	schemaTable = table(1:0, name, type)
	db.createPartitionedTable(table=schemaTable, tableName=tbName, partitionColumns=`TradeTime`SecurityID, compressMethods={TradeTime:"delta"}, sortColumns=`SecurityID`TradeTime, keepDuplicates=ALL)
}

dbName="dfs://TSDB_tick"
tbName="tick"
createTick(dbName, tbName)

The mapping relationship of data fields migrated from PostgreSQL to DolphinDB is as follows:

PostgreSQL field meaning

PostgreSQL fields

PostgreSQL data types

DolphinDB field meaning

DolphinDB fields

DolphinDB data types

Securities code

SecurityID

VARCHAR(20)

Securities code

SecurityID

SYMBOL

transaction hour

TradeTime

TIMESTAMP

transaction hour

TradeTime

TIMESTAMP

Trading price

TradePrice

NUMERIC(38,4)

Trading price

TradePrice

DOUBLE

Number of transactions

TradeQty

NUMERIC(38)

Number of transactions

TradeQty

INT

Amount of the transaction

TradeAmount

NUMERIC(38,4)

Amount of the transaction

TradeAmount

DOUBLE

Buyer order index

Buy No

NUMERIC(38)

Buyer order index

Buy No

INT

Seller's order index

SellNo

NUMERIC(38)

Seller's order index

SellNo

INT

Transaction number

TradeIndex

NUMERIC(38)

Transaction number

TradeIndex

INT

channel code

ChannelNo

NUMERIC(38)

channel code

ChannelNo

INT

Transaction direction

TradeBSFlag

VARCHAR(10)

Transaction direction

TradeBSFlag

SYMBOL

Business serial number

BizIndex

INTEGER

Business serial number

BizIndex

INT

3.3 Migration via ODBC

3.3.1 Install ODBC driver

In this example, the server operating system where DolphinDB is deployed is Centos.

step1: Before installing the PostgreSQL ODBC driver, you need to install the unixODBC library first. Use the following command to install it.

Centos system:

# 安装 unixODBC 库
yum install unixODBC unixODBC-devel

step3: Configure ODBC configuration file

1)odbcinst.ini The file is used to set the file path of the ODBC driver library to be used by a certain ODBC Driver. Configuration odbcinst.ini Contents of the file (if the configuration file does not exist, it needs to be created manually):

[PostgreSQL]
Description     = ODBC for PostgreSQL
Driver          = /usr/lib/psqlodbcw.so
Setup           = /usr/lib/libodbcpsqlS.so
Driver64        = /usr/lib64/psqlodbcw.so
Setup64         = /usr/lib64/libodbcpsqlS.so
FileUsage       = 1

2)/etc/odbc.ini The file is used to set the Driver used in ODBC, the database to be used, etc., and more For multiple configuration items, please refer to ODBC connection string configuration. The Driver is the content in square brackets in the first line configured in the /etc/odbcinst.ini file. Add the following content (if the configuration file does not exist, you need to create it manually):

[postgresql] 					//ODBC数据源名称
Description = PostgresSQLODBC	//ODBC的描述
Driver = PostgreSQL				//驱动名称
Database = postgres				//数据库名称
Servername = 127.0.0.1			//布置了Postgresql数据库的服务器IP地址
UserName = postgres				//数据库相关的用户名
Password = postgres				//数据库密码
Port = 5432					    //布置了Postgresql数据库的服务器的端口号
ReadOnly = 0					//关闭只读特性
ConnSettings = set client_encoding to UTF8	//客户端编码

step4: Test ODBC connection

Use ODBC to log in to the Postgresql database, test the connection and the expected results are as follows:

isql postgresql postgres postgres  //后两位分别是用户名和密码

/**********output********/
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+

Successful login is as shown below, you can enter SQL to operate the database

SQL>
  • Notice:

If you have other frequently asked questions, please refer to Chapter 5 Precautions and Frequently Asked Questions in the ODBC_plugin_user_guide.md · Zhejiang Zhiyu Technology Co., Ltd./Tutorials_CN - Gitee article.

3.3.2 Synchronize data

step1: Run the following command to load the ODBC plug-in ( ServerPath Please modify it yourself)

loadPlugin("ServerPath/plugins/odbc/PluginODBC.txt")

step2: Run the following command to establish a connection with Oracle (the first parameter is the ODBC connection string connStr)

connstr See the connection string reference to modify it yourself.

conn = odbc::connect("Driver={PostgreSQL};Server=*;Port=5432;Database=postgres;Uid=postgres;Pwd=postgres;", `PostgreSQL)

step3: Run the following command to start synchronizing data

def transForm(mutable msg){
	msg.replaceColumn!(`TradeQty, int(msg[`TradeQty]))
	msg.replaceColumn!(`BuyNo, int(msg[`BuyNo]))
	msg.replaceColumn!(`SellNo, int(msg[`SellNo]))
	msg.replaceColumn!(`ChannelNo, int(msg[`ChannelNo]))
	msg.replaceColumn!(`TradeIndex, int(msg[`TradeIndex]))
	msg.replaceColumn!(`BizIndex, int(msg[`BizIndex]))
	return msg
}

def syncData(conn, dbName, tbName, dt){
	sql = "select SecurityID, TradeTime, TradePrice, TradeQty, TradeAmount, BuyNo, SellNo, ChannelNo, TradeIndex, TradeBSFlag, BizIndex from ticksh"
	if(!isNull(dt)) {
		sql = sql + " where to_date(TradeTime,'yyyy-MM-dd hh24:mi:ss') = dt"
	}
    odbc::query(conn,sql, loadTable(dbName,tbName), 100000, transForm)
}

dbName="dfs://TSDB_tick"
tbName="tick"
syncData(conn, dbName, tbName, NULL)

There are 27211975 pieces of data in total, and it takes about 597 seconds to synchronize the data.

step4: Background multi-tasking data synchronization

In this tutorial, the DolphinDB database is partitioned by day. If you need to synchronize data for multiple days, you can submit multiple tasks to the background:

for(dt in 2021.01.04..2021.01.05){
	submitJob(`syncPostgreTick, `syncPostgreTick, syncData, conn, dbName, tbName, dt)
}
// 查看后台任务
select * from getRecentJobs() where jobDesc = `syncPostgreTick

3.4 Migration via DataX

3.4.1 Deploying DataX

Download the DataX compressed package from the DataX download address and extract it to a custom directory. Execute datax self-test as follows:

cd datax/bin
python datax.py /opt/datax/job/job.json

An error may occur. The error is [The configuration file you provided contains error information...plugin.json] does not exist:

This problem is due to the existence of temporary files in the reader and writer directories, which affects the operation of datax. The solution is as follows:

# 这里的/datax/plugin/writer/要改为你自己的目录
find /datax/plugin/reader/ -type f -name "._*er" | xargs rm -rf
find /datax/plugin/writer/ -type f -name "._*er" | xargs rm -rf

After the self-test is successful, copy all the contents of the ./dist/dolphindbwriter directory of the source code in DataX-DolphinDBWriter to the DataX/plugin/writer directory, and you can use it.

3.4.2 Executing DataX tasks

step1: Configure json file

The specific contents of the configuration file pgddb.json are as follows, and the json file is placed in a custom directory. In this tutorial, it is placed in the datax/job directory.

{
    "job": {
            "content": [{
                    "writer": {
                            "name": "dolphindbwriter",
                            "parameter": {
                                    "userId": "admin",
                                    "pwd": "123456",
                                    "host": "10.0.0.80",
                                    "port": 8848,
                                    "dbPath": "dfs://TSDB_tick",
                                    "tableName": "Tick",
                                    "table": [
                                        {
                                             "type": "DT_SYMBOL",
                                             "name": "SecurityID"
                                        },
                                        {
                                            "type": "DT_TIMESTAMP",
                                            "name": "TradeTime"
                                        },
                                        {
                                            "type": "DT_DOUBLE",
                                            "name": "TradePrice"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "TradeQty"
                                        },
                                        {
                                            "type": "DT_DOUBLE",
                                            "name": "TradeAmount"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "BuyNo"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "SellNo"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "TradeIndex"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "ChannelNo"
                                        },
                                        {
                                            "type": "DT_SYMBOL",
                                            "name": "TradeBSFlag"
                                        },
                                        {
                                            "type": "DT_INT",
                                            "name": "BizIndex"
                                        }
                                    ]                            
                            }
                    },
                    "reader": {
                            "name": "postgresqlreader",
                            "parameter": {
                                    "username": "postgres",
                                    "column": ["SecurityID", "TradeTime", "TradePrice", "TradeQty", "TradeAmount", "BuyNo", "SellNo", "ChannelNo", "TradeIndex", "TradeBSFlag", "BizIndex"],
                                    "connection": [{
                                            "table": ["ticksh"],
                                            "jdbcUrl": ["jdbc:postgresql:postgres"]
                                    }],
                                    "password": "postgres",
                                    "where": ""
                            }                            
                    }
            }],
            "setting": {
                    "speed": {
                            "channel": 1
                    }
            }
    }
}

step2: Execute the following command in the Linux terminal to execute the DataX task

cd ./datax
python bin/datax.py --jvm=-Xmx8g job/ddbtopg.json

step3: View DataX synchronization results

任务启动时刻                    : 2023-08-29 14:19:53
任务结束时刻                    : 2023-08-29 14:26:33
任务总计耗时                    :                400s
任务平均流量                    :            4.08MB/s
记录写入速度                    :          68029rec/s
读出记录总数                    :            27211975
读写失败总数                    :                   0

4. Baseline performance

The ODBC plug-in and DataX driver were used respectively for data migration. The amount of data was 27.21 million. The comparison of migration time is as shown in the following table:

ODBC plug-in

DataX

597.54s

400s

In summary, both the ODBC plug-in and DataX can migrate data from PostgreSql to DolphinDB, but each has its own advantages and disadvantages:

  • ODBC is easy to use and suitable for importing customized data, but operation and maintenance management is inconvenient.
  • DataX importing data requires writing complex import configurations, but its expansion is flexible, suitable for batch import, convenient for monitoring, and has rich community support.

Users can choose the appropriate import method based on the size of their data and the convenience of engineering.

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4865736/blog/10142894