Dry goods丨Time series database partition tutorial (1)

1. Why partition the data?

Partitioning the database can greatly reduce system response delay while increasing data throughput. Specifically, partitioning has the following benefits:

  • Partitioning makes large tables easier to manage. Maintenance operations on data subsets are also more efficient, because these operations only target the required data rather than the entire table. A good partitioning strategy will reduce the amount of data to be scanned by reading only the relevant data needed to satisfy the query. When all the data is on the same partition, queries, calculations, and other operations on the database will be restricted to the disk access IO bottleneck.
  • Partitioning allows the system to make full use of all resources. With a good partitioning scheme and parallel computing, distributed computing can make full use of all nodes to complete tasks that are usually completed on one node. When a task can be divided into several scattered subtasks, and each subtask accesses a different partition, the efficiency can be improved.
  • Partitioning increases the availability of the system. Because the copy of the partition is usually stored in different physical nodes. So once a partition is unavailable, the system can still call other replica partitions to ensure the normal operation of the job.

2. Partition method

DolphinDB supports a variety of partitioning methods: range partition (RANGE), hash partition (HASH), value partition (VALUE), list partition (LIST), composite partition (COMPO).

  • Range partitioning creates a partition for each interval, which is the most commonly used and recommended partitioning method. You can put all the records with values ​​in a range into a partition.
  • Hash partitioning uses hash functions to operate on partition columns to facilitate the establishment of a specified number of partitions.
  • Value partition creates a partition for each value, such as stock trading date and stock trading month.
  • List partitioning is based on user-enumerated lists, which is more flexible than value partitioning.
  • Compound partitioning is suitable for very large amounts of data and queries often involve two or more partition columns. Each partition selection can use interval, value or list partition. For example, the value partition is based on the stock transaction date and the range partition is based on the stock code.

We can use the database function to create a database.

语法:database(directory, [partitionType], [partitionScheme], [locations])

parameter

directory: The directory where the database is saved. DolphinDB has three types of databases, which are in-memory database, database on disk, and database on distributed file system. To create an in-memory database, the directory is empty; to create a local database, the directory should be a local file system directory; to create a database on a distributed file system, the directory should start with "dfs://". This tutorial takes the creation of a Windows local database as an example.

partitionType: Partition mode, there are 5 modes: range partition (RANGE), hash partition (HASH), value partition (VALUE), list partition (LIST), composite partition (COMPO).

partitionScheme: partition scheme. The partition schemes corresponding to various partitioning methods are as follows:

6c5af97e99111771a9edb241ee11a46e.png

locations: Specify the node location where each partition is located. If it is a distributed file system database or a compound partition (COMPO) type database, the locations parameter cannot be used.


2.1 Range partition

The range partition is determined by the partition vector. The partition vector represents the interval, including the starting value but not the ending value.

In the following example, the database db has two partitions: [0,5) and [5,10). Use the function append! to save the table t as the partition table pt in the database db, and use the ID as the partition column.

n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://rangedb", RANGE,  0 5 10)

pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t);

pt=loadTable(db,`pt)
select count(x) from pt

6f925802b778e71f0fa201f30f78c09f.png


2.2 Hash partition

Hash partition uses a hash function on the partition column to generate partitions. Hash partitioning is an easy way to generate a specified number of partitions. However, it should be noted that the hash partition cannot guarantee the same size of the partition, especially when the distribution of the value of the partition column is skewed. In addition, when looking for data in a continuous area on the partition column, the efficiency of hash partition is lower than area partition or value partition.

In the following example, the database db has two partitions. Use the function append! to save the table t as the partition table pt in the database db, and use the ID as the partition column.

n=1000000
ID=rand(10, n)
x=rand(1.0, n)
t=table(ID, x)
db=database("dfs://hashdb", HASH,  [INT, 2])

pt = db.createPartitionedTable(t, `pt, `ID)
pt.append!(t);

pt=loadTable(db,`pt)
select count(x) from pt

80270b36bebb1d2446f19e6625f66688.png

2.3 Value partition

Value partition uses a value to represent a partition. The following example defines 204 partitions. Each zone represents a month between January 2000 and December 2016.

n=1000000
month=take(2000.01M..2016.12M, n)
x=rand(1.0, n)
t=table(month, x)

db=database("dfs://valuedb", VALUE, 2000.01M..2016.12M)

pt = db.createPartitionedTable(t, `pt, `month)
pt.append!(t)

pt=loadTable(db,`pt)
select count(x) from pt

67865bbac45ea112b87e52dd4a342346.png


2.4 List partition

In the LIST partition, we use a list containing multiple elements to represent a partition. The following example has two partitions, the first partition contains 3 stock codes, and the second partition contains 2 stock codes.

n=1000000
ticker = rand(`MSFT`GOOG`FB`ORCL`IBM,n);
x=rand(1.0, n)
t = table (ticker, x)

db=database("dfs://listdb", LIST, [`IBM`ORCL`MSFT, `GOOG`FB])
pt = db.createPartitionedTable(t, `pt, `ticker)
pt.append!(t)

pt=loadTable(db,`pt)
select count(x) from pt

09c0318879b18f933acc9801ba2d936f.png


2.5 Combined partition

Combination (COMPO) partition can define 2 or 3 partition columns. Each column can be partitioned independently by range (RANGE), value (VALUE) or list (LIST). The multiple columns of the combined partition are logically parallel, and there is no subordination or priority relationship.

n=1000000
ID=rand(100, n)
dates=2017.08.07..2017.08.11
date=rand(dates, n)
x=rand(10.0, n)
t=table(ID, date, x)

dbDate = database(, VALUE, 2017.08.07..2017.08.11)
dbID=database(, RANGE, 0 50 100)
db = database("dfs://compoDB", COMPO, [dbDate, dbID])
pt = db.createPartitionedTable(t, `pt, `date`ID)
pt.append!(t)

pt=loadTable(db,`pt)
select count(x) from pt

The above example creates 5 value partitions.

0d76d97408efc2a4705b89e79eb44f5a.png

In the 20170807 partition, there are 2 range partitions.

b22745d6c82ab7fec6b3765c168e49af.png


Database partitioning tutorial (2) will introduce the principles of database partitioning and special partitioning schemes

Guess you like

Origin blog.51cto.com/15022783/2562523