Dry goods丨DolphinDB general computing tutorial for time series database

DolphinDB can not only store data distributed, but also has good support for distributed computing. In DolphinDB, users can use the general distributed computing framework provided by the system to implement efficient distributed algorithms through scripts without paying attention to specific underlying implementations. This article will give a detailed explanation of the important concepts and related functions in the DolphinDB general computing framework, and provide a wealth of specific usage scenarios and examples.

1. Data Source

Data Source is the basic concept in the general computing framework of DolphinDB database. It is a special type of data object, which is a meta description of data. By executing the data source, users can obtain data entities such as tables, matrices, and vectors. In DolphinDB's distributed computing framework, lightweight data source objects instead of huge data entities are transmitted to remote nodes for subsequent calculations, which greatly reduces network traffic.

In DolphinDB, users often use the sqlDS function to generate a data source based on a SQL expression. This function does not directly query the table, but returns the meta-statement of one or more SQL sub-queries, that is, the data source. After that, users can use the Map-Reduce framework to input data sources and calculation functions, distribute tasks to the nodes corresponding to each data source, complete the calculations in parallel, and then summarize the results.

Several commonly used methods of obtaining data sources will be described in detail in section 3.1, 3.2, 3.3, and 3.4 of this article.

2. Map-Reduce framework

Map-Reduce function is the core function of DolphinDB general distributed computing framework.

2.1 mr function

The syntax of DolphinDB's Map-Reduce function mr is mr(ds, mapFunc, [reduceFunc], [finalFunc], [parallel=true]), which accepts a set of data sources and a mapFunc function as parameters. It will distribute computing tasks to the node where each data source is located, and process the data in each data source through mapFunc. The optional parameter reduceFunc will calculate the return value of mapFunc in pairs, and the result obtained will be calculated with the return value of the third mapFunc, so that the cumulative calculation will summarize the results of mapFunc. If there are M map calls, the reduce function will be called M-1 times. The optional parameter finalFunc further processes the return value of reduceFunc.

There is an example of distributed least squares linear regression through mr in the official document . This article uses the following example to show how to use a mr call to randomly sample one-tenth of the data in each partition of the distributed table:

// 创建数据库和DFS表
db = database("dfs://sampleDB", VALUE, `a`b`c`d)
t = db.createPartitionedTable(table(100000:0, `sym`val, [SYMBOL,DOUBLE]), `tb, `sym)
n = 3000000
t.append!(table(rand(`a`b`c`d, n) as sym, rand(100.0, n) as val))

// 定义map函数
def sampleMap(t) {
    sampleRate = 0.1
    rowNum = t.rows()
    sampleIndex = (0..(rowNum - 1)).shuffle()[0:int(rowNum * sampleRate)]
    return t[sampleIndex]
}

ds = sqlDS(<select * from t>)              // 创建数据源
res = mr(ds, sampleMap, , unionAll)        // 执行计算

In the above example, the user-defined sampleMap function accepts a table (that is, the data in the data source) as a parameter, and randomly returns one-tenth of the rows. The mr function in this example does not have a reduceFunc parameter, so the return value of each map function is placed in a tuple and passed to finalFunc, that is, unionAll. unionAll merges multiple tables returned by the map function into a distributed table with sequential partitions.

2.2 imr function

DolphinDB database provides an iterative calculation function imr based on the Map-Reduce method. Compared with mr, it can support iterative calculation, each iteration uses the result of the previous iteration and the input data set, so it can support the realization of more complex algorithms. The iterative calculation requires the initial values and termination criteria of the model parameters. Its syntax is imr(ds, initValue, mapFunc, [reduceFunc], [finalFunc], terminateFunc, [carryover=false]) where the initValue parameter is the initial value of the first iteration, the mapFunc parameter is a function, and the accepted parameters include The data source entity, and the output of the final function in the previous iteration. For the first iteration, it is the initial value given by the user. The parameters in imr are similar to the mr function. The finalFunc function accepts two parameters, the first parameter is the output of the final function in the previous iteration. For the first iteration, it is the initial value given by the user. The second parameter is the output after calling the reduce function. The terminateFunc parameter is used to determine whether the iteration is terminated. It accepts two parameters. The first is the output of the reduce function in the previous iteration, and the second is the output of the reduce function in the current iteration. If it returns true, the iteration will be aborted. The carryover parameter indicates whether the map function call generates an object to be passed to the next map function call. If carryover is true, then the map function has 3 parameters and the last parameter is the carried object, and the output result of the map function is a tuple, and the last element is the carried object. In the first iteration, the object carried is NULL.

There is an example of calculating the median of distributed data through imr in the official document . This article will provide a more complicated example, that is, the calculation of Logistic Regression using Newton's method to show the application of imr in machine learning algorithms.

def myLrMap(t, lastFinal, yColName, xColNames, intercept) {
	placeholder, placeholder, theta = lastFinal
    if (intercept)
        x = matrix(t[xColNames], take(1.0, t.rows()))
    else
        x = matrix(t[xColNames])
    xt = x.transpose()
    y = t[yColName]
    scores = dot(x, theta)
    p = 1.0 \ (1.0 + exp(-scores))
    err = y - p
    w = p * (1.0 - p)
    logLik = (y * log(p) + (1.0 - y) * log(1.0 - p)).flatten().sum()
    grad = xt.dot(err)                   // 计算梯度向量
    wx = each(mul{w}, x)
    hessian = xt.dot(wx)                 // 计算Hessian矩阵
    return [logLik, grad, hessian]
}

def myLrFinal(lastFinal, reduceRes) {
    placeholder, placeholder, theta = lastFinal
    logLik, grad, hessian = reduceRes
    deltaTheta = solve(hessian, grad)    // deltaTheta等于hessian^-1 * grad，相当于解方程hessian * deltaTheta = grad
    return [logLik, grad, theta + deltaTheta]
}

def myLrTerm(prev, curr, tol) {
	placeholder, grad, placeholder = curr
	return grad.flatten().abs().max() <= tol
}

def myLr(ds, yColName, xColNames, intercept, initTheta, tol) {
    logLik, grad, theta = imr(ds, [0, 0, initTheta], myLrMap{, , yColName, xColNames, intercept}, +, myLrFinal, myLrTerm{, , tol})
    return theta
}

In the above example, the map function calculates the gradient vector and Hessian matrix under the current coefficients for the data in the data source; the reduce function adds the results of the map, which is equivalent to finding the gradient vector and Hessian matrix of the entire data set; final The function optimizes the coefficients through the final gradient vector and Hessian matrix to complete a round of iteration; the criterion for the terminate function is whether the absolute value of the largest component of the gradient vector in this round of iteration is greater than the parameter tol.

This example can also be further optimized to improve performance through data source conversion operations, see section 3.6 for details.

As a frequently used analysis tool, distributed logistic regression has been implemented as a built-in function in DolphinDB. The built-in version ( logisticRegression) provides more functions.

3. Data source related functions

DolphinDB provides the following commonly used methods to obtain data sources:

3.1 sqlDS function

The sqlDS function creates a data source list based on the input SQL meta code. If the data table in the SQL query has n partitions, sqlDS will generate n data sources. If the SQL query does not contain any partitioned tables, sqlDS will return a tuple containing only one data source.

sqlDS is an efficient way to convert SQL expressions into data sources. Users only need to provide SQL expressions without paying attention to the specific data distribution, and can use the returned data source to execute distributed algorithms. The example provided below shows the method of using sqlDS to perform olsEx distributed least squares regression on the data in the DFS table.

// 创建数据库和DFS表
db = database("dfs://olsDB", VALUE, `a`b`c`d)
t = db.createPartitionedTable(table(100000:0, `sym`x`y, [SYMBOL,DOUBLE,DOUBLE]), `tb, `sym)
n = 3000000
t.append!(table(rand(`a`b`c`d, n) as sym, 1..n + norm(0.0, 1.0, n) as x, 1..n + norm(0.0, 1.0, n) as y))

ds = sqlDS(<select x, y from t>)    // 创建数据源
olsEx(ds, `y, `x)                   // 执行计算

3.2 repartitionDS function

The data source of sqlDS is automatically generated by the system according to the partition of the data. Sometimes users need to make some restrictions on the data source, for example, when obtaining data, re-specify the data partition to reduce the amount of calculation, or only need a part of the partition data. The repartitionDS function provides the function of repartitioning data sources.

The function repartitionDS generates a repartitioned new data source for the meta code according to the input SQL meta code and column name, partition type, partition scheme, etc.

The following code provides an example of repartitionDS. In this example, the DFS table t has fields deviceId, time, temperature, which are symbol, datetime and double respectively. The database adopts double-layer partitioning. The first layer is partitioned by VALUE for time, one partition per day; the second layer is for deviceId Divided into 20 areas according to HASH.

Now it is necessary to aggregate and query the 95th percentile temperature by the deviceId field. If you directly write the query select percentile(temperature,95) from t group by deviceID, this query will not be completed because the percentile function is not implemented by Map-Reduce.

One solution is to load all the required fields locally and calculate the 95th percentile, but when the amount of data is too large, computing resources may be insufficient. repartitionDS provides a solution: repartition the table based on the deviceId according to its original partitioning scheme HASH, and each new partition corresponds to all the data of a HASH partition in the original table. Calculate the 95th percentile temperature in each new partition through the mr function, and finally merge the results.

// 创建数据库
deviceId = "device" + string(1..100000)
db1 = database("", VALUE, 2019.06.01..2019.06.30)
db2 = database("", HASH, INT:20)
db = database("dfs://repartitionExample", COMPO, [db1, db2])

// 创建DFS表
t = db.createPartitionedTable(table(100000:0, `deviceId`time`temperature, [SYMBOL,DATETIME,DOUBLE]), `tb, `deviceId`time)
n = 3000000
t.append!(table(rand(deviceId, n) as deviceId, 2019.06.01T00:00:00 + rand(86400 * 10, n) as time, 60 + norm(0.0, 5.0, n) as temperature))

// 重新分区
ds = repartitionDS(<select deviceId, temperature from t>, `deviceId)
// 执行计算
res = mr(ds, def(t) { return select percentile(temperature, 95) from t group by deviceId}, , unionAll{, false})

The correctness of the result of this calculation can be guaranteed. Because the new partition generated by repartitionDS is based on the original partition of deviceId, it can ensure that the deviceId of each data source does not overlap. Therefore, the correct result can be obtained by combining the calculation results of each partition.

3.3 textChunkDS function

The textChunkDS function can divide a text file into several data sources in order to perform distributed calculations on the data represented by a text file. Its syntax is: textChunkDS(filename, chunkSize, [delimiter=','], [schema]). Among them, the parameters of filename, delimiter, and schema are the same as those of the loadText function. The chunkSize parameter indicates the size of the data in each data source, in MB, which can be an integer from 1 to 2047.

The following example is another implementation of the olsEx example in the official document . It generates several data sources from text files through the textChunkDS function, each of which is 100MB in size. After the generated data source is converted, it executes the olsEx function to calculate the least squares parameters:

ds = textChunkDS("c:/DolphinDB/Data/USPrices.csv", 100)
ds.transDS!(USPrices -> select VOL\SHROUT as VS, abs(RET) as ABS_RET, RET, log(SHROUT*(BID+ASK)\2) as SBA from USPrices where VOL>0)
rs=olsEx(ds, `VS, `ABS_RET`SBA, true, 2)

For the data source conversion operation transDS!, please refer to section 3.6.

3.4 Data source interface provided by third-party data sources

Some plug-ins that load third-party data, such as HDF5, provide an interface for generating data sources. Users can directly execute distributed algorithms on the data sources they return without first importing third-party data into memory or saving them as disks or distributed tables.

The HDF5 plug-in of DolphinDB provides the hdf5DS function. Users can specify the number of data sources that need to be generated by setting the dsNum parameter. The following example generates 10 data sources from the HDF5 file, and calculates the sample variance on the first column of the result through the Map-Reduce framework:

ds = hdf5::hdf5DS("large_file.h5", "large_table", , 10)

def varMap(t) {
    column = t.col(0)
    return [column.sum(), column.sum2(), column.count()]
}

def varFinal(result) {
    sum, sum2, count = result
    mu = sum \ count
    populationVar = sum2 \ count - mu * mu
    sampleVar = populationVar * count \ (count - 1)
    return sampleVar
}

sampleVar = mr(ds, varMap, +, varFinal)

3.5 Data source cache

The data source can have 0, 1, or multiple positions. The data source with position 0 is the local data source. In the case of multiple locations, these locations are mutual backups. The system will randomly select a location to perform distributed computing. When the data source is instructed to cache the data object, the system will select the location where we successfully retrieved the data last time.

The user can instruct the system to cache the data source or clear the cache. For iterative computing algorithms (such as machine learning algorithms), data caching can greatly improve computing performance. When the system memory is insufficient, the cached data will be cleared. If this happens, the system can recover the data because the data source contains all meta descriptions and data conversion functions.

Functions related to data source caching are:

cacheDS!: Instruct the system to cache the data source
clearcacheDS!: Instruct the system to clear the cache after the next execution of the data source
cacheDSNow: Execute and cache the data source immediately, and return the total number of cache lines
clearCacheDSNow: Clear the data source and cache immediately

3.6 Data source conversion

A data source object can also contain multiple data conversion functions to further process the retrieved data. The system will execute these data conversion functions in turn, with the output of one function as the input (and the only input) of the next function.

Including the data conversion function in the data source is usually more effective than converting the data source in the core computing operation (that is, the map function). If the retrieved data only needs to be calculated once, there is no performance difference, but it will cause a huge difference in the iterative calculation of data sources with cached data objects. If the conversion operation is in the core computing operation, the conversion needs to be performed for each iteration; if the conversion operation is in the data source, they are only executed once. The transDS! function provides the function of converting data sources.

For example, before executing the iterative machine learning function randomForestRegressor, users may need to manually fill in missing values in the data (of course, DolphinDB's random forest algorithm has built-in missing value processing). At this point, you can use transDS! to process the data source as follows: For each feature column, use the average value of the column to fill in the missing values. Assuming that the columns x0, x1, x2, x3 in the table are independent variables, and column y is the dependent variable, the following is the implementation method:

ds = sqlDS(<select x0, x1, x2, x3, y from t>)
ds.transDS!(def (mutable t) {
    update t set x0 = nullFill(x0, avg(x0)), x1 = nullFill(x1, avg(x1)), x2 = nullFill(x2, avg(x2)), x3 = nullFill(x3, avg(x3))
    return t
})

randomForestRegressor(ds, `y, `x0`x1`x2`x3)

Another example of converting a data source is the script implementation of logistic regression mentioned in section 2.2. In the implementation of Section 2.2, the map function call includes the operation of fetching the corresponding column from the table of the data source and converting it into a matrix, which means that these operations will occur in each iteration. In fact, each iteration will use the same input matrix, this conversion step only needs to be called once. Therefore, you can use transDS! to convert the data source into a triple containing x, xt and y matrices:

def myLrTrans(t, yColName, xColNames, intercept) {
    if (intercept)
        x = matrix(t[xColNames], take(1.0, t.rows()))
    else
        x = matrix(t[xColNames])
    xt = x.transpose()
    y = t[yColName]
    return [x, xt, y]
}

def myLrMap(input, lastFinal) {
    x, xt, y = input
    placeholder, placeholder, theta = lastFinal
    // 之后的计算和2.2节相同
}

// myLrFinal和myLrTerm函数和2.2节相同

def myLr(mutable ds, yColName, xColNames, intercept, initTheta, tol) {
    ds.transDS!(myLrTrans{, yColName, xColNames, intercept})
    logLik, grad, theta = imr(ds, [0, 0, initTheta], myLrMap, +, myLrFinal, myLrTerm{, , tol})
    return theta
}