Dry goods丨Mixed paradigm programming of DolphinDB scripting language for time series database

The development of big data applications requires not only a distributed database that can support massive amounts of data, but also a distributed computing framework that can efficiently utilize multi-core and multi-nodes. It also requires an organic integration with distributed databases and distributed computing, high performance and easy expansion. , A programming language with strong expressive ability to meet the needs of rapid development and modeling. DolphinDB drew inspiration from the popular SQL and Python languages ​​and designed a scripting language for big data processing. This tutorial explains how to quickly develop applications for big data analysis through mixed paradigm programming. From it you can also understand how DolpinDB's programming language (hereinafter referred to as DolphinDB) integrates with databases and distributed computing.


1. Vectorized Programming

Vectorized programming is the most basic programming paradigm in DolphinDB. Most functions in DolphinDB support vectors as input parameters of the function. Depending on the return value of the function, functions can be divided into two types: one is an aggregate function, which returns a scalar; the other is a vector function, which returns a vector of the same length as the input vector.

Vectorization operations have three main advantages:

  • Simple code
  • Significantly reduce the interpretation cost of scripting languages
  • Many algorithms can be optimized

Time series data can usually be represented by a vector, and each column of a columnar database used for data analysis can also be represented by a vector. As an in-memory computing engine or as an analytical data warehouse, DolphinDB is especially suitable for vectorized programming when analyzing time series data.

Take the addition of two vectors with a length of 10 million as a simple example. The for statement of imperative programming is not only verbose, but also takes more than a hundred times longer than vectorized programming.

n = 10000000
a = rand(1.0, n)
b = rand(1.0, n)

//采用for语句编程:
c = array(DOUBLE, n)
for(i in 0 : n)
    c[i] = a[i] + b[i]
    
//采用向量化编程:
c = a + b

Vectorized programming is actually a batch processing of a group of homogeneous data. Not only can vectorization be used to optimize instructions in the compilation stage, but also many algorithms can be optimized. Take the moving average, one of the frequently used time series data sliding window indicators, as an example. Assuming that the total data volume is n and the window size is k, if batch calculation is not used, the time complexity is O(nk). But because after calculating the moving average of one window, only one data point has changed when calculating the next window, so as long as the value of this point is adjusted, the moving average of the new window can be calculated, so the time complexity of batch calculation It is O(n). In DolphinDB, most of the functions for calculating sliding window indicators have been optimized, and the performance is approximately O(n). These functions include mmax, mmin, mimax, mimin, mavg, msum, mcount, mstd, mvar, mrank, mcorr, mcovar, mbeta and mmed. In the following example, the performance of the optimized mavg function exceeds 300 times using the avg function for each window.

n = 10000000
a = rand(1.0, n)
window = 60

//对每一个窗口分别使用avg计算:
timer moving(avg, a, window);

Time elapsed: 4039.23 ms

//采用mavg函数批量计算:
timer mavg(a, window);

Time elapsed: 12.968 ms

Vectorized programming also has its limitations. First of all, not all operations can be done with vectorized calculations. In machine learning and statistical analysis, in some scenarios, we can only iteratively process row-by-row data, and cannot vectorize calculations. In response to this scenario, DolphinDB plans to introduce just-in-time compilation technology (JIT) in subsequent versions, which can dynamically compile line-by-line processing code written with for statements into machine code for execution at runtime, significantly improving performance.

Secondly, vectorized calculations usually load the entire vector into a continuous memory, Matlab and R both have such requirements. Sometimes due to memory fragmentation, large sections of contiguous memory cannot be found. DolphinDB specifically introduces big array for memory fragmentation, which can combine physically discontinuous memory blocks into a logically continuous vector. Whether the system uses big array is determined dynamically and transparent to users. Generally, for scanning the big array, the performance loss is between 1% and 5% for continuous memory; for random access to the big array, the performance loss is about 20% to 30%. In this regard, DolphinDB trades an acceptable small performance loss for higher system availability.


2. SQL programming

SQL is a problem-oriented language. The user only needs to give a description of the problem, and the SQL engine will produce the result. Usually the SQL engine is part of the database, and other systems communicate with the database through JDBC, ODBC or Native API. The SQL statement of the DolphinDB scripting language not only supports the standard functions of SQL, but also makes many extensions for the analysis of big data, especially the analysis of time series big data, which can greatly simplify the code and make it easy for users to use.


2.1 Integration of SQL and programming languages

In DolphinDB, the scripting language and SQL language are seamlessly integrated. This integration is mainly reflected in several aspects:

  • SQL statement is a subset of DolphinDB language, an expression. SQL statement can be directly assigned to a variable or as a parameter of a function.
  • Variables and functions created in the context can be used in SQL statements. If the SQL statement involves distributed tables, these variables and functions will be automatically serialized to the corresponding nodes.
  • The SQL statement is no longer a simple string, but a code that can be dynamically generated.
  • SQL statements can operate not only on tables, but also on other data structures such as scalar, vector, matrix, set, dictionary. The data table can be converted with other data structures.

Please note that the DolphinDB programming language is case sensitive. All SQL keywords in DolphinDB must be lowercase.

In the following example, first generate an employee salary table:

empWages = table(take(1..10, 100) as id, take(2017.10M + 1..10, 100).sort() as month, take(5000 5500 6000 6500, 100) as wage); 

Then calculate the average salary of a given group of employees. The list of employees is stored in a local variable empIds.

empIds = 3 4 6 7 9
select avg(wage) from empWages where id in empIds group by id;
id avg_wage
-- --------
3  5500
4  6000
6  6000
7  5500
9  5500

In addition to calculating average wages, the names of employees are also displayed. The employee name is obtained using a dictionary empName.

empNames = dict(1..10, `Alice`Bob`Jerry`Jessica`Mike`Tim`Henry`Anna`Kevin`Jones)
select empNames[first(id)] as name, avg(wage) from empWages where id in empIds group by id;
id name    avg_wage
-- ------- --------
3  Jerry   5500
4  Jessica 6000
6  Tim     6000
7  Henry   5500
9  Kevin   5500

In the above two examples, the where clause and select clause of the SQL statement use the array and dictionary defined in the context respectively, so that the problem that originally needed to be solved by subquery and multi-table join was solved by a simple hash table Up. If SQL involves a distributed database, these context variables will be automatically serialized to the corresponding node. This not only makes the code look more concise and has better readability, but also improves performance.

The data table returned by the SQL select statement can be directly assigned to a local variable for further processing and analysis. DolphinDB also introduces the exec keyword. Compared with select, the result returned by the EXEC statement can be a matrix, vector or scalar, which is more convenient for data analysis. In the following example, exec is used in conjunction with pivot by to directly return a matrix.

exec first(wage) from emp_wage pivot by month, id;

         1    2    3    4    5    6    7    8    9    10
         ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
2017.11M|5000 5500 6000 6500 5000 5500 6000 6500 5000 5500
2017.12M|6000 6500 5000 5500 6000 6500 5000 5500 6000 6500
2018.01M|5000 5500 6000 6500 5000 5500 6000 6500 5000 5500
2018.02M|6000 6500 5000 5500 6000 6500 5000 5500 6000 6500
2018.03M|5000 5500 6000 6500 5000 5500 6000 6500 5000 5500
2018.04M|6000 6500 5000 5500 6000 6500 5000 5500 6000 6500
2018.05M|5000 5500 6000 6500 5000 5500 6000 6500 5000 5500
2018.06M|6000 6500 5000 5500 6000 6500 5000 5500 6000 6500
2018.07M|5000 5500 6000 6500 5000 5500 6000 6500 5000 5500
2018.08M|6000 6500 5000 5500 6000 6500 5000 5500 6000 6500


2.2 Friendly support for Panel Data

SQL's group by clause divides the data into multiple groups, and each group produces a value, which is one row. Therefore, after using the group by clause, the number of rows will generally be greatly reduced.

After the panel data is grouped, each group of data is usually time series data, such as grouping by stock, and the data in each group is a stock price sequence. When processing panel data, sometimes it is desirable to keep the number of data rows in each group, that is, to generate a value for each row of data in the group. For example, generate a return sequence based on a stock price sequence, or generate a moving average price sequence based on a price sequence. Other database systems (such as SQL Server, PostGreSQL) use window functions to solve this problem. DolpinDB introduced the context by clause to handle panel data. Compared with the window function, context by has a more concise syntax and a more systematic design (together with group by and pivot by to form three clauses for grouped data processing), it also has stronger expressive ability, which is specifically shown in the following three aspect:

  • Not only can be used in queries with select, but also can be used with update to update data.
  • Most database systems can only use existing field groups in the table in window functions. The context by clause can use any existing fields and calculated fields.
  • Window functions are limited to a few functions. Context by not only does not limit the functions used, but can also use arbitrary expressions, such as a combination of multiple functions.
  • The context by can be used in conjunction with the having clause to filter the rows within each group.

Assuming that the trades data table records the daily end-of-day price of each stock, we can use context by to conveniently calculate the daily return and daily ranking of each stock. First, group by stock code and calculate the daily return of each stock. We assume here that the data is arranged in chronological order.

update trades set ret = ratios(price) - 1.0 context by sym;

Group by date and calculate the descending order of return for each stock every day:

select date, symbol,  ret, rank(ret, false) + 1 as rank from trades where isValid(ret) context by date;

Select the top 10 stocks with daily returns:

select date, symbol, ret from trades where isValid(ret) context by date having rank(ret, false) < 10;

Below we use a more complex practical example to demonstrate how the context by clause can efficiently solve the panel data problem. A paper 101 Formulaic Alphas introduced the 101 quantitative Alpha factors used by WorldQuant, the top quantitative hedge fund on Wall Street. A fund company uses C# to calculate these factors. The representative 98th factor uses not only the nesting of multiple indicators of vertical time series data, but also the sorting information of horizontal cross-sectional data, which uses hundreds of lines of code. Using the historical data of nearly 9 million rows of more than 3,000 stocks in the Chinese stock market in 10 years, it takes about 30 minutes to calculate the Alpha factor 98. However, using DolphinDB instead, as shown in the figure below, only used 4 lines of core code, took only 2 seconds, and achieved a performance improvement of nearly three orders of magnitude.

def alpha98(stock){
	t = select code, valueDate, adv5, adv15, open, vwap from stock order by valueDate
	update t set rank_open = rank(open), rank_adv15 = rank(adv15) context by valueDate
	update t set decay7 = mavg(mcorr(vwap, msum(adv5, 26), 5), 1..7), decay8 = mavg(mrank(9 - mimin(mcorr(rank_open, rank_adv15, 21), 9), true, 7), 1..8) context by code
	return select code, valueDate, rank(decay7)-rank(decay8) as A98 from t context by valueDate 
}

2.3 Friendly support for time series data

DolphinDB's database uses columnar data storage and vectorized programming when calculating, which is naturally friendly to time series data.

  • DolphinDB supports time types with different precisions. High-frequency data can be easily converted into low-frequency data with different precisions through SQL statements, such as seconds, minutes, and hours. It can also be barconverted into data at any time interval through the use of functions and group by clauses.
  • DolphinDB supports modeling the sequence relationship of time series data, including lead, lag, sliding window, cumulative window, etc. More importantly, DolphinDB has optimized the commonly used indicators and functions used in this type of modeling, and its performance is 1 to 2 orders of magnitude better than other systems.
  • DolphinDB provides efficient and commonly used table join methods specially designed for time series: asof join and window join.

We use a simple example to explain window join. For example, it is necessary to calculate the average salary of a group of personnel in the previous three months at a certain time point. We can simply use window join (wj) to achieve. Please refer to the user manual for specific explanation of the window join function

p = table(1 2 3 as id, 2018.06M 2018.07M 2018.07M as month)
s = table(1 2 1 2 1 2 as id, 2018.04M + 0 0 1 1 2 2 as month, 4500 5000 6000 5000 6000 4500 as wage)
select * from wj(p, s, -3:-1,<avg(wage)>,`id`month)

id month    avg_wage
-- -------- -----------
1  2018.06M 5250
2  2018.07M 4833.333333
3  2018.07M

The above problems, in other database systems, you can use equal join (id field) and non-equal join (month field), and the group by clause to solve. However, in addition to the more complicated writing method, compared with DolphinDB's window join, the performance of other systems lags more than two orders of magnitude.

Window join has a wide range of applications in the field of financial analysis. A classic application is to associate the trades table with the quotes table to calculate transaction costs.

The following is the trade table (trades), not partitioned or partitioned by date and stock code:

sym  date       time         price  qty
---- ---------- ------------ ------ ---
IBM  2018.06.01 10:01:01.005 143.19 100
MSFT 2018.06.01 10:01:04.006 107.94 200

The following is the quotation table (quotes), not partitioned or partitioned by date and stock code:

sym  date       time         bid    ask    bidSize askSize
---- ---------- ------------ ------ ------ ------- -------
IBM  2018.06.01 10:01:01.006 143.18 143.21 400     200
MSFT 2018.06.01 10:01:04.010 107.92 107.97 800     100

Use asof join to find the nearest quotation for each transaction, and use the median price of the quotation as the benchmark for transaction costs:

dateRange = 2018.05.01 : 2018.08.01
select sum(abs(price - (bid+ask)/2.0)*qty)/sum(price*qty) as cost from aj(trades, quotes, `date`sym`time) where date between dateRange group by sym;

Use window join to find the quotation in the first 10 milliseconds for each transaction, and calculate the average middle price as a benchmark for transaction costs:

select sum(abs(price - mid)*qty)/sum(price*qty) as cost from pwj(trades, quotes, -10:0, <avg((bid + ask)/2.0) as mid>,`date`sym`time) where date between dateRange group by sym;


2.4 Other extensions of SQL

To meet the requirements of big data analysis, DolphinDB has made many other extensions to SQL. Here we give examples of some commonly used functions.

  • User-defined functions do not need to be compiled, packaged, and deployed, and can be used in SQL on this node or in a distributed environment.
  • As shown in section 5.4, the SQL in DolphinDB is tightly integrated with the distributed computing framework, making in-database analytics more convenient and efficient.
  • DolphinDB supports composite columns, which can output multiple return values ​​of complex analysis functions to one row of the data table.

If you want to use combination fields in SQL statements, the output of the function must be a simple key-value pair or array. If it is not these two types, you can use a custom function to convert. Please refer to the user manual for the detailed usage of the combination field .

factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
t=table(take(1 2 3, 15).sort() as id, 1..15 as y, factor1, factor2);

Run ols for each id, y = alpha + beta1 * factor1 + beta2 * factor2, output parameters alpha, beta1, beta2.

select ols(y, [factor1,factor2], true, 0) as `alpha`beta1`beta2 from t group by id;

id alpha     beta1     beta2
-- --------- --------- ---------
1  1.063991  -0.258685 0.732795
2  6.886877  -0.148325 0.303584
3  11.833867 0.272352  -0.065526

While outputting parameters, output R2. Use a custom function to wrap the output result.

def myols(y,x){
    r=ols(y,x,true,2)
    return r.Coefficient.beta join r.RegressionStat.statistics[0]
}
select myols(y,[factor1,factor2]) as `alpha`beta1`beta2`R2 from t group by id;

id alpha     beta1     beta2     R2
-- --------- --------- --------- --------
1  1.063991  -0.258685 0.732795  0.946056
2  6.886877  -0.148325 0.303584  0.992413
3  11.833867 0.272352  -0.065526 0.144837


3. Imperative Programming

DolphinDB, like mainstream scripting languages ​​(Python and JavaScript, etc.) and compiled strongly typed languages ​​(C++, C and Java), supports imperative programming, that is, telling the computer step by step what to do first and then do. DolphinDB currently supports 18 types of statements (refer to Chapter 5 of the User Manual for details ), including the most commonly used assignment statements, branch statements if..else, and loop statements for and do..while.

DolphinDB supports assignment of single variable and multiple variables.

x = 1 2 3
y = 4 5
y += 2
x, y = y, x //swap the value of x and y
x, y =1 2 3, 4 5

DolphinDB currently supports loop statements including for statement and do..while statement. The loop body of the for statement can include data pairs (left closed and right open intervals), arrays (vectors), matrices (matrix) and tables (table).

1 to 100 cumulative sum:

s = 0
for(x in 1:101) s += x
print s

Sum the elements in the array:

s = 0;
for(x in 1 3 5 9 15) s += x
print s

Print the mean value of each column of the matrix:

m = matrix(1 2 3, 4 5 6, 7 8 9)
for(c in m) print c.avg()

Calculate the product of each row and two columns in the data table:

t= table(["TV set", "Phone", "PC"] as productId, 1200 600 800 as price, 10 20 7 as qty)
for(row in t) print row.productId + ": " + row.price * row.qty

DolphinDB's branch statement if..else is consistent with other languages.

if(condition){
    <true statements>
}
else{
     <false statements>
}

When dealing with massive data, it is not recommended to use control statements (for statement, if..else statement) to process the data row by row. These control statements are generally used for the processing and scheduling of upper-level modules. Compared with the lower-level data processing modules, it is recommended to use vector programming, functional programming, SQL programming, etc.

4. Functional Programming

DolphinDB supports most features of functional programming, including:

  • Pure function
  • User-defined function (user-defined function, or udf for short)
  • lambda function
  • Higher order function
  • Partial application
  • Closure

For details, please refer to Chapter 7 of the User Manual .


4.1 Custom functions and lambda functions

Custom functions can be created in DolphinDB. Functions can have names or no names (usually lambda functions). The created function meets the requirements of a pure function, that is, only the input parameters of the function can affect the output result of the function. DolphinDB is different from Python. The function body can only reference function parameters and local variables inside the function, and cannot use variables defined outside the function body. From the perspective of software engineering, this sacrifices some of the flexibility of syntactic sugar, but it is of great benefit to improving software quality.

def getWeekDays(dates){
    return dates[def(x):weekday(x) between 1:5]
}

getWeekDays(2018.07.01 2018.08.01 2018.09.01 2018.10.01)

[2018.08.01, 2018.10.01] 

In the above example, we defined a function getWeekDays, which accepts a set of dates and returns the dates between Monday and Friday. The realization of the function adopts the filtering function of vectors, which is to accept a Boolean monocular function for data filtering. We define a lambda function for data filtering.


4.2 Higher-order functions

Higher-order functions are functions that can accept another function as a parameter. In DolphinDB, higher-order functions are mainly used as template functions for data processing, and usually the first parameter is another function for specific data processing. For example, the A object has m elements, and the B object has n elements. A common processing mode is that any element in A and any element in B are calculated in pairs, and finally an m*n matrix is ​​generated. . DolphinDB abstracts this data processing mode into a higher-order function cross. DolphinDB provides many similar template functions, including all, any, each, loop, eachLeft, eachRight, eachPre, eachPost, accumulate, reduce, groupby, contextby, pivot, cross, moving, rolling, etc.

In the following example, we use three higher-order functions and only three lines of code to calculate the correlation between every two stocks based on the tick-level trading data of the stocks.

The simulation generates 10,000,000 data points (stock code, trading time and price):

n=10000000
syms = rand(`FB`GOOG`MSFT`AMZN`IBM, n)
time = 09:30:00.000 + rand(21600000, n)
price = 500.0 + rand(500.0, n)

Use pivot function to generate stock price perspective matrix, each column is a stock, each row is one minute:

priceMatrix = pivot(avg, price, time.minute(), syms)

Each and ratios functions are used together to operate each column of the stock price matrix to convert the stock price into a rate of return:

retMatrix = each(ratios, priceMatrix) - 1

The cross and corr functions are used together to calculate the correlation between the return rates of every two stocks:

corrMatrix = cross(corr, retMatrix, retMatrix)

The result is:

     AMZN      FB        GOOG      IBM       MSFT
     --------- --------- --------- --------- ---------
AMZN|1         0.015181  -0.056245 0.005822  0.084104
FB  |0.015181  1         -0.028113 0.034159  -0.117279
GOOG|-0.056245 -0.028113 1         -0.039278 -0.025165
IBM |0.005822  0.034159  -0.039278 1         -0.049922
MSFT|0.084104  -0.117279 -0.025165 -0.049922 1


4.3 Partial Application

Partial application means that a new function is generated when part or all of the parameters of a function are given. In DolphinDB, function calls use parentheses (), and some applications use {}. The specific implementation of the ratios function used in the example in Section 3.2 is the partial application of eachPre{ratio} of the higher-order function eachPre.

The following two lines of code are isomorphic:

retMatrix = each(ratios, priceMatrix) - 1
retMatrix = each(eachPre{ratio}, priceMatrix) - 1

Some applications are often used for higher-order functions. When using higher-order functions, there are usually specific requirements for certain parameters. Through partial applications, it can be ensured that all parameters meet the requirements. For example, to calculate the correlation between a vector a and each column in a matrix m, the function corr can be used in conjunction with the higher-order function each. But if the vector and matrix are directly listed as corr parameters in each, the system will try to calculate the correlation between a certain element of the vector and a certain column of the matrix, resulting in an error. At this time, part of the application can be used to form a new function corr{a} with the function corr and the vector a, and then use it with the higher-order function each in each column of the matrix, as shown in the following example. We can also use the for statement to solve this problem, but the code is lengthy and time-consuming.

a = 12 14 18
m = matrix(5 6 7, 1 3 2, 8 7 11)

Use each and partial application to calculate the correlation between vector a and each column in the matrix:

each(corr{a}, m)

Use the for statement to solve the above problems:

cols = m.columns()
c = array(DOUBLE, cols)
for(i in 0:cols)
    c[i] = corr(a, m[i])

Another magical effect of partial applications is to keep functions in state. Usually we want the function to be stateless, that is, the output result of the function is completely determined by the input parameters. But sometimes we want a function to have "state". For example, in stream computing, users usually need to give a message handler, receive a new message and return a result. If we want the message processing function to return the average of all the data received so far, it can be solved through partial applications.

def cumavg(mutable stat, newNum){
    stat[0] = (stat[0] * stat[1] + newNum)/(stat[1] + 1)
    stat[1] += 1
    return stat[0]
}

msgHandler = cumavg{0.0 0.0}
each(msgHandler, 1 2 3 4 5)

[1,1.5,2,2.5,3]


5. Remote procedure call programming

Remote Procedure Call (Remote Procedure Call) is one of the most commonly used infrastructures in distributed systems. DolphinDB's distributed file system, distributed database, and distributed computing framework all use DolphinDB's original RPC system. DolphinDB's scripting language can execute code on remote machines through RPC. DolphinDB has the following characteristics when using RPC:

  • Not only can the functions registered on the remote machine be executed, but also the local customized functions can be serialized to the remote node for execution. The authority when running code on a remote machine is equivalent to the authority of the currently logged-in user locally.
  • The parameters of the function can be conventional scalar, vector, matrix, set, dictionary, and table, or functions including custom functions.
  • Either an exclusive connection between two nodes can be used, or a shared connection between cluster data nodes can be used.


5.1 Use remoteRun to execute remote functions

DolphinDB uses xdb to create a connection to a remote node. The remote node can be any node running DolphinDB and does not have to be part of the current cluster. After the connection is created, the functions registered on the remote node or locally customized functions can be executed on the remote node.

h = xdb("localhost", 8081);

Execute a script on the remote node:

remoteRun(h, "sum(1 3 5 7)");
16

The above remote call can also be abbreviated as:

h("sum(1 3 5 7)");
16

Execute a function registered on the remote node on the remote node:

h("sum", 1 3 5 7);
16

Execute local custom functions on the remote system node:

def mysum(x) : reduce(+, x)
h(mysum, 1 3 5 7);
16

Create a shared table sales on the remote node (localhost:8081):

h("share table(2018.07.02 2018.07.02 2018.07.03 as date, 1 2 3 as qty, 10 15 7 as price) as sales");

If the local custom function is dependent, the dependent custom function will be automatically serialized to the remote node:

defg salesSum(tableName, d): select mysum(price*qty) from objByName(tableName) where date=d
h(salesSum, "sales", 2018.07.02);
40


5.2 Use rpc to execute remote functions

Another way DolphinDB uses the remote procedure call function is the rpc function. The rpc function accepts the name of the remote node, the definition of the function to be executed, and the required parameters. rpc can only be used between control nodes and data nodes in the same cluster, but there is no need to create a new connection, but to reuse existing network connections. The advantage of this is to save network resources and avoid the delay caused by creating new connections. This is very meaningful when there are many users of the node. The rpc function can only execute one function on the remote node. If you want to run a script, please encapsulate the script in a custom function. The following example must be used in a DolphinDB cluster. nodeB is the alias of the remote node, and there is already a shared table sales on nodeB.

rpc("nodeB", salesSum, "sales",2018.07.02);
40

When using rpc, in order to increase the readability of the code, it is recommended to use partial applications to write function parameters and function definitions together to form a new zero-parameter function definition.

rpc("nodeB", salesSum{"sales", 2018.07.02});
40

master is the alias of the control node. DolphinDB can only create users on the control node:

rpc("master", createUser{"jerry", "123456"});

The parameter required by the rpc function can also be another function, including built-in functions and custom functions:

rpc("nodeB", reduce{+, 1 2 3 4 5});
15


5.3 Use other functions to execute remote functions indirectly

Both remoteRun and rpc can execute user-defined functions locally on a remote node. This is the biggest difference between DolphinDB's RPC subsystem and other RPC systems. In other systems, usually the RPC client can only passively call the registration function that the remote node has exposed. In the field of big data analysis, data scientists often put forward new interface requirements based on new R&D projects. If you wait for the IT department to release a new API interface, it usually takes a long time, which will seriously affect the efficiency and cycle of research and development. If you want to execute a custom function on a remote node, the custom function must currently be developed using DolphinDB scripts. In addition, higher requirements are put forward for data security, and user access rights must be carefully planned and set. If users are restricted to only use registered functions, the user's access authority management can be very simple, as long as external users are denied access to all data, and external users are authorized to access registered view functions.

In addition to directly using remoteRun and rpc functions, DolphinDB also provides many functions to indirectly use remote procedure calls. For example, rpc and olsEx are used in linear regression of distributed databases. In addition, pnodeRun is used to run the same function in parallel on multiple nodes in the cluster and merge the returned results. This is very useful in cluster management.

Each data node returns the last 10 running or completed batch jobs:

pnodeRun(getRecentJobs{10});

Return the last 10 SQL queries of node nodeA and nodeB:

pnodeRun(getCompletedQueries{10}, `nodeA`nodeB);

Clear the cache on all data nodes:

pnodeRun(clearAllCache);


5.4 Distributed Computing

mr is used to develop distributed computing based on MapReduce; imr is used to develop distributed computing based on iterative MapReduce. Users only need to specify the distributed data source and core functions, such as map function, reduce function, final function, etc. Below we demonstrate an example of using distributed data to calculate the median and linear regression.

n=10000000
x1 = pow(rand(1.0,n), 2)
x2 = norm(3.0, 1.0, n)
y = 0.5 + 3 * x1 - 0.5*x2 + norm(0.0, 1.0, n)
t=table(rand(10, n) as id, y, x1, x2)

login(`admin,"123456")
db = database("dfs://testdb", VALUE, 0..9)
db.createPartitionedTable(t, "sample", "id").append!(t)

Use the custom map function myOLSMap, the built-in reduce function (+), the custom final function myOLSFinal, and the built-in map-reduce framework function mr to construct a function myOLSEx that runs linear regression on distributed data sources.

def myOLSMap(table, yColName, xColNames){
    x = matrix(take(1.0, table.rows()), table[xColNames])
    xt = x.transpose();
    return xt.dot(x), xt.dot(table[yColName])
}

def myOLSFinal(result){
    xtx = result[0]
    xty = result[1]
    return xtx.inv().dot(xty)[0]
}

def myOLSEx(ds, yColName, xColNames){
  return mr(ds, myOLSMap{, yColName, xColNames}, +, myOLSFinal)
}

Use user-defined distributed algorithms and distributed data sources to calculate linear regression coefficients:

sample = loadTable("dfs://testdb", "sample")
myOLSEx(sqlDS(<select * from sample>), `y, `x1`x2);
[0.4991, 3.0001, -0.4996]

Use the built-in function ols and undivided data to calculate the coefficients of linear regression and get the same result:

ols(y, [x1,x2],true);
[0.4991, 3.0001, -0.4996]

In the following example, we construct an algorithm to calculate the approximate median of a set of data on a distributed data source. The basic principle of the algorithm is to use the bucketCount function to calculate the number of data in a group on each node, and then accumulate the data on each node. This way we can find which interval the median should fall within. If this interval is not small enough, further subdivide this interval until it is less than the given accuracy requirement. The median algorithm requires multiple iterations, so we use the iterative calculation framework imr.

def medMap(data, range, colName): bucketCount(data[colName], double(range), 1024, true)

def medFinal(range, result){
    x= result.cumsum()
    index = x.asof(x[1025]/2.0)
    ranges = range[1] - range[0]
    if(index == -1)
        return (range[0] - ranges*32):range[1]
    else if(index == 1024)
        return range[0]:(range[1] + ranges*32)
    else{
        interval = ranges / 1024.0
        startValue = range[0] + (index - 1) * interval
        return startValue : (startValue + interval)
    }
}

def medEx(ds, colName, range, precision){
    termFunc = def(prev, cur): cur[1] - cur[0] <= precision
    return imr(ds, range, medMap{,,colName}, +, medFinal, termFunc).avg()
}

Use the above approximate median algorithm to calculate the median of distributed data:

sample = loadTable("dfs://testdb", "sample")
medEx(sqlDS(<select y from sample>), `y, 0.0 : 1.0, 0.001);
-0.052973

Use the built-in med function to calculate the median of unpartitioned data:

med(y);
-0.052947


6. Metaprogramming

Metaprogramming refers to the use of program codes to create program codes that can run dynamically. The purpose of metaprogramming is generally to delay code execution or dynamically create code.

DolphinDB supports the use of metaprogramming to dynamically create expressions, such as function call expressions and SQL query expressions. Many business details cannot be determined at the coding stage. For example, a customer-customized report can only determine a complete SQL query expression when the customer selects the form, field, and field format at runtime.

Delayed code execution is generally divided into these situations:

  • Provide a callback function
  • Delayed execution creates conditions for overall optimization
  • The problem description is completed in the program coding stage, but the problem realization is completed in the program running stage

There are two ways for DolphinDB to implement metaprogramming. One is to use a pair of angle brackets <> to indicate the dynamic code that needs to be executed later, and the other is to use functions to create various expressions. Commonly used functions for metaprogramming include objByName, sqlCol, sqlColAlias, sql, expr, eval, partial, makeCall.

Use <> to generate dynamic expressions for deferred execution:

a = <1 + 2 * 3>
a.typestr();
CODE

a.eval();
7

Use functions to generate dynamic expressions for deferred execution:

a = expr(1, +, 2, *, 3)
a.typestr();
CODE

a.eval();
7

Metaprogramming can be used to customize reports. The user's input includes the data table, the field name and the corresponding format string of the field. In the following example, based on the input data table, field name and format, and filter conditions, SQL expressions are dynamically generated and executed.

def generateReport(tbl, colNames, colFormat, filter){
	colCount = colNames.size()
	colDefs = array(ANY, colCount)
	for(i in 0:colCount){
		if(colFormat[i] == "") 
			colDefs[i] = sqlCol(colNames[i])
		else
			colDefs[i] = sqlCol(colNames[i], format{,colFormat[i]})
	}
	return sql(colDefs, tbl, filter).eval()
}

Simulate to generate a 100-row data table:

t = table(1..100 as id, (1..100 + 2018.01.01) as date, rand(100.0, 100) as price, rand(10000, 100) as qty);

Enter filter conditions, fields and formats, and customize reports. The filter condition uses meta programming.

generateReport(t, ["id","date","price","qty"], ["000","MM/dd/yyyy", "00.00", "#,###"], < id<5 or id>95 >);

id  date       price qty
--- ---------- ----- -----
001 01/02/2018 50.27 2,886
002 01/03/2018 30.85 1,331
003 01/04/2018 17.89 18
004 01/05/2018 51.00 6,439
096 04/07/2018 57.73 8,339
097 04/08/2018 47.16 2,425
098 04/09/2018 27.90 4,621
099 04/10/2018 31.55 7,644
100 04/11/2018 46.63 8,383

The parameters of some built-in functions of DolphinDB require meta-programming. In a window join, you need to specify one or more aggregate functions for the window data set of the right table and the parameters required when these functions are run. Since the description and execution of the problem are in two different stages, we use meta-programming to achieve delayed execution.

t = table(take(`ibm, 3) as sym, 10:01:01 10:01:04 10:01:07 as time, 100 101 105 as price)
q = table(take(`ibm, 8) as sym, 10:01:01+ 0..7 as time, 101 103 103 104 104 107 108 107 as ask, 98 99 102 103 103 104 106 106 as bid)
wj(t, q, -2 : 1, < [max(ask), min(bid), avg((bid+ask)*0.5) as avg_mid]>, `time);

sym time     price max_ask min_bid avg_mid
--- -------- ----- ------- ------- -------
ibm 10:01:01 100   103     98      100.25
ibm 10:01:04 101   104     99      102.625
ibm 10:01:07 105   108     103     105.625

Another built-in feature in DolphinDB that uses metaprogramming is to update the memory partition table. Of course, the update, delete, and sort functions of the memory partition table can also be completed through SQL statements.

Create a memory partitioned database partitioned by date, and simulate the generation of the trades table:

db = database("", VALUE, 2018.01.02 2018.01.03)
date = 2018.01.02 2018.01.02 2018.01.02 2018.01.03 2018.01.03 2018.01.03
t = table(`IBM`MSFT`GOOG`FB`IBM`MSFT as sym, date, 101 103 103 104 104 107 as price, 0 99 102 103 103 104 as qty)
trades = db.createPartitionedTable(t, "trades", "date").append!(t);

Delete the records with qty of 0 and sort them in ascending order by transaction volume in each partition:

trades.erase!(<qty=0>).sortBy!(<price*qty>);

Add a new field logPrice:

trades[`logPrice]=<log(price)>;

Update the number of transactions of stock IBM:

trades[`qty, <sym=`IBM>]=<qty+100>;


7. Summary

DolpinDB database is a programming language born for data analysis. Different from other data analysis languages ​​such as Matlab, SAS, pandas, etc., DolpinDB is tightly integrated with distributed databases and distributed computing, and is inherently capable of processing massive amounts of data. DolphinDB supports SQL programming, functional programming and meta-programming. The language is concise, flexible, and expressive, which greatly improves the development efficiency of data scientists. DolphinDB supports vectorized computing and distributed computing, with extremely fast running speed.

Guess you like

Origin blog.csdn.net/qq_41996852/article/details/112761832