What is the lightweight big data technology

1. The form of big data

Popular big data technologies include Hadoop, Storm, Hive, Spark, etc. These are large cluster solutions suitable for huge enterprises with massive scale data. In fact, popular big data technologies usually originate from such leading Internet companies. In many scenarios, although there is a lot of data, a small cluster or even no cluster is sufficient for processing, far less than the scale of these huge enterprises, and there are not so many hardware equipment and maintenance personnel. In this case, lightweight big data technology is needed.

Second, the SPL domestic database is here!

There are not many lightweight big data technologies, and esProc SPL is the best among them. SPL is an open source big data JAVA computing library. It not only has concise code, light architecture, and easy integration, but also provides a high-performance storage format, supports single-computer parallel computing and multi-computer cluster computing, and can give full play to the hardware performance of small clusters.

2.1 Lightweight Architecture

The SPL architecture is lightweight , does not have a complex computing framework, and does not depend on the external environment. When a cluster is not required, the jar package embedded in the SPL can be used for computing directly without starting the server. SPL's cluster computing does not have a heavy central system. Just find a few node machines to start the SPL service, which can be PCs/Linux/servers/workstations/notebooks with different configurations or operating systems, and then execute the SPL service on any machine. Simple cluster operation code can be:

A
1 =[“192.168.1.11:8281”,“192.168.1.12:8281”,“192.168.1.13:8281”,“192.168.1.14:8281”]
2 =file(“Orders.ctx”:[1,2,3,4],A1)
3 =A2.open().cursor@m(Client, Amount,
4 =A3.groups(year(OrderDate),Client;sum(Amount))

This code can complete cluster grouping and aggregation. The pressure of task splitting and aggregation is far less than that of computing nodes, and it can be executed on any node\integrated environment.

2.2 Lightweight JDBC interface

SPL provides a lightweight JDBC interface for easy integration by JAVA. For example, save the above algorithm as an SPL script file, and reference the script file name in the form of a stored procedure in JAVA:

…

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement statement = conn.prepareCall("{call groupQuery(?, ?)}");
statement.setObject(1, "2021-01-01");
statement.setObject(2, "2021-12-31");
statement.execute();
...

In terms of big data computing, SPL also has many high-performance storage mechanisms and algorithm support, which is much better than most big data platforms using SQL. The operations that require Hadoop/Spark clusters are often solved by a single machine in SPL. .

2.3 High-performance storage format

SPL provides a high-performance storage format called group tables . The group table has been carefully designed, and the information storage density and computing performance are higher than that of the ordinary format; the group table supports compression by default , and is good at storing large data, especially suitable for the case where the field values ​​are repeated; in addition to row storage, the group table also supports column storage , which is suitable for When a few fields of a wide table are calculated, the compression ratio and calculation performance can be greatly improved:

A
1 =file(“Orders.ctx”)
2 = A1.open (). Cursor (Client, Amount, OrderDate; OrderDate> = arg1 && OrderDate
3 =A2.groups(year(OrderDate),Client;sum(Amount))

2.4 Parallel Computing

The SPL group table supports parallel computing , as long as the option @m is simply added after the cursor function, which can take full advantage of the performance advantages of multi-core CPUs:

A
1 =file(“Orders.ctx”)
2 =A1.open().cursor@m(Client,Amount, OrderDate; OrderDate>=arg1 && OrderDate
3 =A2.groups(year(OrderDate),Client;sum(Amount))

2.5 Cursor traversal complex

Traversal in big data computing is time-consuming. SPL supports cursor traversal multiplexing . Multiple computing targets can be calculated by traversing the data once:

A
1 =file(“Orders.ctx”).open()
2 =A1.open().cursor(Client, Amount, OrderDate)
3 =channel(A2).groups(year(OrderDate);max(Amount))
4 =A2.groups(Client;sum(Amount))
5 =A3.result()

2.6 Pre-summary

Similar to many OLAP servers, the SPL group table supports pre-summary , which can cache several common summary results in advance, and directly output the cached results according to the actual situation during formal calculation, or perform secondary calculation on the cached results, thereby improving the computing performance. For example, the following code can use pre-aggregated data to perform high-speed calculations:

A
1 =file(“fact.ctx”).open()
2 =A1.open().cgroups(dim1,dim2;sum(fact1),sum(fact2))

2.7 Association calculation

When a small dimension table is associated with a large fact table , the full dimension table can be loaded into the memory of each node, and the large fact table can be stored on multiple nodes in the form of a cluster group table. Use the dimension table in memory and the fact table in external memory to perform associated calculations to improve computing performance:

A
1 =[“192.168.1.11:8281”,“192.168.1.12:8281”,“192.168.1.13:8281”,“192.168.1.14:8281”]
2 =file(“Orders.ctx”:[1,2,3,4],A1)
3 =A2.open().cursor@m(SellerId, Amount)
4 =file(“Employees.ctx”,A2).open().memory()
5 =A2.join(SellerId,A4,Name,Dept)
6 =A5.groups(dept;sum(Amount))

2.8 Association calculation of large master and sub-tables

In the case of associative calculation of large main and sub-tables, the main table and sub-tables can be stored in multiple nodes in the form of cluster group tables, and stored in an orderly manner according to the associated fields. During the calculation, the orderly merge method can be used for association. Compute to improve computing performance:

A B
1 =[“192.168.1.11:8281”,“192.168.1.12:8281”,“192.168.1.13:8281”,“192.168.1.14:8281”]
2 =file(“orders.ctx”:[1,2,3,4],A1) =file(“orderdetail.ctx”,A2)
3 =A2.open().cursor@m() =B2.open().cursor(;;A3)
4 =joinx(A3:m,ID;B3:c,ID)
5 =A4.groups(m.Client;sum(c.Amount))

In terms of big data computing, SPL also supports large-dimensional table cluster computing, which can customize the task size, specify the number of parallels, allow the design of efficient execution paths, and support external memory fault tolerance and memory fault tolerance. In addition, SPL also supports a variety of file, RDB, NoSQL, big data data sources, and supports hybrid computing between data sources, which can often save the troublesome and time-consuming format conversion and outbound and inbound processes during big data computing.

3. SPL information

If you are interested in this domestic database, you can read two simple introductions I wrote:
domestic SPL database like learning Excel, zero basic introduction (1)
domestic SPL database like learning Excel, zero basic introduction (2)

According to everyone's screening and needs, I will see if I continue to write some introductory tutorials.

The following is the official information, if you are interested, you can check it out:

Guess you like

Origin blog.csdn.net/weixin_46211269/article/details/124403796