DCM: A new member of the middleware family

what is DCM

Modern applications are dealing with data all the time, data calculation is everywhere, report statistics, data analysis, business processing and so on. The main means of current data processing are still related technologies represented by relational databases. Although high-level languages ​​(such as Java) can be hard-coded to achieve various calculations, they are far less convenient than databases (SQL), and databases are still used in contemporary data processing. play an important role.

However, with the development of information technology and the rise of architectures and concepts such as the separation of storage and computing, microservices, pre-computing, and edge computing, overly heavy and closed databases are becoming more and more difficult to cope with these scenarios. The database requires data to be stored in order to be calculated. However, when faced with a wealth of diverse data sources, data storage is not only inefficient and resource-intensive, but also cannot guarantee real-time performance. Some data is only temporarily used but needs to be stored in the database for persistence. . In addition, for scenarios such as microservices and edge computing that require computing power to be preloaded on the application side, it is also difficult to embed the database for use.

In this context, if there is a data computing and processing technology that does not rely on databases, has open computing capabilities, and can be embedded and integrated with applications, then these problems can be well solved. This is the data computing middleware (Data Computing Middleware). Middleware, referred to as DCM). The application scenarios of DCM are very wide and can be said to be ubiquitous. It can play an important role in optimizing application development, microservice implementation, stored procedure replacement, database decoupling, ETL assistance, diverse data source computing, BI data preparation, etc. Almost all scenarios involving application data interaction can use DCM to improve application structure and improve development and computing efficiency.

DCM application scenarios

Optimize application development

The data processing logic in the application can only be implemented by coding. It is often difficult to use the native Java implementation due to the lack of the necessary structured computing class library. Even using the newly added Stream/Kotlin does not improve significantly. With the help of ORM technology, the development dilemma can be alleviated to a certain extent, but there is still a lack of professional structured data types, set operations are not convenient, and the code when reading and writing the database is cumbersome, and complex calculations are difficult to achieve. These shortcomings of ORM often lead to the development efficiency of business logic not only not significantly improved, but even greatly reduced. Additionally, these implementations can lead to application structural issues. The computing logic implemented in Java must be deployed together with the main application, resulting in tight coupling, and it is also troublesome because it does not support hot deployment development and operation.

If using DCM's features such as agile computing, easy integration, and hot switching, to replace Java in applications to implement data processing logic, the above problems can be well solved, not only improving development efficiency, but also optimizing the application structure and realizing the decoupling of computing modules. , and supports hot deployment.

Diversity data source calculation

Modern applications also often face the problem of diverse data sources. Database processing not only requires data storage, which is inefficient, but also cannot guarantee the real-time nature of data. Different data sources have their own advantages. RDB has strong computing power, but weak IO throughput; NoSQL has high IO efficiency but weak computing power; and file data such as text has no computing power at all, but it is very flexible to use. Forcing these data into the repository will lose the advantages of these original data sources.

Through the multi-source mixed calculation capability of DCM, not only can RDB, text, Excel, JSON, XML, NoSQL and other network interface data be directly mixed and calculated to ensure the real-time performance of data and calculation, but also various data sources can be retained at the same time. advantages and give full play to its effectiveness.

Microservice implementation

The current implementation of microservices still relies heavily on Java and databases for data processing. The disadvantage of Java is that it is complex to implement and cannot be hot-swapped; while the database has the limitation of "library", multi-source data can only be calculated after being stored in the library, and the flexibility is very low. Not only the timeliness of data cannot be guaranteed, but also the advantages of various data sources cannot be fully utilized.

Embed the integratable DCM into each link of the middle office or microservice to complete data collection and sorting, data processing and pre-data computing tasks. Using the open computing system can give full play to the advantages of multiple data sources and enhance flexibility. Problems such as multi-source data processing, real-time computing, and hot deployment can be easily solved.

insert image description here

stored procedure replacement

In the past, stored procedures were often used in order to implement complex calculations or organize data. Stored procedures have certain advantages in computing in the library, but the disadvantages are also obvious. Stored procedures lack portability, editing and debugging are difficult, creating and using stored procedures requires high permissions, and there are security issues. Stored procedures serving front-end applications also cause tight coupling between the database and the application.

By externalizing the stored procedure into the application through DCM, the "external storage procedure" can be realized, and the database is mainly used for storage. Decoupling the stored procedure from the database can solve various problems caused by the stored procedure.

insert image description here

Report BI data preparation

Providing data preparation for reports is an important scenario for DCM. In the past, using databases to prepare data for reports had problems such as high implementation difficulty and strong coupling, and the reports themselves had insufficient computing power and could not complete many complex calculations. With the strong computing power outside the library, DCM can provide a special data computing layer for reports, which can not only decouple the database to reduce the burden on the database, but also make up for the lack of computing power of the reporting tool itself. After logically layered, report development and maintenance are very refreshing.

insert image description here

Intermediate table elimination

Sometimes in order to speed up the query efficiency, the data to be queried is processed into a result table and stored in the database, which is the intermediate table. In addition, some complex calculations need to save intermediate results and also save them as intermediate tables; various data sources must also be stored as intermediate tables before they can be mixed in the database. Similar to the stored procedure, once the intermediate table is established, it may be used by multiple applications (modules), resulting in tight coupling between the application and the database. At the same time, because the intermediate table cannot be easily deleted, the number will accumulate. Too many intermediate tables will cause database capacity and performance problems. Storage of intermediate tables requires space, and processing of intermediate tables requires database computing resources.

Through DCM, the intermediate table can be externally placed in the file system, and the DCM can be used for computing, and the decoupling database can reduce the burden of database storage and computing. The key here is that DCM makes the file also have computing power, so the intermediate table in the library can be placed outside the library. Originally, the intermediate table was placed in the library mainly to obtain the computing power of the database. Now the intermediate table with the computing power of DCM is stored in the library. It doesn't matter what form it is, it is better to put it externally to the file system.

insert image description here

T+0 query

When the amount of data accumulates to a certain extent, the query based on the production database will affect the transaction. At this time, a large amount of historical data will be stripped into other historical databases to separate hot and cold data. At this time, if you want to query the full amount of data, you need to complete cross-database query, hot and cold data routing, etc. The database has many problems for cross-database query, especially cross-heterogeneous database. It is not only inefficient, but also has many deficiencies such as unstable data transmission and low scalability, which cannot well realize T+0 full data query.

And these problems can be solved by DCM. Because of its independent and perfect computing ability, it can fetch and calculate from different databases, so it can be well adapted to the situation of heterogeneous databases, and it can also decide whether to calculate according to the resource status of the database. Implemented in database or DCM, very flexible. In terms of computing implementation, DCM's agile computing capabilities can also simplify complex computing in T+0 queries and improve development efficiency.

ETL

ETL needs to clean and transform the data and then load it to the target. However, since the source data may come from multiple sources (text, database, web) and the data quality is uneven, the two steps of E and T will involve a large amount of data calculation. At present, in addition to the database, other data sources do not have such computing power. If you want to complete these calculations, you must first load them into the database and then perform them, which forms the LET. A large amount of useless data stored in the database will occupy a lot of storage space, which can easily lead to capacity problems. However, pressing the calculation work of cleaning and conversion to the database will increase the data processing time, and add a large amount of raw data storage time that has not been cleaned and converted. The limited ETL time window is likely to be insufficient. If the ETL work cannot be completed within the specified time It will affect the business the next day.

Introducing DCM in the ETL task can complete cleaning E, converting T, and loading L in sequence, solving various problems faced by LET. With the help of DCM's open computing capabilities, the multi-source data is cleaned and converted outside the library. DCM has strong computing capabilities to handle various complex calculations, and finally installs the sorted data on the target side to achieve true ETL.

DCM characteristics

It can be seen that the application scenarios of DCM are very wide. So to deal with these scenarios well, what characteristics should an excellent DCM have?

Compatible

First of all, DCM needs to have good compatibility, can be used across platforms, and can run well under various operating systems, cloud platforms, and application servers, which determines the scope of use of DCM.

In addition, compatibility also means that it can be compatible with diverse data sources, no matter what kind of data sources can be used directly and mixed calculation, which requires DCM to have strong enough openness.

Hot-deploy

Data processing is a high-frequency and less stable scenario. In the process of business development, new and modified computing tasks are often added. This requires DCM to have hot deployment characteristics. Modifying data processing logic can be done without restarting the application (service). effective.

High performance (Efficient)

Computing performance is the key aspect of data computing scenarios, and sometimes it becomes the main focus. DCM should be able to efficiently process data and provide high-performance guarantee mechanisms such as high-performance computing libraries, high-performance storage solutions, and parallel computing.

Agility

Agility requires that DCMs can quickly implement data processing logic and have complete computing capabilities. Especially in complex computing scenarios, data processing can be completed with simple enough coding, and at the same time, it can run efficiently. This requires DCM to provide support such as agile programming mechanisms and an easy-to-use development environment.

Scalable

When the computing capacity cannot meet the needs, the DCM should have the flexibility to scale out. Scalability is very important for contemporary applications, and the scalability determines the upper limit of DCM.

Embeddable

DCM should be able to integrate well with the application, act as a computing engine within the application, and be packaged and deployed with the application as a part of the application. In this way, the application itself has obtained strong computing power, and after it no longer relies heavily on the database, it can well cope with scenarios such as separation of storage and computing, microservices, and edge computing. Moreover, good integration is another aspect of agility. DCM is very light and can be embedded and used in conjunction with applications anytime, anywhere.

If the initials of these characteristics of DCM are combined, it is very close to CHEESE (cheese) (CHEASE), and DCM acts like cheese sandwiched in a hamburger. If it is missing, the taste and nutrition will be much worse.

insert image description here

In this way, whether it can be used as an ideal DCM can be examined using the CHEASE standard. Here, let's take a look at the satisfaction of some mainstream technologies for DCM.

State of the art

SQL

The database is the main position for using SQL. The database usually has strong computing power. The computing performance of some head databases is also very strong, which can basically meet the needs of high performance (E). Moreover, the database is too closed, and the data can only be calculated after being stored in the database, which cannot well meet the needs of diverse data source scenarios, and the compatibility (C) is poor.

For integration (E), since most databases are used independently, very few databases (such as SQLite) that support embedding often fail to meet the requirements of function and performance, so the database hardly meets the requirements of integration.

As a special set computing language, SQL is very convenient to implement simple calculations, but it is very cumbersome to express complex calculations in SQL, and it is often necessary to nest multiple layers. In actual business, you can often see thousands of rows of "long" SQL, which is not only difficult to write. , maintenance is not convenient, so SQL does not meet the requirements of agility (A).

Hadoop-related technologies similar to databases also have the same problems. The closed nature leads to disadvantages such as poor compatibility, insufficient agility, and basically no integration. Although it is better than databases in terms of scalability, it does not meet the requirements of DCM. . Spark's performance is slightly better, but Scala does not support hot deployment, and it is not convenient to implement complex calculations, and Spark SQL still has those problems of SQL. These technologies are all too heavy to meet the needs of DCM in terms of agility, integration, hot deployment, etc.

Java

As a native programming language, Java can run well across platforms, and it can also complete multi-data source computing tasks through coding, so the compatibility (C) is very good. And for most applications developed in Java, integration (E) is no problem.

But the shortcomings of Java are also obvious. As a compiled language, hot deployment (H) cannot be achieved. Due to the lack of the necessary structured computing class library, it takes dozens of lines of code to complete simple grouping and summarization, not to mention complex computing. Although Java hard-coding is often used to complete data processing in microservice architectures, in fact, the calculation implementation is much more complicated than SQL. A) Extremely inadequate. Although Stream was introduced after Java8, the computing power has not substantially improved (Kotlin has a similar problem).

Although Java can theoretically implement various high-performance algorithms, if it only serves a certain application/project, it is too much investment to implement these high-performance algorithm packaging. Therefore, from the perspective of practical application, Java does not have the High performance (E) characteristics. Scalability(S) has the same problem. Therefore, on the whole, Java is difficult to use as an excellent DCM technology.

Python

As a popular computing technology, Python has to be mentioned. Python has strong compatibility (C), and can support both cross-platform and multi-data sources. In particular, the rich data processing packages make Python extremely applicable.

Compared with Java and other technologies, Python has considerable advantages in structured data processing, but it is difficult to say that it is perfect, especially when dealing with complex calculations such as ordered grouping. Python is slightly lacking in agility (A).

Not only that, but the performance (E) of Pandas often fails to meet the requirements, especially for large data volume calculations, which has a lot to do with the implementation efficiency of the algorithm. Agile syntax can easily implement high-performance algorithms. . Similarly, in terms of scalability (S), Python is not satisfactory. Essentially, Python as a programming language needs to invest a lot of resources to develop to have good scalability, which is the same as Java.

The biggest problem with Python is integration (E), which is difficult to integrate with existing applications. Although inter-service calls can be made through patterns such as sidecars, they are inherently far from DCM's requirement to be embedded (same process) in conjunction with the application. The main application scenario of Python is not to do enterprise-level application development like Java. In the final analysis, professional things also need professional tools to do.

Professional data computing middleware SPL

The open source esProc SPL is a professional data computing middleware. It has complete computing capabilities that do not depend on the database. At the same time, the open computing capabilities can mix and calculate diverse data. At the same time, the interpretation and execution of SPL naturally supports hot deployment, and its good integration can be very convenient. It is easy to be embedded in applications, so that applications have strong computing power and give full play to the effectiveness of DCM.

compatibility

SPL is developed in Java, and its cross-platform capabilities are consistent with Java, and it can run well under various operating systems and cloud platforms. In terms of multi-data source support, SPL has open computing capabilities and can connect to multiple data sources, such as RDB, NoSQL, CSV, Excel, JSON/XML, Hadoop, RESTful, and Webservice. Storage, data real-time and calculation real-time performance can be well guaranteed.

The multi-source computing support has solved the problem that the original database could not calculate cross-source and external data. Coupled with the complete computing power of SPL and the more concise syntax than SQL, the application is equivalent to the database (more than ) computing power.

In addition to native computing syntax, SPL also provides SQL support (equivalent to the SQL92 standard), which can use SQL to query non-RDB data sources such as text, Excel, and NoSQL, which greatly facilitates application developers who are familiar with SQL.

Only with the support of the open computing system can DCM have strong enough compatibility and adapt to more application scenarios.

hot deployment

SPL adopts an interpreted execution mechanism and naturally supports hot deployment. This is very friendly to some businesses (such as reports and microservices) that often need to add and modify calculation logic with poor stability.

high performance

In terms of performance, SPL provides many high-performance algorithms and high-performance storage mechanisms. In the aforementioned scenario of DCM eliminating intermediate tables and ETL, data is often stored in files outside the database. In this case, using SPL file format storage can achieve much higher performance than open formats such as text.

SPL provides two storage types: set files and group tables. The set file adopts compression technology (smaller footprint and faster reading), stores data types (does not need to parse the data type and reads faster), supports multiplication segmentation mechanism for appendable data, and can easily achieve parallelism by using segmentation strategy Compute to ensure computing performance. The group table supports columnar storage, which has great advantages when the number of columns (fields) involved in the calculation is small. The group table also implements the minmax index and supports multiplication segmentation, which not only enjoys the advantages of column storage, but also makes it easier to improve computing performance in parallel.

SPL also supports various high-performance algorithms. For example, the common TopN operation, in SPL, TopN is understood as an aggregation operation, which can convert high-complexity sorting into low-complexity aggregation operations, and can expand the scope of application.

A
1 =file(“data.ctx”).create().cursor()
2 =A1.groups(;top(10,amount)) Top 10 orders
3 =A1.groups(area;top(10,amount)) Top 10 orders per region

There is no sorting word in the statement here, and no large sorting action will occur. The syntax for calculating TopN in the complete set or grouping is basically the same, and both will have higher performance. There are many similar algorithms in SPL.

SPL is also easy to implement parallel computing, taking advantage of multiple CPUs. SPL has many computing functions that provide parallel mechanisms, such as file reading, filtering, and sorting, as long as an @m option is added, parallel computing can be automatically implemented, which is simple and convenient. At the same time, it can also display the writing of parallel programs, and improve the computing performance through multi-thread parallelism.

Agility

SPL provides a native computing syntax and a concise and easy-to-use IDE environment. In the IDE, it is not only convenient for coding and debugging, but also the calculation results of each step of the process calculation can be viewed in real time. As a result, there is no need to define variables, which is simple and convenient.

At the same time, it is more convenient to implement structured data calculation based on SPL's rich computing class library, including grouping summary, looping, filtering, set operation, and ordered calculation.

SPL is especially good at complex calculations. It is very convenient to use SPL for calculations that require many layers of SQL to be nested. For example, how many days is the longest continuous rise in a stock based on stock records? SPL is much simpler than SQL.

The above SQL is nested in 3 layers, and it is very confusing to read, not to mention writing; the following SPL is completely based on natural thinking and can be implemented in just 3 lines, and the judgment is made.

Good agility can not only improve development efficiency, but many high-performance algorithms can be easily implemented through SPL. Algorithms must not only be conceived, but also implemented. The best implementation is simple. SPL provides this possibility.

Extensibility

For scenarios with high computing performance requirements, SPL can also deploy separate computing services, support multi-machine distributed clusters, support load balancing and fault tolerance mechanisms, and increase computing power through horizontal expansion when computing resources reach the upper limit. Extensibility.

In distributed computing, users can flexibly customize data distribution and redundancy schemes according to the characteristics of data and computing tasks, effectively reducing the amount of data transmission between nodes, achieving higher performance and realizing controllable data distribution.

SPL adopts a non-central cluster design. The cluster has no permanent central master node, allowing programmers to control the nodes participating in the calculation with code, thereby effectively avoiding a single point of failure. At the same time, SPL will decide whether to allocate tasks according to the idleness (number of threads) of each node to achieve an effective balance between burden and resources.

In terms of fault tolerance, SPL provides two types of data fault tolerance mechanisms: external memory redundancy fault tolerance and memory spare tire fault tolerance. Supports computing fault tolerance. When a node fails, the computing task of the node is automatically migrated to other nodes to continue to complete.

integration

As an aspect of combining DCM with applications, SPL provides standard JDBC/ODBC/RESTful interfaces, and applications can request SPL calculation results just like calling stored procedures.

insert image description here

Logically, as a DCM, SPL implements data processing between applications and data sources, provides computing services for the upper side, and shields the differences of diverse data sources for the lower side, fully demonstrating the important role of DCM.

JDBC call SPL code example:

Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();

Taken together, from the perspective of the six characteristics (CHEASE) of DCM, SPL has a very balanced ability in all aspects, and is far superior to other technologies as a whole, and is an ideal choice for DCM.

insert image description here

SPL Information

Welcome to add a little helper (VX number: SPL-helper) who are interested in SPL, join the SPL technical exchange group

Guess you like

Origin blog.csdn.net/u010634066/article/details/124909514