Why Big Data Platforms Regress to Relational Data Models

Let’s start with the point of view: because I haven’t found a better one.

Next, let’s talk about the reasons. First, let’s take a look at what big data platforms are doing.

reason

Structured data computing remains a top priority

The big data platform is mainly to meet the needs of massive data storage and analysis. The massive data storage is indeed true. In addition to the structured data generated by production and operation, there are also a lot of unstructured data such as audio and video. This part of the data is very large and occupies a large amount of data. There is also a lot of space, and sometimes more than 80% of big data platforms store unstructured data. However, data storage is not enough, and only when it is used can it generate value, which requires analysis.

Big data analysis should be discussed in two parts: structured and unstructured data.

Structured data is mainly business data generated in the production and operation process of an enterprise, which can be said to be the core of the enterprise. In the past, when there was no big data platform, the enterprise mainly or completely used this part of the data. With the continuous accumulation of business, this part of data is getting bigger and bigger, and traditional database solutions face great challenges. Building a big data platform naturally needs to solve this part of the core data analysis problem.

With the big data platform, there is also more room for everyone's imagination. Unstructured data such as logs, pictures, audio and video that could not be used in the past will also generate value, which involves the analysis of unstructured data. Relative to core business data analysis, unstructured data analysis looks more like icing on the cake. Even so, the analysis of unstructured data does not exist in isolation, and it will also be accompanied by a large amount of structured data processing. When collecting unstructured data, it is often accompanied by collecting a lot of related structured data, such as the producer of audio and video, production time, category, duration, ...; some unstructured data will also be transformed into structure after processing. Data, such as dismantling the visitor's IP, access time, key search terms, etc. from the web log. The so-called unstructured data analysis is often actually aimed at these accompanying structured data.

Structured data analysis remains a top priority for big data platforms. The structured data processing technology is relatively mature, such as our commonly used relational database (SQL) based on the relational data model.

SQL is still the most widely used structured data computing technology

Returning to SQL is a development trend of the current big data computing syntax. In the Hadoop system, the early PIG Latin has been eliminated, but Hive has always been strong; Spark SQL is also used more on Spark, while Scala is much less (Scala is easy to learn and difficult to master, and as a compiled language does not support hot deployment, there are also some many inconveniences). Some other new big data computing systems generally use SQL as the preferred computing syntax. After several years of melee, SQL has gradually regained the initiative.

There are probably two reasons for this phenomenon:

1. There is really nothing else useful

Relational databases are too popular, programmers are quite familiar with SQL, and even their thinking habits are SQL-like. It is also relatively simple to use SQL to do some general queries. Although it is not convenient to handle complex procedural calculations or ordered operations, other alternative technologies are not much better. When it comes to operations that are difficult to write in SQL, it is necessary to write the same as UDF. Quite complex code, it is troublesome anyway, it is better to continue to use SQL.

2. Strong support from big data vendors

The technical essence of big data is high performance, and SQL is the key position for performance competition. It only makes sense to face the same operation than performance. Overly specialized and complex operations involve too many influencing factors, and it is not easy to evaluate the capabilities of the big data platform itself. And SQL has an international standard TPC series, which all users can understand, so that there is clear comparability, and manufacturers will also focus on performance optimization on SQL.

Compatible with SQL is more conducive to porting

The benefits of a big data platform being compatible with SQL are obvious. SQL is widely used, and many programmers know SQL. If you continue to use SQL, you can avoid a lot of learning costs. There are also many front-end software that supports SQL, and big data platforms using SQL can easily be integrated into this ready-made ecosystem. The traditional database that the big data platform intends to replace also uses SQL syntax, so the compatibility will be very good, and the transplant cost will be relatively low.

Well, we have finished why the big data platform will return to the relational data model. So what are the problems with continuing to use the relational data model (SQL)?

question

low performance

The biggest problem with continuing to use SQL is that it is difficult to obtain the high performance that big data computing needs most.

The lack of some necessary data types and operation definitions in SQL makes it impossible to describe some high-performance algorithms, and can only hope for the optimization of the calculation engine in engineering. After decades of development of traditional commercial databases, the optimization experience has been quite rich, but even so, there are still many scenarios that are difficult to optimize, and problems at the theoretical level are indeed difficult to solve at the engineering level . However, the experience in optimization of emerging big data platforms is far less than that of traditional databases. If the algorithm is not dominant, it can only rely on clustering more machines to improve performance. In addition, the ability of SQL to describe the process is not very good, and it is not good at specifying execution paths. To achieve high performance, a specially optimized execution path is often required, which requires adding many special modifiers for human intervention. It is better to use procedural directly. The syntax is more straightforward, which also prevents writing high-performance code in SQL.

At the beginning of the invention of SQL, the computer hardware capabilities were relatively poor. To ensure practicability, the design of SQL must be adapted to the hardware conditions at that time, which made it difficult for SQL to fully utilize the hardware capabilities of contemporary computers, specifically large memory, parallelism and clusters. JOIN in SQL is corresponding to the key value, but in the case of large memory, it can be directly corresponding to the address, without the need to calculate the HASH value and comparison, the performance can be improved a lot; the SQL data table is out of order, and it is easy to do single table calculation. When it comes to segmented parallelism, in multi-table association operations, only fixed segments can generally be done in advance, and it is difficult to achieve synchronous dynamic segmentation. Theoretically, there is no distinction between dimension tables and fact tables. JOIN operations are simply defined as Cartesian product and post-filtering. To implement a large table JOIN, a HASH Shuffle action that takes up a lot of network resources will inevitably occur. When the number of cluster nodes is too large, the network The delay caused by the transmission will outweigh the benefits of having more nodes.

To give a specific example, we want to take out the top 10 data in 100 million pieces of data, which is written in SQL like this:

select top 10 x,y from T order by x desc

There is an order by in this statement. Strictly executing it will involve large sorting, and sorting is very slow. In fact, we can come up with an algorithm that does not require large sorting, but it cannot be described in SQL, and we can only rely on the database optimizer. For the simple case described by this SQL, many commercial databases can indeed be optimized, and the performance is usually very good using algorithms that do not require large sorting. But the situation is a little more complicated. For example, to take the top 10 in each group, use window functions and subqueries to write SQL like this:

select * from
     (select y,*,row_number() over (partition by y order by x desc) rn from T)
where rn<=10

At this time, the database optimizer will be dizzy, unable to guess the purpose of this SQL statement, and can only honestly execute the sorting logic (there is still the word order by in this statement), resulting in a steep drop in performance.

low development efficiency

Not only runs slowly, but also the development efficiency is not high, especially in terms of complex calculations, SQL implementation is very cumbersome. For example, to query the longest continuous rising days of a stock based on stock records, the SQL (oracle) is written as follows:

select code, max(ContinuousDays) - 1
from (
    select code, NoRisingDays, count(*) ContinuousDays
    from (
        select code,
            sum(RisingFlag) over (partition by code order by day) NoRisingDays
        from (
            select code, day,
                case when price>lag(price) over (partittion by code order by day)
                    then 0 else 1 end RisingFlag
            from stock  ) ) 
    group by NoRisingDays )
group by code

It is implemented in a very convoluted way, not to mention writing it out, it takes half a day to understand.

Also, SQL is difficult to implement procedural computations. What is procedural computing? That is, it cannot be written in one step, and multiple step-by-step operations are required, especially those related to the order of data.

Let's take a few examples:

Proportion of users who have logged in for more than one hour in a week, excluding misoperations with a log-in duration of less than 10 seconds

The distribution of the longest consecutive consumption days of credit cards in the last three months, considering the implementation of the promotion of triple points after 10 consecutive days

How many users in a month continuously operate the action of adding to the shopping cart and buying after viewing the product for 24 hours, and how many users give up in the middle step?

...

(For ease of understanding, these examples have been simplified, and the actual operation is much more complicated)

This kind of procedural operation is very difficult to write in SQL, and it is often necessary to write UDF to complete it. If you can't write SQL, then the effect of SQL will be greatly reduced.

Low development efficiency leads to low performance

The execution efficiency of complex SQL is often very low, which returns to the problem of performance. In fact, development efficiency and computing performance are closely related, and many performance problems are essentially caused by development efficiency.

The optimization effect of complex SQL is very poor. After a few layers of nesting, the database engine will faint, and I don't know how to optimize it. To improve the performance of such complex operations, it is unreliable to expect the automatic optimization of the computing platform, and the fundamental means is to write high-performance algorithms. For example, in procedural operations, it is often necessary to save intermediate results for reuse. SQL needs to use temporary tables, and more IO operations will affect performance. These are not things that engine optimization can solve, and the calculation process must be rewritten.

Therefore, in essence, improving performance or reducing development difficulty. Software cannot improve the performance of hardware. It can only find ways to design algorithms with lower complexity. If these algorithms can be implemented quickly and cheaply, the goal of improving performance can be achieved. If the syntax system is difficult or even impossible to describe high-performance algorithms, and programmers must be forced to adopt higher-complexity algorithms, it will be difficult to improve performance. Optimizing SQL operations will not help reduce the difficulty of its development. The SQL syntax system is just like that. No matter how to optimize its performance, the development difficulty will not change. Many high-performance algorithms still cannot be implemented, so it is difficult to substantially improve the computing performance.

Writing UDFs can indeed improve performance in many scenarios, but on the one hand, it is very difficult to develop, and on the other hand, it is hard-written by programmers, and it cannot take advantage of the optimization capabilities of the SQL engine. Moreover, it is often not possible to write the complete operation as UDF, only the interface provided by the computing platform can be used, and its data type must still be used in the SQL framework, which will limit the implementation of high-performance algorithms.

The fundamental solution is to let the big data platform really have some better syntax.

solution

Using the open source esProc SPL can be used as a good replacement and extension of SQL. As a special computing language for big data platforms, it can continue the advantages of SQL and improve its disadvantages.

SPL is a professional open source data computing engine that provides an independent computing syntax. The entire system does not depend on relational data models, so it has made great breakthroughs in many aspects, especially in terms of development efficiency and computing performance. Let's take a look at what features of SPL are suitable for contemporary big data platforms.

Strong integration

The first is integration. No matter how good SPL is, it will be in vain if it cannot be used in conjunction with the big data platform. It is actually very convenient to use SPL in the big data platform. It can be used by introducing a jar package (it is also open source, and you can use it how you want). SPL provides a standard JDBC driver, which can directly execute SPL scripts or call SPL script files.

…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
//PreparedStatement st = (PreparedStatement)conn.createStatement();;
//直接执行SPL脚本
//ResultSet rs = st.executeQuery("=100.new(~:baseNum,~*~:square2)");
//调用SPL脚本文件
CallableStatement st = conn.prepareCall("{call SplScript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
...

Efficient development

Agile Grammar

In terms of structured data computing, SPL provides an independent computing syntax and a rich computing class library, and supports process computing to make the implementation of complex computing very simple. The previous example of calculating the longest streak of days for a stock is implemented using SPL as follows:

A
1 =db.query(“select * from stock order by day”)
2 =A1.group@i(price<price[-1]).max(~.len())-1

Arrange them according to the trading day, group the consecutive rising records into one group, and then find the maximum value of -1, which is the longest continuous rising days.

Another example is to list the last login interval of each user according to the user login record:

A
1 =ulogin.groups(uid;top(2,-logtime)) Last 2 login records
2 =A1.new(uid,#2(1).logtime-#2(2).logtime:interval) calculation interval

It is convenient to support step-by-step SPL syntax to complete procedural calculations.

SPL provides a rich computing class library, which can further simplify the operation.

Intuitive and easy-to-use development environment

At the same time, SPL also provides a concise and easy-to-use development environment, single-step execution, setting breakpoints, and a WYSIWYG result preview window... and the development efficiency is also higher.

Multiple data source support

SPL also provides support for diverse data sources, and multiple data sources can be used directly. Compared with big data platforms, which require data to be "stored" before being calculated, SPL's system is more open.

Some data sources supported by SPL (still expanding...)

Not only that, SPL also supports mixed computing of multiple data sources, giving full play to the advantages of various data sources and expanding the openness of the big data platform. At the same time, it is simpler to directly use multiple data sources for development and implementation, which further improves development efficiency.

hot swap

SPL is interpreted and executed, and naturally supports hot switching, which is a major benefit to the big data platform under the Java system. SPL-based big data computing logic writing, modification, and operation and maintenance do not need to be restarted, and take effect in real time, making development, operation and maintenance more convenient.

High computing performance

As we said earlier, high performance and high development efficiency are essentially the same thing, and it is easier to write high-performance algorithms based on the concise syntax of SPL. At the same time, SPL also provides many high-performance data storage and high-performance algorithm mechanisms. High-performance algorithms and storage solutions that are difficult to implement in SQL can be easily implemented with SPL. The key to improving software performance lies in algorithms and storage.

For example, the TopN operation mentioned above, in SPL, TopN is understood as an aggregation operation, which can convert a high-complexity sorting into a low-complexity aggregation operation, and can also expand the scope of application.

A
1 =file(“data.ctx”).create().cursor()
2 =A1.groups(;top(10,amount)) Top 10 orders
3 =A1.groups(area;top(10,amount)) Top 10 orders per region

There is no sorting word in the statement here, and no large sorting action will occur. The syntax for calculating TopN in the complete set or grouping is basically the same, and both will have higher performance.

Here are some high-performance computing cases implemented with SPL:

Open source SPL speeds up the query of insurance company group insurance details by 2000+ times

Open source SPL improves bank self-service analysis from 5 concurrency to 100 concurrency

Open source SPL accelerates bank user profile customer group intersection calculation by 200+ times

Open source SPL optimizes bank precomputed fixed queries into real-time flexible queries

Open source SPL turns pre-correlation of bank mobile account inquiries into real-time correlation

Open source SPL speeds up bank capital position reporting by 20+ times

Open-source SPL speeds up bank loan agreements by 10+ times

Open source SPL optimizes insurance company runs from 2 hours to 17 minutes

Open source SPL speeds up bank POS transaction reports by 30+ times

Open source SPL speeds up bank loan approval tasks by 150+ times

Open Source SPL Speeds Up Balance Sheet 60 Times

A few more words, SPL is not based on a relational data model, but adopts an innovative theoretical system, which is innovative at the theoretical level. The reason for the space will not be mentioned too much here. It is written simply and runs fast. The database language SPL has a more detailed introduction here, and interested partners can also search and download by themselves.

SPL Information

Guess you like

Origin blog.csdn.net/qq_45400861/article/details/127104942