Performance comparison of open source analytics database ClickHouse and open source esProc SPL

I am participating in the "Nuggets · Sailing Program"

ClickHouse vs Oracle

Open-source analytics database ClickHouse is known for being fast, is that true? Let's verify it by comparison test.

First use ClickHouse (referred to as CH) and Oracle database (referred to as ORA) to do a comparative test in the same software and hardware environment. The test benchmark uses the internationally recognized TPC-H to complete the calculation requirements (Q1 to Q22) defined by 22 SQL statements for 8 tables. The test uses a single machine with 12 threads, and the total data size is 100G. The SQL corresponding to TPC-H is relatively long, so it will not be listed in detail here.

Q1 is a simple single table traversal calculation group summary, the comparison test results are as follows:

..

The performance of CH to calculate Q1 is better than that of ORA, indicating that CH's columnar storage is doing well, and single-table traversal is fast. The main disadvantage of ORA is the use of row storage, which is obviously much slower.

But how does CH perform if we increase the computational complexity? Continue to look at Q2, Q3, and Q7 of TPC-H. The test results are as follows:

..

After the computation became more complicated, the CH performance dropped significantly. Q2 involves a small amount of data, column storage has little effect, and the performance of CH is almost the same as that of ORA. The amount of data in Q3 is large, and CH has surpassed ORA after taking advantage of column storage. The Q7 data is also larger, but the calculation is complex, and the performance of CH is not as good as that of ORA.

Whether it is fast to do complex calculations mainly depends on whether the performance optimization engine is doing well. CH's column storage occupies a huge storage advantage, but it has been overtaken by ORA's row storage, which shows that CH's algorithm optimization capability is far less than ORA's.

Q8 of TPC-H is a more complicated calculation. There are multiple table joins in the subquery. CH ran for more than 2000 seconds and still did not get the result. It should be stuck, and ORA ran for 192 seconds. Q9 added like to the subquery of Q8, CH directly reported an out of memory error, and ORA ran for 234 seconds. There are some other complex operations that CH cannot run, so there is no way to make an overall comparison.

Both CH and ORA are based on the SQL language, but CH cannot run the statements that ORA can optimize, which further proves that CH's optimization engine is relatively poor.

It is rumored that CH is only good at doing single-table traversal operations, and even can't run MySQL when there are associated operations. It does not seem to be false. Students who want to use CH need to weigh it. How adaptable can this scenario be?

esProc SPL debuts

开源esProc SPL也是以高性能作为宣传点,那么我们再来比较一下。

仍然是跑TPC-H来看 :

..

Q2、Q3、Q7这些较复杂的运算,SPL比CH和ORA跑的都快。CH跑不出结果的Q8、Q9,SPL分别跑了37秒和68秒,也比ORA快。原因在于SPL可以采用更优的算法,其计算复杂度低于被ORA优化过的SQL,更远低于CH执行的SQL,再加上列存,最终是用Java开发的SPL跑赢了C++实现的CH和ORA。

大概可以得到结论,esProc SPL无论做简单计算,还是复杂计算性能都非常好。

不过,Q1这种简单运算,CH比SPL还是略胜了一筹。似乎可以进一步证明前面的结论,即CH特别擅长简单遍历运算。

且慢,SPL还有秘密武器。

SPL的企业版中提供了列式游标机制,我们再来对比测试一下:在8亿条数据量下,做最简单的分组汇总计算,对比SPL(使用列式游标)和CH的性能。(采用的机器配置比前面做TPC-H测试时略低,因此测出的结果不同,不过这里主要看相对值。)

简单分组汇总对应CH的SQL语句是:

SQL1:

SELECT mod(id, 100) AS Aid, max(amount) AS Amax
FROM test.t
GROUP BY mod(id, 100)

这个测试的结果是下图这样:

..

SPL使用列式游标机制之后,简单遍历分组计算的性能也和CH一样了。如果在TPC-H的Q1测试中也使用列式游标,SPL也会达到和CH同样的性能。

测试过程中发现,8亿条数据存成文本格式占用磁盘15G,在CH中占用5.4G,SPL占用8G。说明CH和SPL都采用了压缩存储,CH的压缩比更高些,也进一步证明CH的存储引擎做得确实不错。不过,SPL也可以达到和CH同样的性能,这说明SPL存储引擎和算法优化做得都比较好,高性能计算能力更加均衡。

当前版本的SPL是用Java写的,Java读数后生成用于计算的对象的速度很慢,而用C++开发的CH则没有这个问题。对于复杂的运算,读数时间占比不高,Java生成对象慢造成的拖累还不明显;而对于简单的遍历运算,读数时间占比很高,所以前面测试中SPL就会比CH更慢。列式游标优化了读数方案,不再生成一个个小对象,使对象生成次数大幅降低,这时候就能把差距拉回来了。单纯从存储本身看,SPL和CH相比并没有明显的优劣之分。

接下来再看常规TopN的对比测试,CH的SQL是:

SQL2:

SELECT * FROM test.t ORDER BY amount DESC LIMIT 100

对比测试结果是这样的:

..

单看CH的SQL2,常规TopN的计算方法是全排序后取出前N条数据。数据量很大时,如果真地做全排序,性能会非常差。SQL2的测试结果说明,CH应该和SPL一样做了优化,没有全排序,所以两者性能都很快,SPL稍快一些。

也就是说,无论简单运算还是复杂运算,esProc SPL都能更胜一筹。

进一步的差距

差距还不止于此。

正如前面所说,CH和ORA使用SQL语言,都是基于关系模型的,所以都面临SQL优化的问题。TPC-H测试证明,ORA能优化的一些场景CH却优化不了,甚至跑不出结果。那么,如果面对一些ORA也不会优化的计算,CH就更不会优化了。比如说我们将SQL1的简单分组汇总,改为两种分组汇总结果再连接,CH的SQL写出来大致是这样:

SQL3:

SELECT *
FROM (
SELECT mod(id, 100) AS Aid, max(amount) AS Amax
FROM test.t
GROUP BY mod(id, 100)
) A
JOIN (
SELECT floor(id / 200000) AS Bid, min(amount) AS Bmin
FROM test.t
GROUP BY floor(id / 200000)
) B
ON A.Aid = B.Bid

这种情况下,对比测试的结果是CH的计算时间翻倍,SPL则不变:

..

这是因为SPL不仅使用了列式游标,还使用了遍历复用机制,能在一次遍历过程中计算出多种分组结果,可以减少很多硬盘访问量。CH使用的SQL无法写出这样的运算,只能靠CH自身的优化能力了。而CH算法优化能力又很差,其优化引擎在这个测试中没有起作用,只能遍历两次,所以性能下降了一倍。

SPL实现遍历复用的代码很简单,大致是这样:

A B
1 =file("topn.ctx").open().cursor@mv(id,amount)
2 cursor A1 =A2.groups(id%100:Aid;max(amount):Amax)
3 cursor =A3.groups(id\200000:Bid;min(amount):Bmin)
4 =A2.join@i(Aid,A3:Bid,Bid,Bmin)

再将SQL2常规TopN计算,调整为分组后求组内TopN。对应SQL是:

SQL4:

SELECT
   gid,
   groupArray(100)(amount) AS amount
FROM
(
   SELECT
      mod(id, 10) AS gid,
      amount
   FROM test.topn
   ORDER BY
      gid ASC,
      amount DESC
) AS a
GROUP BY gid

这个分组TopN测试的对比结果是下面这样的:

..

CH做分组TopN计算比常规TopN慢了42倍,说明CH在这种情况下很可能做了排序动作。也就是说,情况复杂化之后,CH的优化引擎又不起作用了。与SQL不同,SPL把TopN看成是一种聚合运算,和sum、count这类运算的计算逻辑是一样的,都只需要对原数据遍历一次。这样,分组求组内TopN就和分组求和、计数一样了,可以避免排序计算。因此,SPL计算分组TopN比CH快了22倍。

而且,SPL计算分组TopN的代码也不复杂:

A
1 =file("topn.ctx").open().cursor@mv(id,amount)
2 =A1.groups(id%10:gid;top(10;-amount)).news(#2;gid,~.amount)

不只是跑得快

再来看看电商系统中常见的漏斗运算。SPL的代码依然很简洁:

A B
1 =["etype1","etype2","etype3"] =file("event.ctx").open()
2 =B1.cursor(id,etime,etype;etime>=date("2021-01-10") && etime<date("2021-01-25") && A1.contain(etype) && …)
3 =A2.group(id).(~.sort(etime)) =A3.new(~.select@1(etype==A1(1)):first,~:all).select(first)
4 =B3.(A1.(t=if(#==1,t1=first.etime,if(t,all.select@1(etype==A1.~ && etime>t && etime<t1+7).etime, null))))
5 =A4.groups(;count(~(1)):STEP1,count(~(2)):STEP2,count(~(3)):STEP3)

CH的SQL无法实现这样的计算,我们以ORA为例看看三步漏斗的SQL写法:

with e1 as (
    select gid,1 as step1,min(etime) as t1
    from T
    where etime>= to_date('2021-01-10', 'yyyy-MM-dd') and etime= to_date('2021-01-10', 'yyyy-MM-dd') and e2.etime t1
       and e2.etime < t1 + 7
       and eventtype='eventtype2' and …
    group by 1
),
with e3 as (
    select gid,1 as step3,min(e2.t1) as t1,min(e3.etime) as t3
    from T as e3
    inner join e2 on e3.gid = e2.gid
    where e3.etime>= to_date('2021-01-10', 'yyyy-MM-dd') and e3.etime t2
       and e3.etime < t1 + 7
       and eventtype='eventtype3' and …
    group by 1
)
select
    sum(step1) as step1,
    sum(step2) as step2,
    sum(step3) as step3
from
    e1
    left join e2 on e1.gid = e2.gid
    left join e3 on e2.gid = e3.gid

ORA's SQL requires more than 30 lines to be written, which is quite difficult to understand. And this code is related to the number of steps in the funnel, and each additional step requires an additional subquery. In contrast, SPL is much simpler, and it is this code to handle any number of steps.

This kind of complex SQL is very laborious to write, and performance optimization is impossible to talk about.

And CH's SQL is far inferior to ORA, basically can't write such complex logic, and can only write C++ code externally. That is to say, in this case, only the storage engine of CH can be used. While it is possible to get good performance with external computation in C++, the development cost is very high. There are many similar examples, and CH cannot directly implement them.

To sum up: CH computes some simple scenarios (such as single-table traversal) really fast, about the same performance as SPL. However, high-performance computing can not only look at the speed of simple situations, but also weighs various scenarios. For complex operations, SPL not only performs far better than CH, but also makes code writing much simpler. SPL can cover the entire scene of high-performance data computing, which can be said to be a complete victory over CH.

refer to

Guess you like

Origin juejin.im/post/7145741592031133732