[Data warehouse of big data] kudu performance test report analysis

This article is published by  NetEase Cloud .

 

The main content of this blog post is not to analyze the performance indicators of kudu, but to analyze why the scan performance of kudu is so bad! At the beginning of the publicity, various anti-sky black technologies were added: column independent storage, bloom filter, compression, in-situ modification, b+tree, mvcc...  

 

Here is a comparison chart of the TPCDS test results of a small part of kudu and parquet:

There is no harm without contrast, and there is fun with contrast. The ordinate is time-consuming, and the unit is seconds. The yellow column representing kudu is too high. In other words, kudu takes too long and its performance is too poor!

 

Boss: Why is the performance of kudu so poor? Me: I don't know...

 

I really didn't know the reason at that time, I was busy with testing and getting test indicators, and it was too late to analyze, not to mention two big unfamiliar systems: impala and kudu, which was very embarrassing :(

 

After all the TPCDS test cases were run, there was a gap period, and it took a few days to find the reason, reading materials, flipping documents, and google to google. The process will not be described here, but the reasons will be described below.

 

We know that Impala has an interactive management tool, impala-shell, which has a profile command. After executing sql every time, you can get the execution plan of the sql and the time-consuming statistics of each point. Because both kudu and parquet are tested, impala is used in the calculation engine, so is it possible to get some information from this?

 

So I took the comparison of query7 and query40 in the above picture for testing, executed it on kudu and parquet respectively, collected their respective profiles, there are a total of 4 files, and then used them for analysis. Maybe you don't believe it, the result of the profile is really too big, a file is close to 10,000 lines, do you still have the confidence to analyze it? (The profile of query40 can be found in the attachment below.) At that time, I was stunned. Inadvertently, I clicked on beyond compare, which was often used to compare code before, and compared the two profiles (kudu and parquet) that executed query40. Pull down a little bit, and in the execution plan section, it is actually true Treasure found!

 

 

Parquet has a runtime filter, but kudu does not, and then pull down, the corresponding disk scan part:

 

 

The result sets obtained by the two scan disks are also different! ! It is no wonder that during the comparison test, when the kudu cluster runs query, there will be a lot of disk IO and network transmission overhead, while the load of parquet is relatively low! Did you understand?

 

Why doesn't kudu have a runtime filter? So I went to kudu's jira library to search, well, I didn't find it! Then try impala's jira library, I really found it, Matthew Jacobs is the development engineer of cloudera company impala/kudu, and found his two jira orders: impala-3741 and impala-4252

 

+

Seeing this, basically the problem is relatively clear, and the answer is there, but I am not reconciled, so I registered an account regardless of the 3721, and raised a bug ticket on their jira library: impala-4719 (normal The situation should be to send an email to the userlist for consultation, then when I help them test the permission problem of the jira library =_=), check again whether it is supported.

 

Later, I re-read the official documents of kudu. There are actually some clues between the lines, but it did not attract enough attention at that time:

 

At this point, this article ends. Hope you guys can learn a little from it, thank you!

 

 

Learn about NetEase Cloud:
NetEase Cloud Official Website: https://www.163yun.com/
New User Gift Package: https://www.163yun.com/gift
NetEase Cloud Community: https://sq.163yun.com/

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324685223&siteId=291194637