SQL on Hadoop technical analysis (1)

 
2016-07-12  Wang Sen  hadoop technology learning

 background

The birth of Hadoop is an epoch-making data revolution, but the persistence of the relational database era has also laid many obstacles for Hadoop to truly occupy the database field. The support for SQL (especially PL/SQL) has always been an urgent problem to be solved by the Hadoop big data platform in the era of replacing old data. Hadoop's support for SQL databases has always been one of the most concerned demands of enterprise users, and it is also an important criterion for their choice of Hadoop platform.

 

Since the advent of Hive, SQL onHadoop-related systems have blossomed, with faster and faster speeds and more complete functions. At present, the more mainstream ones are Impala, Spark SQL, HAWQ, Tez, Drill, Presto, Tajo, etc. From the technical level, we will sort out the perspective of the next technical unity as a reference for the subsequent selection of technical solutions.

 

System Architecture: Runtime Framework vs MPP

In the SQL on Hadoop system, there are two mainstream architectures. One is to build a query engine based on a runtime framework, a typical case is Hive, and the other is to imitate the MPP database architecture, such as Impala and HAWQ. The former has an existing runtime framework, and then puts on the SQL layer, and the latter is an integrated query engine. Sometimes we can hear a voice that the latter's architecture is better than the former, at least in terms of performance. So is it really so?

 

Generally speaking, an important evaluation index for SQL on Hadoop systems is: fast and low latency. After the popularity of Hive, so-called interactive query requirements gradually emerged, because neither the BI system nor the Ad-hoc method can be processed in an offline manner. Many manufacturers are trying to solve this problem, so there are Impala, HAWQ, etc. At the same time, after continuous development, Hive can also run on the DAG framework, not only Tez, but also Spark. From the perspective of task operation, the execution mode of the MPP class engine is actually similar to the DAG model. The main features are as follows:

DAG vs MR: The main advantage, intermediate results are not written to disk (unless there is not enough memory).

Pipeline calculation: The result of the upstream stage is immediately pushed or pulled to the next stage for processing. For example, when a multi-table join has results from the first two tables, the results are directly sent to the third table, unlike MR, which has to wait for the two tables to be completely joined before giving the results. The third table join.

Efficient IO: There is no extra consumption for local queries, and the disk is fully utilized.

Thread-level concurrency: In contrast, each task in MR needs to start the JVM, which has a large delay and takes up a lot of resources.

 

Of course, the MPP mode also has its disadvantages. One is that the scalability is not very high, which has been concluded in the era of relational databases; the other is poor fault tolerance. For Impala, once something goes wrong during the running process, the entire query will hang.

 

storage format

 

The mainstream storage formats are ORC, Parquet, and the CarbonData data format recently developed by Huawei's big data team. From the prototype test data, CarbonData's performance is faster than Parquet. This is mainly due to the creation of many carbon data formats. Index, from the entire query scanning process, uses the index to quickly filter out data and reduce the amount of data scanned. Similar systems such as Tencent's Hermes. In addition, the Parquet format storage used by Impala now has a new solution, the kudu+Impala solution. Cloudera claims that query analysis is very fast, and it can support operations such as data update.

 

resource control

 

 

In the SQL on Hadoop solution, another aspect that customers focus on is resource control, and in the Hadoop system, the integration with Yarn. For example, the current Impala version does not support managing distributed resources through Yarn, but from the Impala version roadmap, data integration with Yarn is already an important goal of Impala. Currently, Spark SQL, HAWQ, etc. can be integrated with Yarn.

 

 

Summarize

 

 

 

The technical development of SQL on Hadoop is getting faster and faster, and the competition among various manufacturers is becoming more and more fierce. Which technology has better performance and lower query delay, this still needs to be analyzed and selected according to the business usage scenario.

Any technology has its own suitable scenarios, and then combined with technical analysis, how to reduce the amount of scanned data is the key to improving query performance.

 

 


 

 

 

 

 
 

WeChat scan and
follow the public account

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326520762&siteId=291194637