Introduction to Impala of Big Data

1. Introduction

Cloudera releases real-time query open source project Impala ( Impala )! The actual measurement of several products shows that the query speed of Hive SQL based on MapReduce is 3 to 90 times faster than that of the original Hive SQL. Impala is an imitation of Google Dremel, but it's better than it is in terms of SQL functionality. Impala uses the same metadata, SQL syntax, ODBC driver and user interface (Hue Beeswax) as Hive, so that the platform for batch and real-time queries is unified when using CDH products. Currently supported file formats are text files and SequenceFiles (which can be compressed as Snappy, GZIP, and BZIP, with the former performing best). Other formats such as Avro, RCFile, LZO text and Doug Cutting's Trevni will be supported in the official release.



 

2. Overview  

1. Impala is an open source implementation based on Google's new three papers Dremel, similar to Shark and Drill. Impala is developed and open sourced by Cloudera. Based on Hive and using memory for computing, taking into account the data warehouse, it has the advantages of real-time, batch processing, and multi-concurrency. It is the preferred real-time query and analysis engine for PB-level big data using CDH. 

2. Impala is a real-time big data analysis and query engine based on Hive. It directly uses Hive's metadata database Metadata, which means that Impala metadata is stored in Hive's metastore. And impala is compatible with Hive's SQL parsing, and implements a subset of Hive's SQL semantics, and its functions are still being improved.   

 

3. Features  

1. Compute based on memory, enabling interactive real-time query and analysis of petabyte-level data  

2. Abandon the MR calculation and use C++ to write, targeted hardware optimization, such as the use of SSE instructions 

3. Compatible with HiveSQL, seamless migration  

4. By using LLVM to uniformly compile the runtime code, unnecessary overhead to support generic compilation is avoided. 

5. Supports the sql92 standard and has its own parser and optimizer. 

6. With the characteristics of data warehouse, it can do data analysis on the original data of hive. 

7. The I/O scheduling mechanism that supports Data locality is used. 

8. Support column storage.  

9. Support jdbc/odbc remote access.

 

 

 

4. Disadvantages 

1. Compute based on memory, which is highly dependent on memory 

2. It is written in C++, which means it is invisible to ordinary C++ users. 

3. Based on Hive, coexist with Hive 

4. In practice, the number of partitions in Impala exceeds 10,000, and the performance is seriously degraded, which is prone to problems. 

5. Not as stable as Hive

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327041836&siteId=291194637