what is Impala

Impala is a new query system developed by Cloudera. It provides SQL semantics and can query petabyte-scale big data stored in Hadoop's HDFS and HBase. Although the existing Hive system also provides SQL semantics, because the underlying execution of Hive uses the MapReduce engine, it is still a batch process, which is difficult to meet the interactivity of the query. In contrast, Impala's biggest feature and biggest selling point is its speed.
 
advantage
  1. Impala does not need to write intermediate results to disk, saving a lot of I/O overhead.
  2. Saves the overhead of MapReduce job startup. MapReduce starts a task very slowly (the default interval for each heartbeat is 3 seconds), and Impala performs job scheduling directly through the corresponding service process, which is much faster.
  3. Impala has completely abandoned MapReduce, a paradigm that is not suitable for SQL queries. Instead, like Dremel, it borrows the idea of ​​MPP parallel database to start a new business, so it can do more query optimization, thus saving unnecessary shuffle, sort and other overheads.
  4. By using LLVM to uniformly compile runtime code, the unnecessary overhead of supporting generic compilation is avoided.
  5. Implemented in C++, a lot of targeted hardware optimizations have been done, such as the use of SSE instructions.
  6. Using the I/O scheduling mechanism that supports Data locality, data and computation are distributed on the same machine as much as possible, reducing network overhead.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326319593&siteId=291194637