【Introduction to Apache Tajo】

Introduction to Apache Tajo

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.

The design idea of ​​Tajo is similar to Tenzing. It fully draws on the advantages of MapReduce and DataBase, so that it has the advantages of Hive's scalability and fault tolerance, but at the same time its performance is much higher than Hive.


Tajo Features

Fast and Efficient

Fully distributed SQL query processing engine

Advanced query optimization such as cost-based and progressive query optimization

Interactive analysis on reasonable data set


Fault tolerance and dynamic scheduling for long-running queries

Out-of-core algorithms for data sets larger than main memory


ANSI/ISO SQL standard compliance

Hive MetaStore access support

JDBC driver support

Various file formats support, such as CSV, JSON, RCFile, SequenceFile, ORC and Parquet


User-defined functions

Interactive shell

Convenient Backup/Restore utility

Asynchronous/Synchronous Java API



Tajo adopts the Master-worker architecture, as follows:

  1) TajoMaster: Provides query services for clients and manages each QueryMaster.

 2) QueryMaster: Responsible for the parsing, optimization and execution of a query, it works with multiple task runner workers to complete the calculation of a query.

Guess you like