Decrypt Apache HAWQ - a powerful SQL-on-Hadoop engine

1. Basic introduction to HAWQ


HAWQ is a Hadoop-native massively parallel SQL analytics engine aimed at analytical applications. Similar to other relational databases, it accepts SQL and returns a result set. But it has massive parallel processing features and functions that many traditional databases and other databases do not have. Mainly as follows:

  1. Perfect support for standards: ANSI SQL standard, OLAP extension, standard JDBC/ODBC support, better than other Hadoop SQL engines.
  2. It has the performance of MPP (massively parallel processing system), which is several times faster than other SQL engines in Hadoop.
  3. Has a very mature parallel optimizer. The optimizer is an important part of a parallel SQL engine and has a lot of impact on performance, especially for complex queries.
  4. Support ACID transaction features: This is something that many existing Hadoop-based SQL engines cannot do, and is important to ensure data consistency.
  5. Dynamic Data Streaming Engine: UDP-based high-speed Internet.
  6. Elastic execution engine: The number of nodes and segments used to execute the query can be determined according to the size of the query.
  7. Supports multiple partition methods and multi-level partitions: such as List partition and Range partition. Partitioned tables are very helpful for performance. For example, if you only want to access the data of the most recent month, the query only needs to scan the partition where the data of the most recent month is located.
  8. Multiple compression methods are supported: snappy, gzip, quicklz, RLE, etc.
  9. Multiple UDF (User Defined Function) language support: java, python, c/c++, perl, R, etc.
  10. Dynamic capacity expansion: Dynamic capacity expansion on demand, adding nodes in seconds according to storage size or computing needs.
  11. Multi-level resource or load management: integrated with external resource manager YARN; can manage CPU, Memory resources, etc.; support multi-level resource queue; convenient DDL management interface.
  12. Supports accessing data from any HDFS and other systems: various HDFS formats (Text, SequenceFile, Avro, Parquet, etc.) and other external systems (HBase, etc.), and users can develop plugins themselves to access new data sources.
  13. Native machine learning data mining library MADLib support: easy to use and high performance.
  14. Seamless integration with Hadoop systems: storage, resources, installation and deployment (Ambari), data formats, access, etc.
  15. Perfect security and authority management: kerberos; authorization management at all levels such as databases and tables.
  16. Supports a variety of third-party tools: such as Tableau, SAS, the newer Apache Zeppelin, etc.
  17. Supports fast access libraries to HDFS and YARN: libhdfs3 and libyarn (other projects also work).
  18. Supports deployment on-premises, in a virtualized environment, or in the cloud.

Let me talk about HAWQ is the meaning of "native" in the native Hadoop SQL engine. "Native" is mainly reflected in the following aspects:

  1. All data is stored on HDFS, no need to use connector mode.
  2. High scalability: Like other Hadoop components, it is highly scalable. And with high performance.
  3. Native code access: Same as other Hadoop projects. HAWQ is an Apache project. Users are free to download, use and contribute. Different from other pseudo open source software.
  4. Transparency: Develop software the Apache way. All feature development and discussions are public. Users are free to participate.
  5. Native management: can be deployed through Ambari, resources can be allocated from YARN, and can run in the same cluster as other Hadoop components.

Key benefits provided by HAWQ:

  • HAWQ is compared with similar open source and closed source products, as shown in Figure 1:

Apache HAWQ

(figure 1)

  • HAWQ is compared with similar open source and closed source products, as shown in Figure 2:

Apache HAWQ

(figure 2)

HAWQ History and Current Status:

  1. Ideas and Prototype System (2011): GOH Phase (Greenplum Database On HDFS).
  2. HAWQ 1.0 Alpha (2012): Tried by many large foreign customers. At that time, the customer performance test was hundreds of times that of Hive. Promoted the release of HAWQ 1.0 as an official product.
  3. HAWQ 1.0 GA (early 2013): Changed the traditional MPP database architecture, including transactions, fault tolerance, metadata management, etc.
  4. HAWQ 1.X version (2014-2015 Q2): Added some enterprise-level functions, such as Parquet storage, new optimizer, Kerberos, Ambari installation and deployment. Customers cover the world.
  5. HAWQ 2.0 Alpha was released and became an Apache incubator project: system architecture redesign for cloud environments, dozens of advanced features, including elastic execution engine, advanced resource management, YARN integration, second-level expansion, and more. Now everyone in Apache open source is the latest 2.0 Alpha version. Future development is all happening at Apache.

2. Apache HAWQ system architecture


Let me introduce the system architecture of HAWQ. Figure 3 presents the main components of a typical HAWQ cluster. There are several Master nodes: including the HAWQ master node, the HDFS master node NameNode, and the YARN master node ResourceManager. Now the HAWQ metadata service is in the HAWQ master node, and it will be a separate service in future versions. Other nodes are slave nodes. HDFS DataNode, YARN NodeManager and a HAWQ Segment are deployed on each Slave node. HAWQ Segment starts multiple QEs (Query Executor, Query Executor) when executing a query. Query executors run inside resource containers.

Apache HAWQ

(image 3)

Figure 4 is the internal architecture diagram of HAWQ:

Apache HAWQ

(Figure 4)

It can be seen that there are several important components inside the HAWQ master node: query parser (Parser/Analyzer), optimizer, resource manager, resource agent, HDFS metadata cache, fault tolerance service, query dispatcher, and metadata service. A physical segment is installed on the slave node. During query execution, for a query, the elastic execution engine will start multiple virtual segments to execute the query at the same time, and the data exchange between nodes is carried out through the Interconnect (high-speed Internet). If a query starts 1000 virtual segments, it means that the query is evenly divided into 1000 tasks, and these tasks will be executed in parallel. So the number of virtual segments actually indicates the parallelism of the query. The parallelism of a query is dynamically determined by the elastic execution engine based on the query size and current resource usage. Below I will explain the role of these components and the relationship between them one by one:

  1. Query Parser: Responsible for parsing queries and checking syntax and semantics. The final generated query tree is passed to the optimizer.
  2. Optimizer: Responsible for accepting query trees and generating query plans. There may be hundreds of millions of possible equivalent query plans for a query, but the execution performance varies widely. The role of the optimizer is to find an optimized query plan.
  3. Resource manager: The resource manager dynamically applies for resources to the global resource manager (such as YARN) through the resource agent. and cache resources. Return resources when they are not needed. The main reason we cache resources is to reduce the interaction cost between HAWQ and the global resource manager. HAWQ supports millisecond-level queries. If every small query goes to the resource manager to request resources, then performance will suffer. The resource manager also needs to ensure that the query does not use more than the resources allocated to the query, otherwise the queries will affect each other, which may cause the system as a whole to be unavailable.
  4. HDFS metadata cache: Used by HAWQ to determine which segments scan which parts of the table. HAWQ is to dispatch computation to where the data resides. So we need to match the locality of computation and data. These require location information for HDFS blocks. Location information is stored on the HDFS NameNode. Every query hitting the HDFS NameNode creates a bottleneck for the NameNode. So we set up the HDFS metadata cache on the HAWQ Master node.
  5. Fault Tolerance Service: Responsible for detecting which nodes are available and which are unavailable. Unavailable machines are excluded from the resource pool.
  6. Query dispatcher: After the optimizer optimizes the query, the query dispatcher dispatches plans to each node for execution, and coordinates the entire process of query execution. The query dispatcher is the glue of the whole parallel system.
  7. Metadata service: responsible for storing various metadata of HAWQ, including database and table information, and access permission information. In addition, the metadata service is also the key to realize distributed transactions.
  8. High-speed internetwork: Responsible for transferring data between nodes. Software implementation, based on UDP.

query execution


After understanding the various components, let's look at the main flow of a query (see Figure 5).

Apache HAWQ

(Figure 5)

After the user submits the query through JDBC/ODBC, the query parser obtains the query tree, and then the optimizer generates a query plan based on the query tree. The dispatcher interacts with the resource manager to obtain resources, decomposes the query plan, and dispatches the plan to the Segment executor for execution. . The final result is passed back to the user.

Let me briefly take a look at what a parallel query plan looks like. Figure 6 shows a specific example.

Apache HAWQ

(Image 6)

This query contains a join, an expression and an aggregate. There are two query plans in the figure. In a nutshell, the difference between a parallel query plan and a serial query plan is the addition of some Motion operators. Motion is responsible for exchanging data between nodes. The bottom layer is achieved through a high-speed Internet network. We can see that there are three Motions here:

  1. Redistribution Motion: Responsible for redistributing data according to hash key values
  2. Broadcast Motion: Responsible for broadcasting data
  3. Gather Motion: Responsible for gathering data together.

The query plan on the left shows the case if the table lineitem and orders are distributed using the join key. In this example, the lineitem is hashed according to l_orderkey, and the orders table is distributed according to o_orderkey. In this case, there is no need to redistribute when the two tables are connected. The query plan on the right represents an example where data needs to be redistributed. Compared with the query plan on the left, this query plan has one more Motion node.

Elastic execution engine


The elastic execution engine has several key design points: complete separation of storage and computation, stateless segments, and how resources are used. The separation of storage and computation allows us to dynamically start any number of virtual segments to execute queries. Stateless segments make the cluster easier to scale. It is difficult to ensure the state consistency of large-scale clusters, so we use stateless Segment. How to use resources includes how to apply for how many resources according to the cost of the query, and how to use these resources effectively, such as how to optimize data locality. HAWQ has a very optimized design for each part inside.

metadata service


The metadata service is located on the HAWQ Master node. It mainly provides metadata storage and query services to other components. The external interface is CaQL (Metadata Query Language, Catalog Query Language). The language supported by CaQL is a subset of SQL, including single table select, count, multiple row delete, single row insert update, etc. The reason for designing CaQL as a subset of the SQL language is that in the future we want to separate the metadata from the master node as a separate service, and it is enough to support a simple subset as a metadata service, and easy to expand.

high-speed internet


The role of high-speed internetwork is to exchange large amounts of data between multiple nodes. HAWQ high-speed internet is based on UDP protocol. You may ask why we don't use TCP. In fact, we support both TCP and UDP protocols. The TCP protocol is earlier than the UDP protocol. It is because we encountered a problem that TCP cannot solve very well that we developed a UDP-based protocol. Figure 7 shows an example of a high-speed Internet network.

Apache HAWQ

(Figure 7)

The executor processes on each node in the example form a data exchange pipeline. Suppose there are 1000 processes on each node. With 1000 nodes, these processes need to interact with each other, and there will be millions of connections on each node. TCP cannot efficiently support this many connections. So we developed an interconnection protocol based on UDP. For UDP transmission, the operating system cannot guarantee reliability, and cannot guarantee orderly delivery. Our design goals need to maintain the following characteristics:

  1. Reliability: It can ensure that lost packets are retransmitted in the case of packet loss
  2. Ordering: Guarantee the final ordering of packets delivered to recipients
  3. Flow Control: If the sender's speed is not controlled, the receiver can be overwhelmed, or even lead to a drastic drop in overall network performance
  4. Performance and Scalability: Performance and scalability are what we need to solve the TCP problem in the first place
  5. Can support multiple platforms

Apache HAWQ

(Figure 8)

Figure 8 shows our state machine for implementing a UDP high-speed internetwork. And the design also needs to consider the elimination of deadlock. Details can be found in the bibliography.

transaction management


Transactions are a very important attribute of data management systems. Most SQL engines in Hadoop do not support transactions. It is basically very difficult for programmers to ensure transaction and data consistency by themselves.

HAWQ supports all ACID properties of transactions and supports Snapshot Isolation. Transactions are coordinated and controlled by the Master node. A swimlane model is used. For example, concurrently inserting each query uses its own swimlane and does not conflict with each other. Consistency is ensured by recording the logical length of the file when the transaction is committed. If the transaction fails, it needs to be rolled back to delete the junk data at the end of the file. At first, HDFS did not support truncate. Now, the truncate function that HDFS just supports is based on the requirements of HAWQ.

resource manager


HAWQ supports three levels of resource management:

  1. Global resource management: YARN can be integrated to share cluster resources with other systems. Mesos, etc. will be supported in the future
  2. HAWQ internal resource management: can support query, user and other level resource management
  3. Operator-level resource management: can allocate and enforce resource usage for operators

Now HAWQ supports multi-pole resource queues. Resource queue definitions can be easily defined and modified through DDL. The following is the main architecture diagram of the HAWQ resource manager:

Apache HAWQ

(Figure 9)

The various components in the resource manager function as follows:

  1. Request Handler: Receives resource requests from the query dispatcher process
  2. Resource allocator: responsible for the allocation of resources
  3. Resource pool: saves the existing state of all resources
  4. Strategy storage: Save all allocation strategies, and the strategy will be customizable in the future.
  5. Resource agent: responsible for interacting with the global resource manager

storage module


HAWQ supports several internally optimized storage formats, such as AO and Parquet. Provides MapReduce InputFormat, which can be directly accessed by external systems. Various other storage formats are accessed through the extension framework. For user-specific formats, users can develop their own plug-ins. At the same time, it supports various functions such as compression and multi-polar partitioning.

MADLib


Apache HAWQ

(Figure 10)

As shown in FIG. MADLib is a very complete parallel machine learning and data mining library. A variety of machine learning and statistical analysis methods are supported. HAWQ natively supports MADLib. Now MADLib is an independent Apache project that basically contains all the commonly used machine learning methods.

3. HAWQ medium and short-term planning


The HAWQ team is focusing on 2.0GA in the short term. In the long run, we will do the following work:

  1. Disaster recovery across data centers
  2. Distributed index support
  3. Snapshot support
  4. Integrate with more other ecosystems
  5. Support for new hardware further improves performance: GPU and more

4. Contribute to the Apache HAWQ community


HAWQ is an Apache open source project, and it is hoped that more people in the community can participate. Contribution from the community is not limited to contributing code, but can also contribute tests, documentation, bug JIRA, provide functional requirements, and more. At present, there are not many contributions to the Apache open source community in China. I hope everyone can work together to promote the development of the domestic open source community.

Contributing to the Apache project is relatively simple, open a JIRA in our Apache JIRA system ( https://issues.apache.org/jira/browse/HAWQ ). Then give your solution. If it is code, you can submit a Pull Request using github. For specific steps, please refer to our process on the Apache wiki website ( https://cwiki.apache.org/confluence/display/HAWQ ). After submitting the code, the HAWQ committer will work with you to submit the code. If you have enough contributions and want to be an Apache Committer, HAWQ PMC will have a voting process to ensure fairness and justice.

Development discussions for all features are posted on JIRA and mailing lists. Here is the main Apache HAWQ URL and mailing list you can subscribe to:

  1. Website: http://hawq.incubator.apache.org/
  2. Wiki: https://cwiki.apache.org/confluence/display/HAWQ
  3. Repo: https://github.com/apache/incubator-hawq.git
  4. JIRA: https://issues.apache.org/jira/browse/HAWQ
  5. Mailing lists: [email protected] and [email protected] How to subscribe: Send mail to [email protected] and [email protected]. org

We have a "big data community" in China that will organize a meetup to discuss the latest developments in HAWQ and other ecosystems. The website is: http://www.meetup.com/Big-Data-Community-China/ "Big Data Community" blog: http://blog.csdn.net/bigdatacommunity (has more HAWQ and big data related technical articles )

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326413397&siteId=291194637