Compare the forefront of HAWQ

A .HAWQ the history and current situation

  1. The idea and the prototype system (2011): GOH stage (Greenplum Database On HDFS).

  2. HAWQ 1.0 Alpha (2012): a large number of foreign customers to try, when customers are performance testing Hive hundreds of times. Promote the HAWQ 1.0 as an official product release.

  3. HAWQ 1.0 GA (early 2013): MPP changed the traditional database architecture, including transactions, fault-tolerant, metadata tubes.

  4. HAWQ 1.X version (2014-2015 Q2): increase the number of enterprise-level functionality required, such as Parquet storage, new optimizer, Kerberos, Ambari installation deployment. Customers worldwide.

  5. HAWQ 2.0 Alpha release and became the Apache incubator project: system architecture for cloud environments redesign, dozens of advanced features, including elastic execution engine, advanced resource management, YARN integration, expansion of seconds and so on. Now everyone in Apache open source is the latest version 2.0 Alpha. Future developments are in Apache.

Two .HAWQ Introduction

HAWQ native Hadoop is a massively parallel SQL analysis engine, is aimed analytical applications. And other relational databases , accepts SQL, returns a result set. But it has a lot of traditional massively parallel processing database and other databases do not have features and functionality. Let us consider a first-class SQL on various aspects of Hadoop, and compare it with HAWQ.

1. rich and fully compatible with the SQL standard

HAWQ is 100% in line with ANSI SQL specification and supports SQL 92,99,2003 OLAP and Hadoop-based PostgreSQL . It comprises a correlated subquery, function window function, and the database summary, a wide range of scalar functions and aggregate functions. Users can connect HAWQ through ODBC and JDBC. For businesses, the benefits are a huge business intelligence, data analysis and data visualization tool for eco-system, because the system is fully compliant SQL support policies, so immediately after opening can be used with HAWQ. In addition, analysis across applications written HAWQ can be easily ported to other data conforms to the SQL standard engine, and vice versa. This prevents vendor lock-in enterprise and foster innovation while controlling operational risks.

2. TPC-DS compliance norms

TPC-DS defines the 99 templates (for example, point to point, report, iterative, OLAP, data mining, etc.) have for various operational requirements and the complexity of the query. PostgreSQL's mature Hadoop-based systems need support and proper implementation of most of these inquiries, to address a variety of different analyzes workloads and use cases are. Benchmark is performed by the TPC-DS 99 templates generated 111 inquiries. Meets two requirements based on the number of queries supported, bar chart below shows some level of compliance based on SQL on Hadoop common system: 1. The number of queries that can be optimized for each system (eg, return query plan) and 2 the number of queries can be executed to complete and return the query results.

HAWQ SQL support capabilities scalable data warehouse is based on the code base, HAWQ successfully completed all 111 queries. For more information on these results published in large data query optimizer modular architecture document of the International Conference of ACM Sigmod data management.

3. enables flexible and efficient connection

HAWQ absorb the most advanced cost-based SQL query optimizer, the industry pioneer in the field of SQL on Hadoop. The query optimizer in order for the query optimizer modular architecture big data research results on the basis of design. HAWQ able to develop implementation plans to optimize resource use Hadoop cluster, rather than the size or complexity of the query count data. A cost function may also be configured in the optimizer, such as for a specific environment: version, hardware, CPU, IOPS like. HAWQ has been verified, the query can be queried to find the perfect plan for the rapid high demand involving more than 50 of the associated table, making it the best in the industry on Hadoop SQL data discovery and query engine. This enables organizations to use HAWQ at a significantly lower cost to reduce large amounts of data for analysis of traditional enterprise data warehouse workload requirements.

4. Use linear scalability, acceleration query Hadoop

HAWQ to PB-level SQL on Hadoop operation designed. Data is stored directly on the HDFS, and the SQL query optimizer has been carefully optimized for the file system based on performance characteristics of HDFS. The main design goal is to execute SQL on Hadoop minimize the cost of data transmission when connected to SQL on Hadoop. HAWQ using Dynamic pipelining to address this critical requirement, the HDFS based on data for interactive queries. Dynamic pipelining is a framework of parallel data streams, combines a unique technology:

  • UDP high-speed interconnect technology adaptability.

  • Run-time execution environment operations, is the basis for all SQL queries, and tuned for big data workloads.

  • Runtime resource management, it ensures the integrity of the query, the query appear high even if the other requirements in heavily loaded cluster.

  • Seamless data distribution mechanism, it is often part of the data sets together for a particular query.

The modular architecture to optimize large data queries in outstanding performance analysis shows that, for Hadoop-based analysis and data warehousing workloads, HAWQ than existing Hadoop query engine fast one or two orders of magnitude. The performance improvement is mainly due to the power of cost-based query optimizer in Dynamic pipelining and HAWQ. This enables HAWQ can help companies to significantly reduce the cost to get rid of the enterprise data warehouse workloads.

5. Integration depth analysis and machine learning

In addition to the polymeric sheet connected to the data using a statistical analysis is usually required, Mathematics and machine learning algorithms, such as principal component analysis and fitting, the code needs to be reconstructed, in order to efficiently run parallel environment. This is becoming a basic requirement of SQL on Hadoop solution. HAWQ with a scalable open source library MADLib analysis of the database to provide these functions to extend SQL on Hadoop capabilities through user-defined functions. MADLib embodiment also supports user-defined function (UDF) in PL / R, PL / Python and PL / Java environment to specify a custom machine learning. For user scenarios have such requirements, this will enable them to embed analytics advanced machine learning in a general analytical workloads.

6. Data Joint Capability

SQL on Hadoop external data sources can be combined to provide more flexibility to be able to combine data from various sources were analyzed. Other analytical data is typically cross / enterprise data warehouse, HDFS, Hbase and joint Hive example, and based on the need to use SQL Hadoop embodiment inherent parallelism on. HAWQ combined functionality provided by the data module called Pivotal eXtension Framework (PXF) a. In addition to common data federation capabilities, PXF also use SQL on Hadoop provides additional features with the industry the ability to:

  • Arbitrarily large data sets low latency: PXF using a smart crawler, which filters down to the Hive and Hbase. Query workload is pushed down to the joint data stack, thereby reducing latency data movement and improve performance as much as possible, especially for interactive query terms.

  • Scalable and customizable: PXF API to provide a framework for its own data stack customer to develop new connectors, thereby enhancing engine loosely coupled data and data reconstruction embodiments avoid end-use cases often require the analysis operations performed .

  • Efficient: PXF utilization statistics ANALYZE to collect external data. This optimizes the federated data source statistical information by the cost-based optimizer, a joint environment to help build more efficient queries.

7. High availability and fault tolerance

HAWQ support a variety of services, it is the preferred SQL on Hadoop solution. Allowing concurrent transaction activity on the user isolation Hadoop and rollback when an error occurs. Fault tolerance, reliability and high availability characteristics HAWQ the three stages can tolerate a disk failure and node level. These capabilities ensure business continuity, and realizes the analysis of migration to more business-critical to the HAWQ run.

8. Native Hadoop file format support

HAWQ support AVRO, Parquet and native file format in Hadoop HDFS in. This reduces the need for the ETL during ingest the maximum extent, and is achieved by using HAWQ schema-on-read (read mode) handling. Reduce ETL and data movement requirements directly contributes to lower cost of ownership analysis solutions.

9. native be managed by Hadoop Apache Ambari

HAWQ use Apache Ambari as a basis for management and configuration, suitable Ambari plug-ins can make HAWQ like other common services, like Hadoop is to manage Ambari, so IT management teams no longer need two sets of management interface, a set of management Hadoop, a set of management HAWQ. This feature allows companies to focus on the scene to minimize the effort required to support, such as configuration and management. Meanwhile, Ambari is completely open source Hadoop configuration and management tools, eliminating vendor lock-in and reduce business risk.

10.Hortonworks Hadoop compatible

In order to further follow up the pace of opening the ODP data Alliance, HAWQ can be seamlessly compatible with Hortonworks HDP large data systems, allowing companies to feel the industry's most advanced SQL on Hadoop solutions on the Hortonworks bring big data platform has been invested in all benefits, HAWQ also supports Pivotal own Hadoop distribution Pivotal HD.

Other key features of 11.HAWQ

(1) Elastic execution engine: execution node may be determined and Segment number for querying the query size. (2) support a variety of partitioning methods and multi-level partitioning: partition, such as List and Range partition. Partition table on the performance of great help, for example, you only want to access the latest month data, query only needs to scan the partition where the data last month. (3) support a variety of compression methods: snappy, gzip, quicklz, RLE like. (4) Dynamic expansion: dynamic capacity on demand, or according to the size of the storage computational requirements, second stage to add nodes. (5) multi-stage load or resource management: the resource manager and external YARN integration; manage CPU, Memory resources; multi-level resource queues; DDL convenient management interface. (6) improve the safety and rights management: kerberos; all levels of databases, tables, and other authorized management. (7) supports a variety of third-party tools: such as Tableau, SAS, newer Apache Zeppelin and so on. (8) and support for HDFS YARN quick access library: libhdfs3 and libyarn (other items can also be used). (9) Support locally, or deploying a virtualized environment in the cloud.

Three .HAWQ of "native" is reflected in the following areas

  1. Data is stored on the HDFS, connector without the use mode.

  2. Scalability: Hadoop and other components, like high scalability, and high performance.

  3. Native access code: Hadoop and other projects, like, HAWQ the Apache project. The user is free to download, use, and contribute to, different from other pseudo-open source software.

  4. Transparency: with Apache way to develop software, to develop and discuss all functions are open, the user can freely participate.

  5. Native management: You can Ambari deployment, resources can be allocated from YARN, with other Hadoop components can run on the same cluster.

To build four .HAWQ

1.HAWQ architecture

HAWQ distributed system architecture is classic master-slave mode, is divided into the following three services:

  • HAWQ master: HAWQ master is the entry point of the whole system, responsible for the connection request is received and authenticated the client, handling SQL commands submitted by the client, parsing and optimizing queries, send inquiries under various Segemt nodes in the cluster and reasonable distribution of load, coordinated from each node returns Segemt sub-query results to the client program after the final treatment results returned merger. HAWQ by the internal master HAWQ Resource Manager, HAWQ Catalog Service, HAWQ Fault Tolerance Service, HAWQ Dispatcher other components. HAWQ master also need to maintain global system catalog, global system catalog system is a collection of tables, which contains metadata information HAWQ cluster; does not contain any user data on HAWQ master, all data on the HDFS.

  • HAWQ segment: HAWQ segment is a cluster of computing nodes HAWQ, responsible for large-scale parallel processing of queries; HAWQ segment node itself does not store any data and metadata for all data to be processed are stored in HDFS on the bottom, segment itself is only responsible for the calculation , it is stateless; HAWQ master in dispatching SQL requests with associated metadata information to Segment, metadata information contains the HDFS URL table need to be addressed, Segment access SQL requests through metadata in the HDFS URL need to be addressed The data.

  • PXF agent: PXF (HAWQ Extension Framework) is allowed HAWQ access external system data extensible framework, PXF built connector access HDFS file, HBase table and Hive table (connectors), PXF can also and HCatalog integrating Hive direct access table; PXF allows the user to develop a new connector for data storage and access to other parallel processing engine. PXF agent is PXF services, need to be deployed in the Segment nodes of the cluster.

  • In HAWQ cluster, it is necessary to start the master node HAWQ master, HDFS namenode and Yarn resourcemanager; HAWQ segment needs to be started on each slave node, PXF agent, HDFS datanode and Yarn nodemanager. Storing data directly HAWQ cluster on the HDFS, may be integrated Yarn be computed resource management (HAWQ itself provides Standalone mode does not depend on Yarn of, HAWQ Ambari deployed default Standalone mode, the user can manually on Ambari page) ; HAWQ HBase and Hive can also query the database table by PXF.

    Users connected to the cluster master nodes HAWQ interact with the client program can use the database (the psql) or JDBC and ODBC connection HAWQ such an API library table.

    By connecting the user to interact with the cluster service HAWQ master HAWQ. Most Hadoop and similar components, HAWQ also provides a command line and API two kinds of interaction: Because HAWQ inherited from Greemplum / PostgresSQL technology stack, so the natural choice for the client program psql PostgresSQL as a command-line tool to connect to the database table HAWQ submit SQL queries; HAWQ support ODBC and JDBC as the programming interface, third-party programs can access HAWQ via JDBC / OCBC way.

In HAWQ cluster, it is necessary to start the master node HAWQ master, HDFS namenode and Yarn resourcemanager; HAWQ segment needs to be started on each slave node, PXF agent, HDFS datanode and Yarn nodemanager. Storing data directly HAWQ cluster on the HDFS, may be integrated Yarn be computed resource management (HAWQ itself provides Standalone mode does not depend on Yarn of, HAWQ Ambari deployed default Standalone mode, the user can manually on Ambari page) ; HAWQ HBase and Hive can also query the database table by PXF.

Users connected to the cluster master nodes HAWQ interact with the client program can use the database (the psql) or JDBC and ODBC connection HAWQ such an API library table.

By connecting the user to interact with the cluster service HAWQ master HAWQ. Most Hadoop and similar components, HAWQ also provides a command line and API two kinds of interaction: Because HAWQ inherited from Greemplum / PostgresSQL technology stack, so the natural choice for the client program psql PostgresSQL as a command-line tool to connect to the database table HAWQ submit SQL queries; HAWQ support ODBC and JDBC as the programming interface, third-party programs can access HAWQ via JDBC / OCBC way.

2. build step

  • Operating system environment ready

    • Installation centos7, set the host name, turn off the firewall, turn off selinux

      chkconfig iptables off

      chkconfig ip6tables off

      systemctl stop firewalld.service

      sestatus // Query

      vi / etc / selinux / config // close selinux

  • Preparation software environment

    • curl -L"https://bintray.com/wangzw/rpm/rpm" -o/etc/yum.repos.d/bintray-wangzw-rpm.repo

      yum install -y epel-release

      yum makecache

      yum install -y man passwd sudo tar whichgit mlocate links make bzip2 net-tools \

      autoconf automake libtool m4 gcc gcc-c++ gdb bison flex cmake gperfmaven indent \

      libuuid-devel krb5-devel libgsasl-devel expat-devel libxml2-devel \

      perl-ExtUtils-Embed pam-devel python-devel libcurl-devel snappy-devel \

      thrift-devel libyaml-devel libevent-devel bzip2-devel openssl-devel \

      openldap-devel protobuf-devel readline-devel net-snmp-devel apr-devel \

      libesmtp-devel xerces-c-devel python-pip json-c-devel libhdfs3-devel \

      apache-ivy java-1.7.0-openjdk-devel \

      openssh-clients openssh-server

      yum install -y postgresql-devel

      pip --retries=50 --timeout=300 installpg8000 simplejson unittest2 pycrypto pygresql pyyaml lockfile paramiko psi

      pip --retries=50 --timeout=300 installhttp://darcs.idyll.org/~t/projects/figleaf-0.6.1.tar.gz

      pip --retries=50 --timeout=300 installhttp://sourceforge.net/projects/pychecker/files/pychecker/0.8.19/pychecker-0.8.19.tar.gz/download

      yum erase -y postgresql postgresql-libspostgresql-devel

  • Download incubator-hawq

  • Installation libyarn

    • cd depends/libyarn/

      mkdir build

      cd build

      ../bootstrap --prefix=/usr/local/

      make

      sudo make install

      The result of copying the file to * .so / usr / lib or follows

      ln -s /usr/local/libyarn.so /usr/lib

      ln -s /usr/local/libyarn.so.1 /usr/lib

      ln -s /usr/local/libyarn.so.0.1.10 /usr/lib

      ldconfig

  • Configure and compile hawq (to keep the network open)

    • ./configure --prefix=/hawq

      make

      make install

  • Also copied to the two virtual machines

    • After the first part of the operation is completed, the virtual machine stored hawq1, then copy into two parts, and a list of the virtual machine are changed according to the host virtual machine host name, the IP and other operations, then mutual ping IP, network to confirm patency between VM machine.

  • Installation hadoop2.x

    • hadoop 2.X System installation configuration (setting gpssh trust may be utilized, etc.), and then start the run.

  • Acquaintance of HAWQ

    • source /install/dir/greenplum_path.sh

      hawq init cluster

      hawq stop/restart/start cluster

Guess you like

Origin www.cnblogs.com/ruanjianwei/p/12156068.html