Data warehouse architecture "slimming", free trial for Hologres 5000CU

Based on the innovative HSAP architecture, Hologres can unify the OLAP system (Greenplum, Presto, Impala, ClickHouse) and KV database/Serving system (HBase, Redis) in your original data warehouse architecture into one big data computing engine, and provide fast offline real-time integrated analysis capabilities.

Product core advantages:

1. Simplify the data warehouse structure, reduce data handling and multiple maintenance costs

2. Strong real-time query performance, refreshing the TPC-H 30000GB world record

3. Integrating lake warehouse query, 0 ETL to import offline MaxCompute data

Introduction to Hologres Tutorial

Based on the TPC-H dataset data in MaxCompute and GitHub public event data, create Hologres database, external tables, internal tables, import data into internal tables, and use Hologres to query internal tables and external tables respectively on Hologres, Alibaba Cloud real-time data warehouse Guidance on the data in . Hologres has the advantage of extremely fast response in querying data.

Prepare environment and resources

Before starting the tutorial, prepare the environment and resources as follows:

  1. A VPC and a VPC have been created. For details, see Creating a VPC and a VPC .

  2. Visit Alibaba Cloud for a free trial . Click the login/register button at the top right of the page, and follow the prompts to complete account login (already have an Alibaba Cloud account), account registration (if there is no Alibaba Cloud account) or real-name authentication (according to the requirements of trial products, complete personal real-name authentication or enterprise real-name authentication authentication).

  3. After successfully logging in, select Big Data Computing > Data Computing and Analysis under the product category , and click Try Now on the real-time data warehouse Hologres card .

  4. Complete parameter information configuration on the panel that pops up to try real-time data warehouse Hologres products. This trial tutorial takes the parameter information in the table as an example, and keeps the default value for parameters not mentioned.

parameter example value
region East China 1 (Hangzhou)
instance type universal type
computing resources 8-core 32GB (number of computing nodes: 1)
proprietary network Select the VPC created in step 1.
Proprietary network switch Select the switch created in step 1.
instance name hologres_test
resource group default resource group
  1. After selecting the service agreement, click Try Now , and follow the prompts on the page to complete the trial application.
    Click Go to console to start the trial experience.

create database

Quickly create a database through Hologres for subsequent storage of sample data for query use.

  1. Log in to the Hologres management console , and click the instance list on the left .

  2. On the instance list page, click the corresponding instance name.

  3. On the left navigation bar of the instance details page , click Database Management .

  4. On the DB authorization page, click Add a database in the upper right corner .

  5. In the New Database dialog box, configure the following parameters.

title illustrate
instance name Choose which Hologres instance to create the database on. By default, the name of the currently logged-in instance is displayed, and you can also select other Hologres instances in the drop-down box.
Name database This example database name is set to holo_tutorial.
Simple Permissions Policy You can choose a permission policy for the database you create. For more descriptions of permission policies, please refer to: - SPM : Simple permission model, the authorization of the permission model is based on DB as the granularity, divided into admin (administrator), developer (developer), writer (reader and writer) and viewer ( Analyst) four roles, you can perform convenient and safe permission management on objects in the DB through a small number of permission management functions. - SLPM : A simple permission model based on the Schema level. The permission model takes Schema as the granularity and divides <db>.admin (DB administrator), <db>.<schema>.developer (developer), <db>.< schema>.writer (reader and writer) and <db>.<schema>.viewer (analyst), which are more fine-grained than the simple permission model. -Expert : Hologres is compatible with PostgreSQL and uses the same permission system as Postgres.
  1. Click OK .

create table

After the database is successfully created, you need to create the corresponding tables in the database.

  1. Login to the database.

<!---->

    1. On the top menu bar of the DB authorization page, click Metadata Management .

    2. On the metadata management page, double-click the successfully created database name in the directory tree on the left, and click OK .

  1. Create a new external table.
    1. On the SQL editor page, click imagethe icon in the upper left corner.

    2. Added an external table that uses TPC-H data set data. TPC-H data is referenced from TPC. For more information, see TPC .
      On the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run .
      The sample SQL statement is used to create an external table mapped to tables such as odps_customer_10g and odps_lineitem_10g in the MaxCompute public space MAXCOMPUTE_PUBLIC_DATA for subsequent queries.

DROP FOREIGN TABLE IF EXISTS odps_customer_10g;
DROP FOREIGN TABLE IF EXISTS odps_lineitem_10g;
DROP FOREIGN TABLE IF EXISTS odps_nation_10g;
DROP FOREIGN TABLE IF EXISTS odps_orders_10g;
DROP FOREIGN TABLE IF EXISTS odps_part_10g;
DROP FOREIGN TABLE IF EXISTS odps_partsupp_10g;
DROP FOREIGN TABLE IF EXISTS odps_region_10g;
DROP FOREIGN TABLE IF EXISTS odps_supplier_10g;
IMPORT FOREIGN SCHEMA "MAXCOMPUTE_PUBLIC_DATA#default" LIMIT to
(
    odps_customer_10g,
    odps_lineitem_10g,
    odps_nation_10g,
    odps_orders_10g,
    odps_part_10g,
    odps_partsupp_10g,
    odps_region_10g,
    odps_supplier_10g
) 
FROM SERVER odps_server INTO public OPTIONS(if_table_exist 'error',if_unsupported_type 'error');
    1. Added an external table that uses GitHub's public event data. The data is referenced from GitHub. For more information, see Offline and Real-time Integration Practice Based on GitHub's Public Event Dataset .
      Click the icon in the upper left corner image, and on the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run . The sample SQL statement is used to create an external table named under Schema
      in the MaxCompute public space MAXCOMPUTE_PUBLIC_DATA for subsequent queries.github_eventsdwd_github_events_odps
DROP FOREIGN TABLE IF EXISTS dwd_github_events_odps;
IMPORT FOREIGN SCHEMA "MAXCOMPUTE_PUBLIC_DATA#github_events" LIMIT to
(
    dwd_github_events_odps
) 
FROM SERVER odps_server INTO public OPTIONS(if_table_exist 'error',if_unsupported_type 'error');
  1. Create a new internal table.
    1. On the SQL editor page, click imagethe icon in the upper left corner.

    2. Create a new internal table using TPC-H dataset data.
      On the newly added temporary Query query page, after selecting the created instance name and database , please enter the following statement in the SQL query edit box, and click Run .
      The sample SQL statement is used to create tables named LINEITEM, ORDERS, PARTSUPP, PART, CUSTOMER, SUPPLIER, NATION, and REGION for subsequent data storage.

DROP TABLE IF EXISTS LINEITEM;
BEGIN;
CREATE TABLE LINEITEM
(
    L_ORDERKEY      BIGINT      NOT NULL,
    L_PARTKEY       INT         NOT NULL,
    L_SUPPKEY       INT         NOT NULL,
    L_LINENUMBER    INT         NOT NULL,
    L_QUANTITY      DECIMAL(15,2) NOT NULL,
    L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
    L_DISCOUNT      DECIMAL(15,2) NOT NULL,
    L_TAX           DECIMAL(15,2) NOT NULL,
    L_RETURNFLAG    TEXT        NOT NULL,
    L_LINESTATUS    TEXT        NOT NULL,
    L_SHIPDATE      TIMESTAMPTZ NOT NULL,
    L_COMMITDATE    TIMESTAMPTZ NOT NULL,
    L_RECEIPTDATE   TIMESTAMPTZ NOT NULL,
    L_SHIPINSTRUCT  TEXT        NOT NULL,
    L_SHIPMODE      TEXT        NOT NULL,
    L_COMMENT       TEXT        NOT NULL,
    PRIMARY KEY (L_ORDERKEY,L_LINENUMBER)
);
CALL set_table_property('LINEITEM', 'clustering_key', 'L_SHIPDATE,L_ORDERKEY');
CALL set_table_property('LINEITEM', 'segment_key', 'L_SHIPDATE');
CALL set_table_property('LINEITEM', 'distribution_key', 'L_ORDERKEY');
CALL set_table_property('LINEITEM', 'bitmap_columns', 'L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_LINENUMBER,L_RETURNFLAG,L_LINESTATUS,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT');
CALL set_table_property('LINEITEM', 'dictionary_encoding_columns', 'L_RETURNFLAG,L_LINESTATUS,L_SHIPINSTRUCT,L_SHIPMODE,L_COMMENT');
CALL set_table_property('LINEITEM', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS ORDERS;
BEGIN;
CREATE TABLE ORDERS
(
    O_ORDERKEY      BIGINT      NOT NULL PRIMARY KEY,
    O_CUSTKEY       INT         NOT NULL,
    O_ORDERSTATUS   TEXT        NOT NULL,
    O_TOTALPRICE    DECIMAL(15,2) NOT NULL,
    O_ORDERDATE     timestamptz NOT NULL,
    O_ORDERPRIORITY TEXT        NOT NULL,
    O_CLERK         TEXT        NOT NULL,
    O_SHIPPRIORITY  INT         NOT NULL,
    O_COMMENT       TEXT        NOT NULL
);
CALL set_table_property('ORDERS', 'segment_key', 'O_ORDERDATE');
CALL set_table_property('ORDERS', 'distribution_key', 'O_ORDERKEY');
CALL set_table_property('ORDERS', 'bitmap_columns', 'O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT');
CALL set_table_property('ORDERS', 'dictionary_encoding_columns', 'O_ORDERSTATUS,O_ORDERPRIORITY,O_CLERK,O_COMMENT');
CALL set_table_property('ORDERS', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS PARTSUPP;
BEGIN;
CREATE TABLE PARTSUPP
(
    PS_PARTKEY    INT    NOT NULL,
    PS_SUPPKEY    INT    NOT NULL,
    PS_AVAILQTY   INT    NOT NULL,
    PS_SUPPLYCOST DECIMAL(15,2) NOT NULL,
    PS_COMMENT    TEXT   NOT NULL,
    PRIMARY KEY(PS_PARTKEY,PS_SUPPKEY)
);
CALL set_table_property('PARTSUPP', 'distribution_key', 'PS_PARTKEY');
CALL set_table_property('PARTSUPP', 'colocate_with', 'LINEITEM');
CALL set_table_property('PARTSUPP', 'bitmap_columns', 'PS_PARTKEY,PS_SUPPKEY,PS_AVAILQTY,PS_COMMENT');
CALL set_table_property('PARTSUPP', 'dictionary_encoding_columns', 'PS_COMMENT');
CALL set_table_property('PARTSUPP', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS PART;
BEGIN;
CREATE TABLE PART
(
    P_PARTKEY     INT    NOT NULL PRIMARY KEY,
    P_NAME        TEXT   NOT NULL,
    P_MFGR        TEXT   NOT NULL,
    P_BRAND       TEXT   NOT NULL,
    P_TYPE        TEXT   NOT NULL,
    P_SIZE        INT    NOT NULL,
    P_CONTAINER   TEXT   NOT NULL,
    P_RETAILPRICE DECIMAL(15,2) NOT NULL,
    P_COMMENT     TEXT   NOT NULL
);
CALL set_table_property('PART', 'distribution_key', 'P_PARTKEY');
CALL set_table_property('PART', 'bitmap_columns', 'P_PARTKEY,P_SIZE,P_NAME,P_MFGR,P_BRAND,P_TYPE,P_CONTAINER,P_COMMENT');
CALL set_table_property('PART', 'dictionary_encoding_columns', 'P_NAME,P_MFGR,P_BRAND,P_TYPE,P_CONTAINER,P_COMMENT');
CALL set_table_property('PART', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS CUSTOMER;
BEGIN;
CREATE TABLE CUSTOMER
(
    C_CUSTKEY    INT    NOT NULL PRIMARY KEY,
    C_NAME       TEXT   NOT NULL,
    C_ADDRESS    TEXT   NOT NULL,
    C_NATIONKEY  INT    NOT NULL,
    C_PHONE      TEXT   NOT NULL,
    C_ACCTBAL    DECIMAL(15,2) NOT NULL,
    C_MKTSEGMENT TEXT   NOT NULL,
    C_COMMENT    TEXT   NOT NULL
);
CALL set_table_property('CUSTOMER', 'distribution_key', 'C_CUSTKEY');
CALL set_table_property('CUSTOMER', 'bitmap_columns', 'C_CUSTKEY,C_NATIONKEY,C_NAME,C_ADDRESS,C_PHONE,C_MKTSEGMENT,C_COMMENT');
CALL set_table_property('CUSTOMER', 'dictionary_encoding_columns', 'C_NAME,C_ADDRESS,C_PHONE,C_MKTSEGMENT,C_COMMENT');
CALL set_table_property('CUSTOMER', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS SUPPLIER;
BEGIN;
CREATE TABLE SUPPLIER
(
    S_SUPPKEY   INT    NOT NULL PRIMARY KEY,
    S_NAME      TEXT   NOT NULL,
    S_ADDRESS   TEXT   NOT NULL,
    S_NATIONKEY INT    NOT NULL,
    S_PHONE     TEXT   NOT NULL,
    S_ACCTBAL   DECIMAL(15,2) NOT NULL,
    S_COMMENT   TEXT   NOT NULL
);
CALL set_table_property('SUPPLIER', 'distribution_key', 'S_SUPPKEY');
CALL set_table_property('SUPPLIER', 'bitmap_columns', 'S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_COMMENT');
CALL set_table_property('SUPPLIER', 'dictionary_encoding_columns', 'S_NAME,S_ADDRESS,S_PHONE,S_COMMENT');
CALL set_table_property('SUPPLIER', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS NATION;
BEGIN;
CREATE TABLE NATION(
  N_NATIONKEY INT NOT NULL PRIMARY KEY,
  N_NAME text NOT NULL,
  N_REGIONKEY INT NOT NULL,
  N_COMMENT text NOT NULL
);
CALL set_table_property('NATION', 'distribution_key', 'N_NATIONKEY');
CALL set_table_property('NATION', 'bitmap_columns', 'N_NATIONKEY,N_NAME,N_REGIONKEY,N_COMMENT');
CALL set_table_property('NATION', 'dictionary_encoding_columns', 'N_NAME,N_COMMENT');
CALL set_table_property('NATION', 'time_to_live_in_seconds', '31536000');
COMMIT;
DROP TABLE IF EXISTS REGION;
BEGIN;
CREATE TABLE REGION
(
    R_REGIONKEY INT  NOT NULL PRIMARY KEY,
    R_NAME      TEXT NOT NULL,
    R_COMMENT   TEXT
);
CALL set_table_property('REGION', 'distribution_key', 'R_REGIONKEY');
CALL set_table_property('REGION', 'bitmap_columns', 'R_REGIONKEY,R_NAME,R_COMMENT');
CALL set_table_property('REGION', 'dictionary_encoding_columns', 'R_NAME,R_COMMENT');
CALL set_table_property('REGION', 'time_to_live_in_seconds', '31536000');
COMMIT;
    1. Added an internal table that exposes event data using GitHub.
      Click the icon in the upper left corner image, and on the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run .
      The sample SQL statement is used to create an internal table named gh_event_data, and set the table attributes of distribution_key, event_time_column, and clustering_key for subsequent data import and high-performance query.
DROP TABLE IF EXISTS gh_event_data;
BEGIN;
CREATE TABLE gh_event_data (
    id bigint,
    actor_id bigint,
    actor_login text,
    repo_id bigint,
    repo_name text,
    org_id bigint,
    org_login text,
    type text,
    created_at timestamp with time zone NOT NULL,
    action text,
    iss_or_pr_id bigint,
    number bigint,
    comment_id bigint,
    commit_id text,
    member_id bigint,
    rev_or_push_or_rel_id bigint,
    ref text,
    ref_type text,
    state text,
    author_association text,
    language text,
    merged boolean,
    merged_at timestamp with time zone,
    additions bigint,
    deletions bigint,
    changed_files bigint,
    push_size bigint,
    push_distinct_size bigint,
    hr text,
    month text,
    year text,
    ds text
);
CALL set_table_property('public.gh_event_data', 'distribution_key', 'id');
CALL set_table_property('public.gh_event_data', 'event_time_column', 'created_at');
CALL set_table_property('public.gh_event_data', 'clustering_key', 'created_at');
COMMIT;

Import sample data

After the internal table is successfully created, the data can be imported into the Hologres internal table through the following steps. External tables do not store data in Hologres, only field mapping. Through the external table, you can use Hologres to directly call the data stored in the MaxCompute public space MAXCOMPUTE_PUBLIC_DATA.

  1. On the SQL editor page, click imagethe icon in the upper left corner.

  2. Import TPC-H dataset data.
    On the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run .
    The sample SQL statement imports data from tables such as public.odps_customer_10g and public.odps_lineitem_10g in the MaxCompute public space MAXCOMPUTE_PUBLIC_DATA to internal tables with corresponding names for subsequent queries.

INSERT INTO public.customer SELECT * FROM public.odps_customer_10g ;
INSERT INTO public.lineitem SELECT * FROM public.odps_lineitem_10g ;
INSERT INTO public.nation SELECT * FROM public.odps_nation_10g ;
INSERT INTO public.orders SELECT * FROM public.odps_orders_10g ;
INSERT INTO public.part SELECT * FROM public.odps_part_10g ;
INSERT INTO public.partsupp SELECT * FROM public.odps_partsupp_10g ;
INSERT INTO public.region SELECT * FROM public.odps_region_10g ;
INSERT INTO public.supplier SELECT * FROM public.odps_supplier_10g ;
vacuum nation;
vacuum region;
vacuum supplier;
vacuum customer;
vacuum part;
vacuum partsupp;
vacuum orders;
vacuum lineitem;
analyze nation;
analyze region;
analyze lineitem;
analyze orders;
analyze customer;
analyze part;
analyze partsupp;
analyze supplier;
analyze lineitem (l_orderkey,l_partkey,l_suppkey);
analyze orders (o_custkey);
analyze partsupp(ps_partkey,ps_suppkey);
  1. Import GitHub public event data.
    Click the icon in the upper left corner image, and on the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run .
    The sample SQL statement imports the previous day's data from the table dwd_github_events_odps in the MaxCompute public space MAXCOMPUTE_PUBLIC_DATA to the internal table for subsequent queries. Due to the limited resources of Hologres in this event, it is recommended that you import and query data less than 15 days old.
INSERT INTO gh_event_data
SELECT
    *
FROM
    dwd_github_events_odps
WHERE
    ds >= (CURRENT_DATE - interval '1 day')::text;
analyze gh_event_data;

Query the data in the table

  1. On the SQL editor page, click imagethe icon in the upper left corner.

  2. Data query based on TPC-H dataset.
    On the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run .
    The following SQL codes are used to query the internal table data. If you need to query the external table, please replace the table name queried by the corresponding code with the external table name.
    For the 22 query statements based on the evolution of TPC-H, please refer to the data in the query table .

select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
from
        lineitem
where
        l_shipdate <= date '1998-12-01' - interval '120' day
group by
        l_returnflag,
        l_linestatus
order by
        l_returnflag,
        l_linestatus;
  1. Query based on GitHub public event data. Click the icon in the upper left corner image, and on the newly added temporary Query query page, after selecting the created instance name and database , please enter the sample code in the SQL query edit box, and click Run . This article gives some simple data analysis statements, you can design other analysis statements and query based on the fields in the table. The following SQL codes are used to query the internal table data. If you need to query the external table, please replace the table name queried by the corresponding code with the external table name.
    • Query yesterday's most active projects.
SELECT
    repo_name,
    COUNT(*) AS events
FROM
    gh_event_data
WHERE
    created_at >= CURRENT_DATE - interval '1 day'
GROUP BY
    repo_name
ORDER BY
    events DESC
LIMIT 5;
    • Query the most active developers yesterday.
SELECT
    actor_login,
    COUNT(*) AS events
FROM
    gh_event_data
WHERE
    created_at >= CURRENT_DATE - interval '1 day'
    AND actor_login NOT LIKE '%[bot]'
GROUP BY
    actor_login
ORDER BY
    events DESC
LIMIT 5;
    • Query yesterday's programming language ranking.
SELECT
    language,
    count(*) total
FROM
    gh_event_data
WHERE
    created_at > CURRENT_DATE - interval '1 day'
    AND language IS NOT NULL
GROUP BY
    language
ORDER BY
    total DESC
LIMIT 10;
    • Query the ranking of the number of new stars added to the project yesterday (the scene of canceling the star is not considered).
SELECT
    repo_id,
    repo_name,
    COUNT(actor_login) total
FROM
    gh_event_data
WHERE
    type = 'WatchEvent'
    AND created_at > CURRENT_DATE - interval '1 day'
GROUP BY
    repo_id,
    repo_name
ORDER BY
    total DESC
LIMIT 10;

Finish

After completing the above operations, you have successfully completed the Hologres data query operation. After the query command is executed successfully, the Result tab pops up under the temporary Query page , displaying the following query data results.

  • Example of data query results based on the TPC-H dataset:
    image

  • Example of query results based on GitHub public event data:

    • Yesterday's most active project:
      image

    • Most Active Developers Yesterday:
      image

    • Yesterday's programming language ranking:
      image

    • The number of new stars added to the project yesterday:
      image

  • Hologres 5000CU, 20GB storage free trial, go to trial>>

  • 了解Hologres: https://www.aliyun.com/product/bigdata/hologram

Welcome to the Hologres developer community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10082981