Big Data Apache Hive SQL Basics (Introduction to HQL)

Hive SQL is an essential skill for almost every Internet analyst. I believe that every child who has interviewed a big factory has the experience of being asked Hive optimization questions by the interviewer. Therefore, it is very important to master a solid HQL foundation. It can not only help analysts "like a duck to water" in their daily work to improve efficiency, but also get a better job offer when they change jobs.

This article is an introduction to Hive, mainly introducing the basic grammar of Hive SQL . The article strives to be concise and concise, allowing everyone to get started with HQL in 30 minutes.

In this article, HQL is compared with relational database SQL in many perspectives, which is suitable for children's shoes with a certain SQL foundation. (If you have not mastered the basic SQL children's shoes, please move to " w3c school - SQL " to quickly get started with SQL)

---------- Hive optimization, please look forward to

1. Introduction to Hive

Simply put, Hive is a data warehouse tool based on Hadoop.

Hive's computing is based on a special computing model MapReduce implemented by Hadoop, which can divide computing tasks into multiple processing units, and then distribute them to a group of home or server-level hardware machines, reducing costs and improving horizontal scalability.

Hive's data is stored on a Hadoop distributed file system, namely HDFS.

It needs to be clear that, as a data warehouse application tool, Hive has three "can'ts" compared with RDBMS (relational database):

  1. It cannot respond in real time like RDBMS, and Hive query has a long delay;
  2. Can't do transactional queries like RDBMS, Hive has no transaction mechanism;
  3. It cannot do row-level change operations (including insert, update, delete) like RDBMS.

In addition, Hive is a more "loose" world than RDBMS, such as:

  • Hive does not have a fixed-length varchar type, and all strings are strings;
  • Hive is a read-time mode. It does not verify the data when saving the table data, but when reading the data, it verifies that the data that does not conform to the format is set to NULL.

2. Hive query statement

The general syntax of Hive select is almost the same as RDBMS SQL such as Mysql. The syntax format is noted below and will not be explained in detail. This section focuses on some special techniques that appear in Hive and that I use in my daily life for your reference.

2.1 Note select grammar and word order,

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY order_condition]
[DISTRIBUTE BY distribute_condition [SORT BY sort_condition] ]
[LIMIT number]

2.2 Multi-dimensional aggregation analysis grouping sets/cube/roolup,

An example is used to illustrate the functions and differences of the three. The request table is the back-end request table, and now it is necessary to count the aggregation of three different dimensions: How many requests in total? How many requests are there for different systems and devices? How many requests in different cities.

Without using multidimensional aggregation methods,

SELECT NULL, NULL, NULL, COUNT(*)
FROM requests
UNION ALL
SELECT os, device, NULL, COUNT(*)
FROM requests GROUP BY os, device
UNION ALL
SELECT null, null, city, COUNT(*)
FROM requests GROUP BY city;

Using grouping sets,

SELECT os, device, city ,COUNT(*)
FROM requests
GROUP BY os, device, city GROUPING SETS((os, device), (city), ());

Cube will enumerate all possible combinations of the specified columns as grouping sets, and roolup will generate grouping sets in the form of hierarchical aggregation. like,

GROUP BY CUBE(a, b, c)  
--等价于以下语句。  
GROUPING SETS((a,b,c),(a,b),(a,c),(b,c),(a),(b),(c),())

GROUP BY ROLLUP(a, b, c)
--等价于以下语句。  
GROUPING SETS((a,b,c),(a,b),(a), ())

2.3 The regular method specifies the select field column

It is said to be specified, but it is actually excluded, such as: (num|uid)?+.+exclude num and uid field columns.

In addition, where can use regular expressions: where A Rlike B, where A Regexp B.

2.4 Lateral View (one row becomes multiple rows)

Lateral View is used in combination with table generation functions (such as Split, Explode, etc.), which can split a row of data into multiple rows of data and aggregate the split data.

Suppose you have a table pageAds, which has two columns of data, the first column is pageid string, and the second column is adid_list, which is a collection of ad IDs separated by commas.

insert image description here

Now it is necessary to count the number of occurrences of all advertisements on all pages, first use Lateral View + explode for processing, and then group and aggregate statistics normally.

SELECT pageid, adid 
 FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;

insert image description here

2.5 Window functions

Hive's window functions are very rich, which is rare in many RDBMS. (At least in earlier versions of mysql, there has been no support for window functions, and a group sort uses very complicated SQL custom variables)

The most commonly used window function is row_number() over(partition by col order col_2), which can achieve grouping and sorting by specified fields.

I won’t go into details about other richer window functions. The space is too large, so I can start a new article. It is recommended to refer to the "window function" document of Alibaba Cloud MaxCompute , which is very detailed and highly recommended!

2.6 Code reuse

  • CTE multiplexing: with t1 as();
  • Alibaba Cloud MaxCompute supports the creation of SQL Script scripts: Allows the use of @var:= to create variables to achieve reuse.
with t1 as(
    select user_id
    from user
    where ...
)

@var:= select
         shop_id
       from shop
       where ...;

select *
from user_shop
where user_id in(select * from t1)
and shop_id in(select * from @var);

3. Hive definition statement (DDL)

3.1 Hive table creation statement format,

Method 1: Independent declaration

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [DEFAULT value] [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name [, col_name, ...]) [SORTED BY (col_name [ASC | DESC] [, col_name [ASC | DESC] ...])] INTO number_of_buckets BUCKETS] 
[STORED BY StorageHandler] -- 仅限外部表
[WITH SERDEPROPERTIES (Options)] -- 仅限外部表
[LOCATION OSSLocation]; -- 仅限外部表
[LIFECYCLE days]
[AS select_statement]

Method 2: Copy directly from an existing table

CREATE TABLE [IF NOT EXISTS] table_name
 LIKE existing_table_name

The following is an explanation of the key statement statements:

  • [EXTERNAL]: Declared as an external table, it is often declared when the table needs to be shared by multiple tools. Deleting an external table will not delete data, but only delete metadata.
  • col_name data type: data_ type must be defined rigorously to avoid the laziness of using string for bigint, double, etc., otherwise the data will go wrong someday. (Some colleagues in the team have made this mistake)
  • [if not exists]: Not specified when creating, if there is a table with the same name, an error will be returned. If this option is specified, if a table with the same name exists, it will be ignored, and if it does not exist, it will be created.
  • [DEFAULT value]: Specifies the default value of the column. When the INSERT operation does not specify the column, the column will be written with the default value.
  • [PARTITIONED BY]: Specifies the partition field of the table. When the partition field is used to partition the table, full table scanning is not required for adding partitions, updating data in partitions, and reading partition data, which can improve processing efficiency.
  • [LIFECYCLE]: It is the life cycle of the table, and for a partitioned table, the life cycle of each partition is the same as the life cycle of the table
  • [AS select_statement]: means that data can be inserted directly with the select statement

Simple example: Create a table sale_detail to save sales records, which uses sales time sale_date and sales region region as partition columns.

create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

The successfully created table can view the definition information through desc,

desc <table_name>;
desc extended <table_name>; --查看外部表信息。

If you don’t remember the complete table name, you can find it in the db (database) range through show tables,

use db_name;
show tables ('tb.*'); --- tb.* 为正则表达式

3.2 Hive delete table statement format,

DROP TABLE [IF EXISTS] table_name;  --- 删除表
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (partition_col1 = partition_col_value1,  ...); --- 删除某分区

3.3 Hive change table definition statement format,

ALTER TABLE table_name RENAME TO table_name_new;  --- 重命名表
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION (partition_col1 = partition_col_value1 ...);  --- 增加分区
ALTER TABLE table_name ADD COLUMNS (col_name1 type1 comment 'XXX');  --- 增加列,同时定义类型与注释
ALTER TABLE table_name CHANGE COLUMN old_col_name new_col_name column_type COMMENT column_comment;  --- 修改列名和注释
ALTER TABLE table_name SET lifecycle days;  --- 修改生命周期

4. Hive operation statement

Hive insert statement format,

INSERT OVERWRITE|INTO TABLE tablename [PARTITION (partcol1=val1...]
select_statement
FROM from_statement;

The following is an explanation of the key statement statements:

  • into|overwrite: into-add data directly to the table or partition of the table; first clear the original data in the table, and then insert data into the table or partition.
  • [PARTITION (partcol1=val1…]: Expressions such as functions are not allowed, only constants.

About PARTITION This expands to explain the specified partition insertion and dynamic partition insertion,

  • Output to the specified partition: directly specify the partition value in the INSERT statement, and insert the data into the specified partition.
  • Output to dynamic partition: In the INSERT statement, the partition value is not directly specified, only the partition column name is specified. The value of the partition column is provided in the SELECT clause, and the system automatically inserts data into the corresponding partition according to the value of the partition field.

The above is an introduction to Hive, and I hope it will be helpful to you as an analyst.

Guess you like

Origin blog.csdn.net/Wis57/article/details/129924474