目录
语法介绍
[WITH CommonTableExpression (, CommonTableExpression)*] (Note: Only available starting with Hive 0.13.0)
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
说明
- SELECT语句可以是union查询的一部分,也可以是另一个查询的子查询
- table_reference表示查询的输入,它可以是常规表、视图、join 构造或子查询
- 表名和字段名是大小写不敏感的
- 从Hive 0.13开始 FROM子句是可选的,如: SELECT 1+1
- 使用SELECT current_database() 可以查看当前的数据库
- 如果要指定数据库,可以在表名前指定数据库名(从Hive0.7开始,“db_name.table_name”)或者在查询之前使用USE语句指定数据库(从Hive0.6开始)
ALL 和 DISTINCT子句
ALL 和 DISTINCT选项指定返回的结果是否可以有重复的记录。如果不指定,默认是ALL。DISTINCT会从返回结果中将重复的记录移除。从Hive 1.1.0 开始支持 SELECT DISTINCT * 语句。
示例
> SELECT DISTINCT name, work_place FROM employee;
+----------+-------------------------+
| name | work_place |
+----------+-------------------------+
| Lucy | ["Vancouver"] |
| Michael | ["Montreal","Toronto"] |
| Shelley | ["New York"] |
| Will | ["Montreal"] |
+----------+-------------------------+
HAVING子句
HAVING是在Hive 0.7.0中添加的,用以支持聚合结果的过滤。通过使用HAVING,可以避免在GROUP BY语句之后使用子查询。
示例
> SELECT
gender_age.age
FROM employee
GROUP BY gender_age.age
HAVING count(*)=1;
+-----------------+
| gender_age.age |
+-----------------+
| 27 |
| 30 |
| 35 |
| 57 |
+-----------------+
使用 IF 或 CASE WHEN 函数
> SELECT
CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
ELSE 'Mr.' END as title, name,
IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;
+-------+---------+---------+
| title | name | country |
+-------+---------+---------+
| Mr. | Michael | CA |
| Mr. | Will | CA |
| Ms. | Shelley | US |
| Ms. | Lucy | CA |
+-------+---------+---------
基于分区的查询
通常,SELECT查询会扫描整个表(而不是采样)。如果一个表使用了PARTITIONED BY子句创建,则查询可以进行分区裁剪,只扫描与查询指定的分区相关的部分。如果在JOIN的ON子句或WHERE子句中指定了分区谓词,Hive会执行分区裁剪。例如,如果表page_views使用字段date进行分区,则下面的查询将只检索2008-03-01和2008-03-31之间的行。
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'
假如表page_views和另一张表dim_users进行join,则可以在ON子句中指定分区的范围
SELECT page_views.*
FROM page_views JOIN dim_users
ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')
基于正则表达式的列描述
Hive 0.13.0 之前,SELECT语句可以使用基于正则表达式的列描述,在 0.13.0之后如果还要使用该功能,可以将hive.support.quoted.identifiers配置项设置为none。
示例
> SET hive.support.quoted.identifiers=none;
> SELECT `^work.*` FROM employee; --查看以work开头的列
+-------------------------+
| employee.work_place |
+-------------------------+
| ["Montreal","Toronto"] |
| ["Montreal"] |
| ["New York"] |
| ["Vancouver"] |
+-------------------------+
条件查询
使用条件子句过滤结果集是很常见的,这些子句有:LIMIT, WHERE, IN/NOT IN, 和 EXISTS/NOT EXISTS
LIMIT 子句
LIMIT子句用于约束SELECT 语句返回的行数。LIMIT可以接受一个或两个数值参数,这两个参数都必须是非负整数常量,其中第一个参数指定要返回的第一行的偏移量(从Hive 2.0.0开始),第二个参数指定要返回的最大行数。如果只给定一个参数时,它代表最大行数和偏移默认值为0。
示例
> SELECT name FROM employee LIMIT 2;
+----------+
| name |
+----------+
| Michael |
| Will |
+----------+
> SELECT name FROM employee LIMIT 2,2;
+----------+
| name |
+----------+
| Shelley |
| Lucy |
+----------+
WHERE子句
where子句后接一个布尔表达式,用以对查询结果进行过滤。从Hive 0.13开始 ,可以在某种形式的子查询中使用where子句。
示例
> SELECT name, work_place FROM employee WHERE name = 'Michael';
+----------+-------------------------+
| name | work_place |
+----------+-------------------------+
| Michael | ["Montreal","Toronto"] |
+----------+-------------------------+
所有的条件子句也可以一起使用,但其他子句是跟在where子句之后。
> SELECT name, work_place FROM employee WHERE name = 'Michael' LIMIT 1;
+----------+-------------------------+
| name | work_place |
+----------+-------------------------+
| Michael | ["Montreal","Toronto"] |
+----------+-------------------------+
IN/NOT IN语句
IN/NOT IN用作一个表达式来检查值是否属于IN/NOT IN指定的集合,从Hive 2.1.0 起,IN和NOT IN语句可以支持多个列。
示例
SELECT name FROM employee WHERE gender_age.age in (27, 30)
+----------+
| name |
+----------+
| Michael |
| Shelley |
+----------+
多个字段
> SELECT name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN (('Female', 27), ('Male', 27 + 3));
+----------+-------------------------------+
| name | gender_age |
+----------+-------------------------------+
| Michael | {"gender":"Male","age":30} |
| Shelley | {"gender":"Female","age":27} |
+----------+-------------------------------+
EXISTS/NOT EXISTS语句
使用EXISTS/NOT EXISTS的子查询必须同时引用内部和外部表达式。
SELECT name, gender_age.gender as gender FROM employee a
WHERE EXISTS (
SELECT * FROM employee b
WHERE b.gender_age.gender = 'Male' );
+----------+---------+
| name | gender |
+----------+---------+
| Michael | Male |
| Will | Male |
| Shelley | Female |
| Lucy | Female |
+----------+---------+
参考
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-LogicalOperators
书籍 Apache Hive Essentials Second Edition (by Dayong Du) Chapter 4、Chapter 6