HiveQl basic query

1 Basic Select operation

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list [HAVING condition]]
[ CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list]

[LIMIT number]
• Use the ALL and DISTINCT options to differentiate the handling of duplicate records. The default is ALL, which means to query all records. DISTINCT means to remove duplicate records
• Where condition
• Similar to our traditional SQL where condition
• Currently supports AND, OR, version 0.9 supports between
• IN, NOT IN
• Does not support EXIST, NOT EXIST
ORDER BY and SORT BY difference
• ORDER BY Global sorting, only one Reduce task
SORT BY only does sorting locally

Limit
•Limit can limit the number of records queried
SELECT * FROM t1 LIMIT 5
• Realize Top k query
• The following query statement queries the 5 sales representatives with the largest sales records.
SET mapred.reduce.tasks = 1
SELECT * FROM test SORT BY amount DESC LIMIT 5
•REGEX Column Specification
The SELECT statement can use regular expressions for column selection. The following statement queries all columns except ds and hr:
SELECT `( ds|hr)?+.+` FROM test

For example
, query by
precondition hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';

Output query data to the directory:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';

Output the query result to the local directory:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

选择所有列到本地目录 :
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

将一个表的统计结果插入另一个表中:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

将多表数据插入到同一表中:
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;


To insert a file stream directly into a file:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08- 09';

2. Partition-based query

•General SELECT queries will scan the entire table, use the PARTITIONED BY clause to build the table, the query can take advantage of the feature of partition pruning (input pruning)
• The current implementation of Hive is that only the partition assertion appears in the WHERE closest to the FROM clause clause, partition pruning will be enabled

3.Join

Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition


table_reference:
table_factor
| join_table


table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )


join_condition:
ON equality_expression ( AND equality_expression )*


equality_expression:
expression = expression
• Hive only supports equality joins, outer joins, and left semi joins. Hive does not support all non-equivalent joins because non-equivalent joins are very difficult to translate into map/reduce tasks

• LEFT, RIGHT and FULL OUTER keywords are used to handle the case of empty records in join
• LEFT SEMI JOIN is a more efficient implementation of IN/EXISTS subqueries
• When joining, the logic of each map/reduce task is as follows: The reducer will cache the records of all tables except the last table in the join sequence, and then serialize the results to the file system through the last table. In
practice, the largest table should be written last


When join query, you need to pay attention to several key points

Only supports equal joins
• SELECT a.* FROM a JOIN b ON (a.id = b.id)
• SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department )
• can join more than 2 tables, such as
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

• If the join key of multiple tables in the join is the same, the join will be transformed into a single map/reduce task
LEFT, RIGHT and FULL OUTER

例子
•SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

•If you want to limit the output of the join, you should write the filter condition in the WHERE clause - or in the join clause
•The problem of confusion is the case of table partitions
• SELECT c.val, d.val FROM c LEFT OUTER JOIN d ON (c.key=d.key)
WHERE a.ds='2010-07-07' AND b.ds='2010-07-07'
•If the record corresponding to table c is not found in table d , all columns of the d table will list NULL, including the ds column. That is, join will filter
all records in table d that cannot find a match for the join key of table c. In this way, LEFT OUTER makes the query result independent of the WHERE clause
• SOLUTION
• SELECT c.val, d.val FROM c LEFT OUTER JOIN d
ON (c.key=d.key AND d.ds='2009-07 -07' AND c.ds='2009-07-07')


LEFT SEMI JOIN
• The limitation of LEFT SEMI JOIN is that the table on the right side of the JOIN clause can only set the filter condition in the ON clause, not in the WHERE clause, SELECT clause or other places.

• SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B);
can be rewritten as:
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)


UNION ALL
• It is used to combine query results of multiple selects, and it is necessary to ensure that the fields in the selects must be consistent

•select_statement UNION ALL select_statement UNION ALL select_statement ...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325256815&siteId=291194637