The nine most error-prone Hive sql explanations and usage notes

Tips for reading this article: This article is suitable for chewing and eating slowly, don't glance at ten lines, or you will miss a lot of valuable details.

The article was first published on the public account: learn big data in five minutes

Preface

SQL is the most commonly used when building data warehouses and data analysis. Its syntax is concise and easy to understand. At present, the major mainstream frameworks in the big data field all support SQL syntax, including hive, spark, flink, etc., so SQL is in large The data field has an irreplaceable role, and we need to focus on it.

If you are unfamiliar or not careful when using SQL, it is extremely easy to make mistakes in query analysis. Next, let's look at a few error-prone SQL statements and precautions for use.

Start of text

1. decimal

In addition to supporting common types such as int, double, and string, hive also supports the decimal type, which is used to store precise values ​​in the database, and is often used in fields that represent amounts.

Precautions:

For example: decimal(11,2) represents up to 11 digits, of which the last 2 digits are decimals, and the integer part is 9 digits;  
if the integer part exceeds 9 digits, this field will become null, if the integer part does not exceed 9 bit, the original display field ;  
if the fractional part is less than 2, the two back filled with zeros, if more than two fractional part, the excess portion is rounded ;  
can also be written directly to decimal, not behind the specified number, the default is decimal (10,0) Integer with 10 digits, no decimal

2. location

表创建的时候可以用 location 指定一个文件或者文件夹
create  table stu(id int ,name string)  location '/user/stu2';

Precautions:

Use location
when creating a table. When a folder is specified, hive will load all files in the folder. When there is no partition in the table, there can be no more folders in this folder, otherwise an error will be reported.      
When the table is a partitioned table, such as partitioned by (day string), then each folder under this folder is a partition, and the folder name is day=20201123
format, then use: msck repair table score ; repair Table structure, after success, you can see that all the data has been loaded into the table

3. load data 和 load data local

从hdfs上加载文件
load data inpath '/hivedatas/techer.csv' into table techer;

从本地系统加载文件
load data local inpath '/user/test/techer.csv' into table techer;

Precautions:

  1. Use load data local to load from the local file system, and the file will be copied to hdfs  
  2. Use load data to load from the hdfs file system, and the files will be moved directly to the relevant directory of hive . Note that it is not copied, because hive thinks that there are already 3 copies of the hdfs file, and there is no need to copy it again
  3. If the table is a partitioned table, an error will be reported if no partition is specified during load  
  4. If you load a file with the same file name, it will be automatically renamed

4. drop 和 truncate

删除表操作
drop table score1;

清空表操作
truncate table score2;

Precautions:

If the recycle bin is enabled in hdfs, the table data deleted by drop can be recovered from the recycle bin . The table structure cannot be recovered and needs to be recreated by yourself; the truncate emptied table does not enter the recycle bin, so the table emptied by truncate cannot be restored.  
Therefore, truncate must be used with caution, once it is emptied, it will not be able to recover except for physical recovery.

5. join connection

INNER JOIN 内连接:只有进行连接的两个表中都存在与连接条件相匹配的数据才会被保留下来
select * from techer t [innerjoin course c on t.t_id = c.t_id; -- inner 可省略

LEFT OUTER JOIN 左外连接:左边所有数据会被返回,右边符合条件的被返回
select * from techer t left join course c on t.t_id = c.t_id; -- outer可省略

RIGHT OUTER JOIN 右外连接:右边所有数据会被返回,左边符合条件的被返回、
select * from techer t right join course c on t.t_id = c.t_id;

FULL OUTER JOIN 满外(全外)连接: 将会返回所有表中符合条件的所有记录。如果任一表的指定字段没有符合条件的值的话,那么就使用NULL值替代。
SELECT * FROM techer t FULL JOIN course c ON t.t_id = c.t_id ;

Precautions:

  1. Hive2 version already supports unequal value join, that is, join on condition can use greater than less than symbol; and also supports join on condition followed by or (earlier version only supports = and and after on, and does not support> <and or)
  2. If the hive execution engine uses MapReduce, a join will start a job, and if there are multiple joins in a sql statement, multiple jobs will be started

Note : using a comma (,) between tables to join is the same as inner join, for example:

select tableA.id, tableB.name from tableA , tableB where tableA.id=tableB.id;   
和   
select tableA.id, tableB.name from tableA join tableB on tableA.id=tableB.id;   

There is no difference in their execution efficiency, except that they are written differently . The use of commas is the sql 89 standard, and the join is the sql 92 standard. Use where to connect with a comma, and use where to connect with a filter.

6. left semi join

为什么把这个单独拿出来说,因为它和其他的 join 语句不太一样,
这个语句的作用和 in/exists 作用是一样的,是 in/exists 更高效的实现
SELECT A.* FROM A where id in (select id from B)

SELECT A.* FROM A left semi join B ON A.id=B.id

上述两个 sql 语句执行结果完全一样,只不过第二个执行效率高

Precautions:

  1. The limitation of left semi join is: the table on the right side of the join clause can only set the filter condition in the on clause, and it cannot filter in the where clause, select clause, or other places.
  2. The filter condition after on in the left semi join can only be an equal sign , not others.
  3. The left semi join only passes the join key of the table to the map phase, so the result of the last select in the left semi join can only appear in the left table .
  4. Because the left semi join is in (keySet) relationship, if duplicate records are encountered in the right table, the left table will skip

7. Null values ​​in aggregate functions

hive支持 count(),max(),min(),sum(),avg() 等常用的聚合函数

Precautions:

Pay attention to the null value when aggregating :

count(*) contains a null value and counts all rows;  
count(id) does not contain a value whose id is null;  
min does not contain null unless all values ​​are null;  
avg does not contain null when averaged .

The above requires special attention, the null value is most likely to cause the wrong result to be calculated

8. The null value in the operator

hive 中支持常用的算术运算符(+,-,*,/)  
比较运算符(>, <, =)
逻辑运算符(in, not in)

以上运算符计算时要特别注意 null 值

Precautions:

  1. The column fields in each row are added or subtracted. If there is a null value, the result is null.  
    Example: There is a product table (product)
id price dis_amount
1 100 20
2 120 null

The meaning of each field: id (product id), price (price), dis_amount (discount amount)

I want to calculate the actual price of each product after discount , the sql is as follows:

select id, price - dis_amount as real_amount from product;

The results are as follows:

id real_amount
1 80
2 null

The commodity price of id=2 is null, and the result is wrong.

We can handle the null value , the sql is as follows:

select id, price - coalesce(dis_amount,0as real_amount from product;

使用 coalesce 函数进行 null 值处理下,得到的结果就是准确的

coalesce 函数是返回第一个不为空的值
如上sql:如果dis_amount不为空,则返回dis_amount,如果为空,则返回0
  1. Less than does not include the null value , such as id \< 10; does not include the id of the null value.

  2. Not in does not contain a null value , such as city not in ('Beijing','Shanghai'). The result of this condition is that city does not contain Beijing, Shanghai and null cities.

9. and 和 or

在sql语句的过滤条件或运算中,如果有多个条件或多个运算,我们都会考虑优先级,如乘除优先级高于加减,乘除或者加减它们之间优先级平等,谁在前就先算谁。那 and 和 or 呢,看似 and 和 or 优先级平等,谁在前先算谁,但是,and 的优先级高于 or

注意事项:

例:  
还是一张商品表(product)

id classify price
1 电器 70
2 电器 130
3 电器 80
4 家具 150
5 家具 60
6 食品 120

我想要统计下电器或者家具这两类中价格大于100的商品,sql如下:

select * from product where classify = '电器' or classify = '家具' and price>100

得到结果

id classify price
1 电器 70
2 电器 130
3 电器 80
4 家具 150

结果是错误的,把所有的电器类型都查询出来了,原因就是 and 优先级高于 or,上面的sql语句实际执行的是,先找出 classify = '家具' and price>100 的,然后在找出 classify = '电器' 的

正确的 sql 就是加个括号,先计算括号里面的:

select * from product where (classify = '电器' or classify = '家具'and price>100

最后

第一时间获取最新大数据技术,尽在公众号:五分钟学大数据  

搜索公众号:五分钟学大数据,学更多大数据技术!

Guess you like

Origin blog.51cto.com/14932245/2588986