SQL Sao Routine Notes

1. Pass the judgment statement in the statistical function

Nest the if judgment statement in the count() function, if the condition is met, it is 1, and if it is not, it is null. The effect is equivalent to filtering where first, then count(*).

--使用if作为count的参数
select count(if(year='2020',1,null)) from pos_rival;

--先过滤,再count
select count(1) from pos_rival where year='2020';

--两者结果一致

Based on this, one statement can be used to count multiple count or sum results, such as:

--统计year=2020且brand=10的数据条数
--统计year=2020的所有数据的price的总和
select 
    count(if(year='2020' and brand='10',1,null)) as count,
    sum(case when year='2020' then price else 0 end) as sum
from
    pos_rival;

2. Use regular exclude fields

Hive and spark-sql support the use of regular selection fields, such as:

--需开启支持正则配置,否则会报错
SET spark.sql.parser.quotedRegexColumnNames=true;

--选择除了rk以外的所有字段
SELECT `(rk)?+.+`
FROM
pos_rival
;

3. Adjust the number of files generated by hive and spark-sql

Both Spark's parallelism and Hive's reduce number can affect the number of files, but modifying the parallelism and reduce number will affect performance. With pure SQL, you cannot use spark's colesce operator to reduce partitions.

You can use DISTRIBUTE BY to adjust the number of files generated.

--DISTRIBUTE BY 一个常数,则将所有数据放入一个分区,即产生一个文件
INSERT OVERWRITE TABLE ldldwd.pos_rival

SELECT *
FROM ldlsrc.pos_rival

DISTRIBUTE BY 1
;

--DISTRIBUTE BY 一个随机数,将数据随机放入一个分区,产生分区个数相同的文件数
--以下代码CAST(RAND() * 5 AS INT),随机生成整数0~4,即生成五个文件
INSERT OVERWRITE TABLE ldldwd.pos_rival

SELECT *
FROM ldlsrc.pos_rival

DISTRIBUTE BY CAST(RAND() * 5 AS INT)
;

4. Use with as to improve sql readability

From a grammatical point of view, with as is to extract individual sub-query statements from SQL statements and organize them as a single name. In the original statement, you can use the new alias instead.

Such as:

--将从pos_rival中查询出的结果命名为a
--将从pos_rival_brand中查询出的结果命名为b
--将a 与 b 进行join


with a as (select * from pos_rival where dt>='2020-01-01'),
with b as (select * from pos_rival_brand where dt>='2020-01-01') 
select * from a left join b on a.brand=b.brand;

 

Guess you like

Origin blog.csdn.net/x950913/article/details/106941355
Sao