HiveSQL programming template and precautions

table of Contents

0 Preface

1 template code

2 Code analysis and precautions


0 Preface

Hive is one of the necessary tools for data warehouses and data analysts. In actual work, many times when using hive, SQL code is encapsulated in a shell script to run. This is a common way to facilitate the scheduling of shell scripts by scheduling tools.

1 template code

#!/bin/bash
lastday=`date --date '-1days' +%Y-%m-%d` #获得昨天的日期
if [ "$2" != "" ];then
   lastday=$2
fi;
input_para="hive" #默认启动方式
if [ "$1" != "" ];then
   input_para=$1
fi;
#UDF函数使用形式
sqlFun1="CREATE TEMPORARY FUNCTION avgCalShock AS 'jttl.jxresearch.com.hive.udf.avgCalShock' using jar 'hdfs:///phm/JTTL_ETL_COMMON/jx-yjy-udfs-1.0-SNAPSHOT.jar'";
sqlFun2="CREATE TEMPORARY FUNCTION maxCalShock AS 'jttl.jxresearch.com.hive.udf.maxCalShock' using jar 'hdfs:///phm/JTTL_ETL_COMMON/jx-yjy-udfs-1.0-SNAPSHOT.jar'";
sqlFun3="CREATE TEMPORARY FUNCTION minCalShock AS 'jttl.jxresearch.com.hive.udf.minCalShock' using jar 'hdfs:///phm/JTTL_ETL_COMMON/jx-yjy-udfs-1.0-SNAPSHOT.jar'";

input_table="phmdwdb.dwd_iot_phm_switch_shock"; #输入表名
output_table="phmdwdb.dwd_phm_switch_shock_event"; #输出表名
log_dir=${output_table:8};

#Hive的连接方式参数获取(本方案采用oozie传参,也可以将参数放到hdfs上然后获取)
option=`echo ${input_para} | awk -F '_' '{print $1}' | sed s/[[:space:]]//g`
#SQL需要的参数获取
guoche_top_num=`echo ${input_para} | awk -F '_' '{print $3}' | sed s/[[:space:]]//g`


hive_home='/usr/idp/current/hive-client/bin';


sql="
 insert overwrite TABLE $output_table
 PARTITION (compute_day='$lastday') --注意sql中需要引用shell变量且为字符串时候,此处必须为单引号,使用双引号结果会显示不正确
 select
       XXX
      ,XXX
      ,XXX
 from ${input_table}
 where from_unixtime(cast(substr(msg_time,1,10) as bigint),'yyyy-MM-dd')='$lastday'
 
 ;
";

if [ "$option" = "beeline" ];then
   # hive2地址获取
   hive_addr=`hadoop fs -cat /phm/JTTL_ETL_COMMON/jdbc.properties | grep hive_addr | awk -F '=' '{print $2}' | sed s/[[:space:]]//g`
   hive_url="${hive_addr}/phmdwdb"
   cd $hive_home
   beeline -u $hive_url -e "$sqlFun1;$sqlFun2;$sqlFun3;$sql"  >>/tmp/$log_dir.log  2>&1 ;
fi

if [ "$option" = "hive" ];then

   hive -e "$sqlFun1;$sqlFun2;$sqlFun3;$sql"  >>/tmp/$log_dir.log  2>&1 ;
fi
#注意使用hive -e "$sql" 而不是hive -e $sql,$sql前必须有双引号。另外日志文件名必须和>>及$sql在同一行。

2 Code analysis and precautions

(1) hive -e way

hive -e " 待执行sql". This method allows us to write SQL statements that need to be executed in quotation marks. Usually suitable for longer sentences. This method is also the most direct method used when task scheduling is required. In this case, variable parameters (such as date) can be defined in combination with the shell, and script automation can be realized in combination with the scheduling system.

When hive -e encounters vertical line division (special symbol), add multiple escape characters

Let's first look at the interactive command line method.

Suppose we want to take out the city and gender of each user, and use the splitfunction, we may use the following writing:

select split(location_city, '|')[0] as city,split(location_city, '|')[1] as genderfrom test_0102;

The result is shown in the figure below.

Obviously the result is not what we want, this is because the vertical line is special. Let's add the escape character and take a look.

select 
split(location_city, '\|')[0] as city,
split(location_city, '\|')[1] as gender
from test_0102;

The results are as follows:

 The results did not change and did not meet expectations. If you add an escape character.

select 
split(location_city, '\\|')[0] as city,
split(location_city, '\\|')[1] as gender
from test_0102;

The results are as follows:

In summary of the experiment, we can see that our final result is the result we want by adding two // . This is because the first escape character is an escape from the hive -> MapReduce process and the second escape character is an escape during MapReduce compilation.

Let's look at the implementation of hive -e. If you use two escape characters directly, the output will still separate the words.

 This is because there is an extra step of escaping from shell to hive. Therefore, an extra escape character is required. In fact, if four escape characters are used, the result is still correct. The specific laws are directly given:

令需要转义符的个数为 y
如何目标字符串包含'\',则  y = 2^n 
如何目标字符串不包含'\',则 y = n 
n:跨框架调用的次数,最终算到 java编译为止,一般最常见的就是上面两种情况。

Reference link: https://blog.csdn.net/lt793843439/article/details/91492088

Correspondingly, if you encounter double vertical lines, you need to escape each vertical line separately. For example, we want to skillssplit one column of the above data . The corresponding writing is as follows

  • hive command line: two escape characters for each vertical bar

image

  • hive -e: Three escape characters for each vertical bar (four is also OK)

image

(2) When hive -e generates the result file, the file name and the redirection character should be placed on one line

         When hive -e executes hiveSQL, you can use the redirection character ( >) to write the query result into the file.

hive -e"use dac_twelve_dev;select split(location_city, '\\\|')[0] as city,split(location_city, '\\\|')[1] as genderfrom test_0102;" > test_0102.txt
cat test_0102.txt北京  男上海  女北京  男广州  女西安  男

It should be noted that the ending double quotation mark, redirection symbol, and the result file name must be placed on the same line. Otherwise, the result file may not be generated as scheduled. As shown in the following ways.

#第一种hive -e"your SQL" > test_0102.txt
#第二种hive -e"your SQL" > test_0102.txt
#第三种hive -e"your SQL" > test_0102.txt

The above three ways, the first being given: -bash: syntax error near unexpected token `newline. The second type will report the same error after printing the result on the screen, and the third type will print the result on the screen without error, but the final result file has no data.

(3) Pay attention to the asterisk when executing hiveSQL to print SQL in the shell

When running hiveSQL in scheduling, shell script files are generally used. In the script, define the time variable first, then define the SQL statement, and finally execute the SQL using hive -e. Similar to the following:

yesterday=`date -d "now -1 day" +%Y-%m-%d`hql="select * from xxt_able where ds='${yesterday}'"echo $hql#错误的写法,正确的是echo "$hql"hive -e $hql > result.txt

It should be noted here that if hqlthere are *signs (such as special symbols) in the defined statement , in order to be able to output normally when echo is printed, so that we can check whether the time variable is correctly replaced. Need to use "$hql" but not to use $hql. Otherwise, when printing, the *number will be used as a shell wildcard, and all file names under the current path will be printed . Also hive will report errors at runtime. As shown in the following code and results. *It is replaced with all the files in the current path when printing.

image

image

(4) About other options when hive is executed

  • -SOption to shield mapreduce log

When executing hiveSQL, if the MapReduce process needs to be executed, map=100%,reduce=33%a prompt similar to this will appear on the screen. If the task is more complicated, the log length will increase accordingly. Although it can help us understand the progress of the task, sometimes we also want to block it. Use hive -S -e " sql语句" to Silent moderun hive to achieve this goal. At this time, there will only be logs of hive startup on the screen, and no logs of the mapreduce process.

image

  • -vThe option prints out the actual executed SQL (used for debugging SQL, usually sh -v + script name is used for debugging at the shell script layer)

image

This option can be used to verify the detailed SQL actually executed when scheduling the tasks mentioned earlier. Assuming that we have defined the yesterday variable in advance, the -voption will print the variable value, which will replace the echo "$hql"method. (The SQL error is reported here. In order to demonstrate the variable, we quoted a dsfield that does not exist in the table )

image

 (5) Single quotation marks are required when quoting variables in the shell in SQL scripts and are strings .

   1) The difference between single quotes and double quotes in Shell

  •    Edit script
#!/bin/bash
do_date=$1

echo '$do_date'
echo "$do_date"
echo "'$do_date'"
echo '"$do_date"'
echo ""$do_date""
echo ''$do_date''
echo `date`
  • Test script, the results are as follows
[root@bigdata-1 dan_test]# ./test3.sh '2021-01-04'
$do_date
2021-01-04
'2021-01-04'
"$do_date"
2021-01-04
2021-01-04
2021年 01月 04日 星期一 20:39:41 CST
  • to sum up:
  • Double quotes: take the value of the $ variable.
  • Single quotation mark: output as it is, output what is inside, do not take variable value

(1) Single quotation marks do not take the variable value and output as it is

(2) Double quotation marks take the variable value.

(3) Single quotation marks are nested inside double quotation marks, and the variable value is taken out. After the variable value is taken out, the single quotation mark will be displayed as a character string.

(4) Double quotation marks are nested inside single quotation marks, the variable value is not taken out, and the whole content in double quotation marks is output as a string

(5) The double quotation marks are nested inside the double quotation marks, and the variable value is taken out, but the double quotation marks are not displayed, and the overall output will not be used as a string. The double quotation marks are offset, which is equivalent to ${variable}.

(6) Single quotation marks are nested inside single quotation marks, and the variable value is taken out, but the single quotation marks are not displayed, and the overall output will not be used as a string. The single quotation marks are offset, which is equivalent to ${variable}.

(7) Backquote `, execute the command in quotation marks

整体总结:双引号与单引号交替出现,看外层,如果外层是双引号则具备双引号取变量值的功能,且显示单引号。如果外层是单引号则原样输出。

        双引号或单引号成对出现:此时无论单引号还是双引号将会被抵消掉,相当于${变量值}

        双引号单引号交替出现,但双引号或单引号成偶数出现时,此时主要看外层,外层是双引号则抵消双引号,$里面的值会被解析,此时显示抵消掉双引号后的内容(不常用)

        如果外层是单引号,则抵消的是单引号,$里面的值会被解析,此时显示抵消掉单引号后的内容(不常用)
#!/bin/bash

a=110

sql11=" " " '$a' " " "

echo $sql11

~

#!/bin/bash

a=110

sql11=' " " " '$a' " " " '

echo $sql11

#!/bin/bash

a=110

sql11=' " " "$a" " " '

echo $sql11

#!/bin/bash

a=110

sql11=" ' " " "$a" " " ' "

echo $sql11

For non-alternating but continuous appearance of the outer layer, it is the same as the case of pair appearance
 

#!/bin/bash

a=110

sql11=' ' """"$a"""" ' '

echo $sql11

#!/bin/bash

a=110

sql11=' " '$a' " '

echo $sql11

#!/bin/bash

a=110

sql11=" ' "$a" ' "(会被用到)

echo $sql11

The following script:

2) In SQL scripts, it is often necessary to output the time with "-" in string form. In this case, single quotation marks are required instead of double quotation marks.

Where you need to filter by day, you must add quotation marks, otherwise it will not be recognized. When outputting time, you must add quotation marks, otherwise garbled characters will be output.

 

 

Reference link: https://mp.weixin.qq.com/s/_18inMSkJKCBCCYWbDLJPw

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/112196874