[hive] Precautions for hive data types and data type conversion

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


1. hive data type

1. Numeric data type

type Support range illustrate
TINYINT 1byte signed integer range: -128~127 The range is too small, basically useless
SMALLINT 2byte signed integer range: -32,768 to 32,767 Basically no use
INT/INTEGER 4byte signed integer range: -2,147,483,648 to 2,147,483,647 INTERGER is only available in hive2.20, generally not used
BIGINT 8byte signed integer range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, with a precision of 19 bits, used as a supplement to int
FLOAT 4byte single-precision floating-point number range: -3.4028235E38 to 3.4028235E387, with a precision of 7 digits. 3.14159
DOUBLE 8byte double-precision floating-point number range: -1.7976E+308 to 1.797693E+308, with a precision of 15 or 16 bits, which is larger than single-precision floating-point number float storage. 3.114159
DECIMAL Can store up to 38 decimals

DECIMAL Numeric Type Description

1. For decimal use decimal(precision, scale), the front is the integer part, and the back is the decimal part.
If the integer part is not defined, the default length is 10. If the decimal part is not specified, the default
length is 0. If the length exceeds the length, it will be truncated. The default length of the decimal part is 0, which is rounded and truncated, and the integer part is rounded up to 1.

> select CAST(12345.523456 AS DECIMAL) ;  
+--------+
|  _c0   |
+--------+
| 12346  |
+--------+

2. If the length of the data to be converted exceeds the length specified by decimal, the result will not be truncated and will be a null value directly, although decimal can store up to 38 decimals. The default integer part length is 10 as follows, now the integer length of the data to be converted exceeds 10, and the whole result is NULL directly


> select CAST(12345678910.523456 AS DECIMAL) ;
+-------+
|  _c0  |
+-------+
| NULL  |
+-------+

3. Use decimal to convert other types of data into decimal, specify the length, and round off the excess.

> select CAST(123456789.1234567 AS DECIMAL(20,5)); 
+------------------+
|       _c0        |
+------------------+
| 123456789.12346  |
+------------------+

2. Character data type

type illustrate
STRING For long strings, if you can use the string type, use the string type as much as possible.
VARCHAR Fixed length, you need to specify the length when using it. Therefore, the data may be lost after the conversion exceeds the specified length
CHAR Fixed length, use needs to specify the length, but the length of char is much smaller than varchar

VARCHAR Numeric Type Description

1. varchar has a fixed length, and the length must be specified when using it. Therefore, the data may be lost after the conversion exceeds the specified length.

> select CAST("ABCDEFGHICD" AS VARCHAR(10));
+-------------+
|     _c0     |
+-------------+
| ABCDEFGHIC  |
+-------------+

2. When creating a varchar table, you need to specify the length, otherwise an error will be reported. If the specified length is too small, the data insertion will be directly truncated to the length.

> create table test_varchar(id varchar(10));

> insert overwrite table test_varchar  values ('123456789122'); 
> select * from test_varchar;
+------------------+
| test_varchar.id  |
+------------------+
| 1234567891       |
+------------------+

3. Date data type

type illustrate
TIMESTAMP 1. Beginning with Hive 0.8.0. Used to represent UTC time (time standard time). Convenience UDFs (to_utc_timestamp, from_utc_timestamp) for timezone conversion are provided. 2. All existing datetime UDFs (month, day, year, hour, etc.) use the TIMESTAMP data type. Secondly, TIMESTAP supports integer, floating-point, and string data. The specific use will be introduced later in the actual development, and there are not many used in actual development.
DATE Since Hive 0.12.0 DATE values ​​describe a specific year/month/day in the format YYYY-MM-DD. For example, DATE'2013-01-01'. Date types have no time component. The range of values ​​supported by the Date type is 0000-01-01 to 9999-12-31, depending on the native support of the Java Date type. Date types can only be converted between Date, Timestamp, or String types.
INTERVAL Starting from Hive 1.2.0, it is not used much in actual development.

TIMESTAMP Numeric Type Description

1. Use timestamp to create a field of date type, which can store time data of floating point, integer, and string types

> create table test_timestamp(
a int,
b bigint,
c timestamp
);

> insert overwrite table test_timestamp 
select 1,2,12334324 from test_timestamp  limit 2;
> select * from test_timestamp;
+-------------------+-------------------+--------------------------+
| test_timestamp.a  | test_timestamp.b  |     test_timestamp.c     |
+-------------------+-------------------+--------------------------+
| 3                 | 4                 | 1970-01-01 03:25:34.324  |
| 3                 | 4                 | 1970-01-01 03:25:34.324  |
+-------------------+-------------------+--------------------------+

> insert overwrite table test_timestamp 
select 3,4,"2019-05-22 21:23:34" from test_timestamp  limit 2;
> select * from test_timestamp;
+-------------------+-------------------+------------------------+
| test_timestamp.a  | test_timestamp.b  |    test_timestamp.c    |
+-------------------+-------------------+------------------------+
| 3                 | 4                 | 2019-05-22 21:23:34.0  |
| 3                 | 4                 | 2019-05-22 21:23:34.0  |
+-------------------+-------------------+------------------------+

4. Other data types

type illustrate
BOOLEAN Boolean type: TRUE or FALSE
BINARY Byte array for storing variable-length binary data.

5. Composite data types

type illustrate
STRUCT A collection of fields, the types can be different
MAP MAP is a combination of key-value pairs (key-value), the key must be of the original type, and the value can be of any type
ARRAY An array is a collection of variables of the same type and name.

Two, hive data type conversion

1. The law of implicit conversion

  • Hive conversion also includes implicit conversion (implicit conversion) and explicit conversion (explicitly conversions).
  • For example, we compare two numbers of different data types. If one data type is INT and the other is SMALLINT, then the data of SMALLINT type will be implicitly converted to INT type; but we cannot implicitly convert an INT Type data is converted to SMALLINT or TINYINT type data, which will return an error unless you use the cast operation.
  • Any integer type can be implicitly converted to a larger type. TINYINT, SMALLINT, INT, BIGINT, FLOAT, and STRING can all be implicitly converted to DOUBLE.
  • BOOLEAN type cannot be converted to any other data type! Returns NULL if forced conversion.

Data Type Conversion Table
insert image description here

2. Conversion between the same data type

The same data type refers to the same numeric data type, date data type, and so on.
The conversion of data of the same type follows the "upward transformation" rule, that is, when low-type data performs logical operations with high-type data (a type with a larger range), it will be implicitly and automatically converted to a high-type data type. Then do the calculations.
For example, when comparing 1 and 1.23, it will automatically convert 1 to 1.0 for calculation and comparison.

2. Conversion between different data types

Strong conversion function cast()
The cast function uses: cast(value as type), value is the data to be converted, AS is a fixed keyword, and type is the type to be converted

> select 
cast("1223" as double),
cast("456.23" as int),
cast("1.99" as int),
cast("abc" as int) ,
cast(456.23 as decimal(9,2));
+---------+------+------+-------+---------+
|   _c0   | _c1  | _c2  |  _c3  |   _c4   |
+---------+------+------+-------+---------+
| 1223.0  | 456  | 1    | NULL  | 456.23  |
+---------+------+------+-------+---------+

Precautions for using the cast() function

  • The cast conversion can only be converted if it meets the conversion conditions . The conversion conditions can refer to type conversion table , otherwise the result is NULL. For example, the conversion of "abc" to double is obviously not a value, so the conversion fails.

  • If you use cast to convert high-type data into low-type data, the cast function will directly intercept, losing data accuracy or even getting wrong results. For example, to convert floating-point data into int type, the internal operation is realized by the round() or floor() function, not by cast. Therefore, when converting the floating-point type 456.23 to int, the display result will directly truncate the decimal part and keep the integer part. If you want to preserve the precision of the value, you can convert the floating-point type 456.23 to decimal(9,2), but you must define the precision of decimal, which is the total number of digits, including the sum of the digits to the left and right of the decimal point.

  • Date data type conversion instructions : For date data types, only Date, Timestamp and String can be converted.

valid conversion result
cast(date as date) return date type
cast(timestamp as date) The value of year/month/day in timestamp depends on the local time zone, and the result returns date type
cast(string as date) If the string is in YYYY-MM-DD format, the corresponding year/month/day date type data will be returned; but if the string is not in YYYY-MM-DD format, the result will be NULL.
cast(date as timestamp) Based on the local time zone, generate a timestamp value corresponding to the year/month/day of the date
cast(date as string) date所代表的年/月/日时间将会转换成YYYY-MM-DD的字符串
# cast(timestamp as date)
# 显示当前时间戳 current_timestamp()
# 显示当前日期 current_date()

> select current_timestamp();
+--------------------------+
|           _c0            |
+--------------------------+
| 2023-05-09 11:24:12.067  |
+--------------------------+

> select current_date();
+-------------+
|     _c0     |
+-------------+
| 2023-05-09  |
+-------------+

## cast(timestamp as date)
> select cast(current_timestamp() as date);
+-------------+
|     _c0     |
+-------------+
| 2023-05-09  |
+-------------+

## cast(date as timestamp)
> select cast(current_date() as timestamp);
+------------------------+
|          _c0           |
+------------------------+
| 2023-05-09 00:00:00.0  |
+------------------------+

三、实际应用的注意事项

1、字符串string与bigint类型的坑

坑一:
错误:如果 table 1 中含有字段 a1 是string格式,比如”420001053411844“,”000001053411844“等等,另一个 table 2 含有字段 a2 是bigint格式的,需要将 table 1 中 a1 与t able 2 的 a2 进行关联。如果直接将cast(a1 as bigint)与a2,因为如果a1是字段”000001053411844“会变成”1053411844“,导致本应该不连接字段匹配成功。

> desc test_table1;
+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| a1        | string     |          |
+-----------+------------+----------+
> select * from test_table1;
+------------------+
|  test_table1.a1  |
+------------------+
| 420001053411844  |
| 000001053411844  |
+------------------+

> desc test_table2;
+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| a2        | bigint     |          |
+-----------+------------+----------+
> select * from test_table2;
+------------------+
|  test_table2.a2  |
+------------------+
| 420001053411844  |
| 1053411844       |
+------------------+

> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b 
on cast(a.a1 as bigint) = b.a2;
+------------------+------------------+
|        a1        |        a2        |
+------------------+------------------+
| 420001053411844  | 420001053411844  |
| 000001053411844  | 1053411844       |
+------------------+------------------+

原因及注意:不可以直接将cast(a1 as bigint)与a2,因为如果a1是字段”000001053411844“会变成”1053411844“,导致匹配出现错误。

> select cast("000001053411844" as int);
+-------------+
|     _c0     |
+-------------+
| 1053411844  |
+-------------+

解决方案:将bigint类型转成string格式再关联。

> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b 
on a.a1 =  cast(b.a2 as string);
+------------------+------------------+
|        a1        |        a2        |
+------------------+------------------+
| 420001053411844  | 420001053411844  |
+------------------+------------------+

题外话:只要表字段是bigint类型,好像就无法存储 000001053411844,插入都会变成1053411844 。

> create table test_table2 (a2 bigint);
> insert overwrite table test_table2 select * from test_table1 ;

> load data local inpath '/apps/wqf/cdc_model/data/data_20230509.txt' overwrite into table wqf.test_table2;
> insert overwrite table test_table2  values (420001053411844),(000001053411844);
> insert overwrite table test_table2  values ("420001053411844"),("000001053411844");
> insert overwrite table test_table2  select * from test_table1;
> select * from test_table2  ;
+------------------+
|  test_table2.a2  |
+------------------+
| 420001053411844  |
| 1053411844       |
+------------------+

坑二:
错误:如果 table 1 中含有字段 a1 是string格式,比如"150970594253582620"等等,另一个 table 2 含有字段 a2 是bigint格式的,比如"150970594253582621"等等,需要将 table 1 中 a1 与table 2 的 a2 进行关联。按理说两条数据是不会进行连接,结果却跟想象的不同,两条数据匹配上了。

> create table test_table1 (a1 string);
> insert overwrite table test_table1  select "150970594253582620";

> create table test_table2 (a2 bigint);
> insert overwrite table test_table2  select 150970594253582621;


> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b 
on a.a1= b.a2;
+---------------------+---------------------+
|         a1          |         a2          |
+---------------------+---------------------+
| 150970594253582620  | 150970594253582621  |
+---------------------+---------------------+

Reason and attention : a1 of table1 and a2 of table2, one is string and one is bigint. When string and bigint are connected, they are implicitly converted to double. ​The precision of bigint is 19 bits, the size is 8 bytes, and the range is -9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807. Among them, the precision of double is 15 or 16 bits, the size is 8 bytes, and the range is -1.7976E+308 ~ 1.797693E+308. Thus, both lose precision, creating a situation where both are equal.

Solution : Specify the types of both a1 and a2 at the same time .

Summary of string and bigint type pitfalls :
1. When comparing bigint and string, specify the types of the two. In addition to paying attention to whether the data type can be converted, you also need to pay attention to two points. One is that the actual conversion field corresponds to What content, especially when the string data type is converted into an integer value type, pay attention to whether there is a part of the data with 0 at the beginning of the original field, or whether there is a decimal point at the end, such as pit one; the second is to associate or convert the actual numerical precision of the field , Whether there will be a loss of precision.


Reference article:
https://juejin.cn/post/7039162114157756430

Guess you like

Origin blog.csdn.net/sodaloveer/article/details/130555485