Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right
Article Directory
1. hive data type
1. Numeric data type
type | Support range | illustrate |
---|---|---|
TINYINT | 1byte signed integer range: -128~127 | The range is too small, basically useless |
SMALLINT | 2byte signed integer range: -32,768 to 32,767 | Basically no use |
INT/INTEGER | 4byte signed integer range: -2,147,483,648 to 2,147,483,647 | INTERGER is only available in hive2.20, generally not used |
BIGINT | 8byte signed integer range: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807, with a precision of 19 bits, used as a supplement to int | |
FLOAT | 4byte single-precision floating-point number range: -3.4028235E38 to 3.4028235E387, with a precision of 7 digits. | 3.14159 |
DOUBLE | 8byte double-precision floating-point number range: -1.7976E+308 to 1.797693E+308, with a precision of 15 or 16 bits, which is larger than single-precision floating-point number float storage. | 3.114159 |
DECIMAL | Can store up to 38 decimals |
DECIMAL Numeric Type Description
1. For decimal use decimal(precision, scale), the front is the integer part, and the back is the decimal part.
If the integer part is not defined, the default length is 10. If the decimal part is not specified, the default
length is 0. If the length exceeds the length, it will be truncated. The default length of the decimal part is 0, which is rounded and truncated, and the integer part is rounded up to 1.
> select CAST(12345.523456 AS DECIMAL) ;
+--------+
| _c0 |
+--------+
| 12346 |
+--------+
2. If the length of the data to be converted exceeds the length specified by decimal, the result will not be truncated and will be a null value directly, although decimal can store up to 38 decimals. The default integer part length is 10 as follows, now the integer length of the data to be converted exceeds 10, and the whole result is NULL directly
> select CAST(12345678910.523456 AS DECIMAL) ;
+-------+
| _c0 |
+-------+
| NULL |
+-------+
3. Use decimal to convert other types of data into decimal, specify the length, and round off the excess.
> select CAST(123456789.1234567 AS DECIMAL(20,5));
+------------------+
| _c0 |
+------------------+
| 123456789.12346 |
+------------------+
2. Character data type
type | illustrate |
---|---|
STRING | For long strings, if you can use the string type, use the string type as much as possible. |
VARCHAR | Fixed length, you need to specify the length when using it. Therefore, the data may be lost after the conversion exceeds the specified length |
CHAR | Fixed length, use needs to specify the length, but the length of char is much smaller than varchar |
VARCHAR Numeric Type Description
1. varchar has a fixed length, and the length must be specified when using it. Therefore, the data may be lost after the conversion exceeds the specified length.
> select CAST("ABCDEFGHICD" AS VARCHAR(10));
+-------------+
| _c0 |
+-------------+
| ABCDEFGHIC |
+-------------+
2. When creating a varchar table, you need to specify the length, otherwise an error will be reported. If the specified length is too small, the data insertion will be directly truncated to the length.
> create table test_varchar(id varchar(10));
> insert overwrite table test_varchar values ('123456789122');
> select * from test_varchar;
+------------------+
| test_varchar.id |
+------------------+
| 1234567891 |
+------------------+
3. Date data type
type | illustrate |
---|---|
TIMESTAMP | 1. Beginning with Hive 0.8.0. Used to represent UTC time (time standard time). Convenience UDFs (to_utc_timestamp, from_utc_timestamp) for timezone conversion are provided. 2. All existing datetime UDFs (month, day, year, hour, etc.) use the TIMESTAMP data type. Secondly, TIMESTAP supports integer, floating-point, and string data. The specific use will be introduced later in the actual development, and there are not many used in actual development. |
DATE | Since Hive 0.12.0 DATE values describe a specific year/month/day in the format YYYY-MM-DD. For example, DATE'2013-01-01'. Date types have no time component. The range of values supported by the Date type is 0000-01-01 to 9999-12-31, depending on the native support of the Java Date type. Date types can only be converted between Date, Timestamp, or String types. |
INTERVAL | Starting from Hive 1.2.0, it is not used much in actual development. |
TIMESTAMP Numeric Type Description
1. Use timestamp to create a field of date type, which can store time data of floating point, integer, and string types
> create table test_timestamp(
a int,
b bigint,
c timestamp
);
> insert overwrite table test_timestamp
select 1,2,12334324 from test_timestamp limit 2;
> select * from test_timestamp;
+-------------------+-------------------+--------------------------+
| test_timestamp.a | test_timestamp.b | test_timestamp.c |
+-------------------+-------------------+--------------------------+
| 3 | 4 | 1970-01-01 03:25:34.324 |
| 3 | 4 | 1970-01-01 03:25:34.324 |
+-------------------+-------------------+--------------------------+
> insert overwrite table test_timestamp
select 3,4,"2019-05-22 21:23:34" from test_timestamp limit 2;
> select * from test_timestamp;
+-------------------+-------------------+------------------------+
| test_timestamp.a | test_timestamp.b | test_timestamp.c |
+-------------------+-------------------+------------------------+
| 3 | 4 | 2019-05-22 21:23:34.0 |
| 3 | 4 | 2019-05-22 21:23:34.0 |
+-------------------+-------------------+------------------------+
4. Other data types
type | illustrate |
---|---|
BOOLEAN | Boolean type: TRUE or FALSE |
BINARY | Byte array for storing variable-length binary data. |
5. Composite data types
type | illustrate |
---|---|
STRUCT | A collection of fields, the types can be different |
MAP | MAP is a combination of key-value pairs (key-value), the key must be of the original type, and the value can be of any type |
ARRAY | An array is a collection of variables of the same type and name. |
Two, hive data type conversion
1. The law of implicit conversion
- Hive conversion also includes implicit conversion (implicit conversion) and explicit conversion (explicitly conversions).
- For example, we compare two numbers of different data types. If one data type is INT and the other is SMALLINT, then the data of SMALLINT type will be implicitly converted to INT type; but we cannot implicitly convert an INT Type data is converted to SMALLINT or TINYINT type data, which will return an error unless you use the cast operation.
- Any integer type can be implicitly converted to a larger type. TINYINT, SMALLINT, INT, BIGINT, FLOAT, and STRING can all be implicitly converted to DOUBLE.
- BOOLEAN type cannot be converted to any other data type! Returns NULL if forced conversion.
Data Type Conversion Table
2. Conversion between the same data type
The same data type refers to the same numeric data type, date data type, and so on.
The conversion of data of the same type follows the "upward transformation" rule, that is, when low-type data performs logical operations with high-type data (a type with a larger range), it will be implicitly and automatically converted to a high-type data type. Then do the calculations.
For example, when comparing 1 and 1.23, it will automatically convert 1 to 1.0 for calculation and comparison.
2. Conversion between different data types
Strong conversion function cast()
The cast function uses: cast(value as type), value is the data to be converted, AS is a fixed keyword, and type is the type to be converted
> select
cast("1223" as double),
cast("456.23" as int),
cast("1.99" as int),
cast("abc" as int) ,
cast(456.23 as decimal(9,2));
+---------+------+------+-------+---------+
| _c0 | _c1 | _c2 | _c3 | _c4 |
+---------+------+------+-------+---------+
| 1223.0 | 456 | 1 | NULL | 456.23 |
+---------+------+------+-------+---------+
Precautions for using the cast() function
-
The cast conversion can only be converted if it meets the conversion conditions . The conversion conditions can refer to type conversion table , otherwise the result is NULL. For example, the conversion of "abc" to double is obviously not a value, so the conversion fails.
-
If you use cast to convert high-type data into low-type data, the cast function will directly intercept, losing data accuracy or even getting wrong results. For example, to convert floating-point data into int type, the internal operation is realized by the round() or floor() function, not by cast. Therefore, when converting the floating-point type 456.23 to int, the display result will directly truncate the decimal part and keep the integer part. If you want to preserve the precision of the value, you can convert the floating-point type 456.23 to decimal(9,2), but you must define the precision of decimal, which is the total number of digits, including the sum of the digits to the left and right of the decimal point.
-
Date data type conversion instructions : For date data types, only Date, Timestamp and String can be converted.
valid conversion | result |
---|---|
cast(date as date) | return date type |
cast(timestamp as date) | The value of year/month/day in timestamp depends on the local time zone, and the result returns date type |
cast(string as date) | If the string is in YYYY-MM-DD format, the corresponding year/month/day date type data will be returned; but if the string is not in YYYY-MM-DD format, the result will be NULL. |
cast(date as timestamp) | Based on the local time zone, generate a timestamp value corresponding to the year/month/day of the date |
cast(date as string) | date所代表的年/月/日时间将会转换成YYYY-MM-DD的字符串 |
# cast(timestamp as date)
# 显示当前时间戳 current_timestamp()
# 显示当前日期 current_date()
> select current_timestamp();
+--------------------------+
| _c0 |
+--------------------------+
| 2023-05-09 11:24:12.067 |
+--------------------------+
> select current_date();
+-------------+
| _c0 |
+-------------+
| 2023-05-09 |
+-------------+
## cast(timestamp as date)
> select cast(current_timestamp() as date);
+-------------+
| _c0 |
+-------------+
| 2023-05-09 |
+-------------+
## cast(date as timestamp)
> select cast(current_date() as timestamp);
+------------------------+
| _c0 |
+------------------------+
| 2023-05-09 00:00:00.0 |
+------------------------+
三、实际应用的注意事项
1、字符串string与bigint类型的坑
坑一:
错误:如果 table 1 中含有字段 a1 是string格式,比如”420001053411844“,”000001053411844“等等,另一个 table 2 含有字段 a2 是bigint格式的,需要将 table 1 中 a1 与t able 2 的 a2 进行关联。如果直接将cast(a1 as bigint)与a2,因为如果a1是字段”000001053411844“会变成”1053411844“,导致本应该不连接字段匹配成功。
> desc test_table1;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| a1 | string | |
+-----------+------------+----------+
> select * from test_table1;
+------------------+
| test_table1.a1 |
+------------------+
| 420001053411844 |
| 000001053411844 |
+------------------+
> desc test_table2;
+-----------+------------+----------+
| col_name | data_type | comment |
+-----------+------------+----------+
| a2 | bigint | |
+-----------+------------+----------+
> select * from test_table2;
+------------------+
| test_table2.a2 |
+------------------+
| 420001053411844 |
| 1053411844 |
+------------------+
> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b
on cast(a.a1 as bigint) = b.a2;
+------------------+------------------+
| a1 | a2 |
+------------------+------------------+
| 420001053411844 | 420001053411844 |
| 000001053411844 | 1053411844 |
+------------------+------------------+
原因及注意:不可以直接将cast(a1 as bigint)与a2,因为如果a1是字段”000001053411844“会变成”1053411844“,导致匹配出现错误。
> select cast("000001053411844" as int);
+-------------+
| _c0 |
+-------------+
| 1053411844 |
+-------------+
解决方案:将bigint类型转成string格式再关联。
> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b
on a.a1 = cast(b.a2 as string);
+------------------+------------------+
| a1 | a2 |
+------------------+------------------+
| 420001053411844 | 420001053411844 |
+------------------+------------------+
题外话:只要表字段是bigint类型,好像就无法存储 000001053411844,插入都会变成1053411844 。
> create table test_table2 (a2 bigint);
> insert overwrite table test_table2 select * from test_table1 ;
> load data local inpath '/apps/wqf/cdc_model/data/data_20230509.txt' overwrite into table wqf.test_table2;
> insert overwrite table test_table2 values (420001053411844),(000001053411844);
> insert overwrite table test_table2 values ("420001053411844"),("000001053411844");
> insert overwrite table test_table2 select * from test_table1;
> select * from test_table2 ;
+------------------+
| test_table2.a2 |
+------------------+
| 420001053411844 |
| 1053411844 |
+------------------+
坑二:
错误:如果 table 1 中含有字段 a1 是string格式,比如"150970594253582620"等等,另一个 table 2 含有字段 a2 是bigint格式的,比如"150970594253582621"等等,需要将 table 1 中 a1 与table 2 的 a2 进行关联。按理说两条数据是不会进行连接,结果却跟想象的不同,两条数据匹配上了。
> create table test_table1 (a1 string);
> insert overwrite table test_table1 select "150970594253582620";
> create table test_table2 (a2 bigint);
> insert overwrite table test_table2 select 150970594253582621;
> SELECT
a.a1 a1,
b.a2 a2
from test_table1 a
join test_table2 b
on a.a1= b.a2;
+---------------------+---------------------+
| a1 | a2 |
+---------------------+---------------------+
| 150970594253582620 | 150970594253582621 |
+---------------------+---------------------+
Reason and attention : a1 of table1 and a2 of table2, one is string and one is bigint. When string and bigint are connected, they are implicitly converted to double. The precision of bigint is 19 bits, the size is 8 bytes, and the range is -9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807. Among them, the precision of double is 15 or 16 bits, the size is 8 bytes, and the range is -1.7976E+308 ~ 1.797693E+308. Thus, both lose precision, creating a situation where both are equal.
Solution : Specify the types of both a1 and a2 at the same time .
Summary of string and bigint type pitfalls :
1. When comparing bigint and string, specify the types of the two. In addition to paying attention to whether the data type can be converted, you also need to pay attention to two points. One is that the actual conversion field corresponds to What content, especially when the string data type is converted into an integer value type, pay attention to whether there is a part of the data with 0 at the beginning of the original field, or whether there is a decimal point at the end, such as pit one; the second is to associate or convert the actual numerical precision of the field , Whether there will be a loss of precision.
Reference article:
https://juejin.cn/post/7039162114157756430