0458-Hive data type checking Analysis

Tips: If you use a computer to view the picture is not clear, you can use the phone to open the text of the article click photo to enlarge view high-definition picture.
Fayson the GitHub:
https://github.com/fayson/cdhproject
Tip: block portion can slide see Oh

1

Articles written purpose

Schema, the number of columns of data field types for each column you will encounter problems when using data type checking Hive, compared to traditional relational databases will be strictly required data has strict rules, so data must be stored in accordance with the definition the Schema format to store. The Hive database for specific data format and content are not concerned only does the conversion and Schema definitions when data is read out. That this time there will be the problem of data type conversion, the next major analysis of this article Fayson how to find the table of data type conversion errors as well as Hive and handling of null values ​​to NULL.

  • test environment

1.RedHat7.2

2.CM and CDH version 5.15.0

2

Test data preparation

  1. Construction of the table statement is as follows:
create table test_null (id int, age string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

(About slidably)

  1. Test data are as follows:
[root@cdh2 ~]# vim test1.dat
1,23
2,24c
3,32d
4,30
5,NULL

(About slidably)

  1. The test data is loaded into test_cast table, the data in the table view

Fayson the table structure previously defined for the id and two fields are of type int age, the age of the sample data Load column with a non-numeric type data, see above type conversion failure theme while viewing the display table data to NULL.

3

Find a type of abnormal data

Hive itself no mechanism to verify the validity of data, if we want to retrieve the data type conversion table abnormal, you can determine whether to combine data conversion failed nvl and cast by two functions. In order to achieve the following way:

  1. Create a test table and prepare test data, SQL as follows:
select id,nvl(cast(age as int), "error") age from test_cast;

2. The type of anomaly data into the new table, SQL as follows:

create table  test_exception as
 select * from (select id,nvl(cast(age as int), 'error') age from test_cast) as b where b.age='error';

(About slidably)

The same can also be used just cast to find it, SQL is as follows:

create table test_exception as 
select * from (select id,nvl(cast(age as int), age) age from test_cast) as b where b.age is null;

(About slidably)

Check the type of data retrieved Abnormal

3. HDFS abnormal data is written to a view type conversion

通过如上方式我们可以检索出test_cast表中age列类型转换异常的数据,通过每条数据的ID查找对应的原始数据找到问题原因。在上述过程中还出现了另一个问题Hive中NULL和空值是如何处理的?如下Fayson再介绍下Hive中对着两个值的处理。

4

Hive中NULL和空值处理

通过上述的处理过程,我们可以看到Hive对于类型转换异常的数据查询出来显示为NULL,但我们将这些数据写入到新的表后数据文件中显示的为\N。那在我们的数据中如果存’NULL’类型的字符串呢?Hive中默认将NULL存为\N,NULL类型的字符串如何检索?

1.创建一个测试表及准备测试数据,SQL如下:

create table test_null (id int, age string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

测试数据如下:

[root@cdh2 ~]# vim test1.dat
1,23
2,24c
3,32d
4,30
5,NULL
6,\N

2.将数据Load到test_null表中显示如下:

如上图所5、6两条数据均显示的为NULL,通过数据我们无法真实的区别那条数据的age真正的为空。

3.通过指定查询条件可以检索出空值和NULL类型字符串数据

使用is null可以检索出存储为\N的数据(即id为6的这条数据)

使用=’NULL’可以检索出为NULL字符串的数据(即id为5的这条数据)

4.在Hive中是通过serialization.null.format参数来保存和标识NULL,通过将表的该参数修改为NULL表示为空值

alter table test_null set serdeproperties ('serialization.null.format' = 'NULL');

(可左右滑动)

向表中插入一条age为NULL的数据

insert into test_null values(7,NULL);

查看此时表中的数据显示

查看HDFS插入的数据显示NULL

5

总结

1.Hive在对表进行Put和Load数据操作时,Hive是不支持数据类型的校验,在使用insert into table select…方式向表中插入数据时,对于类型异常的数据会在表中插入一个\N空的值(\N为Hive中默认NULL标识)

2.可以使用serialization.null.format来指定Hive中保存和标识NULL,可以设置为默认的\N,也可以为NULL或’’

3.如果表中存在大量的NULL值,则在Hive的数据文件中会产生大量的\N数据,浪费存储空间,那我们可以将serialization.null.format设置为’’

alter table test_null set serdeproperties('serialization.null.format' = '');

(可左右滑动)

插入为NULL的数据后,HDFS的数据文件存储如下

可以通过建表语句中指定Hive保存和标识NULL,也可以通过alter修改已存在的表,建表指定方式如下:

create table test_null_1 (id int, age string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS''
STORED AS TEXTFILE;

提示:代码块部分可以左右滑动查看噢
为天地立心,为生民立命,为往圣继绝学,为万世开太平。
温馨提示:如果使用电脑查看图片不清晰,可以使用手机打开文章单击文中的图片放大查看高清原图。

发布了342 篇原创文章 · 获赞 14 · 访问量 3万+

Guess you like

Origin blog.csdn.net/Hadoop_SC/article/details/104097452