现象
Hive使用TEZ作为默认的执行引擎,当表插入完记录后,count()得到的结果与实际的记录数不一致,如果使用MR作为执行引擎来执行count(),结果与实际记录数一致。
解决
使用TEZ执行count()十分高效,绕过了MapReduce操作,实际结果不正确,应该是TEZ内部有某种机制count()直接查询统计信息,然后统计信息不是最新的,导致count(*)结果不正确。
hive> select count(*) from test1;
OK
1131921
查看表结构,我们发现count(*)的记录数与表结构中的numRows保持一致,
hive> show create table test1;
OK
CREATE TABLE `test1`(
`pripid` string,
`uniscid` string,
`entname` string,
...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hadoop1/apps/hive/warehouse/default.db/test1'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='28',
'numRows'='1131921',
'rawDataSize'='685459303',
'totalSize'='2323131590',
'transient_lastDdlTime'='1531319725')
Time taken: 0.227 seconds, Fetched: 48 row(s)
使用ANALYZE命令对表重新更新统计信息并重新统计后结果正确,
hive> analyze table test1 compute statistics;
Query ID = trafodion_20180711104240_02eb6fb5-f53c-454f-aa1e-8c6ca157b21c
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1531148517927_2403)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 146 146 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 12.62 s
--------------------------------------------------------------------------------
Table test1 stats: [numFiles=28, numRows=5562243, totalSize=2323131590, rawDataSize=2317569347]
OK
Time taken: 14.247 seconds
hive> select count(*) from test1;
OK
5562243
Time taken: 0.045 seconds, Fetched: 1 row(s)