hive file query data sources and methods specific location

Usually the user in using SELECT HIVE statement out the results, the results can not determine which documents or information from a specific location, HIVE takes into account this point, you can specify three in Virtual Column false statement in a static column:

  1. INPUT__FILE__NAME map task read the full path File
  2. If RCFile BLOCK__OFFSET__INSIDE__FILE or SequenceFile block Block file Offset compressed file format is displayed, which is fast in the first word of the current file offset, if the TextFile, displaying the first byte in the current line in the file biasing amount of shift
  3. ROW__OFFSET__INSIDE__BLOCK RCFile and SequenceFile display row number, textfile displayed as 0
    NOTE: To display ROW__OFFSET__INSIDE__BLOCK, must be set set hive.exec.rowoffset = true;

When we emerged from among the data dirty data, we can use this method to locate where specific data is dirty. It is a very good way to troubleshoot.

Create a textfile table

create table temp.temp_text_file_name (
content_name  string,
channel       string
)
row format delimited fields terminated by '\t'
stored as textfile;

Query file location and offset, when hive.exec.rowoffset not set to true, will report the following error:

Error: Error while compiling statement: FAILED: SemanticException [Error 10004]: Line 1:69 Invalid table alias or column reference 'row__offset__inside__block': (possible column names are: content_name, channel) (state=42000,code=10004)

Set set hive.exec.rowoffset = true;

select content_name, input__file__name, block__offset__inside__file, row__offset__inside__block
  from temp.temp_text_file_name
 limit 10;

+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| content_name  |                 input__file__name                  | block__offset__inside__file  | row__offset__inside__block  |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| CCTV10科教      | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 0                            | 0                           |
| CCTV10科教高清    | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 20                           | 0                           |
| CCTV11戏曲      | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 46                           | 0                           |
| CCTV12社会与法    | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 66                           | 0                           |
| CCTV13新闻      | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 92                           | 0                           |
| CCTV14少儿      | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 112                          | 0                           |
| CCTV14少儿高清    | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 132                          | 0                           |
| CCTV15音乐      | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 158                          | 0                           |
| CCTV1综合       | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 178                          | 0                           |
| CCTV1综合高清     | hdfs://hive/warehouse/temp.db/temp_text_file_name/temp_test.csv | 196                          | 0                           |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+

ORC to create a storage format for the table

create table temp.temp_orc_file_name (
content_name  string,
channel       string
)
row format delimited fields terminated by '\t'
stored as orc;

Inserting analog data

insert into table temp.temp_orc_file_name
select * from temp.temp_text_file_name;

Inquire

select content_name, input__file__name, block__offset__inside__file, row__offset__inside__block
  from temp.temp_orc_file_name
 limit 10;
 +---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| content_name  |                 input__file__name                  | block__offset__inside__file  | row__offset__inside__block  |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| CCTV10科教      | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 24                           | 0                           |
| CCTV10科教高清    | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 48                           | 0                           |
| CCTV11戏曲      | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 73                           | 0                           |
| CCTV12社会与法    | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 97                           | 0                           |
| CCTV13新闻      | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 121                          | 0                           |
| CCTV14少儿      | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 146                          | 0                           |
| CCTV14少儿高清    | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 170                          | 0                           |
| CCTV15音乐      | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 194                          | 0                           |
| CCTV1综合       | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 219                          | 0                           |
| CCTV1综合高清     | hdfs://hive/warehouse/temp.db/temp_orc_file_name/000000_0 | 243                          | 0                           |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
10 rows selected (0.29 seconds)

Sequencefile create tables and insert data format

create table temp.temp_seq_file_name (
content_name  string,
channel       string
)
row format delimited fields terminated by '\t'
stored as sequencefile;

insert into table temp.temp_seq_file_name
select * from temp.temp_text_file_name;
select content_name, input__file__name, block__offset__inside__file, row__offset__inside__block
  from temp.temp_seq_file_name
 order by row__offset__inside__block desc
 limit 10;

+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| content_name  |                 input__file__name                  | block__offset__inside__file  | row__offset__inside__block  |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
| CCTV1综合高清     | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 391                          | 0                           |
| CCTV1综合       | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 361                          | 0                           |
| CCTV央视音乐      | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0_copy_1 | 1818                         | 0                           |
| CCTV14少儿高清    | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 291                          | 0                           |
| CCTV14少儿      | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 259                          | 0                           |
| CCTV13新闻      | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 227                          | 0                           |
| CCTV12社会与法    | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 189                          | 0                           |
| CCTV11戏曲      | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 157                          | 0                           |
| CCTV央视文化精品    | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0_copy_1 | 1760                         | 0                           |
| CCTV10科教      | hdfs://hive/warehouse/temp.db/temp_seq_file_name/000000_0 | 87                           | 0                           |
+---------------+----------------------------------------------------+------------------------------+-----------------------------+--+
10 rows selected (26.585 seconds)

Question to be verified and explained: row__offset__inside__block field regardless of format or ORC textfile format or sequencefile format, values ​​are 0? ? ?

Guess you like

Origin blog.csdn.net/lz6363/article/details/89498340