HIVE index

  • Mechanisms and principles

    • The purpose is to speed up the search Hive table specified column

    • When there is no index, Hive when performing a query need to load the entire table or the entire district, and then deal with all the data, but there is an index on the specified columns, and then through the designated column of the query, then only load and process some files

    • Like a traditional relational database, the index increased at the same time improve query speed, it will consume additional resources to create an index and requires more disk space to store index

    • Hive index is actually an index table (Hive physical tables), which store the index value in the table column, the HDFS file path corresponding to the value of the offset value data file

    • Hive when executing the query index through the column, through a first MR Job to query the index table, according to the filtering conditions of a column index, and queries the file directory HDFS offset value corresponding to columns in the index, and outputs the data to the HDFS a file, and then based on this document to sift through the original file as input a query Job

  • advantage

    • To avoid waste of resources and a full table scan

    • You can speed up the statement containing group by the query speed

  • Shortcoming

    • Use cumbersome process

    • Job required additional scanning the index table

    • It does not automatically refresh, if the table has data changes, you need to manually refresh the index table

Guess you like

Origin www.cnblogs.com/xiangyuguan/p/11403824.html