hive (data warehouse tool)

Hive is a data warehouse tool based on Hadoop, which can map structured data files into a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for operation. The advantage is that the learning cost is low, and simple MapReduce statistics can be quickly implemented through SQL-like statements, without the need to develop special MapReduce applications, which is very suitable for statistical analysis of data warehouses .

 

Definition of Hive

Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools for extract-transform-load (ETL), a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language, called HQL, that allows SQL-savvy users to query data. At the same time, the language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot do.
Hive has no specialized data format. Hive works well on top of Thrift, controlling delimiters and also allowing the user to specify the data format.

hive application scenarios

Hive is built on top of Hadoop, which is based on static batch processing. Hadoop usually has high latency and requires a lot of overhead in job submission and scheduling. Therefore, Hive cannot implement low-latency and fast queries on large-scale datasets. For example, Hive generally has minute-level time delays when executing queries on datasets of hundreds of MB. therefore,
Hive is not suitable for applications that require low latency, such as online transaction processing (OLTP). The Hive query operation process strictly follows the job execution model of Hadoop MapReduce. Hive converts the user's HiveQL statement into a MapReduce job through the interpreter and submits it to the Hadoop cluster. Hadoop monitors the job execution process and returns the job execution result to the user. Hive is not designed for online transaction processing, Hive does not provide real-time query and row-level based data update operations. The best use case for Hive is batch jobs with large datasets, for example, network log analysis.

hive design features

Hive is a data warehouse processing tool that encapsulates Hadoop at the bottom. It uses the SQL-like HiveQL language to implement data query. All Hive data is stored in a Hadoop-compatible file system (eg, Amazon S3, HDFS). Hive will not modify the data in the process of loading data, but only move the data to the directory set by Hive in HDFS. Therefore, Hive does not support rewriting and adding of data. All data is loaded at the time of loading. definite. The design features of Hive are as follows.
● Support index to speed up data query.
● Different storage types, eg, plain text files, files in HBase.
● Keeping metadata in a relational database greatly reduces the time spent performing semantic checks during queries.
● Data stored in the Hadoop file system can be used directly.
● A large number of built-in user function UDFs are used to manipulate time, character strings and other data mining tools, and users are supported to extend UDF functions to complete operations that cannot be achieved by built-in functions.
● SQL-like query mode, which converts SQL queries into MapReduce jobs and executes them on the Hadoop cluster.

hive Hive architecture

It is mainly divided into the following parts:
user interface
There are three main user interfaces: CLI, Client and WUI. The most commonly used is CLI. When Cli starts, it will start a copy of Hive at the same time. Client is the client of Hive, and the user connects to the Hive Server. When starting the Client mode, you need to point out the node where the Hive Server is located, and start the Hive Server on this node. WUI is to access Hive through a browser.
metadata storage
Hive stores metadata in databases like mysql, derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, and so on.
Interpreter, Compiler, Optimizer, Executor
Interpreter, compiler, optimizer complete HQL query statement from lexical analysis, syntax analysis, compilation, optimization and query plan generation. The resulting query plan is stored in HDFS and executed by subsequent MapReduce calls.
Hadoop
Hive data is stored in HDFS, and most of the queries are completed by MapReduce (queries containing *, such as select * from tbl will not generate MapReduce tasks).

hive data storage

First of all, Hive does not have a special data storage format, nor does it build an index for data. Users can organize tables in Hive very freely. They only need to tell Hive the column separator and row separator in the data when creating a table, and Hive will Data can be parsed.
Second, all data in Hive is stored in HDFS, and Hive contains the following data models: Table, External Table, Partition, and Bucket.
A Table in Hive is similar in concept to a Table in a database, and each Table has a corresponding directory to store data in Hive. For example, a table pvs, its path in HDFS is: /wh/pvs, where wh is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in hive-site.xml, all the Table data (excluding External Table) are stored in this directory.
Partition corresponds to a dense index on the Partition column in the database, but the organization of Partition in Hive is very different from that in the database. In Hive, a Partition in a table corresponds to a directory under the table, and all Partition data is stored in the corresponding directory. For example: the pvs table contains two Partitions: ds and city, then the HDFS subdirectory corresponding to ds = 20090801, ctry = US is: /wh/pvs/ds=20090801/ctry=US; corresponding to ds = 20090801, ctry = The HDFS subdirectory of CA is; /wh/pvs/ds=20090801/ctry=CA
Buckets calculate the hash of the specified column, and divide the data according to the hash value. The purpose is to parallelize, and each bucket corresponds to a file. Distribute the user column into 32 buckets, first calculate the hash of the value of the user column, the corresponding HDFS directory with a hash value of 0 is: /wh/pvs/ds=20090801/ctry=US/part-00000; The HDFS directory is: /wh/pvs/ds=20090801/ctry=US/part-00020
External Table points to data that already exists in HDFS, and Partition can be created. It is the same as Table in the organization of metadata, but the storage of actual data is quite different.
Table creation process and data loading process (these two processes can be completed in the same statement), in the process of loading data, the actual data will be moved to the data warehouse directory; then access to the data will be directly in the data done in the repository directory. When you delete a table, the data and metadata in the table will be deleted at the same time.
  • External Table has only one process, loading data and creating a table are completed at the same time (CREATE EXTERNAL TABLE ... LOCATION), the actual data is stored in the HDFS path specified after LOCATION, and will not be moved to the data warehouse directory. When deleting an External Table, only the metadata is deleted, the data in the table will not be deleted.

hive installation configuration

You can download a packaged stable version of hive, or you can download the source code and build a version yourself.
installation required
  1. java 1.6, java 1.7 or higher.
  2. Hadoop 2.x or higher, 1.x. Hive 0.13 version also supports 0.20.x, 0.23.x
  3. Linux, mac, windows operating system. The following is for linux systems.
Install the packaged hive
You need to go to apache to download the packaged hive image, and then unzip the file
?
1
tar  -xzvf hive-x.y.z. tar .gz
set hive environment variables
?
1
cd  hive-x.y.z$  export  HIVE_HOME={{ pwd }}
Set hive run path
?
1
export  PATH=$HIVE_HOME /bin :$PATH
Compile Hive source code
Download hive source code
To compile with maven here, you need to download and install maven.
 
Take Hive version 0.13 as an example
  1. Compile hive 0.13 source code based on hadoop 0.23 or higher
    $cdhive$mvncleaninstall-Phadoop-2,dist$cdpackaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin$lsLICENSENOTICEREADME.txtRELEASE_NOTES.txtbin/(alltheshellscripts)lib/(requiredjarfiles)conf/(configurationfiles)examples/(sampleinputandqueryfiles)hcatalog/(hcataloginstallation)scripts/(upgradescriptsforhive-metastore)
  2. Compile hive based on hadoop 0.20
    $cdhive$antcleanpackage$cdbuild/dist#lsLICENSENOTICEREADME.txtRELEASE_NOTES.txtbin/(alltheshellscripts)lib/(requiredjarfiles)conf/(configurationfiles)examples/(sampleinputandqueryfiles)hcatalog/(hcataloginstallation)scripts/(upgradescriptsforhive-metastore)
run hive
Hive operation depends on hadoop, and hadoopHome must be configured before running hadoop.
?
1
     export  HADOOP_HOME=<hadoop- install - dir >
Create \tmp directory and /user/hive/warehouse(akahive.metastore.warehouse.dir) directory for hive on hdfs, then you can run hive.
Set up HiveHome before running hive.
?
1
     export  HIVE_HOME=<hive- install - dir >
Start hive in the command line window
?
1
      $ $HIVE_HOME /bin/hive
If the execution is successful, you will see similar content as shown in the figure

hive basic syntax

hive basic data types

Hive supports a variety of integer and floating-point data of different lengths, supports Boolean, and supports string types with no length limit. For example: TINYINT, SMALINT, BOOLEAN, FLOAT, DOUBLE, STRING and other basic data types. These basic data types, like other SQL dialects, are reserved words.

hive collection data type

Columns in hive support the use of struct, map and array collection data types. These collection data types are not supported in most relational databases because they break the standard format. In relational databases, the collection data type is realized by establishing appropriate foreign key associations between multiple tables. In a big data system, the advantage of using collection-type data is to improve data throughput and reduce the number of addressing to improve query speed.
Create a table instance using the collection data type:
CREATE TABLE STUDENTINFO
(
NAME STRING,
FAVORITE ARRAY<STRING>,
COURSE MAP<STRING,FLOAT>,
ADDRESS STRUCT<CITY:STRING,STREET:STRING>
)
查询语法:SELECT S.NAME,S.FAVORITE[0],S.COURSE["ENGLISH"],S.ADDRESS.CITY FROM STUDENTINFO S;

hive partition table

创建分区表:create table employee (name string,age int,sex string) partitioned by (city string) row format delimited fields terminated by '\t';
Partition table load data: load data local inpath '/usr/local/lee/employee' into table employee partition (city='hubei');

Hive Hive commonly used optimization methods

1. Optimization of join connection: when three or more tables are joined, if each on uses the same field to join, only one mapreduce will be generated.
2. Optimization of join connection: When multiple tables are queried, the size order of the tables from left to right should be from small to large. Reason: hive will cache other tables first when operating on each row of records, until the last table is scanned for calculation
3. Add a partition filter to the where clause.
4. Do not use inner join when left semi join syntax can be used, the former is more efficient. Reason: For a specified record in the left table, stop scanning as soon as it is found in the right table.
5. If one of the tables is small enough, it can be placed in memory, so that the matching can be completed when connecting with other tables, and the reduce process is omitted. This can be achieved by setting properties, set hive.auto.covert.join=true; users can configure the size of the small table they want to be optimized set hive.mapjoin.smalltable.size=2500000; If you need to use these two configurations, you can put $ HOME/.hiverc file.
6. Multiple processing of the same data: Multiple data aggregations generated from one data source do not need to be rescanned for each aggregation.
例如:insert overwrite table student select * from employee; insert overwrite table person select * from employee;
可以优化成:from employee insert overwrite table student select * insert overwrite table person select *
7. Limit tuning: The limit statement usually returns partial results after executing the entire statement. set hive.limit.optimize.enable=true;
8. Enable concurrent execution. A job task may contain many stages, some of which have no dependencies and can be executed concurrently. After enabling concurrent execution, the job task can be completed faster. Set properties: set hive.exec.parallel=true;
9. The strict mode provided by hive prohibits the query mode in three cases.
a: When the table is a partitioned table, if there is no partition field and restriction after the where clause, it is not allowed to execute.
b: When using the order by statement, the limit field must be used, because the order by will only generate one reduce task.
c: Queries that restrict the Cartesian product.
10. Set the number of map and reduce reasonably.
11. JVM reuse. The number of times the jvm is reused can be set in hadoop's mapred-site.xml.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326390385&siteId=291194637