[Hive] (a) Introduction

Learning porter, notes taken from the laboratory building courses

table of Contents

1, the experimental introduction

2, what is the Hive

3, Hive architecture

4, the difference between Hive and relational databases 

5, Hive scenarios

6, Hive data storage and metadata storage


1, the experimental introduction

⭐ introduces the theory

  • Hive definition
  • Hive architecture
  • The difference between Hive and relational database
  • Hive application scenarios
  • Hive storage

⭐ experimental knowledge

  • Hive QL
  • Data ETL
  • Metadata Store

⭐ experimental environment

  • Hive 1.2.1
  • hadoop2.7.3
  • Xfce Terminal

 

2, what is the Hive

   Hive is based on Hadoop file system data warehouse architecture. It offers many features for managing data warehouse: the ability to query and analyze data ETL (extract, transform and load) tools, data management and storage of large data sets. Hive while also defines SQL-like language -. Hive QL Hive zQL SQL and allows the user to similar operation, it can be mapped to the structure of a database table data file, and provide simple SQL queries. Also allows developers to easily use the Mapper and Reducer operations, you can convert SQL statements to run MapReduce tasks, which MapReduce framework is a strong support.

 

3, Hive architecture

Hive is an important subproject of Hadoop, from the figure we can roughly understand the relationship between Hadoop and Hive position of.

Alt text

FIG upper layers described in Hadoop EcoSystem system. The Hive system itself is structured as follows:

Picture Description Information

From the figure we can see Hive consisting essentially divided into:

  • The user interface, including CLI, JDBC / ODBC, WebUI
  • Metadata store, is typically stored in relational databases such as MySQL, Derby in
  • Interpreter, compiler, optimizer, executor
  • Hadoop, for storing HDFS, calculated using the MapReduce

 

4, the difference between Hive and relational databases 

Hive in many ways similar to traditional relational databases (such as support for SQL interface), but the underlying reliance on HDFS and MapReduce architecture means that it's different from traditional relational databases, and these differences also affect the characteristics of the Hive supports, thereby affecting the use of Hive.

We can cite some simple differences:

  • Hive and relational databases to store different file system, using the Hive Hadoop's HDFS (Hadoop Distributed File System), a relational database is local to the server file system;
  • Hive computing model using MapReduce, and relational database model is the calculation of their own design;
  • Relational databases are carried out for the real-time query of business design, and Hive is to do data mining massive data design, real-time poor; the difference between real-time application scenarios lead to Hive and relational databases are very different;
  • Hive easily expand their storage capacity and computing power, this is the inheritance of Hadoop, and relational databases in this respect much worse.

5, Hive scenarios

By comparison, after Hive traditional relational database, in fact, we can easily come Hive scene which can be applied.

Hive is built on the basis of static batch Hadoop, Hadoop usually have higher latency and requires a lot of overhead at the time of job submission and scheduling. Therefore, Hive is not suitable for low-latency fast queries on large data sets.

Hive is not suitable for those applications that require low latency, such as online transaction processing (OLTP). Hive query operation process strictly abide by the Hadoop MapReduce job execution model, Hive converts the user's HiveQL statement submitted to the Hadoop clusters MapReduce jobs through an interpreter, Hadoop monitoring process to execute the job, and then return to the job execution result to the user. Hive is not designed for online transaction processing, Hive does not provide real-time query and update data based on row-level operations.

Hive is best use of batch jobs where large data sets, for example, Web log analysis.

6, Hive data storage and metadata storage

Hive storage is built on top of Hadoop file system. Hive does not have a dedicated data storage format, the data can not be indexed, so the user can very freely organize Hive in the table, just need to tell Hive data column separator when creating a table you can parse the data.

Hive comprising 4 kinds of main data 表(Table)model: 外部表(External Table), , 分区(Partition)and  桶(Bucket).

Hive tables and tables in the database is no essential difference in concept, the Hive each table has a corresponding storage directory. The external table points to the data that already exists in HDFS, you can create partitions. Hive each partition corresponds to a respective index database partitioning column, its organization and manner of a traditional relational database partitions different. When the specified column tub Hash calculation, a hash value based segmentation data, each bucket corresponds to a file.

Since the metadata Hive might have to face constantly update, modify, and read operations, so it is obviously not suitable for use Hadoop file system for storage. Currently the Hive metadata stored in an RDBMS such as MySQL, Derby stored in. In the configuration which we described in FIG Hive system above, it can also be seen.

 

 

Published 44 original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/YYIverson/article/details/101111426