After reading this article, don't say you don't know Hive

content

What is Hive in Hadoop?

Hive mainly consists of three core parts

Key Features of Hive

Advantages of Hive

Disadvantages of Hive

Architecture of Hive

Hive mode

Map Reduce mode:

Relationship between Hive and relational databases

What is the difference between Hive and Pig

Difference between Hive and Hbase

Hadoop is a must-have skill for big data developers


Hadoop is one of the most popular software frameworks capable of processing and storing big data information, and Hive is a tool designed to help Hadoop become more efficient.

What is Hive in Hadoop?

Hive is a data warehouse analysis system built on Hadoop. It provides rich SQL query methods to analyze data stored in the Hadoop distributed file system: it can map structured data files into a database table, and provide Complete SQL query function; you can convert SQL statements into MapReduce tasks to run, and analyze the required content through your own SQL query. This set of SQL is referred to as Hive SQL, so that users who are not familiar with mapreduce can easily use SQL language to query, summarize and analyze data.

Hive mainly consists of three core parts

1. Hive client: Hive provides a variety of drivers designed to work with different applications. For example, Hive provides a Thrift client for Thrift-based applications.

2. Hive service: Hive service performs client interaction with Hive. For example, if a client wants to perform a query, it must talk to the Hive service.

3. Hive storage and computing:  Data storage depends on HDFS, and data computing depends on MapReduce.

Key Features of Hive

● Support to create indexes to optimize data query.

● Different storage types, eg, plain text files, files in HBase.

● Keeping metadata in a relational database greatly reduces the time spent performing semantic checks during queries.

● Data stored in the Hadoop file system can be used directly.

● A large number of built-in user function UDFs are built to operate time, character strings and other data mining tools, and users are supported to extend UDF functions to complete operations that cannot be achieved by built-in functions.

● SQL-like query mode, which converts SQL queries into MapReduce jobs and executes them on the Hadoop cluster.

Advantages of Hive

The operation interface adopts SQL-like syntax, which provides the ability of rapid development (simple and easy to use).

It avoids writing MapReduce and reduces the learning cost of developers.

The execution delay of Hive is relatively high, so Hive is often used for data analysis and does not require high real-time performance.

The advantage of Hive lies in processing big data, but it has no advantage in processing small data, because Hive's execution delay is relatively high.

Hive supports user-defined functions, and users can implement their own functions according to their own needs.

Disadvantages of Hive

1. Hive's HQL expressive capabilities are limited

(1) Iterative algorithms cannot express

(2) Not good at data mining

2. Hive's efficiency is relatively low

(1) MapReduce jobs automatically generated by Hive are usually not intelligent enough

(2) Hive tuning is more difficult and the granularity is coarser

Architecture of Hive

Hive mode

Depending on the size of the Hadoop data nodes, Hive can run in two different modes: local mode and map-reduce mode.

user local mode

Hadoop is installed in pseudo mode with only one data node

Smaller amount of data, limited to a single local machine

Data processing is faster because the local machine contains smaller datasets

Map Reduce mode:

Hadoop has multiple data nodes and data is distributed across these different nodes

Users have to deal with more massive datasets

MapReduce is the default mode of Hive.

Relationship between Hive and relational databases

What is the difference between Hive and Pig

different users

  • Data Analysts Favor Apache Hive
  • Programmers and researchers prefer Apache Pig

different languages

  • Hive uses a declarative language variant of SQL called HQL
  • Pig uses a unique programming language called Pig Latin

Data processing is different

  • Hive for structured data
  • Pig works with structured and semi-structured data

Cluster operation

  • Hive runs on the server side of the cluster
  • Pig runs on the client side of the cluster

partition

  • Hive supports partitioning
  • Pig does not support partitioning

loading speed

  • Hive doesn't load as fast, but executes faster
  • Pig loads faster

Difference between Hive and Hbase

  • HBase is an open source, column-oriented database management system that runs on top of the Hadoop Distributed File System (HDFS)
  • Hive is a query engine, and Hbase is a data storage system for unstructured data.
  • Hive is mainly used for batch processing; Hbase is widely used for transaction processing
  • Hbase real-time processing, real-time query; Hive is only used for analytical queries
  • Hive runs on top of Hadoop and Hbase runs on top of HDFS
  • Hive is not a database, but Hbase supports NoSQL databases
  • Hive has schema model, Hbase does not
  • Finally, Hive is ideal for high-latency operations, while Hbase is primarily used for low-latency operations

Hadoop is a must-have skill for big data developers

If you want to be engaged in data development in the future, then Hadoop is your basic ability, you must learn it, and it is best to obtain certification, so that enterprises can see your strength.

Simplilearn Big Data Hadoop certification training provides data processing, functional programming, Apache Spark, parallel processing, Spark RDD optimization technology, Spark SQL and other big data knowledge and skills. The course content is consistent with Cloudera CCA175 certification, laying a solid foundation for your career Base.

Guess you like

Origin blog.csdn.net/simplilearnCN/article/details/124256402