[Data Development] Big data platform architecture, introduction to Hive/THive

1. Big data engine

Big data engine is a software system used to process large-scale data,
Commonly used big data engines includeHadoop, Spark, Hive, Pig, Flink, Storm, etc.
Among them, Hive is a data warehouse tool based on Hadoop, which can map structured data to Hadoop's distributed file system and provide classes. SQL query function.
Compared with traditional databases, Hive’s advantage is that it can handle massive data and can run on cheap hardware . At the same time, Hive's query language is similar to SQL and is easy to use and learn.

Compared with traditional databases, the differences between data engines are:
1. Data volume: Traditional databases usually process small-scale data, while big data engines can handle massive amounts of data.
2. Processing method: Traditional databases use transaction processing, while big data engines use batch processing or stream processing.
3. Hardware requirements: Traditional databases require high-performance hardware support, while big data engines can run on cheap hardware.
4. Data type: Traditional databases usually handle structured data, while big data engines can handle structured, semi-structured and unstructured data.
In short, a big data engine is a software system designed to process massive amounts of data. Compared with traditional databases, it has higher data processing capabilities and more flexible data processing methods.

Comparison of data processing methods

  • Batch processing: Batch processing is a data processing method that processes a batch of data as a whole, usually offline. Batch processing is suitable for processing large amounts of data, but the processing speed is slow. It is suitable for scenarios that require full data analysis, such as data warehouses, offline computing, etc.
  • Stream processing: Stream processing is a real-time data processing method that takes data streams as input, processes them in real time, and outputs results. Stream processing is suitable for processing real-time data, has fast processing speed, and is suitable for scenarios that require real-time calculations, such as real-time monitoring, real-time recommendations, etc.

Data type comparison:

  • Semi-structured data: Semi-structured data is a data type between structured data and unstructured data. It has a certain structure but is not as strictly defined as structured data. Semi-structured data is usually stored in XML, JSON, YAML and other formats, such as web pages, logs, etc.
  • Unstructured data: Unstructured data refers to data without a fixed structure, such as text, pictures, audio, video, etc. Unstructured data is usually difficult to process through traditional relational databases and requires the help of big data technology for processing and analysis.

Comparison of Hadoop, Hive and Spark
Although they are all open source frameworks for big data processing, they have different characteristics and uses.

  • Hadoop is a distributed computing framework mainly used to store and process large-scale data sets. It includes two main components: HDFS (Hadoop Distributed File System) and MapReduce, which can achieve distributed storage and computing, as well as high reliability and fault tolerance .
  • Hive isa data warehouse tool based on Hadoop. It provides SQL-like query functions that can Map structured data to Hadoop's distributed file system. Hive implements query and analysis by converting SQL statements into MapReduce tasks, which can facilitate data processing and analysis.
  • Spark is a fast, versatile, and scalable big data processing engine that supports batch and stream processing, and provides advanced APIs such as Spark SQL, Spark Streaming, MLlib, and GraphX. Spark improves computing performance through in-memory computing and RDD (elastic distributed data sets), and can handle larger-scale data and more complex computing tasks.
  • In general, Hadoop provides distributed storage and computing infrastructure, and Hive provides SQL-like query functions, while Spark providesmore advanced data processing and analysis functions.
  • They can be used in conjunction with each other, such as using Hadoop as the underlying storage and computing infrastructure, using Hive for data query and analysis, and using Spark for more advanced data processing and analysis.

Insert image description here
Insert image description here
Insert image description here

Insert image description here
Insert image description here

2. What is Hive / THive

What is Hive?

  • Hive is a data warehouse tool based on Hadoop.
  • It provides a SQL-like query language called HiveQL for querying and analyzing large-scale data sets.
  • Hive maps structured data to Hadoop's distributed file system and Hadoop's distributed processing engine, allowing users to use similar SQL is a language that queries data and converts it into other formats, such as MapReduce tasks.
  • The Hive engine is a Hadoop-based data warehouse tool, which provides a SQL-like query language called HiveQL for querying and analyzing large-scale data sets.

What is THive?

  • THive is an open source Hive JDBC driver that allows users to connect to Hive using any JDBC-enabled tool (such as Tableau, Excel, etc.).
  • Therefore,THive is not a data warehouse tool, but a Hive JDBC driver.

Therefore, Hive and THive are two different things. Hive is a data warehouse tool, and THive is a JDBC driver for Hive.

Hive engine classification

  • Speed ​​ranking: THive on MapReduce < THive on Spark < Presto
  • Hive can use two different engines: MapReduce and Tez. MapReduce is Hadoop's default engine, and Tez is a faster engine that uses higher-level optimization techniques.
  • THive on MapReduce is another variant of THive, which uses MapReduce as the computing engine. MapReduce is the default computing engine of Hadoop. It can process large-scale data sets, but it is slow.
  • THive on Spark is a variant of THive, which uses Spark as the computing engine. Spark is a fast distributed computing engine that performs calculations in memory and is therefore faster than MapReduce. THive on Spark can provide faster query speed and better performance.
  • Presto is a distributed SQL query engine, which can query multiple data sources, including Hive, MySQL, PostgreSQL, etc. Presto’s query speed is very fast and can handle PB-level data. Unlike Hive, Presto does not require data to be converted into MapReduce tasks, thus providing faster query speeds and better performance< a i=5>.
  • Therefore, Hive, THive on Spark, THive on MapReduce, and Presto are all tools for querying and analyzing large-scale data sets, but they use different computing engines, so There are also certain differences in performance and query speed.

Insert image description here

3. Data storage: Mysql=>HDFS=>data warehouse

Mysql=>HDFS=>Data warehouse

  • The data warehouse has stronger data processing capabilities, but has limited requirements such as data format.
  • Mysql is lightweight, has a small amount of data, but has many formats and definable functions.
  • Mysql and data warehouse are structured data, while HDFS is unstructured data.

HDFS (Hadoop Distributed File System) and MySQL are two different types of data storage systems. They have the following differences:

  1. Data type: HDFS is suitable for storing large-scale unstructured data, such as logs, images, audio, videos, etc., while MySQL is suitable for storing structured data, such as table data.
  2. Storage method: HDFS is a distributed file system in which data is divided into multiple blocks and stored on different servers, while MySQL is a relational database system in which data is stored in tables.
  3. Storage capacity: HDFS can store massive amounts of data and can expand storage capacity by adding new servers, while MySQL has a relatively small storage capacity and requires more advanced hardware support to expand storage capacity.
  4. Data processing method: HDFS uses batch processing for data processing, which is suitable for offline data processing and analysis, while MySQL supports real-time query and update, and is suitable for online data processing and interactive query.
  5. Data security: HDFS provides data redundancy and backup mechanisms to ensure high reliability and fault tolerance of data, while MySQL requires backup and replication to ensure data security.

In short, HDFS and MySQL are two different types of data storage systems suitable for different data storage and processing scenarios. HDFS is suitable for storing large-scale unstructured data, such as logs, images, audio, videos, etc., while MySQL is suitable for storing structured data, such as table data.

Data warehouse (Data Warehouse) is a system used to store and manage enterprise data. It can integrate data from different sources into a unified data model so that Perform data analysis and decision support. Compared with HDFS and MySQL, data warehouse has the following differences:

  1. Data type: Data warehouses usually store structured data, such as tabular data, while HDFS is suitable for storing large-scale unstructured data, such as logs, images, audio, videos, etc., and MySQL can store structured data and semi-structured data.

  2. Data integration: Data warehouse can integrate data from different sources into a unified data model for data analysis and decision support, while HDFS and MySQL can usually only store and process data from a single source.

  3. Data processing methods: Data warehouses usually use OLAP (Online Analytical Processing) for data processing, supporting complex multi-dimensional analysis and data mining, while HDFS and MySQL usually use OLTP (Online Transaction Processing) for data processing, supporting real-time queries and updates. .

  4. Storage capacity: HDFS can store massive amounts of data and can expand storage capacity by adding new servers. MySQL storage capacity is relatively small and requires more advanced hardware support to expand storage capacity, and data warehouses also require high-performance hardware support for storage. and processing large-scale data.

In short, data warehouse, HDFS and MySQL are all different types of data storage and processing systems, suitable for different data storage and processing scenarios. Data warehouse is suitable for storing and processing structured data and supports complex multi-dimensional analysis and data mining. HDFS is suitable for storing large-scale unstructured data. MySQL is suitable for storing structured data and semi-structured data.

Export the data in MySQL to HDFS, and then import the data in HDFS into the data warehouse. The principles in the middle mainly include the following aspects:

  1. Data extraction: Extract data from MySQL into HDFS, usually using Sqoop for data extraction. Sqoop implements data extraction through MapReduce jobs. It first divides the data into multiple data blocks, and then runs the MapReduce job on each data block to convert the data into Hadoop's input format and write it to HDFS.

  2. Data conversion: Convert and clean the extracted data to make it comply with the data model and data quality requirements of the data warehouse. ETL (Extract-Transform-Load) tools are usually used for data conversion and cleaning, such as Apache Nifi, Talend, etc. ETL tools can perform format conversion, data cleaning, data merging and other operations on data toconvert the data into the format required by the data warehouse.

  3. Data loading: Load the converted data into the data warehouse.Usually use the ETL tool of the data warehouse for data loading, such as ODI (Oracle Data Integrator), Informatica, etc. ETL tools can load converted data into the data warehouse and perform data verification and quality control to ensure data accuracy and completeness.

  4. Data modeling: Perform data modeling in the data warehouse,for data analysis and decision support. Data modeling is usually done using ER modeling tools, such as ERwin, PowerDesigner, etc. ER modeling tools can perform data modeling according to the needs of the data warehouse, including entities, attributes, relationships, etc.

  5. Data analysis:Carry out data analysis and decision support in the data warehouse, usually using BI (Business Intelligence) tools for data analysis and report generation, such as Tableau, QlikView, etc. BI tools can extract data from data warehouses and perform data analysis and visual display for decision support and business analysis.

In short, exporting data in MySQL to HDFS, and then importing data in HDFS into the data warehouse requires multiple steps such as data extraction, conversion, loading, modeling and analysis, which involves a variety of The application of technology and tools toachieve efficient, accurate and reliable processing and analysis of data.

Guess you like

Origin blog.csdn.net/qq_33957603/article/details/134228143