Touge Big Data Assignment 6: Hive

Extracurricular homework six: Hive

  • Job details

content

1. Alibaba Cloud - Yunqi Lab - "EMR-based offline data analysis" EMR-based offline data analysis - Yunqi Lab - Online experiment - Cloud migration practice - Alibaba Cloud Developer Community - Alibaba Cloud official experimental platform - Alibaba Cloud , Or install Hive on your own virtual machine. See the installation steps below for details.

Experimental requirements:

Complete textbook 9.6-Hive basic operations. The Hive data table emrusers has been changed to the last four digits of their names and student IDs. Screenshot: Query how many data results there are in the data table, including Hive commands. Complete the following three experiments on Huawei Cloud, including the "Hive Data Statistics" experiment. In the last step of multi-table statistics, spell out the names of the two tables followed by your name, and take screenshots of the HQL statistical statements and execution results. 2. Huawei Cloud KooLabs experiment

"Hive Creates a Data Warehouse" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud " Hive Data Query" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud "Hive Data Statistics" KooLabs Cloud Experiment_Online Experiment_Cloud Practice_Cloud Computing Experiment_AI Experiment_Huawei Cloud Official Experiment Platform-Huawei Cloud

3. Briefly answer the content of “Classroom Assessment”

1. Is HiveQL language a SQL language? Answer: HiveQL is a query language similar to SQL. Its syntax is similar to SQL, but there are some differences. HiveQL supports SQL-like syntax, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY and other keywords, but there are also some keywords and syntax that SQL does not support, such as USING, CLUSTER BY, DISTRIBUTE BY, LATERAL VIEW, etc.

2. If Hive does not specify the database, which database is used to create the table? If no database is specified when creating a table in Hive, the table will be created in the default database. By default, Hive will use the database named default as the default database. You can use the USE statement to specify the database to use, such as USE my_database; switches the current database to my_database. You can also use database.table_name in the CREATE TABLE statement to explicitly specify the database to which the table belongs. For example, CREATE TABLE my_database.my_table (col1 INT, col2 STRING); creates the table my_table in the my_database database.

3. How to create a table in Hive? How to create a partition table? The syntax for creating a table in Hive is: CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment] ], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC]), ...]] [INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format ] [TBLPROPERTIES (property_name=property_value, ...)]

Among them, if you want to create a partitioned table, you need to use the PARTITIONED BY clause in the CREATE TABLE statement to specify the partition key, for example: CREATE TABLE my_table ( col1 INT, col2 STRING ) PARTITIONED BY (col3 STRING, col4 INT);

4. How to load data in Hive? What are the columns separated by? What are the lines separated by? You can use the following command to load data from Hadoop into a Hive table: LOAD DATA INPATH 'hdfs_path' INTO TABLE table_name; where hdfs_path is the HDFS path where the data is stored, and table_name is the name of the target table. Before loading data, you need to ensure that the structure of the table and the data format match. Hive supports multiple data formats, including CSV, text, serialization, etc. By default, columns are separated by tabs \t and rows are separated by newlines. If you want to use other column delimiters or row delimiters, you can specify the ROW FORMAT and FIELDS TERMINATED BY options when creating the table, for example: CREATE TABLE mytable( column1 string, column2 int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; this The command will create a table named mytable with columns separated by commas. When loading data, it will automatically be parsed based on the specified delimiter.

5. How to query Hive data? What are the query conditions? To query data in Hive, you can use the SELECT statement, for example: SELECT * FROM mytable WHERE column1 = 'value'; where mytable is the name of the table to be queried, column1 is the name of the column to be queried, and 'value' is the query condition. You can specify multiple query conditions as needed and connect them using logical operators AND or OR. You can also use the following command to save the query results to other tables: INSERT OVERWRITE TABLE myoutput SELECT * FROM mytable WHERE column1 = 'value'; This command will query all records in the mytable table where column column1 has a value of 'value', and will The results are saved to the myoutput table. 6. How to calculate Hive statistics? What do the statistical keywords in HiveQL statements mean? Hive statistics can use the following keywords: • COUNT: Returns the number of records in a specified column or row. • SUM: Returns the sum of the specified numeric type column. • AVG: Returns the average value of the specified numeric type column. • MAX: Returns the maximum value for a specified column or row. • MIN: Returns the minimum value for a specified column or row. These keywords can be used in HiveQL statements to perform statistical analysis on data.

4. Exercise 9.8

  1. Describe the interrelationship between Hive and its components in the Hadoop ecosystem. The interrelationship between Hive and the components of the Hadoop ecosystem: Hive is tightly integrated with other components in the Hadoop ecosystem (such as HDFS, YARN, MapReduce, etc.). The bottom layer of Hive uses HDFS to store data, YARN to manage jobs, and MapReduce to perform calculations. .
  2. Please briefly describe the difference between Hive and traditional databases. The difference between Hive and traditional databases: Hive is a data warehouse based on the Hadoop ecosystem. Compared with traditional relational databases, Hive is more suitable for processing big data, supports delayed insertion of data and large-scale batch processing, and does not Support transaction processing.
  3. Please briefly describe several access methods of Hive. Several access methods of Hive: Hive provides multiple access methods such as CLI (command line interface), JDBC/ODBC driver, and Web services, which can be selected according to specific needs.
  4. Please give a brief introduction to several important components of Hive.
  • HCatalog: Provides a way to integrate Hive into other Hadoop components.
  • Metastore: stores metadata information of Hive tables.
  • Hive service (HiveServer2): serves as the interface for clients to communicate with Hive.
  • Hive driver: responsible for converting HiveQL into MapReduce tasks.
  1. Please briefly describe the specific execution process of inputting a query to Hive.
  • Submit Hive query statements through the client, and the query statements are handed over to the Hive service.
  • The Hive service passes the HiveQL statement to the parser through the Hive driver, and analyzes the HiveQL statement to establish an execution plan.
  • The execution plan is decomposed into a series of MapReduce tasks and submitted to YARN for execution.
  • After the MapReduce task is executed, the results are returned to the Hive service and finally to the command line interface or client application.
  1. Please briefly describe the Hive HA principle. Hive HA principle: Hive HA implements synchronization and iteration, and realizes Hive's high availability through the Zookeeper high availability framework. When the primary node goes down, the backup node immediately takes over the role of the primary node, achieving rapid automatic failover.
  2. Please briefly describe the main role of the Impalad process. The main role of the Impalad process: The Impalad process is responsible for processing query requests received from the client, including parsing, optimizing, and executing queries. It is also responsible for interacting with the Metastore so that Impala can obtain metadata information.
  3. Please compare the similarities and differences between Hive and Impala.
  • Similarities and differences: Hive uses MapReduce for calculations, while Impala can directly execute queries on storage files to achieve higher query speeds; Hive only supports offline batch operations, while Impala can perform real-time queries.
  • Similarities: Hive and Impala are both data warehouses for the Hadoop ecosystem, and both use the SQL-like language HiveQL.
  1. Please briefly describe the role of State Store. The role of State Store: State Store is the process in the Impala cluster responsible for storing Impala's status information, including metadata used by Impala, execution plans, query progress, etc.
  2. Please briefly describe the specific process of Impala executing a query.
  • The client connects to an Impala Daemon and sends a query request.
  • Impala Daemon parses the request and sends it to the State Store to obtain metadata and execution plan.
  • Impala Daemon builds an execution plan for the query, then divides it into several fragments, and assigns each fragment to a node in the Impala Daemon cluster for execution.
  • Impala executes each fragment and returns a result set. The final result set is combined by the Impala Daemon into one result set and returned to the client.
  1. Please list the three collection data types supported by columns in Hive.
  • ARRAY: An ordered collection of values, the elements can be of any data type.
  • MAP: An unordered collection of key-value pairs. The key and value can be of any data type.
  • STRUCT: A collection of values ​​of the same data type or different data types.
  1. Please give some examples of common operations and basic syntax of Hive.
  • Table of contents: CREATE TABLE table_name(column1 datatype, column2 datatype, …);
  • 查询数据:SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list [ASC | DESC]] [LIMIT number]
  • Load data: LOAD DATA [LOCAL] INPATH 'input_path' [OVERWRITE] INTO TABLE table_name;
  • Statistical data: Use COUNT, SUM, AVG, MAX, and MIN keywords for statistical analysis.
  • 创建分区表:CREATE TABLE table_name(column1 datatype, column2 datatype, …) PARTITIONED BY (partition_column1 datatype, partition_column2 datatype, …);
  • Define delimiter: ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

Guess you like

Origin blog.csdn.net/qq_50530107/article/details/131261069