Explore Apache Hive: a fantasy journey of database operations that combines professionalism, fun and attraction

Copyright Notice

  • The content of this blog is based on my personal study notes from the Dark Horse Programmer course. I hereby declare that all copyrights belong to Dark Horse Programmers or related rights holders. The purpose of this blog is only for personal learning and communication, not commercial use.
  • I try my best to ensure accuracy when organizing my study notes, but I cannot guarantee the completeness and timeliness of the content. The content of this blog may become outdated over time or require updating.
  • If you are a Dark Horse programmer or a related rights holder, if there is any copyright infringement, please contact me in time and I will delete it immediately or make necessary modifications.
  • For other readers, please abide by relevant laws, regulations and ethical principles when reading the content of this blog, refer to it with caution, and bear the resulting risks and responsibilities at your own risk.

A database operation

  • Create database
    create database if not exists myhive;
    use myhive;
    
  • View database details
    desc database myhive;
    
  • Databases are essentially folders on HDFS. The default database storage path is located in /user/hive/warehouseHDFS
  • Create a database and specify the hdfs storage location
# 使用location关键字,指定数据库在HDFS的存储路径
create database myhive2 location '/myhive2 ';
  • Delete an empty database. If there is a data table under the database, an error will be reported.
drop database myhive;
  • Forcefully delete the database, including all tables under the database.
drop database myhive2cascade;

2. Hive data table operations

2.1 Table operation syntax and data types

CREATE [EXTERNAL] TABLE[IF NOT EXISTS] tabLe_name
	[(col_namedata_type [COMMENT col_comment], ...)]
	[COMMENT table_comment]
	[PARTITIONED BY (col_name data_type [CoMMENTcol_comment], ...)]
	[CLUSTERED BY (col_name,col_name,...)
	[SORTED BY (col_name[ASC|DESC],...)] INTO num_buckets BUCKETS]
	[ROW FORMAT row_format]
	[STORED AS file_format]
	[LOCATION hdfs_path]
  • EXTERNAL: Create an external table
  • PARTITIONED BY: partition table
  • CLUSTERED BY: bucket table
  • STORED AS: storage format
  • LOCATION: storage location

Insert image description here

2.2 Hive table classification

  • In Apache Hive, you can create different types of tables, including internal tables (Managed Table), external tables (External Table), partitioned tables (Partitioned Table) and bucketed tables (Bucketed Table).
  1. Internal table (Managed Table):
    • Internal tables, also known as managed tables, are Hive’s default table type.
    • Both data and metadata are managed by Hive and stored in Hive's default file system (usually the Hadoop distributed file system). When an internal table is deleted, Hive also deletes related data and metadata.
    • Internal tables are suitable for situations where the data set is completely managed and controlled by Hive.
  2. External Table:
    • External tables mean that both data and metadata are stored in an external storage system, such as Hadoop Distributed File System (HDFS) or a cloud storage service such as Amazon S3.
    • Unlike internal tables, when an external table is deleted, Hive only deletes the metadata and not the actual data . This feature makes external tables suitable for sharing data with other systems or introducing existing data into Hive.
  3. Partitioned Table:
    • Partitioned tables divide data into different partitions for storage based on one or more partition keys (such as date, region, etc.), so that data can be managed and queried more efficiently.
    • Partitioned tables allow users to load only data from a specific partition when querying, rather than having to load the entire table.
    • Partitioned tables are suitable for scenarios where data is organized and queried according to certain rules.
  4. Bucketed Table:
    • Bucket tables are a method of further subdividing data based on partition tables.
    • Bucket tables divide each partition into a fixed number of buckets, where data is bucketed according to a specific column hash algorithm.
    • Bucketing tables can improve query performance, especially when you often need to perform join operations based on a certain column. Bucketed tables are often used in conjunction with partitioned tables.

  • Summarize
    • Internal tables are suitable for data sets completely managed by Hive, external tables are suitable for sharing or introducing data with other systems, partition tables are suitable for organizing and querying data according to specific rules, and bucket tables are a way to further subdivide data to improve query performance. Way.

2.3 Internal table Vs external table

Internal table (Managed Table) External Table
create grammar CREATE TABLE table_name … CREATE EXTERNAL TABLE table_name … LOCATION …
storage location Managed by Hive and stored in the Hive default file system Can be specified at any location via the LOCATION keyword
Metadata and data Hive manages and controls metadata and data Only Hive manages metadata and does not control the actual data
Behavior when dropping table Deleting a table deletes both metadata and stored data Only deletes the table's metadata, not the actual data
Applicable scene A situation where the data set is completely managed and controlled by Hive Share data with other systems and introduce existing data scenarios
Sharability with other tools Not suitable for sharing data with other tools Can temporarily connect to external data at will
  1. Internal tables (Managed Tables) are tables managed and controlled by Hive, and data and metadata are stored and managed by Hive. Deleting an internal table removes related data and metadata. Suitable for scenarios where data is completely managed and controlled by Hive.
  2. An external table is a table associated with external data. The data storage location can be anywhere and is specified by the LOCATION keyword. When you delete an external table, only the metadata is deleted, not the actual data. Suitable for sharing data with other systems or introducing existing data.

2.4 Internal table operations

2.4.1 Create internal table

  • Internal table creation syntax
    CREATE TABLE table_name ...
    
  • Demo
    1. Create a basic table
    create database if not exists myhive;
    use myhive;
    create table if not exists stu(id int, name string);
    insert into stu values ( 1,"zhangsan")(2"wangwu");
    select *from stu;
    
    1. View table data storage
    hadoop fs -ls /user/hive/warehouse/myhive.db/stu
    hadoop fs -cat /user/hive/warehouse/myhive.db/stu/*
    

Insert image description here

2.4.2 Other ways to create internal tables

  • Create a table based on query results
CREATE TABLE table_name as
-- 示例
create table stu3 as select * from stu2;
  • Create a table based on an existing table structure
CREATE TABLE table_namelike
-- 示例
create table stu4 like stu2;
  • Use DESCFORMATTEDtable_name to view the table type and details
DESC FORMATTED Stu2;

2.4.3 Data delimiter

Insert image description here

  • Data also exists as plain text files on HDFS. The strange thing is that the column ID and column name seem to have no separators, but are squeezed together.
  • The default data delimiter is: "\001" is a special character and an ASCII value, which cannot be typed by the keyboard.
  • It is displayed as SOH in some text editors.
    Insert image description here

2.4.4 Custom separator

create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t' ;
  • row format delimited fields terminated by '\t' ;Indicates the use \tof separation

2.4.5 Delete internal tables

  • Deleting an internal table will delete related data and metadata
drop table table_name;

2.5 External table operations

2.5.1 Create external table

  • Create external table syntax
CREATE EXTERNAL TABLE table_name ... LOCATION ... 
  • External tables, created tables are EXTERNALmodified by keywords. From the concept, they are considered tables that are not owned by Hive and are only used to temporarily associate data.
  • External tables and data are independent of each other, that is, you can have the table first, and then move the data to the LOCATION specified by the table. You can also have data first, and then create a table to point to the data through LOCATION.

2.5.2 Operation demonstration

  1. Create a new file on Linux, test_external.txt, and fill in the following content, using \tseparated data columns:
1	hello
2	world
3	hadoop

2.5.3 Demonstrate creating a table first and then moving the data

  1. Demonstration of first creating an external table and then moving the data to LOCATIONthe directory
    • First check: to confirm that the directory hadoop fs -ls /tmpdoes not exist/tmp/test_ext1
    • Create external table:
    create external table test_ext1(id int,name string) row format delimited fields terminated by '\t' location ' /tmp/test_ext1';
    
    • After the creation is successful, the data content of the view table is empty.
    select * from test_ext1
    
    • Upload the data and see the data results
hadoop fs -put test_external.txt /tmp/test_ext1. select * from test_ext1

2.5.4 Demonstration of storing data first and then creating tables

hadoop fs -mkdir /tmp/test_ext2
hadoop fs -put test_external.txt /tmp/test_ext2/
create external table test_ext2(id int,name string) row format delimited fieldsterminated by '\t' location '/tmp/test_ext2';
select * from test_ext2; 

2.5.5 Delete external table

  • Delete external table statement
DROP TABLE table_name;
  • Note: The DROP TABLE statement only deletes the metadata of the table and does not delete the actual data associated with the external table. After executing this statement, Hive will delete the metadata information of the specified external table, including table structure, partition information, location, etc., but will not delete the actual data associated with the external table. If you want to delete data from an external table at the same time, you can manually delete the data files or directories stored in the external location.

2.6 Hive internal and external transformation

  • View table type:desc formatted table_name;
    Insert image description here

  • Hive can easily convert internal and external tables through SQL statements.

  • Convert internal table to external table

    alter table table_name set tblproperties('EXTERNAL'='TRUE');
    
  • Convert external table to internal table

    alter table table_name set tblproperties('EXTERNAL'='FALSE');
    
  • Please note: ('EXTERNAL'='FALSE') or ('EXTERNAL'='TRUE') is a fixed writing method and is case-sensitive! ! !

2.7 Hive data loading and exporting

2.7.1 Data loading-LOAD syntax

  • grammar
    Insert image description here
  • Note that when loading data based on HDFS, the source data file will disappear (essentially, it will be moved to the directory where the table is located)
    Insert image description here
    Insert image description here
  • Example
    load data local inpath '/home/hadoop/search_log.txt' into table myhive.test_load;
    load data inpath '/tmp/search_log.txt' overwrite into table myhive.test_load;
    

2.7.2 Data loading-insert select syntax

  • grammar
INSERT [OVERWRITE | INTO] TABLE tablename1 [PARTITION (partcol=vall, partcol2=val2 ...) [TF NOTEXISTS]] select_statement1 FROM from_statement;
  • Insert the results of the SELECT query statement into other tables. The table queried by SELECT can be an internal table or an external table.

2.7.3 Data loading - choice of two syntaxes

Insert image description here

2.7.4 hive table data export-insert overwrite method

  • grammar:
insert overwrite [local] directory 'path' select_statement1 FROM from_statement;

  • Export the data in the hive table to any other directory, such as linux local disk, such as hdfs, such as mysq|, etc.

  • Export the results of the query locally - using the default column delimiter

insert overwrite local directory '/home/hadoop/export1' select * from test_load ;
  • Export query results to local - specify column delimiter
insert overwrite local directory '/home/hadoop/export2' row format delimited fields terminated by '\t' select * from test_load;
  • Export the query results to HDFS (without the local keyword)
insert overwrite directory '/tmp/export' row format delimited fields terminated by '\t' select * from test_load;

2.7.5 hive table data export-hive shell

  • Basic syntax: (hive -f/-e execute statement or script > file)
bin/hive -e "select * from myhive.test_load;" > /home/hadoop/export3/export4.txt
bin/hive -f export.sql > /home/hadoop/export4/export4.txt

2.8 Partition table

  • In Hive, a partitioned table is a table that logically partitions data according to the value of a specific column. Partitioned tables can speed up queries and improve data management efficiency.
    Insert image description here
  • At the same time, Hive also supports multiple fields as partitions, and multiple partitions have hierarchical relationships.
    Insert image description here

2.8.1 Create partition

  • Create partition table syntax
    CREATE TABLE table_name (
       column1 data_type,
       column2 data_type,
       ...
    )
    PARTITIONED BY (partition_column1 data_type, partition_column2 data_type, ...);
    

  • Demo
    create tables core(
    		sid string,
    		cid string,
    		sscore int
    	) 
    partitioned by(month string)
    row format delimited fields terminated by't';
    
  • Create a table with multiple partitions
    create table score2 (sid string,c_id string,sscore int) 
    partitioned by(year string,month string,day string)
    row format delimited fields terminated by'\t';
    

2.8.2 Loading data

  • Load data into partitioned table
    load data local inpath '/export/server/hivedatas/score.txt' into table score 
    partition (month='202006');
    
  • Load data into a multi-partitioned table
    load data local inpath '/export/server/hivedatas/score.txt' into table score2
    partition(year='2020',month='06',day='01');
    
  • Insert data into partitioned table
    INSERT INTO TABLE table_name PARTITION (partition_column1 = value1, partition_column2 = value2, ...)
    VALUES (value1, value2, ...);
    
    INSERT INTO TABLE sales PARTITION (year = 2023, month = 9)
    VALUES (1, 'Product A', '2023-09-08', 100.0);
    

2.8.3 View partitions

  • View partitions
    show partitions score;
    

2.8.4 Add partition

  • add a partition
    alter table score add partition(month='202005')
    
  • Add multiple partitions at the same time
    alter table score add partition(month='202004') partition(month='202003');
    
  • Note: After adding a partition, you can see an additional folder under the table in the hdfs file system.

2.8.5 Modify partition location

ALTER TABLE table_name PARTITION (partition_column1 = value1, partition_column2 = value2, ...)
SET LOCATION '/new/partition/location';

2.8.6 Modify partition value

alter table table_name partition(month='2002005') rename to partition(month='201105')

2.8.7 Delete partition

alter table table_name drop partition(month='202006');
  • Modifying and deleting operations on partitions actually modify the source data table and will not modify the data content in HDFS! [It is not recommended to modify the partition]

2.9 Bucket table

  • Bucketing, like partitioning, is also a tuning method that changes the storage mode of the table to optimize the table.
  • But unlike partitioning, which splits the table into different subfolders for storage, bucketing splits the table into a fixed number of different files for storage.
    Insert image description here

2.9.1 Create bucket table

  • Enable automatic optimization of bucketing (automatically match the number of reducetask to the same number of buckets)
set hive.enforce.bucketing=true;
  • Create bucket table
create table course (
	c_id string,
	c_name string,
	t_id string
) clustered by(c_id) into 3 buckets 
row format delimited fields terminated by '\t';

2.9.2 Bucket table data loading

  • The data loading of the bucket table cannot be performed through load data and can only be performed through insert select.
  1. Create a temporary table (either external table or internal table), and load data into the table through load data
  2. Then insert data from the temporary table into the bucket table through insert select
  • Create a normal table:
create table course_common (
	c_id string,
	c_name string,
	t_id string
) rowformat delimited fields terminated by 't';
  • Load data into a normal table
    load data local inpath '/export/server/hivedatas/course.txt' into table course_common;
    
  • By insert overwriteloading data into the bucket table
    insert overwrite table course select * from course_common cluster by(cid);
    

2.9.3 Reason explanation

  • The data loading of the bucket table cannot be performed through load data and can only be performed through insert select.

Without bucketing settings, inserting (loading) data is simply a matter of placing the data into:

  • The table's data is stored in the folder (no partitions)

  • Table in the folder of the specified partition (with partitions)
    Insert image description here

  • Once the bucket settings are set, for example, the number of buckets is 3, then the number of files in the table or data files in the partition is limited to 3. When data is inserted, it needs to be divided into 3 files and entered into three bucket files.
    Insert image description here

  • Question: How to divide the data into three parts, and what are the rules for division?

  • The three divisions of the data are determined based on the hash modulus of the bucket column values . Since load data does not trigger MapReduce, that is, there is no calculation process (the Hash algorithm cannot be executed), it just simply moves the data, so it cannot be used for bucket table data insertion.

2.9.4 Performance improvement of bucket table

  • The performance improvement of partitioned tables is to reduce the amount of data being operated on the premise of specifying partition columns, thereby improving performance.
  • The performance improvement of bucket tables is: specific operations based on bucket columns, such as filtering, JOIN, and grouping, can all bring performance improvements.
    Insert image description here

2.10 Modify table

  • Table rename
alter table old_table_namerename to new_table_name;
  • Modify table properties
ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:(property_name=property_value,property_name=property_value,...)
  • Such as: ALTER TABLE table_name SETTBLPROPERTIES("EXTERNAL"="TRUE")modifying internal and external table attributes
  • Such as: ALTER TABLE table_name SETTBLPROPERTIES('comment'=new_commentmodify table comments
  • Add column
alter table table_name add columns(v1 int,v2 string); 
  • Modify column name
alter table table_name change v1 v1new int;
  • Delete table
drop table table_name;
  • Clear table
-- 只能清空内部表
truncate table table_name;

2.11 Complex types

2.11.1 array type

  • Hive supports many data types. In addition to the basic ones: int, string, varchar, timestamp, etc., there are also some complex data types: array (array type), map (mapping type), struct (structural type)

  • The content of data_for_array_type.txt file is as follows

    zhangsan	beijing,shanghai,tianjin,hangzhou
    wangwu	changchun,chengdu,wuhan,beijin
    
    • Description: The name and locations are separated by tabs, and the elements in locations are separated by commas.
      Insert image description here
  • Create table statement

    create table test_array(
        name string,
        work_locations array<string>)
    row format delimited fields terminated by '\t'
    COLLECTION ITEMS TERMINATED BY ',';
    
    • row format delimited fields terminated by '\t′Indicates that the column separator is \t.
    • COLLECTION ITEMS TERMINATED BY ',’The separator character representing the elements of the collection (array) is a comma
      Insert image description here
  • Import Data

    load data local inpath '/home/hadoop/data_for_array_type.txt' overwrite into table itheima.test_array;
    
  • Commonly used array type queries:

    -- 查询所有数据
    select * from test_array;
    -- 查询loction数组中第一个元素
    select name, work_locations[0] location from test_array;
    -- 查询location数组中元素的个数
    select name, size(work_locations) location from test_array;
    -- 查询location数组中包含tianjin的信息
    select * from test_array where array_contains(work_locations,'tianjin');
    

2.11.2 map type

  • The map type is: Key-Value data format.
    Insert image description here
  • There is the following data file, in which the members field is a key-value data field and the field separator: ","; The separator between map fields is required: "#"; The kv separator inside the map: ":"
1,林杰均,father:林大明#mother:小甜甜#brother:小甜,28
2,周杰伦,father:马小云#mother:黄大奕#brother:小天,22
3,王葱,father:王林#mother:如花#sister:潇潇,29
4,马大云,father:周街轮#mother:美美,26
  1. Create table statement
    create table test_map(
        id int,
        name string,
        members map<string,string>,
        age int)
    row format delimited fields terminated by ','
    COLLECTION ITEMS TERMINATED BY '#'
    MAP KEYS TERMINATED BY ':';
    
  • MAP KEYS TERMINATED BYIndicates that key-value is :separated by
  1. Import Data
    load data local inpath '/home/hadoop/data_for_map_type.txt' overwrite into table test_map;
    
  2. Common queries
    --查询全部
    select * from test_map;
    --查询father、mother这两个map的key
    select id, name, members["father"] father, members["mother"] mother, age from test_map;
    --查询全部map的key,使用map_keys函数,结果是array类型
    select id, name, map_keys(members) as relation from test_map;
    --查询全部map的value,使用mapvalues函数,结果是array类型
    select id, name, map_values(members) as relation from test_map;
    --查询map类型的KV对数量
    select id,name,size(members) num from test_map;
    --查询map的key中有brother的数据
    select * from test_map where array_contains(map_keys(members), 'brother');
    

2.11.3 struct type

  • The struct type is a composite type that can store multiple sub-columns in one column, and each sub-column allows setting the type and name.
    Insert image description here

  • There is the following data file, description: # separates fields, colon separates structs

1#周杰轮:11
2#林均杰:16
3#刘德滑:21
4#张学油:26
5#蔡依临:23
  1. Create table statement
    create table test_struct(
        id string,
        info struct<name:string,age:int>
        )
    row format delimited fields terminated by '#'
    COLLECTION ITEMS TERMINATED BY ':';
    
  2. Import Data
    load data local inpath '/home/hadoop/data_for_struct_type.txt' into table test_struct;
    
  3. Common queries
    -- 查询全部 
    select * from test_struct;
    -- 直接使用列名。子列名即可从struct中取出子列查询
    select id, info.name from test_struct;
    

2.11.4 Summary of three structures

Insert image description here

Guess you like

Origin blog.csdn.net/yang2330648064/article/details/132515595