Hive-data type, DDL data definition language, DML data manipulation

type of data

Basic data type

Hive data type Java data types length example
TINYINT byte 1 byte signed integer 20
SMALINT short 2 byte signed integer 20
INT int 4 byte signed integer 20
BIGINT long 8 byte signed integer 20
BOOLEAN boolean Boolean type, true or false TRUE,FALSE
FLOAT float Single precision floating point 3.14159
DOUBLE double Double precision floating point 3.14159
STRING string String, you can specify the character set, you can use single or double quotes ‘Good morning’,“Good afternoon”
TIMESTAMP Time type
BINARY Byte array

    The String type of Hive is equivalent to the varchar type of the database, which is a variable string. However, it cannot state how many characters can be stored at most, in theory it can store 2GB of characters.

Collection data type

type of data description
STRUCT Similar to the struct in the c language, the content of the element can be accessed through the "dot" symbol. For example, if the data type of a column is STRUCT{first STRING,last STRING}, then the first element can be referenced by the field .first
MAP MAP is a set of key-value pair tuples, and data can be accessed using array notation. For example, if the data type of a column is MAP, and the key->value pairs are'first'->'John' and'last'->'Doe', then the last one can be obtained by the field name ['last'] element
ARRAY An array is a collection of variables with the same type and name. These variables are called elements of the array, and each array element has a number, which starts from zero. For example, if the array value is ['John','Doe'], then the second element can be referenced by the array name [1]
create table test(
	name string,
	friends array<string>,
	children map<string, int>,
	address struct<street:string, city:string>
)
row format delimited
fields terminated by ','				列分隔符
collection items terminated by '_'		MAP,STRUCT,和ARRAY 的分隔符(数据分割符号)
map keys terminated by ':'				MAP中的key与value的分隔符
lines terminated by '\n';				行分隔符

Type conversion

    Hive's basic data types can be implicitly converted, similar to Java's type conversion. For example, TINYINT is automatically converted to INT type. However, Hive will not perform reverse conversion. For example, INT will not be automatically converted to TINYINT type.
    The implicit type conversion rules are as follows

  1. Any integer type can be implicitly converted to a broader type, such as TINYINT can be converted to INT, INT can be converted to BIGINT.
  2. All integer types, FLOAT and STRING types can be implicitly converted to DOUBLE.
  3. TINYINT, SMALLINT, INT can all be converted to FLOAT.
  4. The BOOLEAN type cannot be converted to any other type.

    Use the CAST operation to display the data type conversion. For example, CAST('1' AS INT) will convert the string '1' into an integer 1. If the type conversion fails, such as executing CAST('X' AS INT), the expression returns The null value is NULL.

DDL data definition language

Create database

create database if not exists hive;

    Create a database, specify the location where the database is stored on HDFS, and the *.db folder will not appear

Create database if not exists hive location '/hive';

Query database

    Show database

show datatbases;

    Display database information

desc database hive;

    Show database details

desc database extended hive;

Modify and delete the database

    Other metadata information of the database is unchangeable, including the database name and the directory location where the database is located.

alter database hive set dbproperties('createtime'='20200528');

    Delete empty database

drop database if exists hive;

    Use cascade cascade to delete empty databases

drop database if exists hive cascade;

Create table

    Create table syntax

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

    Ordinary create table

create table if not exists student(
	id int, name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/hive/warehouse/student';

    Create a table based on the query result (the query result will be added to the newly created table)

create table if not exists student as select id,name from person;

    Create a table based on an existing table structure

create table if not exists student like person;

    Type of lookup table

desc formatted student;

Internal and external tables

    The tables created by default are so-called management tables, sometimes called internal tables. When we delete a management table, Hive will also delete the data in this table.
    The EXTERNAL keyword allows users to create an external table and specify a path to the actual data (LOCATION) while creating the table. When Hive creates an internal table, it will move the data to the path pointed to by the data warehouse; if an external table is created, only Record the path where the data is located, and do not make any changes to the location of the data. When a table is deleted, the metadata and data of the internal table will be deleted together, while the external table only deletes the metadata, not the data .
    Application scenario: The collected website logs are regularly streamed into HDFS text files every day. Do a lot of statistical analysis on the basis of the external table (original log table). The intermediate table and result table used are stored in the internal table, and the data enters the internal table through SELECT+INSERT.
    For the conversion between management tables and external tables, note: ('EXTERNAL'='TRUE') and ('EXTERNAL'='FALSE') must be capitalized.

alter table student set tblproperties('EXTERNAL'='TRUE');

Partition Table

    The partition table actually corresponds to an independent folder on the HDFS file system, and all the data files of the partition are under this folder. The partition in Hive is the sub-directory , which divides a large data set into small data sets according to business needs. When querying, select the specified partition required by the query through the expression in the WHERE clause. This query efficiency will increase a lot. The partition table format is as follows:

/hive/warehouse/log_partition/20170702/20170702.log
/hive/warehouse/log_partition/20170703/20170703.log
/hive/warehouse/log_partition/20170704/20170704.log

    Create partition table

create table log_partition(
	dname string, loc string
)
partitioned by (month string)
row format delimited fields terminated by '\t';

    Load data into the partition table

load data local inpath '/opt/test.txt' into table log_partition partition(month='202005');

    Query the data in the partition table

select * from loc_partition where month='202005';

    Increase partition

alter table log_partition add partition(month='202005');

    Create multiple partitions separated by spaces

alter table log_partition add partition(month='202005') partition(month='202006');

    Delete partition

alter table log_partition drop partition(month='202005');

    Delete multiple partitions, separated by commas

alter table log_partition drop partition(month='202005'),partition(month='202006');

    View how many partitions the partition table has

show partitions log_partition;

    View the partition table structure

desc formatted log_partition;

    Create a secondary partition table

create table log_partition(
	dname string, loc string
)
partitioned by (month string,day string)
row format delimited fields terminated by '\t';

    Load data into the secondary partition table

load data local inpath '/opt/test.txt' into table log_partition partition(month='202005',day='28');

Data and metadata

    Hive query data is to query the metadata first, and then obtain the actual data based on the metadata. Hive query has two conditions: 1. Metadata exists. 2. There are actual data. Regardless of whether it is the first data or the first metadata, as long as both conditions are met, the data can be queried. If one of them is missing, the data cannot be queried.

hadoop fs -mkdir -p /hive/warehouse/test
hadoop fs -put /opt/test.txt /hive/warehouse/test
msck repair table test;
select * from test;

Modify table

    Rename table

alter table student rename to pupil;

    Add column

alter table pupil add column(grade int);

    Update the column, pay attention to the need to add the type

alter table pupil change column name last_name string;

    Replace column, REPLACE is to replace all fields in the table.

alter table pupil replace column(id int,name string);

    Delete table

 drop table pupil;

DML data manipulation

data import

  1. Load data into the table ( Load )

    Load local files to table hive

load data local inpath '/opt/test.txt' into table hive;

    Load HDFS files into table hive

load data inpath '/test.txt' into table hive;

    Load data to overwrite the existing data in the table

load data inpath '/test.txt' overwrite into table hive;
  1. Insert data into the table ( Insert )

    Basic insert data

insert into table student partition(month='202005') values(1,'abc');
insert into table student partition(month='202005') select * from table;

    Create table and load data (As Select)

create table if not exists student as select id, name from student;

    Specify the path to load the data through Location when creating the table

create table if not exists student(
	id int, name string
)
row format delimited fields terminated by '\t'
location '/hive/warehouse/student';

hadoop fs -put /opt/test.txt /hive/warehouse/student
select * from student;

     Import data to the specified Hive table. Note: First use export to export, and then import data, because export will generate metadata files when exporting tables, and import needs metadata.

export table temptable to '/hive/warehouse/export/student';
import table student partition(month='202005') from '/hive/warehouse/export/student';

Data output

    insert exports the results of the query to the local

insert overwrite local directory '/opt/test' select * from student;

    Format and export query results to local

insert overwrite local directory '/opt/test' row format delimited fields terminated by '\t' select * from student;

    Export the results of the query to HDFS (no local)

insert overwrite directory '/opt/test' select * from student;

    Export Hadoop commands to local

hadoop fs -get /hive/warehouse/test.txt /opt/test.txt

    Hive Shell command export

hive -e 'select * from student;' > /opt/student.txt;

    Export to HDFS

export table student to /hive/warehouse/student;

Clear table data (truncate)

    Truncate can only delete management tables, not data in external tables

truncate table student;

Guess you like

Origin blog.csdn.net/H_X_P_/article/details/106392001