Big Data Offline Phase 06: HQL Data Definition Language (DDL) Overview

The role of DDL syntax

Data Definition Language (DDL) is an operating language for creating, deleting, and modifying the object structure inside the database in the SQL language. These database objects include database (schema), table, view, index, etc. The core syntax consists of CREATE , ALTER and DROP . DDL does not involve the operation of data inside the table.

In some contexts, the term is also known as a data description language because it describes the fields and records in database tables.

DDL usage in Hive

The syntax of Hive SQL (HQL) and SQL is similar, and they are basically the same. Users who have learned SQL can use Hive SQL painlessly. It's just that when learning HQL grammar, pay special attention to Hive's own unique grammar knowledge points, such as partition-related DDL operations.

Based on the design and use characteristics of Hive, the create syntax (especially create table) in HQL will be the most important thing in learning and mastering DDL syntax . It can be said that whether the table is successfully built directly affects whether the data file is successfully mapped, and then affects whether the subsequent data can be analyzed based on SQL. In layman's terms, there is no table, and the table has no data. What do you analyze?

Choosing the right direction is often more important than trying blindly.


Hive DDL table building basics

Complete table creation syntax tree

  • The blue font is the keyword of the table syntax, which is used to specify certain functions.
  • The syntax in brackets [] means optional.
  • | Indicates that when using, choose one of the left and right syntax.
  • The grammatical order in the table creation statement must be consistent with the above grammatical rules.

Detailed explanation of Hive data types

Overall overview

The data type in Hive refers to the column field type in the Hive table. Hive data types are generally divided into two categories: primitive data types and complex data types .

Native data types include: numeric types, time types, string types, and miscellaneous data types;

Complex data types include: array array, map mapping, struct structure, union union.

Regarding the data type of Hive, you need to pay attention to:

  • English letters are not case sensitive;
  • In addition to SQL data types, Java data types are also supported, such as: string;
  • int and string are the most used and supported by most functions;
  • The use of complex data types usually needs to be used in conjunction with the delimiter specification syntax.
  • If the defined data type is inconsistent with the file, hive will try to convert implicitly, but success is not guaranteed.

native data type

The native data types supported by Hive are shown in the following figure:

Among them, the marked data types are widely used. For detailed description, please refer to the grammar manual:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

complex data type

The complex data types supported by Hive are shown in the following figure:

Among them, the marked data types are widely used. For detailed description, please refer to the grammar manual:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types


Data type implicit, explicit conversion

Like SQL, HQL supports both implicit and explicit type conversions.

The conversion of a native type from a narrow type to a wide type is called an implicit conversion, and vice versa is not allowed.

The following table describes the implicit conversions allowed between types:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Explicit type conversions use the CAST function.

For example, CAST('100' as INT) will convert 100 string to 100 integer value. If the cast fails, such as CAST('INT' as INT), the function returns NULL.


Hive read and write file mechanism

What is SerDe

SerDe is the abbreviation of Serializer and Deserializer, which is used for serialization and deserialization. Serialization is the process of converting objects into bytecodes; deserialization is the process of converting bytecodes into objects.

Hive uses SerDe (and FileFormat) to read and write row objects.

It should be noted that the "key" part is ignored when reading, while the key is always a constant when writing. Basically row objects are stored in "value" .

You can use desc formatted tablename to view the relevant SerDe information of the table. The default is as follows:

Hive read and write file process

Hive read file mechanism : first call InputFormat (default TextInputFormat), return a kv key-value pair record (by default, one line corresponds to one record). Then call the Deserializer of SerDe (default LazySimpleSerDe) to split the value in a record into fields according to the delimiter.

Hive file writing mechanism : When writing Row to a file, first call the Serializer of SerDe (default LazySimpleSerDe) to convert the object into a byte sequence, and then call OutputFormat to write the data into the HDFS file.


SerDe related syntax

In Hive's table creation statement, the syntax related to SerDe is:

Among them, ROW FORMAT is a grammatical keyword, and one of DELIMITED and SERDE is selected.

If delimited is used, the default LazySimpleSerDe class is used to process data. If the data file format is special, you can use ROW FORMAT SERDE serde_name to specify other Serde classes to process data, and even support user-defined SerDe classes.

LazySimpleSerDe separator specification

LazySimpleSerDe is the default serialization class of Hive. It contains 4 sub-syntaxes, which are used to specify the separator between fields , between collection elements , between map mapping kv , and newline . When building a table, it can be used flexibly according to the characteristics of the data.

default delimiter

If there is no row format syntax when hive creates a table. At this time, the default separator between fields is '\001' , which is a special character that uses the value of ascii encoding, which cannot be typed by the keyboard.

In the vim editor, press Ctrl+v/Ctrl+a continuously to enter '\001' and display ^A

In some text editors it will appear as SOH:

Hive data storage path

Default storage path

The default storage path for Hive tables is specified by the hive.metastore.warehouse.dir property in the ${HIVE_HOME}/conf/hive-site.xml configuration file. The default value is: /user/hive/warehouse.

Under this path, the files will be regularly stored in the corresponding folders according to the libraries and tables they belong to.

Specify the storage path

When Hive builds a table, you can use the location syntax to change the storage path of data on HDFS , making it more flexible and convenient to load data into a table.

Syntax: LOCATION '<hdfs_location>'.

For the generated data files, it is convenient to use location to specify the path.


Case—Glory of the King

Native data type case

The file archer.txt records the shooter related information of the mobile game "Glory of the King". The content is as follows, where the separator between fields is the tab character \t. It is required to create a table in Hive and map the file successfully.

1	后羿	5986	1784	396	336	remotely	archer
2	马可波罗	5584	200	362	344	remotely	archer
3	鲁班七号	5989	1756	400	323	remotely	archer
4	李元芳	5725	1770	396	340	remotely	archer
5	孙尚香	6014	1756	411	346	remotely	archer
6	黄忠	5898	1784	403	319	remotely	archer
7	狄仁杰	5710	1770	376	338	remotely	archer
8	虞姬	5669	1770	407	329	remotely	archer
9	成吉思汗	5799	1742	394	329	remotely	archer
10	百里守约	5611	1784	410	329	remotely	archer	assassin

Field meaning: id, name (hero name), hp_max (maximum life), mp_max (maximum mana), attack_max (maximum physical attack), defense_max (maximum physical defense), attack_range (attack range), role_main (main positioning), role_assist (secondary targeting).

Analyze: the fields are all basic types, and the order of the fields needs to be paid attention to. The separator between fields is a tab character, which needs to be specified using the row format syntax.


Create table statement :

--创建数据库并切换使用
create database itcast;
use itcast;

--ddl create table
create table t_archer(
    id int comment "ID",
    name string comment "英雄名称",
    hp_max int comment "最大生命",
    mp_max int comment "最大法力",
    attack_max int comment "最高物攻",
    defense_max int comment "最大物防",
    attack_range string comment "攻击范围",
    role_main string comment "主要定位",
    role_assist string comment "次要定位"
) comment "王者荣耀射手信息"
row format delimited fields terminated by "\t";

After the table is successfully built, the folder corresponding to the table is generated under the default storage path of Hive, and the archer.txt file is uploaded to the corresponding table folder.

hadoop fs -put archer.txt /user/hive/warehouse/honor_of_kings.db/t_archer

Execute the query operation, it can be seen that the data has been mapped successfully.

Think about it: Is Hive's ability to insert data one by one more convenient than mysql?


Complex data type case

The file hot_hero_skin_price.txt records the relevant skin price information of the popular heroes of the mobile game "Glory of the King".

1,孙悟空,53,西部大镖客:288-大圣娶亲:888-全息碎片:0-至尊宝:888-地狱火:1688
2,鲁班七号,54,木偶奇遇记:288-福禄兄弟:288-黑桃队长:60-电玩小子:2288-星空梦想:0
3,后裔,53,精灵王:288-阿尔法小队:588-辉光之辰:888-黄金射手座:1688-如梦令:1314
4,铠,52,龙域领主:288-曙光守护者:1776
5,韩信,52,飞衡:1788-逐梦之影:888-白龙吟:1188-教廷特使:0-街头霸王:888

Fields: id, name (hero name), win_rate (winning rate), skin_price (skin and price)

Analyze: the first 3 fields are of native data type, and the last field is of complex type map. It is necessary to specify the delimiter between fields, the delimiter between collection elements, and the delimiter between map kv.

Create table statement :

create table t_hot_hero_skin_price(
    id int,
    name string,
    win_rate int,
    skin_price map<string,int>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':' ;

After the table is created successfully, upload the hot_hero_skin_price.txt file to the corresponding table folder.

hadoop fs -put hot_hero_skin_price.txt /user/hive/warehouse/honor_of_kings.db/t_hot_hero_skin_price

Execute the query operation, it can be seen that the data has been mapped successfully.

Think about it: if the last field is defined as a String type, is it convenient for subsequent use?


default delimiter case

The file team_ace_player.txt records the information of the most popular ace players in the main team of the mobile game "Honor of Kings".

Fields: id, team_name (team name), ace_player_name (ace player name)

Analyze: the data are all native data types, and the separator between fields is \001, so the row format statement can be omitted when creating a table, because the default separator of hive is \001.

Create table statement :

create table t_team_ace_player(
    id int,
    team_name string,
    ace_player_name string
);

After the table is created successfully, upload the team_ace_player.txt file to the corresponding table folder.

hadoop fs -put team_ace_player.txt /user/hive/warehouse/honor_of_kings.db/t_team_ace_player

Execute the query operation, it can be seen that the data has been mapped successfully.

Guess you like

Origin blog.csdn.net/Blue92120/article/details/132467229