Thirty, Hive data types and commonly used attribute configuration

In the last article, we deployed Hive on the server and stored its Metastore on MySQL. This article introduces the data types of Hive and some commonly used attribute configurations. Pay attention to the column "Break the Cocoon and Become a Butterfly-Big Data" to see more related content~


table of Contents

One, the data type of Hive

1.1 Basic data types

1.2 Collection data types

1.2.1 Introduction

1.2.2 Example

1.3 Data type conversion

Two, commonly used attribute configuration

2.1 HiveServer2 service

2.2 Hive interactive commands

2.3 Other Hive commands

2.4 Commonly used attribute configuration

2.4.1 Query and display header information

2.4.2 Display the current database name

2.4.3 Configure the location of the Hive data warehouse

2.4.4 Configure the storage location of Hive running log information

2.4.5 Priority of parameter configuration


 

One, the data type of Hive

1.1 Basic data types

There are 10 basic data types of Hive, as follows:

Hive data type

Corresponding Java data type

length

TINYINT

byte

1byte signed integer

SMALINT

short

2byte signed integer

INT

int

4byte signed integer

BIGINT

long

8byte signed integer

BOOLEAN

boolean

Boolean type, true or false

FLOAT

float

Single precision floating point

DOUBLE

double

Double-precision floating-point number

STRING

string

Character series, you can specify the character set, you can use single quotation marks or double quotation marks. Equivalent to the varchar type of the database

TIMESTAMP

 

Time type

BINARY

 

Byte array

1.2 Collection data types

1.2.1 Introduction

type of data

description

Syntax example

STRUCT

Access element content through the "dot" symbol. For example, if the data type of a column is STRUCT{one STRING, two STRING}, then the first element can be referenced by the field .one.

struct()

For example struct<person:string, city:string>

MAP

MAP is a set of key-value pair tuples, and data can be accessed using array notation. For example, if the data type of a column is MAP, where the key->value pairs are'one'->'xzw' and'two'->'yxy', then the last one can be obtained by the field name ['two'] element

map()

For example map<string, int>

ARRAY

An array is a collection of variables with the same type and name. These variables are called elements of the array, and each array element has a number, which starts from zero. For example, if the array value is ['one','two'], then the second element can be referenced by the array name [1].

Array()

For example array<string>

Hive has three complex data types: STRUCT, ARRAY and MAP. STRUCT is similar to Struct in C language, it encapsulates a set of named fields. ARRAY and MAP are similar to Array and Map in Java. Complex data types allow any level of nesting.

1.2.2 Example

1. The following data are available

[{
    "name": "xzw",
    "loc": ["qd" , "zb"],
    "city": {
        "ta": 4,
        "qd": 3
    }
    "subject": {
        "dm": "Python" ,
        "reg": "bigdata" 
    }
},
{
    "name": "yxy",
    "loc": ["bj" , "sh"],
    "city": {
        "bj": 1,
        "sh": 3
    }
    "subject": {
        "dm": "Java" ,
        "reg": "AI" 
    }
}]

2. First, we need to construct the data file imported into Hive. The data file is as follows:

xzw,qd|zb,ta:4|qd:3,Python|bigdata
yxy,bj|sh,bj:1|sh:3,Java|AI

It is worth noting that the relationship between elements in MAP, STRUCT and ARRAY can all be represented by the same character, which is represented by "|". Place the constructed test.txt file in the /root/files directory.

3. Create Hive table

create table test(
name string,
loc array<string>,
city map<string, int>,
subject struct<dm:string, reg:string>
)
row format delimited fields terminated by ','
collection items terminated by '|'
map keys terminated by ':'
lines terminated by '\n';

4. Load data into the Hive table

load data local inpath '/root/files/test.txt' into table test;

5. Query test

1.3 Data type conversion

Hive data types can be implicitly converted, and the rules are as follows:

(1)任何整数类型都可以隐式地转换为一个范围更广的类型,如TINYINT可以转换成INT,INT可以转换成BIGINT。
(2)所有整数类型、FLOAT和STRING类型都可以隐式地转换成DOUBLE。
(3)TINYINT、SMALLINT、INT都可以转换为FLOAT。
(4)BOOLEAN类型不可以转换为任何其它的类型。
(5)可以使用CAST操作显示进行数据类型转换。例如CAST('1' AS INT)将把字符串'1' 转换成整数1;如果强制类型转换失败,如执行CAST('X' AS INT),表达式返回空值 NULL。

Two, commonly used attribute configuration

Before explaining the common attribute configuration, let's take a look at how to access or connect to Hive and some interactive commands commonly used by Hive. This will lay a foundation for the explanation of the subsequent attribute configuration.

2.1 HiveServer2 service

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries on Hive and retrieve results. In other words, you can use JDBC to access Hive through the HiveServer2 service. The following is how to start the HiveServer2 service.

1. First start the HiveServer2 service with the following command:

hiveserver2

2. Start beeline

beeline

3. Connect to HiveServer2 service

Connect to the HiveServer2 service through the following command:

!connect jdbc:hive2://master:10000

2.2 Hive interactive commands

You can use the following commands to view the interactive commands available in Hive:

hive -help

For commonly used interactive commands, please refer to another blog of mine: "Hive calls sql files through -f and transfers parameters" .

2.3 Other Hive commands

1. View the hdfs file system on the command line

dfs -ls /;

2. View the local file system on the command line

!ls /root/files;

3. View historical commands entered in Hive

There is a command called .hivehistory in the root directory, as shown in the following figure:

2.4 Commonly used attribute configuration

2.4.1 Query and display header information

Add the following configuration in hive-site.xml:

<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>

2.4.2 Display the current database name

Add the following configuration in hive-site.xml:

<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>

2.4.3 Configure the location of the Hive data warehouse

The most original location of the default data warehouse is under the /user/hive/warehouse path on hdfs. In the warehouse directory, no folder is created for the default database default. If a table belongs to the default database, it will directly create a folder in the data warehouse directory. Add the following configuration to hive-site.xml to solve this problem:

<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>/user/hive/warehouse/default</value>
  <description>location of default database for the warehouse</description>
</property>

And modify the execution permissions:

hdfs dfs -chmod g+w /user/hive/warehouse/default

Create a new table in the default database for testing:

It can be found that the newly created tables in the default database appear in the default directory on hdfs:

2.4.4 Configure the storage location of Hive running log information

By default, Hive log information is stored in the /tmp/root/ directory:

Modify the hive-log4j.properties file to put the log information in the specified location. There is a problem here. Some friends found that there is no such configuration file in the hive test conf directory. Therefore, before modifying the configuration file, you still need to Perform the following steps:

At this point, you can modify the corresponding parameters in the configuration file:

2.4.5 Priority of parameter configuration

In the parameter configuration of Hive, there are three parameter configuration methods, as follows.

1. Through the configuration file. The default configuration file is hive-default.xml, and the user-defined configuration file is hive-site.xml. The user-defined configuration will override the default configuration. In addition, Hive will also read the Hadoop configuration, because Hive is started as a Hadoop client, and the Hive configuration will override the Hadoop configuration. The configuration file settings are valid for all Hive processes started on this machine. 2. Add configuration via set in the command line, for example: set mapred.reduce.tasks=100;. This method is to temporarily modify the configuration file, and it will become invalid when Hive is started next time. 3. Set the parameter -hiveconf when starting Hive, for example: hive -hiveconf mapred.reduce.tasks=10. Similarly, this method is to temporarily modify the configuration file, and it will become invalid when Hive is started next time.

The priority of the above three setting methods is to modify the configuration file <command line parameters<parameter declaration.

 

This article is nearing the end. Any problems you have encountered during this process, welcome to leave a message and let me see what problems you have all encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/111029760