Big data study notes (4)

One, Hive

1.1 Data Warehouse

1.1.1 Database and Data Warehouse

1) The database is transaction-oriented and the data warehouse is subject-oriented;
2) The database generally stores business data, and the data warehouse generally stores historical data;
3) The database design should avoid redundancy as much as possible, generally for a certain business application Design; the data warehouse deliberately introduces redundancy when designing, and design according to analysis requirements, analysis dimensions, and analysis indicators;
4) The database is designed to capture data, and the data warehouse is designed to analyze data;

The data warehouse is generated in order to further mine data resources for data users to make decisions when there are already a large number of databases.

1.1.2 The layered architecture of the data warehouse

Insert picture description here
It can be seen from the architecture diagram that the bottom layer is the data source. The data in the data warehouse generally comes from different data sources. It can be a document file or a database. The middle layer is our data warehouse, and the top layer is the data application layer. It can be seen that after the data flows into the data warehouse from bottom to top, it provides data support for upper-level applications. It can be understood that the data warehouse is just an intermediate integrated data management platform.

The data warehouse obtains data from various data sources, as well as the data conversion and flow in the data warehouse can be regarded as the process of ETL (Extract Extra, Convert Transfer, Load Load). ETL is the pipeline of the data warehouse, and it can also be considered as the blood of the data warehouse. It maintains the metabolism of the data in the data warehouse, and most of the energy in the daily management and maintenance of the data warehouse is to keep the ETL normal and stable.

1.1.3 Metadata Management of Data Warehouse

Metadata mainly records the definition of the model in the data warehouse, the mapping relationship between various levels, the data status of the monitoring data warehouse, and the task running status of ETL. Generally, it will be stored and managed uniformly through a metadata database.

Metadata is an important part of the data warehouse management system. Metadata management runs through the entire process of data warehouse construction and directly affects the construction, use and maintenance of the data warehouse.

Its main function:

  • It defines the mapping from the source data system to the data warehouse, the rules of data conversion, the logical structure of the data warehouse, the rules of data update, the history of data import, and the loading cycle and other related content;
  • When using the data warehouse, users access data through metadata, clarify the meaning of data items and customize reports;
  • The scale and complexity of the data warehouse are inseparable from correct metadata management, including adding or removing external data sources, changing data cleaning methods, controlling error queries, and arranging backups;
    Insert picture description here
    metadata can be divided into technical metadata and business Metadata. Technical metadata is used by IT personnel who develop and manage data warehouses. It describes data related to data warehouse development, management and maintenance, including data source information, data conversion description, data warehouse model, data cleaning and update rules, and data mapping And access rights, etc. The business metadata serves management and business analysts. It describes data from a business perspective, including business terms, what data is in the data warehouse, the location of the data, and the availability of data, etc., helping data users better understand what is in the data warehouse The data is available and how to use it.

It can be seen that metadata not only defines the mode, source, extraction and conversion rules of the data in the data warehouse, but also is the basis for the operation of the entire data warehouse system. Metadata links the loose components in the data warehouse system to form An organic whole.

1.2 Introduction to Hive

1.2.1 Basic concepts of Hive

Hive is a data warehouse tool based on Hadoop, mainly used for data extraction, conversion, and loading. Hive can map structured data files into a data table and provide SQL query functions. It can convert SQL statements into MapReduce tasks for calculations. The underlying layer is HDFS to provide data storage.

1.2.2 Basic architecture of Hive

Insert picture description here

  • User interface : including CLI, JDBC/ODBC, WebGUI. Among them, CLI is a shell command line, JDBC/ODBC is the JAVA implementation of Hive, and WebGUI is to access Hive through a browser.
  • Metadata storage : usually stored in relational databases such as mysql/derby. Hive stores metadata in the database. Metadata in Hive includes table information, column information, partition information, and the directory where table data is located.
  • Interpreter, compiler, optimizer, executor : complete HQL query statement from lexical analysis, syntax analysis, compilation, optimization, and query plan generation. The generated query plan is stored in HDFS and invoked and executed by MapReduce.

1.2.3 The relationship between Hive and traditional databases

Hive RDBMS
Query language HQL SQL
data storage HDFS Raw Device or Local FS
carried out MapReduce Executor
Data delay high low
Data size Big small
Application scenarios Statistical analysis of data Data persistent storage

1.3 Hive installation

  • Step 1: Download the compressed package and unzip it;
cd /export/softwares/
tar -zxvf apache-hive-3.1.0-bin.tar.gz -C ../servers/
  • Step 2: Modify the hive configuration file;
cd /export/servers/apache-hive-3.1.0-bin/conf
cp hive-env.sh.template hive-env.sh

# 配置HADOOP_HOME
HADOOP_HOME=/export/servers/hadoop-3.1.1
# Hive配置文件路径
export HIVE_CONF_DIR=/export/servers/apache-hive-3.1.0-bin/conf

hive-site.xml:

cd /export/servers/apache-hive-3.1.0-bin/conf
vim hive-site.xml

Configuration file content:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>javax.jdo.option.ConnectionUserName</name>
		<value>root</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionPassword</name>
		<value>123456</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:mysql://node03:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
	</property>
	<property>
		<name>javax.jdo.option.ConnectionDriverName</name>
		<value>com.mysql.jdbc.Driver</value>
	</property>
	<property>
		<name>hive.metastore.schema.verification</name>
		<value>false</value>
	</property>
	<property>
		<name>datanucleus.schema.autoCreateAll</name>
		<value>true</value>
	</property>
	<property>
		<name>hive.server2.thrift.bind.host</name>
		<value>node03.hadoop.com</value>
	</property>
</configuration>
  • Step 3: Configure hive environment variables;
# 修改配置文件
sudo vim /etc/profile

export HIVE_HOME=/export/servers/apache-hive-3.1.0-bin
export PATH=:$HIVE_HOME/bin:$PATH
  • Step 4: Install mysql on node03 and start mysql service;
# 安装mysql相关包
 yum install mariadb mariadb-server
 
# 启动mysql服务
systemctl start mariadb

# 设置用户名和密码
mysql_secure_installation

# 给用户授权
grant all privileges on *.* to 'root'@'%' identified by 'root' with grant option;
flush privileges;
  • Step 5: Add the mysql driver package to the lib directory of hive;

  • Step 6: Start the hadoop service, and then execute the hivecommand. If a hive>prompt appears , it means the installation was successful (as shown below);
    Insert picture description here

1.4 Basic operation of Hive

1.4.1 Database operations

  • Create database
create database [if not exists] 数据库名;
  • Create a database and specify the storage location
create database 数据库名 location 存储路径;

例如:
create database myhive location '/myhive';

The default storage location of hive hive.metastore.warehouse.diris specified by the parameters of the hive-site.xmlp configuration file .
Insert picture description here

  • Modify the database

Hive can only modify some basic properties of the database, but cannot modify metadata information (such as database name, database location, etc.).

alter database 数据库名 set dbproperties('参数名'='属性值');
  • View the details of the database
desc database [extended] 数据库名;

If the extended parameter is specified, the detailed information of the database will be displayed.

  • Delete database
drop database 数据库名 [cascade];

If the database is not empty, you need to specify the cascade parameter.

1.4.2 Table operations

1.4.2.1 The basic syntax of creating a table

The basic syntax for creating a table:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
  • EXTERNAL: Represents the creation of an external table. When creating an external table, you need to specify the actual storage path of the data through LOCATION. Therefore, when deleting an external table, only the metadata of the table is deleted, not the data. If EXTERNAL is not specified, the default is the internal table. When an internal table is deleted, the metadata and data of the internal table will be deleted together.
  • ROW FORMAT: Set the data format. The user can customize the SerDe or use the built-in SerDe when building the table. If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the built-in SerDe will be used. When creating the table, the user also needs to specify the column for the table. When specifying the column, the user can also customize the SerDe. Hive determines the data of the column through the SerDe;

SerDe is the English abbreviation of Serialize and Deserilize. Hive completes table read and write operations through serialization and deserialization. The serialization operation is to convert the java object used by hive into a byte sequence that can be written to hdfs, or a stream file that can be recognized by other systems. The deserialization operation is to convert a string or binary stream into a java object that hive can recognize. For example: the select statement will use the Serialize object to parse the hdfs data; the insert statement will use the Deserilize, the data is written into the hdfs system, and the data needs to be serialized.

  • STORED AS: The storage format of the file. If it is plain text, you can use it STORED AS TEXTFILE; if the data needs to be compressed, use it STORED AS SEQUENCEFILE. hive file storage format for the default TEXTFILE, the configuration can hive.default.fileformatbe modified;
  • CLUSTERED BY: Organization of buckets for a certain column. Buckets are a more fine-grained data range division. Hive uses the hash of the column value, then divides by the number of buckets and then calculates the remainder to determine which bucket the record is stored in;
  • PARTITIONED BY: Create a partitioned table. The name of the specified partition field cannot be the same as the table field name, otherwise an error will be reported;

1.4.2.2 Field Type

Types of description Example Version restrictions
BOOLEAN true/false 1Y
TINYINT 1-byte signed integer, -128~127 1S
SMALLINT 2-byte signed integer, -32768~32767 1S
INT 4-byte signed integer 1
BIGINT 8-byte signed integer 1L
FLOAT 4-byte single-precision floating-point number 1.0
DOUBLE 8-byte double-precision floating-point number 1.0 Hive 2.2.0+
DECIMAL Arbitrary precision signed decimal 1.0 Hive 0.11.0+ starts to introduce 38 decimal places, and Hive 0.13.0+ starts to customize decimal places
Numeric Arbitrary precision signed decimal 1.0 Hive 3.0.0+
STRING Variable length string "a"或’a’
VARCHAR Variable length string "a"或’a’ Hive 0.12.0+
CHAR Fixed-length string "a"或’a’ Hive 0.13.0+
BINARY Byte array Hive 0.8.0+
TIMESTAMP Timestamp, in milliseconds 1287897987312 Hive 0.8.0+
DATE date ‘2020-06-06’ Hive 0.12.0+
INTERVAL Time frequency interval Hive 1.2.0+
ARRAY Array, can only store the same type of data array(1,2,3,4,5) Hive 0.14+
MAP Store a collection of key-value pairs map(‘a’,1,‘b’,2) Hive 0.14+
STRUCT Structure, can store different types of data person_struct(1,'Xiaobai', 18)
UNION A value in a limited range hive 0.7.0+

For specific types, please refer to: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

1.4.2.3 Create Table Operation

use myhive;

# 创建简单表
create table company(id int, name string);

# 创建表并指定分隔符
create table if not exists company (id, name) 
row format delimited fields terminated by '\t';

# 根据查询结果创建表
create table t2 as select * from t1; # 复制表结构和数据
create table t2 like t1; # 复制表结构

1.4.4.4 Partition Table

When the amount of data in the table is very large, we can partition the data of a table, for example: date. This can improve the efficiency of data query.

  • Create a partition table:
# 创建分区表
create table score(s_id string,c_id string, s_score int) partitioned by (year string, month string) row format delimited fields terminated by '\t' location '/score';

Above we created a partition table score, which is partitioned according to the fields year and month. Data with the same month value will be stored in the same partition.

What needs to be emphasized here is that the partition columns year and month do not really exist in the database table, they are only our artificial provisions. Hive does not support certain columns in the table as partition columns.

  • View the partition table directory:
hdfs dfs -ls /score/year=2020&month=6/
  • View partition:
show partitions score;
  • Load data to the partition table:
load data local inpath '/export/data/score.csv' into table score partition (year='2020',month='6');
    • View partition data:
select * from score where year = '2020' and month = '6';

The method of querying partition data specifies the partition field as the query condition. After the partition name is specified, the full table scan is no longer performed, and the query is directly performed from the specified partition, thereby improving the efficiency of data query.

  • Add partition:
alter table score add partition(year='2020', month='5');
  • Delete partition:
alter table score drop partition(year='2020', month='5');

1.4.4.5 Bucket table

The difference with partitioning is that the partitioning is not based on the actual columns in the table, but the bucketing is based on the actual columns in the table, and the data is divided into different buckets according to the specified fields.

  • How does Hive determine which bucket the data is allocated to?

Hive determines which bucket the data is stored in by modulo the number of buckets of the hash of a certain column value. For example, if the name attribute is divided into 3 buckets, then the hash of the name attribute value is modulo 3, and then the data is divided into buckets according to the modulo result. If the result is 0, it is recorded in the first file, the result is 1 is recorded in the second file, the result is 2 is recorded in the third file, and so on.

If you want to use the bucketing function, you must first enable bucketing:

set hive.enforce.bucketing=true;

Then set the number of Reduce:

set mapreduce.job.reduces = 3;

When creating a table, specify the number of buckets:

# 创建表时指定按照c_id列进行分桶,并且将数据放入到3个分桶中
create table ... clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

Insert data into the bucket table:

insert overwrite table course select * from course_two cluster by (c_id);

View bucket data:

# 查看第一个分桶的数据
select * from course tablesample(bucket 1 out of 3 on c_id);

1.4.4.5 Modified other operations

  • Modify the table:
# 重命名
alter table old_table_name rename to new_table_name;

# 查询表结构
desc tablename;

# 添加列
alter table tablename add columns (column_name column_type, ...);

# 删除表
drop table tablename;

Guess you like

Origin blog.csdn.net/zhongliwen1981/article/details/106515900