The big vernacular explains the big data hive knowledge points in detail, Lao Liu is really attentive (1)

Insert picture description here

Foreword: Lao Liu dare not say how good his writing is, but he dare to make sure to explain in detail the knowledge points he has reviewed in vernacular as much as possible, and refuse to use mechanical methods in the materials to have his own understanding!

01 Hive knowledge points (1)

Insert picture description here
Point 1: The concept of data warehouse

Since hive is a data warehouse tool based on hadoop, Lao Liu will first talk about some things about data warehouse, and then start talking about hive.

A data warehouse, just by hearing its name, is a warehouse for storing data. A warehouse is different from an engineering. A warehouse is only used to store things, not production or consumption.

To put it succinctly, the data warehouse itself does not produce data, nor does it consume data. The data comes from the outside and is provided for external use. It is mainly used for data analysis and provides some assistance to the enterprise's support decision-making.

Point 2: Characteristics of Data Warehouse

Data warehouse has 4 characteristics:

Subject-oriented: It means that it is purposeful to build a data warehouse and use it to do something;

Integrated: it means to integrate all the data used;

Non-volatile: In other words, the data inside will generally not change;

Time-varying: That is to say, with the development of time, the analysis methods of the data warehouse will also change.

Point 3: The difference between data warehouse and database

As you can see from the data warehouse concept mentioned earlier, the two are quite different.

First, as an example, every transaction made by a customer in a bank will be written into the database and recorded, which is equivalent to using the database for accounting.

The data warehouse is the data platform of the analysis system. It obtains data from the transaction system, summarizes and processes it, and provides decision-makers with some basis for decision-making.

For example, how many transactions occur in a certain branch of a certain bank in a month, and what is the current deposit balance of the branch. If there are few deposits and many consumer transactions, it is necessary to set up an ATM in this area.

The next thing to say is that the difference between database and data warehouse is actually the difference between OLTP and OLAP.

Operational processing, OLTP online transaction processing, can also be called transaction-oriented processing system, it is aimed at the daily operation of specific business online in the database, usually querying and modifying records, people generally care about the response time of the operation and whether the data is safe , Integrity and concurrency related issues.

Analytical processing, online analytical processing OLAP, generally analyzes historical data on certain topics to support management decision-making.

To sum up, the emergence of data warehouses is not intended to replace databases.

The database is a transaction-oriented design, and the data warehouse is a subject-oriented design.

The database generally stores business data, and the data warehouse generally stores historical data.

The database is designed to capture data, and the data warehouse is designed to analyze data.

Another point is that the data warehouse is created in order to further mine data resources and make decisions when a large number of databases already exist.

Point 4: Data warehouse layering

First of all, the data warehouse can be divided into three layers:

Source Data Layer (ODS): It is mainly used to keep our original data;

Data warehouse layer (DW): It mainly cleans the data from the source data layer and then uses it for data analysis. Most of the work is written in this layer.

Data Application Layer (APP): It is mainly used for various displays of data.

So why do you want to layer the data warehouse?

First of all, think about how we usually solve a very complex problem. Do we usually decompose a complex problem into many small problems? Is each small problem relatively easy compared to this big problem?

To summarize, layering the data warehouse is equivalent to splitting a complex job into multiple simple jobs. The processing logic of each layer is relatively simple and easy to understand, so it is easier for us to ensure the correctness of each step , Even if there is an error in the data, we can relatively easily find out where the error occurred and quickly correct the error.

The data warehouse is layered to achieve the effect of using space for time, and the efficiency of the system is improved through a large number of preprocessing.

Point 5: What is hive

Insert picture description here
In a nutshell, because the mapreduce code is very complex, hive is a tool that converts SQL statements into mapreduce tasks, which greatly simplifies the development of mr through hive.

It can also be said that the main job of hive is to translate the sql statements we write into mr tasks, run on yarn, hive can be simply understood as mr's client

Point 6: The difference between hive and database

There are too many differences, just to say that Hive does not support record-level addition, deletion and modification operations.

In the earlier version, hive does not support additions, deletions, and modifications. It only supports query operations. The current version supports both.

But the actual work will not use additions, deletions and modifications, only the query operation select.

For the rest, go search by yourself.

Hive only has the appearance of SQL database, but the application scenarios are completely different. Since the execution speed of the executor MapReduce is extremely slow, hive can only process offline data.

Point 7: hive architecture

Insert picture description here
User interface: provide users with various ways to access hive, through jdbc, through hiveserver2, and through hive shell;
parser: mainly used to parse sql grammar;

Compiler: compile the parsed SQL grammar into the task of MR;

Optimizer: There is a certain optimization function, which automatically tunes the sql statement we write, and the tuning function is limited;

Executor: Submit the task of mr to yarn for execution;

The underlying hadoop: data storage hdfs, data calculation mr, running on yarn.

Point 8: Hive data types

Insert picture description here
Point 9: DDL operation of hive

Some people may think that Hive's DDL operation can be done directly on Baidu or looking through the information in the future. There is no need to remember at all, but in Lao Liu's opinion, at least a few commonly used commands must be remembered. How embarrassing to go to Baidu to search!

First talk about hive database operations:

1、创建数据库
create database if not exists db_hive;
2、显示所有数据库
show databases;
3、查询数据库
show databases like 'gmall';
4、查看数据库详情
desc database gmall;
5、显示数据库详细信息
desc database extended gmall;
6、切换当前数据库
use gmall;
7、删除数据库
如果删除的数据库不存在,最好采用if exists 判断数据库是否存在
drop database if exists gmall;
如果数据库中有表存在,这里需要使用cascade强制删除数据库
drop database if exists gmall cascade;

Next, talk about the DDL operation of hive:

It has a table-building grammar. If you look at this grammar directly, Lao Liu does not recommend to look at it directly. It is best to learn it slowly through examples.

Hive table creation is divided into internal tables and external tables. First, we will talk about creating internal tables.

1、直接建表
先切换到自己要用的数据库
use myhive;
create table stu(id int,name string);

2、通过AS 查询语句完成建表:将子查询的结果存在新表里,有数据
create table if not exists myhive.stu1 as select id, name from stu;

3、根据已经存在的表结构创建表
create table if not exists myhive.stu2 like stu;

4、查询表的类型
desc formatted myhive.stu;

According to the type of the query table, you can get this picture. This picture contains a lot of information, which will be described later. Don't worry!

Insert picture description here
Generally, the most commonly used is to create an internal table and specify the separator between the fields, specify the storage format of the file, and the location of the data storage. Note that the location of the data storage refers to the storage location on HDFS. Don't remember it wrong. Yes, Lao Liu remembered it wrong at first!

The created code is as follows:

create table if not exists myhive.stu3(id int ,name string)
row format delimited fields terminated by '\t' stored as textfile location '/user/stu2';

Now to create an external table, we must first know what an external table is and what is the difference between an internal table?

Because the external table specifies the data of other hdfs paths to load into the table, the hive table will think that it does not completely monopolize the data, so when the hive table is deleted, the data is still stored in hdfs and will not be deleted.

When creating an external table, you need to add the external keyword. The location field can be specified or not specified. Specifying is the specific directory where the data is stored. If not specifying, the default directory /user/hive/warehouse is used.

The creation code is as follows:

create external table myhive.teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

Insert picture description here
Summarize the difference between internal and external tables:

1. The external keyword needs to be added when the external table is created.

2. After the internal table is deleted, the metadata and real data of the table are deleted; but after the external table is deleted, only the metadata of the table is deleted, the real data is still there, and it can be recovered later.

So when do we generally use internal and external tables?

Because the internal table deletes the table, the HDFS data file will be deleted synchronously, so if we determine that a table is solely for our own exclusive use, and other people can create an internal table when it is not applicable, if the file data of a table, other people also To use, then create an external table.

Generally, external tables are used in the ODS layer of the data warehouse, and internal tables are used in the DW layer of the data warehouse.

After the table is created, how to import the data? Generally, the load method is used to load the data to the internal table or the external table without insert.

The load data can be loaded from the local file system or from the data on hdfs. Note that the local system refers to the linux system.

① Load data from the local system to the table

First create a file in the local system, upload the data table to this file, and then upload the file to the table.

mkdir -p /kkb/install/hivedatas
load data local inpath '/kkb/install/hivedatas/teacher.csv' into table myhive.teacher;

Note that the local system import must add a local;

② Import data from hdfs

First create a directory on hdfs, upload the data file, and then upload the file to the table.

hdfs dfs -mkdir -p /kkb/hdfsload/hivedatas
hdfs dfs -put teacher.csv /kkb/hdfsload/hivedatas
# 在hive的客户端当中执行
load data inpath '/kkb/hdfsload/hivedatas' overwrite into table myhive.teacher;

Point 10: hive partition table

The partition in Hive is divided into directories. The table data is stored in directories and stored in different folders. Later, the data will be queried according to different directories, without the need for full scanning, which improves query efficiency.

Create partition table syntax:

create table score(s_id string,c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

Create multiple partitions of a table:

create table score2 (s_id string,c_id string, s_score int) partitioned by (year string,month string,day string) row format delimited fields terminated by '\t';

The next step is to load the data into the partition table. Old Liu thinks this needs to be mastered, so please be serious!

Load data into the partition table

load data  local inpath '/kkb/install/hivedatas/score.csv' into table score partition  (month='201806');

Load data into the multi-partition table

load data local inpath '/kkb/install/hivedatas/score.csv' into table score2 partition(year='2018',month='06',day='01');

Point 11: Comprehensive exercises

This is the only thing that Lao Liu thinks is good in some materials. After experiencing a lot of basic DDL operations, the only way to speed up remembering these operations is to do a small exercise. The following is a small exercise about hive basic operations. Exercise.

Requirement description: Now there is a file score.csv with three fields, namely s_id string, c_id string, and s_score int. The fields are all split using \t and stored in this directory of the cluster /scoredatas/day=20180607 , This file will be generated every day and stored under the corresponding date folder. The file needs to be shared by others and cannot be moved. As required, create a table corresponding to hive, load the data into the table, perform data statistical analysis, and after deleting the table, the data cannot be deleted.

According to these requirements, what we can know is to create an external partition table, but what is interesting is that Lao Liu looked at the data, it is not to build the table first and then import the data, it is to import the data first and then create the table, which is very interesting.

cd /kkb/install/hivedatas/
hdfs dfs -mkdir -p /scoredatas/day=20180607
hdfs dfs -put score.csv /scoredatas/day=20180607/

Insert picture description here
After the data file is imported, create an external partition table and specify the file data storage directory.

create external table score4(s_id string, c_id string,s_score int) partitioned by (day string) row format delimited fields terminated by '\t' location '/scoredatas';

Perform data query and found that there is no data in the table.
Insert picture description here
This is the case. If we first create a partition table and then import data through load, there will definitely be data in the table;

If you put the file directly in the corresponding location on hdfs, even if the storage location specified by our table is the same as the location of the uploaded data, because there is no metadata partition information recorded in mysql, there is no data, and you need to repair it and refresh mysql Metadata information is sufficient.

One point of knowledge is that the metadata of hive exists in mysql. There will be a hive library in mysql, which stores the corresponding tables, a total of 53 tables.
Insert picture description here
Of course, as Lao Liu said before, first create a table, and import data through load. There is absolutely data in the table.

02 Summary

The knowledge points of hive are mainly practice. A lot of practice is needed in the learning process to truly master it. Old Liu tried his best to explain the first part of hive's knowledge in vernacular, hoping to help everyone. If you have anything to say, you can directly contact the official account: Lao Liu who works hard!

If you think Lao Liu wrote well, please give me a thumbs up!

Guess you like

Origin blog.csdn.net/qq_36780184/article/details/110940680