Data warehouse notes

 

1. Data Warehouse

2.hive Introduction

3.hive operation

4.hive parameters

5.hive functions (udf)

6.hive data compression

7.hive storage format

8. The combination of compression and storage

9.hive Tuning

 

1. Data Warehouse

Data Warehouse: for storing large amounts of historical data history. Referred to as DW or DWH, databasewarehouse, providing decision support for business-oriented.

Data Warehouse: neither production data, consumption data not only for storing data.

Granary: neither consume food, not food production, but for storing food.

Data warehouse features:

1) subject-oriented: the integrated system data together, analyze data in accordance with a level.

2) integration: integrating data with a plurality of subsystems to perform data analysis.

3) non-volatile: data warehouse will not easily removed.

Denatured 4): The data will be updated over time.

The difference between the data warehouse and database:

OLTP: Online Online Transaction Processing On-Line Transaction Processing (), for processing data transactions, mainly used in the database.

OLAP: Online online analytical processing (On-Line Analytical Processing), for historical data analysis, decision support for enterprises.

Data warehouse layered architecture:

1) the data source layer (ODS layer, a source layer attached): the most primitive data

2) Data Warehouse layer: data is processed by the ETL, and analyzed data.

ETL (extract Extra, conversion Transfer, load Load)

3) application layer data: analysis result of a visual display (graphs and reports)

Layered reasons:

With space for time.

Data Warehouse Metadata Management:

Metadata: a correspondence (mapping) the relationship between the data and the model data.

 

The basic concept of 2.hive

1) hive Introduction

hive is Hadoop data warehouse tools for structured data analysis. Sql class by the way, hql: hive sql.

Structured data: data such as a relational database. Text (text format specific, divided by separators)

Semi-structured data: for example json, xml

Unstructured data: such as video

Data stored in the hive: the data is actually stored on hdfs.

hive in sql query: execute sql will eventually be converted to mapreduce, complete query operation.

Why use the hive:

​ select * from user;

​ select * from product left join order on product.id = order.pid;

hive using sql complete analysis of the data.

hive features:

1) may be extended

Hadoop cluster node scalability

2) ductility

1 can be extended related functions, such as UDF.

3) Fault Tolerance

Node problems, sql can still be performed.

hive framework

1) hive user interface

hive client: shell / cli jdbc

2) hive parser

Compiler: split and parse the sql

Optimizer: sql will execute syntax optimization

Actuator: data query and return results

3) hive execution and landing points:

hive execution: mapreduce

Landing point: data is eventually stored on hdfs

The relationship between the hive and hadoop

Performed by mapreduce, stored on hdfs

Comparison with the traditional database hive

sql is actually a hive of off-line analysis

hive of data storage

hive data is eventually stored on hdfs

hive storage formats: text sequencefile qarquetfile orc

hive installation (see document)
hive of interaction

​ 1)bin/hive

bin / hive client connection

2) hive jdbc connected

way bin / hive --service hiveserver2 # use jdbc connection, you must first start hiveserver2 Service

Background start:

​ nohup bin/hive --service hiveserver2 2>&1 &

​ bin/beeline

! Connect jdbc: hive2: // node03: 10000 #hive of jdbc url address

hive username: hadoop installation and user name of the same root

hive Password: any password

Note: Before using beeline, to use bin / hive connection once, to let mysql database to automatically generate metadata information.

3) Parameters embodiment hive

​ hive -e “select * from test;”

hive -f hive.sql # implementation of the script file

The basic operation of 3.HIVE

1. Create a database

​ create database if not exists myhive;

​ use myhive;

The default database storage location:

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

You can modify the default save addresses hive-site.xml

Execution table storage location:

​ create database myhive2 location '/myhive2'

Modify the database

Note: You can only modify the properties of the database, the database name and memory address can not be modified.

​ alter database myhive2 set dbproperties('createtime'='20180611');

View information:

​ desc database myhive2;

​ desc database extended myhive2;

 

Delete the database:

drop database myhive2 cascade; # if there are tables under the database will be removed together

Operation 2) database table

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name # create a table external external

[(Col_name data_type [COMMENT col_comment], ...)] # Field Type field and

Comment [COMMENT table_comment] # table

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] #PARTITIONED BY table partition, divide that folder

[CLUSTERED BY (col_name, col_name, ...) #CLUSTERED BY points table barrel, points file

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] #SORTED BY 排序

[ROW FORMAT row_format] #ROW FORMAT formatted data divided for each default field '\ 001'

[STORED AS file_format] #STORED AS data storage format

[LOCATION hdfs_path] #LOCATION data storage path

1) management table (internal table)

Table hive management when deleting table, the data will also be deleted.

When you create a table, without external keyword is the internal table.

Create a table and specify the delimiter between fields:

​ create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t'stored as textfile location '/user/stu2';

 

2) an external table

More than one table managed by the hive itself. When you delete a table, it does not delete data.

Creating external tables need to add external keyword.

Teachers create an external table:

​ create external table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';

Students create an external table:

​ create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';

Data loading:

​ load data local inpath '/export/servers/hivedatas/student.csv' into table student;

Representative overwrite existing data plus overwrite

​ load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;

Load data on hdfs

​ load data inpath '/hivedatas/teacher.csv' into table teacher;

Partition table:

In accordance with the specified field, divided data directory. The core idea: divide and rule.

​ 语法:partitioned by (month string)

Create a partition:

​ create table score(s_id string,c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

Loading data:

​ load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');

Partition union query using union all to achieve

​ select * from score where month = '201806' union all select * from score where month = '201806'

 

View Subdivision

​ show partititions score;

Adding a partition

​ alter table score add partition(month='201805');

Delete partition

​ alter table score drop partition(month = '201806');

 

Description of Requirement: There is a file score.csv file stored in this directory cluster / scoredatas / month = 201806, this file will be generated every day, stored in the corresponding date to the following folder, files, others need a public, not mobile. Demand, create a hive corresponding table and load the data into the table, statistical data analysis, and then delete the table, the data can not be deleted

1) external table:

2) partition table: month = 201806

3) specify the storage directory

​ create external table score3(s_id string, c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t' location '\scoredats';

Repair the table, it means establishing a relationship between our table and the mapping of our data files

​ msck repair table score4;

Sub-barrel table

The data is divided according to a specified field, into different file.

​ 语法:clustered by (id) into 3 buckets

Open barrel table function hive of

​ set hive.enforce.bucketing=true;

Setting reduce the number of

​ set mapreduce.job.reduces=3;

Create a table through

​ create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';

Loading bucket table data points can not be used in the load mode to load the data inserted by indirect table query data from normal mode.

Create a regular table

​ create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';

Loading the data table to the tub by insert overwrite

​ insert overwrite table course select * from course_common cluster by(c_id);

 

Other operating table

1) Weight naming:

​ alter table score4 rename to score5;

(2) add columns

​ alter table score5 add columns (mycol string, mysco string);

(3) look-up table structure

score5 dec;

(4) Update Column

​ alter table score5 change column mysco mysconew int;

(5) look-up table structure

score5 dec;

(6) Delete table

​ drop table test;

Loading data

1) set by the load mode to load data

​ load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');

2) Load data query

create table score4 like score;

insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;

Multi-insert mode:

​ score s_id,c_id,s_score

Word table: score1 s_id, c_id

​ score2 c_id s_score

​from score

insert overwrite table score_first partition(month='201806') select s_id,c_id

insert overwrite table score_second partition(month = '201806') select c_id,s_score;

Empty table data:

​ truncate table score6;

2.hive query syntax

SELECT [ALL | DISTINCT] select_expr, select_expr, ... # query field

FROM table_reference # lookup table

[WHERE where_condition] # query

[GROUP BY col_list [HAVING condition]] # packet, filtering the data

[CLUSTER BY col_list #

| The DISTRIBUTE BY col_list #DISTRIBUTE BY query data in accordance with zoning

#ORDER BY ordering (only set a renduce, global ranking)

#SORT BY partial ordering, partial ordering of each of the internal partition

#DISTRIBUTE BY and used in conjunction with SORT BY

​ #CLUSTER BY = DISTRIBUTE BY(id)+SORT BY(id)

]

[LIMIT number] #limit Returns the number of defined

 

Full-table query

select * from score;

Select specific columns in the query

select s_id ,c_id from score;

Column aliases

1) Rename a column.

2) ease of calculation.

3) followed by a column name, you can also add the keyword 'AS' between the column names and aliases

select s_id as myid ,c_id from score;

 

Common Functions

1) Find the total number of rows (count)

select count(1) from score;

2) find the maximum score (max)

​ select max(s_score) from score;

3) Find the minimum fraction (min)

select min(s_score) from score;

4) the sum (sum) of the required fraction

select sum(s_score) from score;

5) Determine the average score (avg)

​ select avg(s_score) from score;

 

LIMIT statement

A typical query returns multiple rows of data. LIMIT clause restricts the number of rows returned.

select * from score limit 3;

 

WHERE statement

1) WHERE clause, the row does not satisfy the conditions to filter out.

2) WHERE clause immediately after the FROM clause.

3) cases of practical operation

The data query score greater than 60,

select * from score where s_score > 60;

 

LIKE和RLIKE

1) use the LIKE operator to select similar values

2) the selection criteria may contain characters or numbers:

% Represents zero or more characters (any characters).

_ Represents one character.

3) RLIKE Hive in this clause is an extension function, which can be expressions of this more powerful language to match the conditions specified by the Java positive.

4) cases of practical operation

(1) Find all grades beginning with 8

​ select * from score where s_score like '8%';

(2) Find the number of all the performance data to a second value of 9

select * from score where s_score like '_9%';

(3) Find all grades containing performance data 9

​ select * from score where s_score rlike '[9]';

 

 

GROUP BY statement

GROUP BY statement often used together and aggregate functions, grouped according to one or more of the queue result, and then perform the polymerization operation for each group.

Case practical operation:

(1) Calculate the average score for each student

​ select s_id ,avg(s_score) from score group by s_id;

(2) the highest score calculated for each student

​ select s_id ,max(s_score) from score group by s_id

HAVING statement

1) having different and where

(1) where to play a role for the columns in the table, query data; having to play a role for the query results columns, filter the data.

(2) write back where the packet can not function, and the packet may be used later having a function.

(3) having only a group by group statistics statement.

2) Case practical operation:

Averaging the scores for each student

​ select s_id ,avg(s_score) from score group by s_id;

Ask each student an average score greater than 85

select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;

 

JOIN

1) equivalent join

For the association between multiple tables, can only be accomplished in the equivalent connection hive, join on o.pid = p.id

2) table alias

​ select * from teacher t join course c on t.t_id = c.t_id;

The connector (INNER JOIN)

The connection: only for two tables exist in the data matches with the connection condition will be preserved.

select * from teacher t inner join course c on t.t_id = c.t_id;

Left outer join (LEFT OUTER JOIN)

Left outer: JOIN operator left table WHERE clause meets all records will be returned.

Query teacher correspondence courses

select * from teacher t left join course c on t.t_id = c.t_id;

 

Right outer join (RIGHT OUTER JOIN)

Right outer join: JOIN operator the right table WHERE clause meets all records will be returned.

select * from teacher t right join course c on t.t_id = c.t_id;

Full outer join (FULL OUTER JOIN)

Full outer join: All will return all the records in the table meet the conditions of the WHERE clause. If any of the designated field of a table does not meet the criteria value, then replace it with a NULL value.

SELECT * FROM teacher t FULL JOIN course c ON t.t_id = c.t_id ;

 

9. Sort

1) total ordering

order by: reduce the number must be a, is a globally ordered.

Sort according to Alias

According to the average ranking score

​ select s_id ,avg(s_score) avg from score group by s_id order by avg;

Sorting multiple columns

Sorted according to grade point average and the student id

​ select s_id ,avg(s_score) avg from score group by s_id order by s_id,avg;

2) Sort By partial ordering

1) reduce the number of settings

​ set mapreduce.job.reduces=3;

2) reduce the number of settings See

​ set mapreduce.job.reduces;

3) the query results in descending order according to the results

​ select * from score sort by s_score;

4) the query results into a file (in descending order according to the results)

​ insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;

 

3) ordering partition (DISTRIBUTE BY)

DISTRIBUTE by means according to the data fields to reduce the transmission data is processed.

Reduce the number of settings, corresponding to our s_id division corresponding to reduce them to

​ set mapreduce.job.reduces=7;

Data through distribute by partitions

​ insert overwrite local directory '/export/servers/hivedatas/sort' select * from score distribute by s_id sort by s_score;

​ 4)CLUSTER BY

Partition the specified field, and sorted (ascending order only)

When distribute by the same field and sort by, cluster by way may be used.

4.hive of shell parameters

For general parameters, setting the following three ways:

1) configuration file hive-site.xml

2) command-line arguments

​ bin/hive -hiveconf hive.root.logger=INFO,console

3) parameter declaration

​ set mapred.reduce.tasks=100;

This setting is session-level scope.

Order to take effect

Parameter declaration> command line arguments> configuration file parameters (hive)

5.hive function

1) built-in function

1) Check the system comes with a function

​ hive> show functions;

2) the use of built-in display function

​ hive> desc function upper;

3) the usage of functions shown in detail comes

​ hive> desc function extended upper;

2) hive Custom Function

1) udf: (User-Defined-Function) user-defined function, into a

2) udaf: user-defined aggregate function, into a plurality

​ 3)udtf:UDTF(User-Defined Table-Generating Functions),一进多出

​ lateral view explore()

Guess you like

Origin www.cnblogs.com/MrChenShao/p/11713222.html