1. Data Warehouse
2.hive Introduction
3.hive operation
4.hive parameters
5.hive functions (udf)
6.hive data compression
7.hive storage format
8. The combination of compression and storage
9.hive Tuning
1. Data Warehouse
Data Warehouse: for storing large amounts of historical data history. Referred to as DW or DWH, databasewarehouse, providing decision support for business-oriented.
Data Warehouse: neither production data, consumption data not only for storing data.
Granary: neither consume food, not food production, but for storing food.
Data warehouse features:
1) subject-oriented: the integrated system data together, analyze data in accordance with a level.
2) integration: integrating data with a plurality of subsystems to perform data analysis.
3) non-volatile: data warehouse will not easily removed.
Denatured 4): The data will be updated over time.
The difference between the data warehouse and database:
OLTP: Online Online Transaction Processing On-Line Transaction Processing (), for processing data transactions, mainly used in the database.
OLAP: Online online analytical processing (On-Line Analytical Processing), for historical data analysis, decision support for enterprises.
Data warehouse layered architecture:
1) the data source layer (ODS layer, a source layer attached): the most primitive data
2) Data Warehouse layer: data is processed by the ETL, and analyzed data.
ETL (extract Extra, conversion Transfer, load Load)
3) application layer data: analysis result of a visual display (graphs and reports)
Layered reasons:
With space for time.
Data Warehouse Metadata Management:
Metadata: a correspondence (mapping) the relationship between the data and the model data.
The basic concept of 2.hive
1) hive Introduction
hive is Hadoop data warehouse tools for structured data analysis. Sql class by the way, hql: hive sql.
Structured data: data such as a relational database. Text (text format specific, divided by separators)
Semi-structured data: for example json, xml
Unstructured data: such as video
Data stored in the hive: the data is actually stored on hdfs.
hive in sql query: execute sql will eventually be converted to mapreduce, complete query operation.
Why use the hive:
select * from user;
select * from product left join order on product.id = order.pid;
hive using sql complete analysis of the data.
hive features:
1) may be extended
Hadoop cluster node scalability
2) ductility
1 can be extended related functions, such as UDF.
3) Fault Tolerance
Node problems, sql can still be performed.
hive framework
1) hive user interface
hive client: shell / cli jdbc
2) hive parser
Compiler: split and parse the sql
Optimizer: sql will execute syntax optimization
Actuator: data query and return results
3) hive execution and landing points:
hive execution: mapreduce
Landing point: data is eventually stored on hdfs
The relationship between the hive and hadoop
Performed by mapreduce, stored on hdfs
Comparison with the traditional database hive
sql is actually a hive of off-line analysis
hive of data storage
hive data is eventually stored on hdfs
hive storage formats: text sequencefile qarquetfile orc
hive installation (see document)
hive of interaction
1)bin/hive
bin / hive client connection
2) hive jdbc connected
way bin / hive --service hiveserver2 # use jdbc connection, you must first start hiveserver2 Service
Background start:
nohup bin/hive --service hiveserver2 2>&1 &
bin/beeline
! Connect jdbc: hive2: // node03: 10000 #hive of jdbc url address
hive username: hadoop installation and user name of the same root
hive Password: any password
Note: Before using beeline, to use bin / hive connection once, to let mysql database to automatically generate metadata information.
3) Parameters embodiment hive
hive -e “select * from test;”
hive -f hive.sql # implementation of the script file
The basic operation of 3.HIVE
1. Create a database
create database if not exists myhive;
use myhive;
The default database storage location:
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
You can modify the default save addresses hive-site.xml
Execution table storage location:
create database myhive2 location '/myhive2'
Modify the database
Note: You can only modify the properties of the database, the database name and memory address can not be modified.
alter database myhive2 set dbproperties('createtime'='20180611');
View information:
desc database myhive2;
desc database extended myhive2;
Delete the database:
drop database myhive2 cascade; # if there are tables under the database will be removed together
Operation 2) database table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name # create a table external external
[(Col_name data_type [COMMENT col_comment], ...)] # Field Type field and
Comment [COMMENT table_comment] # table
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] #PARTITIONED BY table partition, divide that folder
[CLUSTERED BY (col_name, col_name, ...) #CLUSTERED BY points table barrel, points file
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] #SORTED BY 排序
[ROW FORMAT row_format] #ROW FORMAT formatted data divided for each default field '\ 001'
[STORED AS file_format] #STORED AS data storage format
[LOCATION hdfs_path] #LOCATION data storage path
1) management table (internal table)
Table hive management when deleting table, the data will also be deleted.
When you create a table, without external keyword is the internal table.
Create a table and specify the delimiter between fields:
create table if not exists stu2(id int ,name string) row format delimited fields terminated by '\t'stored as textfile location '/user/stu2';
2) an external table
More than one table managed by the hive itself. When you delete a table, it does not delete data.
Creating external tables need to add external keyword.
Teachers create an external table:
create external table teacher (t_id string,t_name string) row format delimited fields terminated by '\t';
Students create an external table:
create external table student (s_id string,s_name string,s_birth string , s_sex string ) row format delimited fields terminated by '\t';
Data loading:
load data local inpath '/export/servers/hivedatas/student.csv' into table student;
Representative overwrite existing data plus overwrite
load data local inpath '/export/servers/hivedatas/student.csv' overwrite into table student;
Load data on hdfs
load data inpath '/hivedatas/teacher.csv' into table teacher;
Partition table:
In accordance with the specified field, divided data directory. The core idea: divide and rule.
语法:partitioned by (month string)
Create a partition:
create table score(s_id string,c_id string,s_score int) partitioned by (month string) row format delimited fields terminated by '\t';
Loading data:
load data local inpath '/export/servers/hivedatas/score.csv' into table score partition (month='201806');
Partition union query using union all to achieve
select * from score where month = '201806' union all select * from score where month = '201806'
View Subdivision
show partititions score;
Adding a partition
alter table score add partition(month='201805');
Delete partition
alter table score drop partition(month = '201806');
Description of Requirement: There is a file score.csv file stored in this directory cluster / scoredatas / month = 201806, this file will be generated every day, stored in the corresponding date to the following folder, files, others need a public, not mobile. Demand, create a hive corresponding table and load the data into the table, statistical data analysis, and then delete the table, the data can not be deleted
1) external table:
2) partition table: month = 201806
3) specify the storage directory
create external table score3(s_id string, c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t' location '\scoredats';
Repair the table, it means establishing a relationship between our table and the mapping of our data files
msck repair table score4;
Sub-barrel table
The data is divided according to a specified field, into different file.
语法:clustered by (id) into 3 buckets
Open barrel table function hive of
set hive.enforce.bucketing=true;
Setting reduce the number of
set mapreduce.job.reduces=3;
Create a table through
create table course (c_id string,c_name string,t_id string) clustered by(c_id) into 3 buckets row format delimited fields terminated by '\t';
Loading bucket table data points can not be used in the load mode to load the data inserted by indirect table query data from normal mode.
Create a regular table
create table course_common (c_id string,c_name string,t_id string) row format delimited fields terminated by '\t';
Loading the data table to the tub by insert overwrite
insert overwrite table course select * from course_common cluster by(c_id);
Other operating table
1) Weight naming:
alter table score4 rename to score5;
(2) add columns
alter table score5 add columns (mycol string, mysco string);
(3) look-up table structure
score5 dec;
(4) Update Column
alter table score5 change column mysco mysconew int;
(5) look-up table structure
score5 dec;
(6) Delete table
drop table test;
Loading data
1) set by the load mode to load data
load data local inpath '/export/servers/hivedatas/score.csv' overwrite into table score partition(month='201806');
2) Load data query
create table score4 like score;
insert overwrite table score4 partition(month = '201806') select s_id,c_id,s_score from score;
Multi-insert mode:
score s_id,c_id,s_score
Word table: score1 s_id, c_id
score2 c_id s_score
from score
insert overwrite table score_first partition(month='201806') select s_id,c_id
insert overwrite table score_second partition(month = '201806') select c_id,s_score;
Empty table data:
truncate table score6;
2.hive query syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ... # query field
FROM table_reference # lookup table
[WHERE where_condition] # query
[GROUP BY col_list [HAVING condition]] # packet, filtering the data
[CLUSTER BY col_list #
| The DISTRIBUTE BY col_list #DISTRIBUTE BY query data in accordance with zoning
#ORDER BY ordering (only set a renduce, global ranking)
#SORT BY partial ordering, partial ordering of each of the internal partition
#DISTRIBUTE BY and used in conjunction with SORT BY
#CLUSTER BY = DISTRIBUTE BY(id)+SORT BY(id)
]
[LIMIT number] #limit Returns the number of defined
Full-table query
select * from score;
Select specific columns in the query
select s_id ,c_id from score;
Column aliases
1) Rename a column.
2) ease of calculation.
3) followed by a column name, you can also add the keyword 'AS' between the column names and aliases
select s_id as myid ,c_id from score;
Common Functions
1) Find the total number of rows (count)
select count(1) from score;
2) find the maximum score (max)
select max(s_score) from score;
3) Find the minimum fraction (min)
select min(s_score) from score;
4) the sum (sum) of the required fraction
select sum(s_score) from score;
5) Determine the average score (avg)
select avg(s_score) from score;
LIMIT statement
A typical query returns multiple rows of data. LIMIT clause restricts the number of rows returned.
select * from score limit 3;
WHERE statement
1) WHERE clause, the row does not satisfy the conditions to filter out.
2) WHERE clause immediately after the FROM clause.
3) cases of practical operation
The data query score greater than 60,
select * from score where s_score > 60;
LIKE和RLIKE
1) use the LIKE operator to select similar values
2) the selection criteria may contain characters or numbers:
% Represents zero or more characters (any characters).
_ Represents one character.
3) RLIKE Hive in this clause is an extension function, which can be expressions of this more powerful language to match the conditions specified by the Java positive.
4) cases of practical operation
(1) Find all grades beginning with 8
select * from score where s_score like '8%';
(2) Find the number of all the performance data to a second value of 9
select * from score where s_score like '_9%';
(3) Find all grades containing performance data 9
select * from score where s_score rlike '[9]';
GROUP BY statement
GROUP BY statement often used together and aggregate functions, grouped according to one or more of the queue result, and then perform the polymerization operation for each group.
Case practical operation:
(1) Calculate the average score for each student
select s_id ,avg(s_score) from score group by s_id;
(2) the highest score calculated for each student
select s_id ,max(s_score) from score group by s_id
HAVING statement
1) having different and where
(1) where to play a role for the columns in the table, query data; having to play a role for the query results columns, filter the data.
(2) write back where the packet can not function, and the packet may be used later having a function.
(3) having only a group by group statistics statement.
2) Case practical operation:
Averaging the scores for each student
select s_id ,avg(s_score) from score group by s_id;
Ask each student an average score greater than 85
select s_id ,avg(s_score) avgscore from score group by s_id having avgscore > 85;
JOIN
1) equivalent join
For the association between multiple tables, can only be accomplished in the equivalent connection hive, join on o.pid = p.id
2) table alias
select * from teacher t join course c on t.t_id = c.t_id;
The connector (INNER JOIN)
The connection: only for two tables exist in the data matches with the connection condition will be preserved.
select * from teacher t inner join course c on t.t_id = c.t_id;
Left outer join (LEFT OUTER JOIN)
Left outer: JOIN operator left table WHERE clause meets all records will be returned.
Query teacher correspondence courses
select * from teacher t left join course c on t.t_id = c.t_id;
Right outer join (RIGHT OUTER JOIN)
Right outer join: JOIN operator the right table WHERE clause meets all records will be returned.
select * from teacher t right join course c on t.t_id = c.t_id;
Full outer join (FULL OUTER JOIN)
Full outer join: All will return all the records in the table meet the conditions of the WHERE clause. If any of the designated field of a table does not meet the criteria value, then replace it with a NULL value.
SELECT * FROM teacher t FULL JOIN course c ON t.t_id = c.t_id ;
9. Sort
1) total ordering
order by: reduce the number must be a, is a globally ordered.
Sort according to Alias
According to the average ranking score
select s_id ,avg(s_score) avg from score group by s_id order by avg;
Sorting multiple columns
Sorted according to grade point average and the student id
select s_id ,avg(s_score) avg from score group by s_id order by s_id,avg;
2) Sort By partial ordering
1) reduce the number of settings
set mapreduce.job.reduces=3;
2) reduce the number of settings See
set mapreduce.job.reduces;
3) the query results in descending order according to the results
select * from score sort by s_score;
4) the query results into a file (in descending order according to the results)
insert overwrite local directory '/export/servers/hivedatas/sort' select * from score sort by s_score;
3) ordering partition (DISTRIBUTE BY)
DISTRIBUTE by means according to the data fields to reduce the transmission data is processed.
Reduce the number of settings, corresponding to our s_id division corresponding to reduce them to
set mapreduce.job.reduces=7;
Data through distribute by partitions
insert overwrite local directory '/export/servers/hivedatas/sort' select * from score distribute by s_id sort by s_score;
4)CLUSTER BY
Partition the specified field, and sorted (ascending order only)
When distribute by the same field and sort by, cluster by way may be used.
4.hive of shell parameters
For general parameters, setting the following three ways:
1) configuration file hive-site.xml
2) command-line arguments
bin/hive -hiveconf hive.root.logger=INFO,console
3) parameter declaration
set mapred.reduce.tasks=100;
This setting is session-level scope.
Order to take effect
Parameter declaration> command line arguments> configuration file parameters (hive)
5.hive function
1) built-in function
1) Check the system comes with a function
hive> show functions;
2) the use of built-in display function
hive> desc function upper;
3) the usage of functions shown in detail comes
hive> desc function extended upper;
2) hive Custom Function
1) udf: (User-Defined-Function) user-defined function, into a
2) udaf: user-defined aggregate function, into a plurality
3)udtf:UDTF(User-Defined Table-Generating Functions),一进多出
lateral view explore()