Hive summary and optimization

1. What is hive?

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.
Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation and loading (ETL), which is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called HQL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mappers and reducers to handle complex analysis tasks that the built-in mappers and reducers cannot complete.

2. Hive's three modes
Database: access, virtual fox, sqlserver, mysql, sqlite, postgresql, oracle;
Local mode. This mode is connected to an In-Memory database Derby, which is generally used in UnitTest.
Single user mode. Through the network Connecting to a database is the most commonly used mode.
Multi-user mode: remote server mode. Used for non-java clients to access the metastore (metastore), start metaStoreServer on the server side, and the client uses the thrift protocol to access the metastore through metaStoreServer database

3.Hive component
Metastore: storage of metadata; metadata (database and table structure, columns);
a file in mysql : except for the content of the file, it is called metadata; (put on the namenode)
Cli: client: client ;Hive's black window
Jdbc: Connected
Webgui:


There are three main user interfaces: Cli, jdbc and WebGUI. The most commonly used is cli. When cli is started, a copy of hive will be started at the same time. The client is the hive client and the user connects to the hiveServer. When starting the client mode, you need to point out The node where hiveServer is located, and hiveServer is started on this node.WUI can also access Hive through a browser.
Hive stores metadata (database, table) in database tables (real database, mysql), such as mysql, derby.hive Metadata includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), the directory where the data of the table is located, etc. The database (Mysql) does not store Hive records; the
interpreter, The compiler and optimizer complete the generation of sql query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan. The generated query plan is stored in hdfs and then called and executed by mapreduce.
Hive data is stored in hdfs. Part of the data query, calculation is done by mapreduce (queries that include *, such as select* from tbl will not generate mapreduce tasks)
metadata (libraries and tables seen in hive) are stored in the real database (mysql); records are stored in hdfs Top; Frequently used mysql: tables, records, and libraries are stored in the file system (NTFS)
Hive's client connects to the server using the thrift protocol; ====http==https; the size of the transmitted content is relatively small;

4. Type

 

5. CRUD
modify the table name:
alter table table name rename to new table name
# modify column name
alter table change column name new column name data type
# add new column, and replace column
alter table table name
add columns 
(
sex smallint, 
updateTime timestamp 
);

6. File format
mysql underlying storage file format
Ibd: the table engine is innerdb; the default is to support transactions;
Myd, myi, sdi: the engine is myism; does not support transactions;
# start transaction;
start transaction;
# can write a bunch of sql statement (update operation, add operation, delete operation);
#submit; confirm that there is a problem, submit
Commit;
# rollback; (withdraw)
rollback

7.Hive underlying file format
Avro:
ORC:
Parquet: directory structure (father and son)

Summary: The
higher the compression ratio, the more space (disk space) is saved; however, CPU + memory (machine performance) is wasted during decompression and compression, and time is wasted.
The lower the compression ratio or no compression; the waste of space, but no decompression and compression , Saving machine performance;


The Load data command is only applicable to the table file format is textFile;

If the format of the table is textFile, directly upload txt to the directory specified by the table, and you can directly query it;
you can create the directory first, upload the file, and then create the table;

8. View the table structure
Desc table name;
desc extended table name;
desc formatted table name; 
 
9. Create table --as
10. Create table --like
# Use like to create a table, the target table and the source table have the same structure, wood There is data;
11.Truncate
truncated table: group is destroyed; all the records in the table are killed;; directly delete the file
Delete; equivalent to deleting some content in the file ; 12.Create a
complex table of type
- address; container, generic
address array<string>,
- map hobby; container, generic
hobby map<string,string>,

-- 数组的拆分
collection items terminated by '-'
-- map
map keys terminated by ':'

The split of the array and the split of the map do not specify which column it is. All arrays are used-split, all maps are used: split
Map: key-value pair, use: separate, multiple Use sets of key-value pairs -;

13.Partition Partition
, partition rules, specify columns for partitioning, partition columns are not allowed to appear in parentheses; (there can be multiple
partitions ) partitioned by (sex string, adress string)

14.Window function
- the query statement changes the content of the record
- the query result must be grouped; (partition);
- the value of this column, lead: down; the value of the current row, please Fill in the next line;
- lag: go forward one by default, the number can be specified;
select *,lag(name,2) over (partition by dynastyId) from a_king; 
Over: condition; take the execution result of the currently executed SQL statement and shoot A picture; processing on these results;
analysis function

- Requirements
- select dynastyId,count(*) from a_king group by dynastyId; 
- row_number: query results, grouped; number the results of each group;
select *, row_number() over (partition by dynastyId) from a_king; 
- For each dynasty, only the first two emperors are listed; (topn)
select * from (select *, row_number() over (partition by dynastyId) as bh from a_king) t 
where bh <3; 
select *, row_number() over (partition by dynastyId order by id desc) as bh from a_king

15. Special query
Grouping sets;
Cubes: (cube)
Rollups: (tube)
Grouping__ID function
For each column, if the column has been aggregated in the row, the value "1" is generated for a row in the result set, otherwise the value is " 0"
When we do not count a column, its value is displayed as null, which may conflict with the null value of the column itself, which requires a way to distinguish whether there is no statistics or the value is originally null. (Writing an algorithm for permutation and combination, you immediately understand, grouping_id is actually the binary sum of each column being counted)
Suppose there are three columns in the table; a, b, c; when writing sql statement statistics, only a, b is counted; group by a, b; the c column has
nothing to do with it, the value of the c column in hive is null; the  problem: c itself is the nu ll value (it becomes a null value in hive statistics, which is not correct); c it is originally If there is a value, it will directly become a null value, which is not fair; it is equivalent to overwriting the original data;

16.Udf-udtf-udaf
UDF: User Defined Function; (User Defined Function); Common functions
UDAF: User Defined Aggregate (aggregation) Functions: count, sum, avg; the input data is multiple, the output data It is one;
UDTF: User Defined Table-Generating Functions; the input is one, the output is multiple;

What if you want to execute multiple sql statements? Write multiple sql statements to a sql file
bin/hive -f'/root/demo.sql' 
-S: do not output the results of the sql statement execution;
# execute hdfs The sql file above;
bin/hive -f hdfs://jh/hw/hive_sql.sql


hive (ETL extraction conversion loading) optimization:
1. hql statement optimization:
1.
   After strict mode strict query is enabled, 3 queries will be restricted:
   1. Partition query, without partition field filter conditions, cannot be executed
   2.limit
   3. order by: do not use limit statement, cannot execute
   4. Limit Cartesian product query, when do not use where, use on
for join 2. Learn to read explain and explain extended, the latter will additionally print out the abstract syntax tree
  parser of hql What needs to be done is to turn the hql statement into an abstract syntax tree.
  A stage is a mapreduce task. A task can have multiple map and reduce tasks.
  An HQl statement contains one or more stages, and multiple stages are interdependent.
  Hive only executes one stage by default, and it can also be set to run in parallel without dependencies.
3. Limit performance tuning:
  whether to enable limit optimization
4. Join optimization:
  1. Always small tables drive large tables
  2. Join Cartesian product queries, Try to add where filtering (strict mode must be added)
  3. When querying the partition table, add partition field filtering to the where condition
5. group by: grouping. Generally used with aggregate functions, 
   having is to further filter the result set.
   Order by: global sort
   sort by: local sort
6.distinct: deduplication      
7.union all: concatenate multiple result sets together without reordering
   union: concatenate multiple result sets together, deduplicate sorting
8.job number optimization:
   hive defaults to one query or Subqueries or group by statements to generate a job
   job as small as possible
9. Sort as little as possible:
    Sort operations will consume more CPU, which will affect the response time of SQL

10. Try to avoid select *:
    Which fields are used to select which fields to reduce unnecessary waste of resources.
11. Try to use join instead of sub-queries:
    Although join performance is not good, it is much better than mysql sub-queries.
12. Priority optimization High-concurrency SQL, not the "big" SQL with low execution frequency:
    For destructive purposes, high-concurrency SQL will always come faster than low-frequency SQL, and it will not give any breathing opportunity to break the system. ,
    And the frequency is low, at least it will give us the opportunity to buffer
13. Optimize from the global, rather than one-sided adjustment:
    sql optimization can not be carried out for a single one, all sql must be considered
14. As little as possible or:
    when in the where clause When multiple conditions exist or coexist, the mysql optimizer does not solve the execution plan problem well. If
    necessary, use union all or union instead.        
15. Always set ID for each table:
    set ID as the primary key, preferably int Type, and set the auto_increment flag to be automatically added.
16. Configure more servers.
 
2. Mapreduce optimization:
17. Local mode:
   mr is required for hive operation mode. mr is divided into Wei distributed, stand-alone version, and distributed
   also has a local mode : Open local mode calculation, more is 128m; the maximum number of mode files allowed is 4
18. Parallel execution:
   Enable parallel execution of jobs. When there are no dependencies between jobs, they can be executed at the same time. The number of parallels is set separately. The maximum number of threads for parallel execution is 8.
   Turning on parallel will consume more cluster resources to increase the execution speed and execute specific jobs in parallel. Appropriate
 
19. The number of mapreducers:
   reduce the number of maps by merging small files;
   set attributes and reduce the number of fragments. When a single file is very large, appropriately increase the number of maps and
   set the number of reducers. If not controlled, it will be based on the map stage The size of the output data to determine its number
20. Combine small files: The
    small number of files can easily cause a bottleneck on the file storage side, putting pressure on HDFS and affecting processing efficiency.
    In this regard, this effect can be eliminated by merging the result files of Map and Reduce.
21. Reuse of
    jvm : 1 task is run by default, and the number of tasks of the default jvm is also 1

3. Data skew:
Performance: The task progress is maintained at 99% or 100% for a long time. Check the task monitoring page and find that only a few (1 or 2)
reduce subtasks are not completed because of the difference between the amount of data it processes and other Too large
1. The hql statement is too long due to uneven key distribution.
  Solution: Open and calculate the key distribution and then merge.   
2. Group by can easily cause data skew:
       # Whether to enable optimization for data skew during group by query
    hive.groupby .skewindata=flase; #Whether to
    turn on group by Use the join on the map side to optimize
    hive.map.aggr=true;
3. Data skew caused by null values:
  Solution: Assign new key values ​​to null values
4. Different data types are associated Data skew:
  Solution: convert the number type to a string type  
5. When a null value appears in the table, add is not null when building the table

4. Data aspect:
1. Hive is an internal table by default. When deleted, metadata and data content on hdfs will be deleted.
   Deleting an external table will only delete metadata.      

2. Partitioning is to avoid hive brute force scanning of large tables. The query
   partition field does not exist in the metadata, and the real data does not exist in hdfs. 
   The large table is scattered into multiple small tables to improve query efficiency 
. 3. The partition uses off-table fields (paritioned) by), the use of the field in the table (clustered by)
   partitioning buckets are used to disperse big data into multiple files 
4. Partition tailoring: you can reduce unnecessary partitions during the query process.
  
5. Hive storage and compression: storage format is generally orcfile / parquet; compression is used'SNAPPY';
    compression format is set before table creation: set parquet.compression=SNAPPY;
    file format and compression format are set during table creation: stored as orc tblproperties('orc.compression'='SNAPPY')

              

Guess you like

Origin blog.csdn.net/weixin_43777152/article/details/109229676