Chapter One First Meet Hive

Data Mining -> Algorithm Writing

Data analysis -> hive (pig, hbase): reduce the difficulty of data analysis
Framework: supplement (expand) the infrastructure
1. Hive framework:
1. Open source by Facebook to solve the data analysis of massive structured logs.
The structured log, as the name suggests, is no longer a free-format log, but follows a certain structure: each log line is a JSON structure.
2. Hive is a data warehouse tool based on hadoop, which can map structured data files (txt, seq, rc) to a table and provide SQL-like query functions.
Essence: hql (hive query language) sql->mapreduce
hiveQL statement is transformed into MR program

data warehouse (architecture, large amount of data) not a database (software)
1) Hive processed data is stored in HDFS
2) The underlying implementation of Hive analysis data is MapReduce
3) The execution program runs on
hadoop on Yarn :
1. common: underlying support components, common modules
2. hdfs: big data distributed storage
3. map reduce: big data distributed computing (data analysis)
1. Input
2. Mapping (Map)
3. Distribution (shuffle)
4. Simplification (reduce)
5. Output

排序:hive:order by(HQL)
      mr:编写很多文件

4. Yarn: Provides resource allocation functions.
Summary:
1) The data processed by Hive is stored in HDFS.
2) The underlying implementation of Hive analysis data is MapReduce.
3) The execution program runs on Yarn.
2. The advantages and disadvantages of hive ;
advantages:
1. The operating interface is adopted. sql grammar class, providing the ability to quickly analyze data
2. avoid writing mapreduce, reduce development costs
3.hive execution delay high (due to the underlying data analysis was performed by mapreduce achieve), suitable for the analysis of static data for real-time data analysis can not be achieved
4 .hive is aimed at big data and does not support small data well.
5. Hive provides the function of user-defined functions.
Disadvantages:
1. hive's HQL expression ability is limited, unable to express the needs of iterative development
2. hive is only for the analysis of static data, not good at real-time data analysis and data mining
3. The execution efficiency of hive is relatively low, due to the conversion of hql to mr, not smart enough.
4. The optimization of hive is more troublesome (hive-site.xml).

Three.hive structure

Insert picture description here

Meta-information:
Since hive only solves the problem of data analysis and does not provide data storage functions, after the big data file is converted into a hive table, the information of various attributes of the hive table file is called meta-information, and these meta-information needs to be stored in another In the database (relational: default derby, recommended mysql).
Meta Store does not store real data, but only stores metadata information of the database, mainly including which databases, which tables, table modes, directories, partitions, indexes, and namespaces are stored in hive. The real data is stored on HDFS. .

hive component (hive-site.xml): convert hql->mr program
1.HQL parser to
convert hql string into abstract syntax tree AST, tool: antlr, analyze AST, for
example: analyze whether the table exists, Hql syntax , Whether the field exists
2. HQL compiler
will generate the syntax tree and generate the logical execution plan.
3. HIVE optimizer optimizes
the logical execution plan
4. Executor
execution plan (logical execution -> physical execution plan)

4. Hive compares hadoop and mysql
hadoop:
1. hadoop solves data storage and data calculation, hive only has data calculation
2. In the process of performing data analysis, hive uses HQL, and hadoop is mapreduce
3. HQL is essentially mapreduce
mysql:
1.mysql Support data update operations, Hive theoretically does not support data update operations.
2. The hive data is stored in hdfs, and the meta information is stored in the meta database. The mysql data is placed on the host.
3. Hive will not process the data or even scan the data during the process of loading data, so it does not index some keys in the data. When Hive wants to access a specific value in the data that meets the conditions, it needs to brute force the entire data, so the access delay is high. Due to the introduction of MapReduce, Hive can access data in parallel, so even without indexes, Hive can still show advantages for accessing large amounts of data. In the database, an index is usually established for one or several columns, so for access to a small amount of data with specific conditions, the database can have high efficiency and low latency. Due to the high latency of data access, Hive is not suitable for online data query.
4. Execution delay: MySQL execution delay is low (in the case of small data),
hive execution delay is high (in the case of a small amount of large data volume)
5. Scalability: Due to the existence of UDF (user-defined function),
hive has strong scalability
MySQL is inferior
6. Hive data scale is large
5. Compared with relational database, hive
Insert picture description here
only has HQL and SQL syntax

Example: Statistical
data collection
"NBA" "2020-3-1 10:0:0" "IP" "host"
"CBA "2020-3-1 10:0:0" "IP" "host"
"CBA" " 2020-3-1 10:0:0" "IP" "host"
data cleaning
"NBA"
"CBA"
"CBA"
data formatting
NBA
CBA
CBA
data analysis
NBA 1
CBA 2
example; query the number of occurrences in a certain period of time The most keyword
hive process:
1.hive: table creation

 temp          sj
"NBA"  "2020-3-1 10:0:0" 
"CBA " "2020-3-1 10:0:0" 
"CBA"  "2020-3-1 10:0:0" 

2. Write hql analysis

select temp ,count(*) as cs from(
select temp from tb1 where sj>****** and sj<********)
group by temp order by cs desc 
limit 0,1;

Guess you like

Origin blog.csdn.net/weixin_44703894/article/details/114477034