Hadoop and Hive three major components of cognitive basis

Hadoop three major components:

Distributed File System: HDFS - will achieve a lot of files distributed storage server on (distributed memory)

Distributed computing programming framework: MapReduce-- distributed multi parallel computing machines. (Distributed computing)

Distributed Resource Scheduling Platform: YARN - to help a lot of mapreduce user scheduling program, and reasonable allocation of computing resources

Hive Starter

Hive is built on Hadoop
The interpretation of HQL query, the optimizer generates a query plan is completed by the Hive of all data stored on Hadoop.
Query plan is converted into MapReduce tasks (job), performed in Hadoop (Some queries do not MR tasks, such as: select * from table)
Hive Hadoop and are encoded in UTF-8
 

Common Database (database referred to as DB)

Relational database (relational database is organized by a data link between the two-dimensional table and the composition):
mysql oracle sqlServer postgresql(小oracle)
Non-relational databases: mangodb hbase redis

Data warehouse datawarehouse referred DW (data processing is divided into two categories)

The difference between the data warehouse and database:
Large amount of data data warehouse database to a small amount of data
But new data warehouse can be very slow, complicated and can not be modified and deleted, generally used only for high-volume queries;
             Database support small amount of CRUD
The data warehouse is used for analysis (OLAP), in order to read the main; the database is mainly used to process transactions (OLTP), mainly to write
Online analytical processing: OLAP
Online Transaction Processing: OLTP
 
What data warehouse?
Hive EMR (Ali) TDW (Tencent) infosfree (IBM). . . . . Data warehouse products are at least dozens
 
Data warehouse data come from
1 log includes an application logs, system log, web logs (tomcat, ngnix, apache)
2 database
3 external (reptiles, external interfaces Company)
 
Why hive
1, open-source
2, free
3, are based on hadoop encoding format UTF-8 to its
 
tips; ETL data conversion processing for data extraction
  
 
Bridge mode. Static IP in different places, and you need to change ip because the machine is the same network segment
NAT mode, dynamic ip LAN adapter has its own virtual gateway, so whether off-site, without having to change
192.168.1.1 network gateway typically require data transmission through the gateway
DNS domain name service domain name resolver  
View address, will normally find whether there is a corresponding, not only back to the system host file lookup by DNS Domain Name Service
Tips role is to determine the IP subnet mask whether the same network segment.
 
 HDFS dfs -mkdir /xxx

Hive basic grammar

Basic data types include: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP (timestamp, year, month, day, hour), DECIMAL (decimal precision, guaranteed not to lose, and money-related use ), CHAR, VARCHAR, dATE (date: date)
 

Metadata (the Metadata)

Metadata: describe the data ie: the contents of the file are data files that describe information such as: file size, file location, file access time, modification time metadata
 
Hive is a data table a
Hive metadata that describe the data table, the position of the field in the table, the subject name, the table is stored in mysql
 
Tips: jps command: the current virtual machine java process
 
Start all Start-all.sh
 
HQL statement
Create a table
 hive/   create table nanjing(id int,name string) row format delimited fields terminated by ',';
hive/ CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); 
PARTITIONED partition
Partitions can multi-level partition
Partition field is not a field in a table
Any number of the partition
例如:CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (day STRING);
load data local inpath '/tmp/record' into table record partition(dggg
Hadoo page local file system directory: / user / hive / warehouse
 
Hive inner and outer tables
Table created by default is the inner table
If you include external external table is created at the time
the difference:
Internal deleted when deleting data and metadata will be together
External table to delete only the metadata, the data is still retained
External table than the internal table safety, but recommended deleting trouble
 
Load local data, while a given partition information
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
DFS loading data while a given partition information hadoop -fs put record / 1.txt
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
 
 
Hive execution engine
1, local native mode is not performed on the yarn mr, locally. Not a distributed execution
2、Mr mapreduce
3、Spark
4 also
 
Abnormal command Detailed
ParseException resolve anomalies
SemanticRxception grammar abnormal
 
hive is a tool data stored in HDFS, the metadata stored in mysql
 
hive query operation process: index query sequence, go mysql query tables in the database, the specified location database in HDFS, so went down in HDFS to download the file and then displayed in the hive tool.
 
mysql password
mysqladmin -u root password 'yourpassword'
 
When you create a data directory in the hive,
DBS (database) storing the data in the directory path information hive
TBLS basic table information stored in the hive
 

Guess you like

Origin www.cnblogs.com/zzok/p/11351500.html