HiveQL: Data Definition
Hive in the database
Hive database is essentially a table of contents or namespace
Building a database:
hive> CREATE DATABASE [IF NOT EXISTS] finacials;
Hive will create a directory for each database table in the database will be stored in a subdirectory of the database directory. Exception: default database - "default database does not have its own directory.
After the database is located in the directory specified property hive.metastore.warehouse.dir top-level directory
Database files ending in .db
举例, 创建数据库financials ==》 hive对应创建目录/user/hive/warehouse/financials.db
Modify database default location
hive > CREATE DATABASE financials > LOCATION '/my/preferred/directory';
Increase descriptive information, and queries
hive > CREATE DATABASE financials > COMMENT 'Holds all financial tables'; hive > DESCRIBE DATABASE financials; financials Holds all financial tables hdfs://master-server/user/hive/warehouse/financials.db
master-server represents the URL permission - "master node" (Namenode) + optional port number
Hive will use the configuration items fs.default.name Hadoop as a master configuration file corresponding to the server name and port number, this configuration file can be found under $ HADOOP_HOME / conf directory
hdfs:///user/hive/warehouse/financials.db和hdfs://master-server/user/hive/warehouse/financials.db等价
Wherein the master-server is a DNS name of the master node and optional port number
Remarks:
For completeness, when the user specifies a relative path, and for HDFS Hive relative path are the root directory into the specified distributed file system. However, if the user is then performed in local mode, then the current working directory will be relatively local directory's parent directory.
For portability, the server normally omitted and port number information, and only relates to the other distributed file system will indicate that the message instance.
Increase in key attribute information for the data
hive > CREATE DATABASE financials > WITH DBPROPERTIES ('creator'='Mark', 'date'='2012-01-02'); hive > DESCRIBE DATABASE financials; financials hdfs://master-server/user/hive/warehouse/financials.db hive > DESCRIBE DATABASE EXTENDED financials; financials hdfs://master-server/user/hive/warehouse/financials.db {date=2012-01-02, creator=Mark};
USE -> Switch user is currently working database
hive > USE financials;
Remarks:
No command allows users to view the current job database, no embedded database concepts - "reusable USE
Delete Database
hive > DROP DATABASE [IF EXISTS] financials;
Remarks:
By default, Hive does not allow users to delete a database containing the table. Either delete user tables in the database, and then delete the database, either in the final surface data commands with the keyword CASCADE, so Hive free to delete the database tables.
hive > DROP DATABASE IF EXISTS financials CASCADE;
Use the keyword RESTRICT consistent with the default situation. If a database is deleted, the corresponding directory will also be deleted.
Modify the database (there is no way to delete or "reset" database property)
hive > ALTER DATABASE financials SET DBPROPETIES ('edited-by'='Joe')
table
Create a table
CREATE TABLE [IF NOT EXISTS] xxx_db.xxx(...) COMMENT '' TBLPROPERTIES('') LOCATION ' ' ;
In most cases, TBLPROPERTIES main role is key - value pair format for the table add a description of additional documentation.
Hive table will automatically add two attributes: one is last_modified_by, save for the final table to modify the user's user name; one is last_modified_time, save the last modification time in seconds a new era
By default, the directory will be created Hive always placed after the table of the database directory table belongs. (Pl: /user/hive/warehouse/mydb.db/employees)
default database was an accident, which is in / user / hive / warehouse no directory, which is located directly table / user / hive / warehouse after directory
Copy Tabular
CREATE TABLE IF NOT EXISTS XXX_db.XXX2 LIKE XXX_db.XXX;
Display table information
SHOW TABLES [IN DB]; DESCRIBE EXTENDED|FORMATTED XXX_db.XXX[.colXXX];
Management table and the outer table
Management Table -> Hive control the data lifecycle, by default Hive these tables will be stored in the configuration items hive.metastore.warehouse.dir defined directory subdirectories. [Pl: Delete a management table, Hive will delete data in the table]
Since the read mode, Hive virtually no ability to manage user management table to
External table ->
CREATE EXTERNEL TABLE [IF NOT EXISTS] ...
Keywords EXTERNEL told Hive table is external, because the table is external, Hive is not considered in full possession of this data. So when you delete the table does not delete this data, metadata description information will be deleted.
Partition
Partition management table with an external partition table
Hive create better reflect the structure of the partition pl subdirectory:
hdfs://master_server/user/hive/warehouse/mydb.db/employees
Remarks:
Execute a query that contains all the partitions may trigger a huge task MapReduce, Hive can be set to "strict" mode, so if the partition query where clause is not added to the partition filter, would be prohibited by submitting this task!
Queries child partition
SHOW PARTITIONS EMPLOYEES PARTITIONS (COUNTRY='CN');
Loading data to create a partition
LOAD DATA LOCAL INPATH '${ENV:HOME}/CALIFORNIA-EMPLOYEES' INTO TABLE EMPLOYEES PARTITION(COUNTRY='CN', STATE='BK');
Hive do not care about a partition corresponds to partition the directory exists or whether there are files in the directory partition. If the partition directory does not exist or is no directory partition file, then for this filter partition query returned no results.
Table stored display format specified by STORED AS, while the user can also specify a variety of separators create a table.
TEXTFILE means that all fields are letters, numbers, character encodings, including international character sets.
Use TEXTFILE, each line is considered as a separate record.
SEQUENCEFILE and RCFILE uses binary coding and compression to optimize disk space usage and IO bandwidth performance.
Hive inputformat using a stream object is divided into an input record, and then using a recording outputformat object formatted output stream, and then use a SerDe in reading recorded data parsed into columns, when writing recording data encoded into the column.
Hive WITH SERDEPROPERTIES provide features that allow users to pass configuration information SerDe.
Delete table
DROP TABLE IF EXISTS employees;
Remarks:
Hadoop Recycle Bin function
If the user turns on the feature (off by default), the data will be deleted after the transfer to the next .Trash user directory under the user root directory in a distributed file system, HDFS is in the \ user \ $ USER \ .Trash table of Contents. Fs.trash.interval value can be configured for a reasonably positive integer, the value is between time "Trash checkpoint" interval, in minutes. (Version does not necessarily support) mistakenly deleted data can rebuild the table with partitions, accidentally deleted files from the .Trash folder to the correct file directory down to re-store data.
Modify table
ATLER TALBE --仅仅修改表元数据,
Rename Table
ALTER TABLE xxx RENAME TO new_xxx
Additions and deletions partition
ALTER TABLE XXX ADD IF NOT EXISTS PARTITION(, ,) LOCATION '//'; ALTER TABLE XXX PARTITION(, ,) SET LOCATION '//'; --移动分区路径,不移走数据,也不删除旧数据 ALTER TABLE XXX DROP IF EXISTS PARTITION(, ,);
Modify table properties
ALTER TABLE XXX SET TBLPROPERTIES( );
Modify the storage properties
ALTER TABLE PARTITION(, ,) SET FILEFORMAT SEQUENCEFILE;
Remarks:
SERDEPROPERTIES properties such SerDe various implementations can allow the user to customize
Column operations
Modify column information
ALTER TABLE xxx CHANGE COLUMN xxx XXXX COMMENT '--------' AFTER XXXXX
Increase Column
ALTER TABLE xxx ADD COLUMNS (, ,);
Replace / Remove Columns
ALTER TABLE XXX REPLACE COLUMNS(, , ,);
ALTER TABLE ... TOUCH .... -- 钩子 ALTER TABLE ... ARCHIVE(/UNARCHIVE) PARTITION( , ,) -- 将分区内的文件打成一个Hadoop压缩包 ALTER TABLE ... ENABLE(/DISABLE) NO_DROP/OFF_LINE; --防止分区删除或被查询