Hive SQL commands commonly used summary, a large collection of data needed to develop learners

Hive is a key component of Hadoop-based ecology, is a tool to manage and analyze data on the data warehouse. She provides a SQL query to analyze data stored in HDFS distributed file system, you can map the structure of the data file to a database table and provides full SQL query.

This SQL is the Hive SQL, she can be converted to SQL statements Map Reduce task to run, to go through a special SQL query analysis need to make are not familiar with the map reduce users easily take advantage of SQL query language, aggregate, analyze the data.

First, the basic command

1, database operations

  • show databases; # view a database
  • use database; # to enter a database
  • show tables; # display all tables
  • desc table name; # display table structure
  • show partitions table; # display the name of the partition table
  • show create table_name; # Create a display table structure

2, modified table structure

  • use xxdb; create table xxx; # internal table
  • create table xxx like xxx; # create a table structure like any other
  • use xxdb; create external table xxx; # 外部表
  • use xxdb; create external table xxx (l int) partitoned by (d string); # 分区表
  • alter table table_name set TBLPROPROTIES ( 'EXTERNAL' = 'TRUE'); # turn table inside the outer table
  • alter table table_name set TBLPROPROTIES ( 'EXTERNAL' = 'FALSE'); # external turn table inner table

3, field type

  • Basic types: tinyint, smallint, int, bigint, float, decimal, boolean, string
  • Composite types: struct, array, map

Second, the commonly used functions

  • length () # Returns the string length
  • trim () # remove the clear space
  • lower (), upper () # case conversion
  • reverse () # reverse the string
  • cast (expr as type) # type conversion
  • substring (string A, int start, int len) # string interception
  • split (string str, string pat) # divided according pat string str, returns an array of strings after division
  • coalesce (v1, v2, v3, ...) # returns a list of the first non-empty element, if all values ​​are null, null is returned
  • from_unixtime (unix_timestamp (), 'yyyy-MM-dd HH: mm: ss') # Returns the current time
  • instr (string str, string search_str) # returns its second argument to be in the position to find the string (not found return 0)
  • concat (string A, string B, string C, ...) # string connection
  • concat_ws (string sep, string A, string B, string C, ...) # custom separator string sep connected to
  • str_to_map (string A, string item_pat, string dict_pat) # string to map the
  • map_keys (map m) # Map extracted key returns the key array
  • datediff (date1, date2) # date comparison function that returns the difference between the number of days, datediff ( '$ {cur_date}, d)
  • explode (colname) # explode is to hive row array or map the complex structure split into multiple lines

Third, related concepts

1、hive

hadoop hive is based on a data warehouse tool, you can map the structure of the data file to a database library table, and provides SQL-like query.

2, the basic components

User Interface: CLI, shell command line; JDBC / ODBC is a hive of java implementation; webGUI visiting hive through a browser; metadata storage: usually a relational database such as mysql, derby stored; hive metadata includes the name of the table , and the list of partitions and their properties, property sheet (whether the outer table), a data directory table and the like. If you are interested in big data development, want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307 , welcome additions, understand Courses

Interpreter, compiler, optimizer completed HQL query from lexical analysis, parsing, compilation, optimization and query plan generation. Resulting query is stored in HDFS, and then there mapreduce called. Therefore, the relationship between the hive and Hadoop can be understood issuing SQL queries to the user, hive queries stored in HDFS, and then by the mapreduce called.

3、table

Hive in Table and Table databases are conceptually similar, each Table has a corresponding directory data is stored in the Hive. For example, a table PVS, which is in HDFS path: / wh / pvs, where, WH is $ {hive.metastore.warehouse.dir} specified directory data warehouse, all of the hive-site.xml Table data (not including External Table) are stored in this directory.

4、partition

Partition corresponds to the database-intensive index Partition column, but in very different Hive Partition organization and database. In Hive in a Partition table corresponds to a directory table, all data are stored in the Partition the corresponding directory.

5、buckets

Buckets specified column calculated hash, the hash value according to the data slicing, the purpose of parallel, each corresponding to a file Bucket. The user dispersed into the column 32 bucket, first calculates the hash value of user column, corresponding to the hash value of 0 HDFS directory: / wh / pvs / ds = 20090801 / ctry = US / part-00000; hash value 20 HDFS directory: / wh / pvs / ds = 20090801 / ctry = US / part-00020

6、external table

External Table to the data that already exists in HDFS, you can create Partition. Table on the tissue and its metadata are the same, the actual data is stored are quite different.

Table creation process and data loading process (these two processes can be done in the same statement), in the process of loading the data, the actual data is moved to the data warehouse directory; access to the data after the data will be in direct warehouse directory to complete. When you delete a table, the table data and metadata will be deleted.

 

External Table is only a process, load the data and create a table while completing (CREATE EXTERNAL TABLE ...... LOCATION), the actual data is stored in HDFS behind LOCATION specified path, and does not move into the data warehouse catalog. When deleting an External Table, delete only the meta data, data in the table will not really be deleted.

7, the full amount of data and incremental data

View the partition information: If the size of the partition increases over time, the latest data partition for the whole amount. If the size of the partition increases the size of the vertical change over time, each partition are incremental data.

Four, HQL and SQL similarities and differences

1, HQL and SQL common different,

  • After select distinct must specify the field name
  • join conditions only supports and does not support or equivalent related conditions
  • Subqueries can not be used in select;
  • HQL no UNION, may be implemented using distinct + union all the UNION;
  • HQL separated by semicolons, must be written on the semicolon at the end of each sentence;
  • HQL more stringent comparison string, case sensitive and spaces, when comparing recommended upper (trim (a)) = upper (trim (b))
  • Analyzing date, we recommended TO_DATE (), such as: to_date (orderdate) = '2016-07-18'
  • Keyword must be `` plus sign on the field names, such as select `exchange` from xxdb.xxtb;
  • Only points between the database and a table / view, as xx_db.xx_tb.

2, HQL does not support the update, using the union all + left join (is null) disguised implemented update.

 

  • Remove the incremental data;
  • Yesterday partition using the full amount of data left connecting increment primary key data, and only the increment of the table takes the primary key data is empty (i.e., whether full data amount does not occur);
  • 1 and 2 combined data to the latest partition cover, i.e. to achieve the update.

3, HQL does not support delete, using methods not exists / left join (is null) disguised achieved.

  • Remove the main key data (Table B) deleted;
  • The total amount of data using a partition (Table A) A connection through the primary keys left, and only take A null primary key data, and then directly to the new partition insert overwrite.

For the staff will SQL, into Hive SQL is relatively easy, most of the syntax is figured in, a small part of the function is not consistent.

Published 138 original articles · won praise 0 · Views 7723

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104310499