Big Data Hive actually not difficult at all, from the pit to give up? nonexistent

Hive

First, let's explain what is the Hive. Not only will some people think, Hive to write SQL does not do. Yes, Hive SQL syntax and structure like, in fact, there is not much difference between the two, and even can say, Hive to write SQL. However, the question came - it really is that SQL yet? It SQL database and what is the difference? And other traditional off line database What is the difference and what relationship there? A series of class problem, do not worry, we slowly analysis.

1, Hive implemented by open Facebook and
2, is based on a data warehouse tool Hadoop
3, structured data can be mapped to a database table
4, and provides HQL (Hive SQL) queries
5, the underlying data is stored in the the HDFS
6, is essentially converts Hive SQL statement MapReduce task runs
7, a user unfamiliar with MapReduce HQL easily use computing and processing on the data structure of the HDFS, suitable for bulk data calculated offline.

The father of the data warehouse door grace of Bill (Bill Inmon) in 1991 published "Building the Data Warehouse" ( "data warehouse") in his book The proposed definition is widely accepted - Data Warehouse (Data Warehouse) is a subject-oriented (Subject oriented), integrated (integrated), a relatively stable (Non-Volatile), reflects the historical changes (Time Variant) data collection to support management decision-making (decision Making support).

Hive relies on HDFS for storing data, converts the Hive HQL to execute MapReduce, Hive is based on Hadoop so that a data warehouse tool, in essence, a calculation based on the MapReduce framework of HDFS, data stored in HDFS analysis and management.
Here Insert Picture Description

Hive background of
the era of big data, vast amounts of data over traditional relational databases maintenance costs are very high up, what shall we do it? Hive was born at this time, starting with the facebook open source, initial log data for statistical problem-solving massive structured; ETL (Extraction-Transformation-Loading ) tools to build data warehouses on the Hadoop; data calculated using MR, data storage use HDFS.

Hive class defines a SQL query language called HQL; SQL similar, but not identical, generally used for off-line data processing (using the MapReduce); HQL MR can be thought of as a language translator.

Apache Hive data warehouse software can use SQL to easily read, write, and manage the distribution of distributed storage of large data sets. Structure may be projected onto the already stored data. Providing a command-line tool and JDBC driver to connect users to Hive. Background in the following areas:

  1. Inconvenience MapReduce programming
  2. HDFS file on missing some fields

Hadoop Hive position in the ecosystem

Here Insert Picture Description

Hive Architecture

Here Insert Picture Description
Here Insert Picture Description

Having said that, you know about Hive is not much higher, so the question is, why do we want to use Hive it? What is unique about it yet? Using MapReduce do not you? So first, let 's get the direct use of MapReduce faced:

  • Staff learning costs are too high;
  • Project cycle requirements too short;
  • MapReduce development of complex query logic too difficult;

The use of the Hive:

  • It has a friendly interface, operator interface using SQL-like syntax, the ability to provide rapid development.
  • Lower costs of learning costs, avoid writing MapReduce, developers reduce the cost of learning.
  • Better scalability, cluster size can be free to expand without having to restart the service, also supports user-defined functions.

Let us talk about the advantages and disadvantages of the Hive:

advantage
  • Scalable scalable, scale, Hive can expand cluster freedom, generally do not need to restart the service. Scale: the size of the cluster by way of a pressure balancing expansion; longitudinal extension: a server cpu i7-6700k 4 core threads 8, 8 core 16 threads, memory 64G => 128G
  • Ductility, Hive support for custom functions, users can implement your own functions according to their needs
  • Good fault tolerance, can guarantee even if there is a problem node, SQL statements can still be completed execution

Shortcoming

  • Hive does not support record level CRUD operations, but the user can create a new table or query by the query results into a file (hive-2.3.2 version of the currently selected record level support insert).
  • Hive query latency is very serious, because MapReduce Job startup process consumes a long time, it can not be used in interactive query system.
  • Hive does not support transactions (because it is not no additions and deletions, it is mainly used for OLAP (online analytical processing), instead of OLTP (online transaction processing), which is the two-level data processing).

to sum up:

Hive has the appearance of a SQL database, but the scenario is completely different, Hive is only suitable for applications where massive off-line statistical analysis, that is, the data warehouse.

Hive is also very easy to use, it has a look at what functions it.

Relations function

  • Equivalence comparison: =

    • Syntax : A = B. If the expression A and B are equal expression, compared TRUE; otherwise FALSE
  • No equivalence comparison: <>

    • Syntax : A <> B. A If the expression is NULL, the expression or B is NULL, return NULL; if the expression is not equal to A and Expression B, was TRUE; otherwise FALSE
  • Smaller than the comparison: <

    • Syntax : A <B. A If the expression is NULL, the expression or B is NULL, return NULL; if A is smaller than expression expression to B, TRUE; otherwise FALSE
  • Less than or equal comparison: <=

    • Syntax : A <= B. A If the expression is NULL, the expression or B is NULL, return NULL; if less than or equal expression Expression A to B, TRUE; otherwise FALSE
  • Greater than or equal comparison:> =

    • Grammar : A> = B. A If the expression is NULL, the expression or B is NULL, return NULL; if A is greater than or equal to the expression expression to B, TRUE; otherwise FALSE
  • Null value determination: IS NULL

    • Grammar : A IS NULL. If the value is NULL expression of A, it was TRUE; otherwise FALSE
  • Analyzing non-empty: IS NOT NULL

    • Grammar : A IS NOT NULL. If the value is NULL expression of A, it was FALSE; otherwise TRUE
  • LIKE比较: LIKE

    • Syntax : A [NOT] LIKE B. A or B string if the string is NULL, NULL is returned; if the string A regular expression in line B of the grammar, for the TRUE; otherwise FALSE. B, the character "_" means any single character, and the character "%" represents any number of characters.
      Example: SELECT * WHERE partition_pay_date from dw.topic_order = '2016-04-22' and client_type like 'ip%' ## able to match all strings beginning with the ip.
      Note: When special characters translates it, using two backslash \
  • JAVA-LIKE / REGEXP operations: RLIKE / REGEXP

    • Grammar : A RLIKE / REGEXP B. If the A string or B string is NULL, NULL is returned; if string A compliance JAVA regular expression syntax B regular, was TRUE; otherwise FALSE
      example : select * from dw.topic_order where partition_pay_date = '2016- 04-22 'and client_type rlike / regexp' ^ android * '.
      Note: wildcard'% 'in rlike / regexp function, only match a'% 'character' 'can only match a' 'character

Date Functions

  • UNIX timestamp date transfer function: from_unixtime
    • Syntax: from_unixtime (bigint unixtime [, string format]). UNIX timestamp conversion (from 1970-01-01 00:00:00 UTC time to the specified number of seconds) to the current time zone time format
      example : select from_unixtime (1323308943, 'yyyyMMdd ') from dual; ## return value 20111208
  • On the date of transfer function: year
    • Syntax : year (string date). Returns the year in a date.
  • Date January transfer function: month
    • Syntax : month (string date). Returns the month of the date.
  • Date function next day: day
    • Syntax : day (string date). Returns the day of the date.
  • Date hour transfer function: hour
    • Syntax : hour (string date). Returns the hour of the date.
  • Date of transfer function min: minute
    • Syntax : minute (string date). Returns the date of the minutes.
      For example : select minute ( '2011-12-08 10:03:01' ) from dual; ## returns a value of 3
  • Second date transfer function: second
    • Syntax : second (string date). Return date in seconds.
      For example : select second ( '2011-12-08 10:03:01' ) from dual; ## returns a value of 1
  • Date Week transfer function: weekofyear
    • Syntax : weekofyear (string date). Returns the date in the current weeks.
      For example : select weekofyear ( '2011-12-08 10:03:01' ) from dual; ## returns a value of 49
  • Date comparison function: datediff
    • Syntax : datediff (string enddate, string startdate ). Returns the end date minus the number of days of the start date.
      For example : select datediff ( '2012-12-08', '2012-05-09') from dual; ## return value 213
  • Date of increasing function: date_add
    • Syntax : date_add (string startdate, int days ). Returns the start date startdate days days after the date of the increase.
      For example : select date_add ( '2012-12-08', 10) from dual; ## return value 2012-12-18
  • Date decreasing function: date_sub
    • Syntax : date_sub (string startdate, int days ). Returns the start date startdate reduce days after the date of days.
      For example : select date_sub ( '2012-12-08', 10) from dual; ## return value 2012-11-28

Conditional Functions

  • If function: if
    • Syntax : if (boolean testCondition, T valueTrue , T valueFalseOrNull)
      Description : When the condition testCondition is TRUE, return valueTrue; otherwise valueFalseOrNull.
      For example : select if (app_name = 'group ', object_id, null) as deal_id from dw.topic_order where partition_pay_date = '2016-04-22'
  • Find a non-empty function: COALESCE
    • Syntax : COALESCE (T v1, T v2 , ...)
      Description : Returns the first non-null value parameter; If each value is NULL, then return NULL
      for example : select coalesce (uuid, '' ) as uuid from dw. topic_order where partition_pay_date = '2016-04-22'
  • Conditional functions: CASE
    • Syntax : CASE a WHEN b THEN c [ WHEN d THEN e] * [ELSE f] END
      Description : If a is equal to b, then returns C; if a is equal to d, then return E; otherwise f
      Example : select object_id, user_id, uuid, case when client_type like 'ip %' then 'ios'when client_type like' andr% 'then' android 'else' other 'end as utm_mediumfrom dw.topic_order where partition_pay_date =' 2016-04-22 '
      Note: relative , case when a function is the most complete condition, may be used for determining various conditions; if function is followed, belonging to two points is determined; finally coalesce function, which can only be null and non-null determination.

Statistical Functions

  • The number of statistical functions: count
    • Syntax : COUNT ( ), COUNT (expr), COUNT (DISTINCT expr [, expr_.]). count ( the number of count (expr) Returns the specified field of non-null values;; count the number of rows retrieved), a NULL value row count (DISTINCT expr [., expr_ ]) Returns the specified field of different non- the number of null values
  • The sum of the statistical functions: sum
    • Syntax : sum (col), sum ( DISTINCT col). The sum result of the addition (col) statistics of concentrated col; (the DISTINCT col) statistics sum of different values of the addition result col
  • The average statistical functions: avg
    • Syntax : avg (col), avg ( DISTINCT col). avg (col) the statistical average value of the result set of col; (the DISTINCT col) in col different statistics avg average values are added
  • The minimum statistical functions: min
    • Syntax : min (col). Statistical col field in the result set minimum
  • Maximum statistical functions: max
    • Syntax : max (col). The maximum concentration of statistics col field
  • Median function: precentile Syntax: percentile (BIGINT col, p). Seek Accuracy of pth percentile, p must be between 0 and 1, col field but currently only supports integer, floating-point type is not supported.

These are the Hive function declaration, there are many, I am here just listed, I think a very important part, looked very much like SQL is not right, ha ha ha. In fact, not difficult. Well, we say that it is the similarities and differences and databases.

Since the Hive uses a SQL-like query language HQL, it is easy to understand for the Hive database. In fact, from a structural point of view, Hive, and in addition has a similar database query language, no similarities. Online database can be used in applications in, but Hive is a data warehouse designed, aware of this, from the application point of view helps to understand the characteristics of the Hive.

Hive database and comparison table below:

Here Insert Picture Description

Summary: Hive SQL database with a look, but the scenario is completely different, Hive is only suitable for applications where massive off-line statistical analysis, that is, the data warehouse.
Published 36 original articles · won praise 13 · views 1052

Guess you like

Origin blog.csdn.net/weixin_44598691/article/details/105016295
Recommended