Hive's function article (it is enough to use the function to read this article)

Table of contents

Query function related information:

(1) View all built-in functions of the system

(2) View all functions related to string

(3) Query the details of a function

One-line function:

(1) Arithmetic operation function:

(2) Numerical function:

(3) String function:

 (4) Date function: (year, month, day)

(5) Process control function

(6) Aggregate functions:

(7) Advanced aggregation functions

Burst function:

(1) explode: (explosion function --- the most commonly used)

(two) posexplode:

(3) inline function:

  (4) Lateral View:

 window function:

syntax -----window----row-based

Syntax -----window----value-based:

Syntax -----window----partition:

 Syntax -----windows----default:

 Window function--cross-row value function:

(1) lead and lag:

 (2) first_value and last_value

 Window function---ranking function:


The functions in hive are actually the same as the functions in Java and MySQL. They are all built-in functions to achieve certain functions.

Query function related information:

(1) View all built-in functions of the system

show functions;

(2) View all functions related to string

show functions like '*string';

(3) Query the details of a function

desc function extended substring;

One-line function:

The single-line function is characterized by one input and one output, that is, one line is input and one line is output.

(1) Arithmetic operation function:

        (1) Bitwise AND

select 3&2;

 Operational thinking: The input is decimal, the bottom layer will convert the two numbers into binary, and then let the two binary numbers perform bitwise AND operation, and the final result will be converted into decimal for output.

(2) Numerical function:

        (1) round function (rounding)

 The round function is directly rounded to an integer by default, but it can be limited to retain a few decimal places

select round(3.345,1);

(3) String function:

        (1) substring function (string interception function)

The complete function has three parameters, parameter 1: string parameter 2: where to start intercepting (starting from 1, a positive number means counting from left to right, find the specified position, a negative number means counting from right to left, starting from -1, find the specified position) parameter 3: indicates the number of characters to be intercepted, if not written, it will be intercepted to the end by default

select substring('facesbook',5);

select substring('facesbook',-4);

 

         (2) replace function (replacement function)

This function can know its function according to the name (the substring is replaced), and the parameters will not be explained one by one. The operation result is as follows:

select replace('xiaotangtongxue','x','X')

 

 (3) regexp_replace function (regular replacement):

In fact, it is to use regular expressions to replace

regexp_replace(string A, string B, string C)

Parameter one: main string

Parameter two: regular expression rules (if you encounter \, think about whether to use escape characters)

Parameter three: the string to be replaced

select regexp_replace("abcd-123-abcd","[0-9]{1,}","&")

(4) regexp (regular matching):

Returns true if the string matches the regular expression, otherwise returns false.

(The same is matching with like, why use regexp for matching?)

The above are my own thoughts during the learning process. After checking the information, I came to the following conclusions:

like is only suitable for simple fuzzy matching (such as beginning or ending with specific characters, or strings containing certain fixed patterns),

Regular expressions are suitable for many complex forms and are more flexible, as in the following example:

In the following cases, using like is not very good for matching

select 'dfsaaaa1234' regexp "[0-9]{1,}";

(5) repeat (repeat string):

select repeat("123",3);

(6) split (split function):

The cutting symbol in hive is not a simple symbol, but a regular expression

select split("192.168.10.102","\\.");

(7) nvl(A,B) (replace null value)

If the value of A is not null, return A, otherwise return B. 

select nvl(null,0);
select nvl(4,0);

(8) concat: splicing strings 

concat(string A, string B, string C, ……)

Concatenate A, B, C... and other characters into a string

select concat("1","-","a","b");

(9) concat_ws: concatenate strings or string arrays with specified delimiters

 concat_ws(string A, string…| array(string))

Use delimiter A to concatenate multiple strings, or all elements of an array.

select concat_ws("-","qq","weixin","bb","cc");
select concat_ws("-",array("aa","bb","cc","dd"));

(10) get_json_object: (parse json string)

The function takes two parameters

Parameter 1: The incoming json file

Parameter 2: What is passed in is the location to be searched (plus a $(referring to the string passed in before). symbol) 

select get_json_object('[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]','$.[0].name')

 (4) Date function: (year, month, day)

(1) unix_timestamp: returns the timestamp of the current or specified time

Timestamp: The timestamp we often refer to is the unix timestamp ---- refers to the number of seconds from January 1, 1970 00:00:00 (UTC--Universal Coordinated Time, which is convenient for unifying the time zone--can be considered as the 0 time zone), which is used to represent the time difference between a certain point in time and the UNIX epoch time. UNIX timestamps are usually represented as integers.

The default is to return the timestamp of the current time

select unix_timestamp();

Get the timestamp for a specified time:

select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');

(Note that when a given time is converted into a timestamp, it is converted according to the 0 time zone --- unix format, not according to the local time zone)

Before the comma is the time to be obtained, after the comma is the format of the time you provide

(2) from_unixtime: convert the UNIX timestamp (the number of seconds from 1970-01-01 00:00:00 UTC to the specified time) to the time format of the current time zone

The first parameter is the timestamp to be converted

The second parameter is the specific format you need (can be omitted)

(3) from_utc_timestamp function: convert time zone according to timestamp

The first parameter:

 The first parameter: is the incoming integer (in milliseconds), the timestamp we use is seconds, s*1000

The second parameter: is the incoming string, write the time zone code

Note here that the timestamp passed in as the first parameter is an int, and the timestamp must be converted to a bigint type to prevent overflow

select from_utc_timestamp(cast(1659946088 as bigint)*1000,'GMT+8')

There will be a lot of suffix 0 in the running result. If you want to remove it, you need to call the date formatting function to format it into the format you specify.

(4) select current_date: view the current date (current time zone)

select current_date;

(5) current_timestamp: the current date plus time, and precise milliseconds (current time zone)

select current_timestamp;

(6) month: Get the month in the specified date

select month('2022-08-08 08:08:08');

(7) day: Get the day in the date

(8) hour: Get the hour in the date

 (9) datediff: the number of days between the two dates (the end date minus the number of days from the start date)

datediff(string enddate, string startdate)

select datediff('2021-08-08','2022-10-09');

(10) date_add: date plus days

Syntax: date_add(string startdate, int days)

Returns the date after the start date startdate incremented by days days

select date_add('2022-08-08',2);  

(11) date_sub: date minus days

(12) date_format: parse the standard date into a specified format string 

select date_format('2022-08-08','yyyy年-MM月-dd日')  

(5) Process control function

(1) case when: conditional judgment function

语法一:case when a then b [when c then d]* [else e] end

when followed by a judgment statement

Description: If a is true, return b; if c is true, return d; otherwise return e

select case when 'tan'='tanh' then '棒' when 'xia'='xia' then '棒2' when 'con'='con' then '糖' end

语法二: case a when b then c [when d then e]* [else f] end

Description: If a is equal to b, then return c; if a is equal to d, then return e; otherwise return f

select case'tan' when 'tanh' then '棒' when 'xia' then '棒2' when 'con' then '糖' else '都不对' end;

(2) if: conditional judgment, similar to the ternary operator in Java 

语法:if(boolean testCondition, T valueTrue, T valueFalseOrNull)

Description: When the condition testCondition is true, return valueTrue; otherwise return valueFalseOrNull

select if(10 > 5,'正确','错误');

(6) Aggregate functions:

(1) size: the number of elements in the collection

select size(friends) from test;

(2) map: Create a map collection

Syntax: map (key1, value1, key2, value2, …)

Description: Build a map type based on the input key and value pairs

select map('xiaohai',1,'dahai',2); 

 (3) map_keys: returns the key in the map

select map_keys(map('xiaohai',1,'dahai',2));

(4) map_values: returns the value in the map

select map_values(map('xiaohai',1,'dahai',2));

(5) array declares the array collection

Syntax: array(val1, val2, ...)

Description: Build an array array class according to the input parameters

 select array('1','2','3','4');

 (6) array_contains: Determine whether an element is contained in the array

select array_contains(array('a','b','c','d'),'e');

(7) sort_array: Sort the elements in the array

 select sort_array(array('a','d','c'));

(8) struct: declare the structure

Build the structure struct class according to the input parameters

just declare a struct 

select struct('name','age','weight');

 (9) named_struct declares the attributes and values ​​of struct

select named_struct('name','xiaosong','age',18,'weight',80);

(7) Advanced aggregation functions

(1) collect_list: collect and form a list set, and the result will not be duplicated

select collect_list(job) from employee

(2) collect_set: collect and form a set collection, and deduplicate the results

select collect_set(job) from employee

Example:

The number of employees and their names per month
select month(replace(hiredate,'/','-')) , count(1),collect_list(name) from employee group by month(replace(hiredate,'/','-'));

Burst function:

Burst function: UDTF: input one row of data, output one or more rows of data

Note: (The type of explosion is an array --array)

(1) explode: (explosion function --- the most commonly used)

Function: pass in an array, it will explode the data in the array into multiple lines

Case number one:

select explode(array("1","b","c"))  as itmo

Case two:

select explode(`map`("a",1,"b",2)) as (key,value)

 

(two) posexplode:

 Function: return two columns of data (exploded elements and subscripts)

select posexplode(`array`("a","b","c")) as (pos,item)

(3) inline function:

Function: The incoming data of structure type is exploded

select inline(array( named_struct("id",1,"name","zs"),
                      named_struct("id",2,"name","ls"),
                      named_struct("id",3,"name","txc")))
    as (id,name);

  (4) Lateral View:

The burst function generally bursts a row of data. Lateral View is equivalent to using the burst function on each row of data to connect with the source table. During the query process, the virtual table after their union is used as the source table of the query.

 Parameter 1 introduction:

tmp is the table name of the table formed after the explode function explodes

 Parameter 2 introduction:

hobby is the field name for the fields in the tmp table, if there are more than one, use them    and     separate them

Burst function case:

select type,count(1) 
from movie_info  
LATERAL view explode(split(category,','))  tmp 
as type group by type;

 window function:

The window function is a combination of window + function. The window is used to define the calculation range, and the function is used to define the calculation logic and calculate the data in the window range.

grammar:

Most aggregate functions can be calculated with window functions (all belong to many-to-one)

There are two types of window definitions: (1) row-based (row relation) (2) value-based (value relation)

 

syntax -----window----row-based

In the process of row-based calculation, slicing will be formed during the calculation process using mapreduce, and the order of the obtained window is not necessarily the order of the original table, so when defining the window range, you need to use order by to sort a certain field (it will be in a certain order when slicing)

Grammar example:

Syntax -----window----value-based:

 The role of value-based order by is: select which field to divide

Note: (order by note that the selected field is a numeric type when using preceding and following)

Syntax -----window----partition:

Partition: When defining the window, the partition field will be specified, and each partition field will divide the window separately

 partition by is the field of the partition (different values ​​of this field are different partitions)

 Syntax -----windows----default:

 The default here is mainly for the default of keywords in the process of using window functions

 Window function--cross-row value function:

(1) lead and lag:

Function: Get the value of a certain field above/below the current line

 grammar:

 Note: The lag and lead functions do not support custom windows (you only need to mark whether partitioning is required and sorting is required in over).

 (2) first_value and last_value

Function: Get the first value/last value of the specified column (one column) in the window

(It is all performed row by row, and the data in each row is not necessarily the same, starting from the first data)

grammar: 

 These two functions can define the window by themselves

Run sample results:

 Window function---ranking function:

Commonly used ranking functions are: rank, dense_rank, row_number

The above three methods have slightly different ranking rules, please see the example below

Function: calculate ranking

Note: rank, dense_rank, row_number do not support custom windows.

grammar: 

search result:

Guess you like

Origin blog.csdn.net/m0_61469860/article/details/131445243