Table of contents
Query function related information:
(1) View all built-in functions of the system
(2) View all functions related to string
(3) Query the details of a function
(1) Arithmetic operation function:
(4) Date function: (year, month, day)
(7) Advanced aggregation functions
(1) explode: (explosion function --- the most commonly used)
syntax -----window----row-based
Syntax -----window----value-based:
Syntax -----window----partition:
Syntax -----windows----default:
Window function--cross-row value function:
(2) first_value and last_value
Window function---ranking function:
The functions in hive are actually the same as the functions in Java and MySQL. They are all built-in functions to achieve certain functions.
Query function related information:
(1) View all built-in functions of the system
show functions;
(2) View all functions related to string
show functions like '*string';
(3) Query the details of a function
desc function extended substring;
One-line function:
The single-line function is characterized by one input and one output, that is, one line is input and one line is output.
(1) Arithmetic operation function:
(1) Bitwise AND
select 3&2;
Operational thinking: The input is decimal, the bottom layer will convert the two numbers into binary, and then let the two binary numbers perform bitwise AND operation, and the final result will be converted into decimal for output.
(2) Numerical function:
(1) round function (rounding)
The round function is directly rounded to an integer by default, but it can be limited to retain a few decimal places
select round(3.345,1);
(3) String function:
(1) substring function (string interception function)
The complete function has three parameters, parameter 1: string parameter 2: where to start intercepting (starting from 1, a positive number means counting from left to right, find the specified position, a negative number means counting from right to left, starting from -1, find the specified position) parameter 3: indicates the number of characters to be intercepted, if not written, it will be intercepted to the end by default
select substring('facesbook',5);
select substring('facesbook',-4);
(2) replace function (replacement function)
This function can know its function according to the name (the substring is replaced), and the parameters will not be explained one by one. The operation result is as follows:
select replace('xiaotangtongxue','x','X')
(3) regexp_replace function (regular replacement):
In fact, it is to use regular expressions to replace
regexp_replace(string A, string B, string C)
Parameter one: main string
Parameter two: regular expression rules (if you encounter \, think about whether to use escape characters)
Parameter three: the string to be replaced
select regexp_replace("abcd-123-abcd","[0-9]{1,}","&")
(4) regexp (regular matching):
Returns true if the string matches the regular expression, otherwise returns false.
(The same is matching with like, why use regexp for matching?)
The above are my own thoughts during the learning process. After checking the information, I came to the following conclusions:
like is only suitable for simple fuzzy matching (such as beginning or ending with specific characters, or strings containing certain fixed patterns),
Regular expressions are suitable for many complex forms and are more flexible, as in the following example:
In the following cases, using like is not very good for matching
select 'dfsaaaa1234' regexp "[0-9]{1,}";
(5) repeat (repeat string):
select repeat("123",3);
(6) split (split function):
The cutting symbol in hive is not a simple symbol, but a regular expression
select split("192.168.10.102","\\.");
(7) nvl(A,B) (replace null value)
If the value of A is not null, return A, otherwise return B.
select nvl(null,0);
select nvl(4,0);
(8) concat: splicing strings
concat(string A, string B, string C, ……)
Concatenate A, B, C... and other characters into a string
select concat("1","-","a","b");
(9) concat_ws: concatenate strings or string arrays with specified delimiters
concat_ws(string A, string…| array(string))
Use delimiter A to concatenate multiple strings, or all elements of an array.
select concat_ws("-","qq","weixin","bb","cc");
select concat_ws("-",array("aa","bb","cc","dd"));
(10) get_json_object: (parse json string)
The function takes two parameters
Parameter 1: The incoming json file
Parameter 2: What is passed in is the location to be searched (plus a $(referring to the string passed in before). symbol)
select get_json_object('[{"name":"大海海","sex":"男","age":"25"},{"name":"小宋宋","sex":"男","age":"47"}]','$.[0].name')
(4) Date function: (year, month, day)
(1) unix_timestamp: returns the timestamp of the current or specified time
Timestamp: The timestamp we often refer to is the unix timestamp ---- refers to the number of seconds from January 1, 1970 00:00:00 (UTC--Universal Coordinated Time, which is convenient for unifying the time zone--can be considered as the 0 time zone), which is used to represent the time difference between a certain point in time and the UNIX epoch time. UNIX timestamps are usually represented as integers.
The default is to return the timestamp of the current time
select unix_timestamp();
Get the timestamp for a specified time:
select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');
(Note that when a given time is converted into a timestamp, it is converted according to the 0 time zone --- unix format, not according to the local time zone)
Before the comma is the time to be obtained, after the comma is the format of the time you provide
(2) from_unixtime: convert the UNIX timestamp (the number of seconds from 1970-01-01 00:00:00 UTC to the specified time) to the time format of the current time zone
The first parameter is the timestamp to be converted
The second parameter is the specific format you need (can be omitted)
(3) from_utc_timestamp function: convert time zone according to timestamp
The first parameter:
The first parameter: is the incoming integer (in milliseconds), the timestamp we use is seconds, s*1000
The second parameter: is the incoming string, write the time zone code
Note here that the timestamp passed in as the first parameter is an int, and the timestamp must be converted to a bigint type to prevent overflow
select from_utc_timestamp(cast(1659946088 as bigint)*1000,'GMT+8')
There will be a lot of suffix 0 in the running result. If you want to remove it, you need to call the date formatting function to format it into the format you specify.
(4) select current_date: view the current date (current time zone)
select current_date;
(5) current_timestamp: the current date plus time, and precise milliseconds (current time zone)
select current_timestamp;
(6) month: Get the month in the specified date
select month('2022-08-08 08:08:08');
(7) day: Get the day in the date
(8) hour: Get the hour in the date
(9) datediff: the number of days between the two dates (the end date minus the number of days from the start date)
datediff(string enddate, string startdate)
select datediff('2021-08-08','2022-10-09');
(10) date_add: date plus days
Syntax: date_add(string startdate, int days)
Returns the date after the start date startdate incremented by days days
select date_add('2022-08-08',2);
(11) date_sub: date minus days
(12) date_format: parse the standard date into a specified format string
select date_format('2022-08-08','yyyy年-MM月-dd日')
(5) Process control function
(1) case when: conditional judgment function
语法一:case when a then b [when c then d]* [else e] end
when followed by a judgment statement
Description: If a is true, return b; if c is true, return d; otherwise return e
select case when 'tan'='tanh' then '棒' when 'xia'='xia' then '棒2' when 'con'='con' then '糖' end
语法二: case a when b then c [when d then e]* [else f] end
Description: If a is equal to b, then return c; if a is equal to d, then return e; otherwise return f
select case'tan' when 'tanh' then '棒' when 'xia' then '棒2' when 'con' then '糖' else '都不对' end;
(2) if: conditional judgment, similar to the ternary operator in Java
语法:if(boolean testCondition, T valueTrue, T valueFalseOrNull)
Description: When the condition testCondition is true, return valueTrue; otherwise return valueFalseOrNull
select if(10 > 5,'正确','错误');
(6) Aggregate functions:
(1) size: the number of elements in the collection
select size(friends) from test;
(2) map: Create a map collection
Syntax: map (key1, value1, key2, value2, …)
Description: Build a map type based on the input key and value pairs
select map('xiaohai',1,'dahai',2);
(3) map_keys: returns the key in the map
select map_keys(map('xiaohai',1,'dahai',2));
(4) map_values: returns the value in the map
select map_values(map('xiaohai',1,'dahai',2));
(5) array declares the array collection
Syntax: array(val1, val2, ...)
Description: Build an array array class according to the input parameters
select array('1','2','3','4');
(6) array_contains: Determine whether an element is contained in the array
select array_contains(array('a','b','c','d'),'e');
(7) sort_array: Sort the elements in the array
select sort_array(array('a','d','c'));
(8) struct: declare the structure
Build the structure struct class according to the input parameters
just declare a struct
select struct('name','age','weight');
(9) named_struct declares the attributes and values of struct
select named_struct('name','xiaosong','age',18,'weight',80);
(7) Advanced aggregation functions
(1) collect_list: collect and form a list set, and the result will not be duplicated
select collect_list(job) from employee
(2) collect_set: collect and form a set collection, and deduplicate the results
select collect_set(job) from employee
Example:
The number of employees and their names per month
select month(replace(hiredate,'/','-')) , count(1),collect_list(name) from employee group by month(replace(hiredate,'/','-'));
Burst function:
Burst function: UDTF: input one row of data, output one or more rows of data
Note: (The type of explosion is an array --array)
(1) explode: (explosion function --- the most commonly used)
Function: pass in an array, it will explode the data in the array into multiple lines
Case number one:
select explode(array("1","b","c")) as itmo
Case two:
select explode(`map`("a",1,"b",2)) as (key,value)
(two) posexplode:
Function: return two columns of data (exploded elements and subscripts)
select posexplode(`array`("a","b","c")) as (pos,item)
(3) inline function:
Function: The incoming data of structure type is exploded
select inline(array( named_struct("id",1,"name","zs"),
named_struct("id",2,"name","ls"),
named_struct("id",3,"name","txc")))
as (id,name);
(4) Lateral View:
The burst function generally bursts a row of data. Lateral View is equivalent to using the burst function on each row of data to connect with the source table. During the query process, the virtual table after their union is used as the source table of the query.
Parameter 1 introduction:
tmp is the table name of the table formed after the explode function explodes
Parameter 2 introduction:
hobby is the field name for the fields in the tmp table, if there are more than one, use them and separate them
Burst function case:
select type,count(1)
from movie_info
LATERAL view explode(split(category,',')) tmp
as type group by type;
window function:
The window function is a combination of window + function. The window is used to define the calculation range, and the function is used to define the calculation logic and calculate the data in the window range.
grammar:
Most aggregate functions can be calculated with window functions (all belong to many-to-one)
There are two types of window definitions: (1) row-based (row relation) (2) value-based (value relation)
syntax -----window----row-based
In the process of row-based calculation, slicing will be formed during the calculation process using mapreduce, and the order of the obtained window is not necessarily the order of the original table, so when defining the window range, you need to use order by to sort a certain field (it will be in a certain order when slicing)
Grammar example:
Syntax -----window----value-based:
The role of value-based order by is: select which field to divide
Note: (order by note that the selected field is a numeric type when using preceding and following)
Syntax -----window----partition:
Partition: When defining the window, the partition field will be specified, and each partition field will divide the window separately
partition by is the field of the partition (different values of this field are different partitions)
Syntax -----windows----default:
The default here is mainly for the default of keywords in the process of using window functions
Window function--cross-row value function:
(1) lead and lag:
Function: Get the value of a certain field above/below the current line
grammar:
Note: The lag and lead functions do not support custom windows (you only need to mark whether partitioning is required and sorting is required in over).
(2) first_value and last_value
Function: Get the first value/last value of the specified column (one column) in the window
(It is all performed row by row, and the data in each row is not necessarily the same, starting from the first data)
grammar:
These two functions can define the window by themselves
Run sample results:
Window function---ranking function:
Commonly used ranking functions are: rank, dense_rank, row_number
The above three methods have slightly different ranking rules, please see the example below
Function: calculate ranking
Note: rank, dense_rank, row_number do not support custom windows.
grammar:
search result: